Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Data Engineering using Databricks on AWS and Azure

Name: Data Engineering using Databricks on AWS and Azure
Rating: 4.6 (2078 reviews)

Build Data Engineering Pipelines using Databricks core features such as Spark, Delta Lake, cloudFiles, etc.

Created byDurga Viswanatha Raju Gadiraju, Phani Bhushan Bozzam, Vinay Gadiraju

Last updated 8/2024

English

What you'll learn

Data Engineering leveraging Databricks features
Databricks CLI to manage files, Data Engineering jobs and clusters for Data Engineering Pipelines
Deploying Data Engineering applications developed using PySpark on job clusters
Deploying Data Engineering applications developed using PySpark using Notebooks on job clusters
Perform CRUD Operations leveraging Delta Lake using Spark SQL for Data Engineering Applications or Pipelines
Perform CRUD Operations leveraging Delta Lake using Pyspark for Data Engineering Applications or Pipelines
Setting up development environment to develop Data Engineering applications using Databricks
Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
Overview of Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.

Course content

24 sections • 267 lectures • 18h 57m total length

Overview of the course - Data Engineering using Databricks11:20
Explore key features of Databricks on aws and azure, set up development environments, use notebooks and cli, and master Spark, Delta Lake, and Glue catalog for data engineering.
Where are the resources that are used for this course?0:32

Getting Started with Databricks on Azure - Introduction1:30
Learn to start with Databricks on Azure: sign up, set quotas, create a multi-node cluster and workspace, upload data from Azure Blob, run notebooks, and clean up resources.
Signup for the Azure Account1:52
Sign up for an azure account and log in to portal.azure.com to start using databricks on azure, with $200 credit for one month and quotas to explore.
Login and Increase Quotas for regional vCPUs in Azure3:21
Log in to portal.azure.com and increase your regional vcpus quotas to support Databricks on Azure, starting with 10–20 vcpus and using a single-node cluster before scaling to multi-node setups.
Create Azure Databricks Workspace3:34
Learn to create an Azure Databricks workspace by selecting a region with quota (East US), creating a resource group, naming the workspace, and choosing standard pricing with no public IP.
Launching Azure Databricks Workspace or Cluster1:40
Begin by signing into portal.azure.com, locate the Azure Databricks dashboard, and launch the workspace. Navigate the Databricks platform to manage clusters and submit data jobs using Azure AD.
Quick Walkthrough of Azure Databricks UI2:44
Explore the Azure Databricks UI, including workspace navigation, notebooks, repos, clusters, jobs, and settings, plus token and Git integrations to develop, run, and monitor data workflows.
Create Azure Databricks Single Node Cluster3:04
Create an Azure Databricks single-node cluster and explore core concepts by configuring a minimal single-node setup, with LTS runtime 9.1, idle termination after 15 minutes, and basic Spark configurations.
Upload Data using Azure Databricks UI4:01
Upload data in Azure Databricks UI via import and export data, drag-and-drop files into retail db under file store, but folders may not upload recursively; use CLI for folder uploads.
Overview of Creating Notebook and Validating Files4:30
Create a python notebook in Azure Databricks, choose scala, sql, or r languages, navigate file store and dbfs, use percentage magic, and apply spark APIs to process and validate data.
Develop Spark Application using Azure Databricks Notebook7:15
Develop and run a Spark-based data processing app in Azure Databricks notebooks, reading CSV data from Azure Blob Storage via DBFS, then group by date and count orders.
Validate Spark Jobs using Azure Databricks Notebook2:56
Validate spark jobs by writing the order count by date to a csv in the target location with spark.write.csv, then read with spark.read.csv to verify 364 records.
Export and Import of Azure Databricks Notebooks3:08
Master exporting and importing Azure Databricks notebooks, backing up workspaces and folders as archive or IPython notebooks, then reimporting as dbc files and attaching to running clusters.
Terminating Azure Databricks Cluster and Deleting Configuration1:28
Learn how to back up notebooks, terminate the Azure Databricks cluster, and delete the configuration, then clean up the workspace in the Azure portal.
Delete Azure Databricks Workspace by deleting Resource Group3:25
Learn to clean up an Azure Databricks workspace by deleting the resource group, and when to delete the workspace alone to avoid removing other resources.

Azure Essentials for Databricks - Azure CLI1:41
Learn to use Azure CLI to interact with Azure resources, create resource groups and storage accounts, manage containers, upload data for Databricks on Azure, on Mac or Windows.
Azure CLI using Azure Portal Cloud Shell2:23
Learn to use Azure CLI in Cloud Shell on the Azure Portal to list resource groups, view subcommands with -h, and filter results with query.
Getting Started with Azure CLI on Mac3:37
Install azure cli on macOS using brew, log in to your account, and list resource groups, then retrieve their names with a json query for streamlined management.
Getting Started with Azure CLI on Windows3:53
Install the Azure CLI on Windows with the MSI installer and validate in PowerShell. Then log in and run az group list with --query to show resource group names.
Warming up with Azure CLI - Overview1:28
Begin warming up with Azure CLI by creating a resource group and a storage account, then containers, uploading a directory, validating results, and deleting the storage account and resource group.
Create Resource Group using Azure CLI5:36
Install the Azure CLI on your platform, then create a resource group with az group create using location and name, and verify it with group list and query.
Create ADLS Storage Account with in Resource Group6:59
Learn to manage storage accounts within a resource group using Azure CLI, including creating, listing, and deleting storage accounts with practical command examples.
Add Container as part of Storage Account3:55
Use storage fs to manage file systems in azure data lake storage gen2, listing and creating containers in an ITV demo storage account, and note containers resemble AWS S3 bucket.
Overview of Uploading the data into ADLS File System or Container1:31
Upload data from your local file system into the data container of an Azure storage account using the storage directory command.
Setup Data Set locally to upload into ADLS File System or Container1:56
Prepare the local retail_db dataset. Clone its GitHub repository and upload the folder to an Azure storage account file system or container via Azure CLI.
Upload local directory into Azure ADLS File System or Container6:59
Upload a local directory to an Azure storage account container or file system using storage fs directory upload, recursively transferring the retail underscore DB dataset and validating via listing commands.
Delete Azure ADLS Storage Account using Azure CLI3:53
Learn to delete an Azure ADLS storage account using Azure CLI, including listing accounts by resource group and confirming deletion, then delete the resource group.
Delete Azure Resource Group using Azure CLI2:55
Delete an Azure resource group with Azure CLI and observe that all resources, including storage accounts, are removed; recreating the group may restore some but not all resources.

Mount ADLS on to Azure Databricks - Introduction2:41
Mount ADLS onto Azure Databricks to access containers without credentials via dbfs. Integrate an AD app with storage and mount the container to Databricks workspaces.
[Material] - Mount ADLS on to Azure Databricks0:24
Ensure Azure Databricks Workspace2:03
Create a new azure databricks workspace in the azure portal, configure a resource group and east us region using standard pricing, launch the databricks ui, and prepare for storage mounting.
Setup Databricks CLI on Mac or Windows using Python Virtual Environment2:11
Install, configure, and mount ADLs onto the Azure Databricks workspace and clusters using a Python virtual environment, then set up the Databricks CLI with token-based authentication.
Configure Databricks CLI for new Azure Databricks Workspace3:25
Configure the Databricks CLI to connect an Azure Databricks workspace by generating a token, setting the host, and using a dedicated profile for Databricks file system commands.
Register an Azure Active Directory Application3:36
Register an Azure Active Directory application to connect a storage account with the Databricks workspace, and capture the client secret, application ID, and directory tenant ID.
Create Databricks Secret for AD Application Client Secret4:18
Create a secret scope and store the Azure AD app client secret in Databricks secrets to enable mounting storage into the Azure Databricks workspace using the Databricks CLI.
Create ADLS Storage Account3:42
Use azure cli to validate the resource group, create an adls storage account named ITV DB Demo in east us, and verify the provisioning state for integration with azure databricks.
Assign IAM Role on Storage Account to Azure AD Application2:17
Assign the storage blob data contributor role to the Azure AD application in the storage account's access control, completing the role assignment for mounting to Azure Databricks.
Setup Retail DB Dataset1:42
Mount the retail db dataset on Azure Databricks by cloning the retail_db repo from GitHub, removing the .git folder, and uploading the folder to a storage container.
Create ADLS Container or File System and Upload Data4:40
Create the data container (file system) in the ITV Idols DB demo storage account, then upload the local data directory recursively and verify via FS directory list.
Start Databricks Cluster to mount ADLS2:03
Mount the storage account onto the Databricks workspace by provisioning a single-node Azure Databricks cluster, then access mounted files with DBFS commands even after cluster termination.
Mount ADLS Storage Account on to Azure Databricks6:31
Mount an Azure Data Lake Storage account to Azure Databricks via a Python notebook and a configs dict, providing app and directory IDs and client secret, then validate the mount.
Validate ADLS Mount Point on Azure Databricks Clusters4:20
Demonstrates mounting an ADLS storage container onto Azure Databricks, reads orders data with a defined schema, and writes json output to a mounted path for cloud-agnostic pipelines.
Unmount the mount point from Databricks1:01
Unmount the mount point from Databricks and validate removal from /mnt using dbutils.fs.unmount, ensuring the data disappears and resources are cleaned up to avoid charges.
Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks3:10
Remove the Azure resource group after unmounting and stopping the cluster to delete the Databricks workspace and storage accounts, ensuring no charges remain.

Introduction to Getting Started with Databricks on AWS0:56
Learn to get started with Databricks on aws, explore how Databricks sits on a cloud data platform, and upload data using notebooks to build a basic solution.
Signup for AWS Account1:36
Sign up for aws account by visiting amazon.com, creating and verifying account, and adding credit card; then log in console to start using aws services, or enterprise login if available.
Login into AWS Management Console2:13
Log into the AWS management console using your email, password, and two-factor authentication, then use the global search to locate services like EMR and view billing.
Setup Databricks Workspace on AWS using QuickStart7:07
Learn to sign up for Databricks on AWS using QuickStart, choose a cloud provider, review plans, and set up a workspace to manage the Databricks environment.
Login into Databricks Workspace on AWS2:16
Learn how to access a Databricks workspace on AWS by navigating to the login page, entering your email, and using bookmarks to streamline authentication and reach the ITV demo workspace.
Cleaning up the workspace2:17
Learn how to clean up Databricks workspaces by viewing details, updating settings, and deleting resources to prevent unexpected charges after your data engineering practice on AWS and Azure.
Quick Walkthrough of Databricks UI on AWS4:53
Explore a quick walkthrough of the Databricks UI on AWS, covering the sidebar, workspace, notebooks, tables, clusters, and data preview from DBFS and S3.
Create Single Node Databricks Cluster on AWS4:41
Learn to set up a single node Databricks cluster on AWS, choosing the latest runtime, configuring inactivity timeout, and selecting a suitable node type for learning and cost control.
Upload Data using AWS Databricks UI6:58
Upload data with the AWS Databricks UI by creating a folder structure in the file store and uploading files into six folders (categories, customers, departments, order items, orders, products).
Overview of Creating Databricks Notebook on AWS and Validating Files4:01
Create your first Databricks notebook on AWS, attach it to a cluster, use Python to explore the file system, and run a simple join between orders and order items.
Develop Spark Application using AWS Databricks Notebook11:54
Learn to build a Spark application in an AWS Databricks notebook, read data into dataframes, join orders with order items, and calculate daily revenue from complete or closed orders.
Review the AWS Databricks Cluster state and restart4:45
Review the aws databricks cluster state and restart by starting the cluster from the notebook, then write the daily revenue data frame to the path using the specified file format.
Write Data frame to DBFS and Validate using Databricks Notebook and Spark7:30
Persist processing results to dbfs to guard against ephemeral clusters. Write the daily revenue dataframe to csv with header and validate by listing the folder and re-reading data with spark.
Export and Import AWS Databricks Notebooks8:53
Export and import AWS Databricks notebooks across accounts by exporting to a local file, then importing into another workspace; learn path adjustments, formats, and cluster attachment to run notebooks.

Introduction to Setup Local Environment with AWS CLI and Boto3 on Windows1:26
Set up a local Windows environment using Windows Subsystem for Linux to run AWS CLI and Boto3, enabling Databricks workflows on AWS and Azure.
Overview of Powershell on Windows 10 or Windows 114:25
Explore PowerShell on Windows 10 and 11, compare it with the DOS prompt, learn how to launch, basics, customization, and key features like remote machine access and file navigation.
Setup Ubuntu VM on Windows 10 or 11 using wsl6:07
Enable the Windows Subsystem for Linux and install an Ubuntu distribution using PowerShell. Reboot to complete the WSL setup on Windows 10 or 11.
Setup Ubuntu VM on Windows 10 or 11 using wsl5:16
Set up an Ubuntu-based virtual machine on Windows 10 or 11 with WSL, log in, and connect via PowerShell to explore the VM’s home directory.
Setup Python venv and pip on Ubuntu8:49
Set up a Python development environment on Ubuntu by installing Python 3 and pip, creating a virtual environment, updating apt, and validating the setup for data engineering tasks.
Setup AWS CLI on Windows and Ubuntu using Pip3:09
Set up a local development environment by installing Python components with pip on an Ubuntu-based virtual machine on Windows using WAFL, and verify the tool installation and command usage.
Create AWS IAM User and Download Credentials3:49
Create an AWS IAM user with programmatic access, download the access key and secret key, and assign administrator access to enable configuration and credential validation.
Configure AWS CLI on Windows7:36
Configure the AWS CLI on Windows by setting up credentials and config files, managing profiles, and validating access with S3 bucket listings and secure cleanup.
Create Python Virtual Environment for AWS Projects3:14
Create and activate a Python virtual environment, install required libraries, and validate Python can interact with your AWS account, using Linux commands to manage a venv project folder.
Setup Boto3 as part of Python Virtual Environment2:29
Set up a Python virtual environment and pip install boto3 to let Python interact with the services. Validate the installation, create an S3 client, and list buckets.
Setup Jupyter Lab and Validate boto36:42
Install JupyterLab, launch the notebook server, and validate boto3 by creating an S3 client to list buckets. Configure credentials with a default profile to enable seamless AWS access.

Introduction to Setup Local Development Enviroment for AWS on Mac5:05
Set up a Mac-based local development environment for AWS using AWS CLI and Bawtry to interact with S3 buckets.
Setup AWS CLI on Mac2:08
Install Lithia on a Mac via Python 3.8's pip and verify the installation. Configure Lithia by creating a user and downloading the access key to enable account interaction.
Setup AWS IAM User to configure AWS CLI2:40
Create an AWS IAM user with programmatic access, attach administrator permissions, download credentials, and configure the AWS CLI to enable programmatic control of the account.
Configure AWS CLI using IAM User Credentials6:25
Configure the AWS CLI with IAM user credentials, create and switch profiles for multi-account access, set a default region, and validate credentials to interact with AWS services.
Setup Python Virtual Environment on Mac using Python 34:43
Set up and activate a Python 3.8 virtual environment on Mac, install required libraries with pip, and validate Python versions to ensure consistent behavior across environments.
Setup Boto3 as part of Python Virtual Environment2:29
Activate Python virtual environment, install boto3 with pip, verify by importing boto3, and create an s3 client to list bucket for Amazon Web Services access in Jupyter or development-environment workflow.
Setup Jupyter Lab and Validate boto36:42
Set up a Jupyter Lab environment and validate boto3 to interact with AWS services using Python, including creating an S3 client, listing buckets, and handling credentials via the default profile.

Getting Started with AWS S32:59
Learn Amazon S3, a low-cost cloud storage service, to create buckets, set permissions, store objects from anywhere, and use Standard and Glacier storage with versioning and cross-region replication.
[Instructions] Getting Started with AWS S30:07
Setup Data Set locally to upload to s32:17
Set up a local data set by cloning the repository, organizing a local folder, and copying files into three s3 buckets as objects, with careful path and typo avoidance.
[Instructions] Setup Data Set locally to upload to s30:18
Adding AWS S3 Buckets and Objects5:49
Create a uniquely named S3 bucket in the AWS console using initials and hyphens, then organize and upload folders and files as objects in the bucket.
[Instruction] Adding AWS s3 Buckets and Objects0:25
Version Control in AWS S35:55
Enable bucket versioning in S3 and apply lifecycle rules to delete older versions for a prefix, reducing storage costs while preserving recoverability for disasters, and explore cross-regional application.
[Instructions] Version Control in AWS S31:01
AWS S3 Cross-Region Replication for fault tolerance9:15
Configure cross-region replication for an S3 bucket pair, create a destination bucket in another region, set a replication rule, enable versioning, and ensure high availability for only new objects.
[Instructions] AWS S3 Cross-Region Replication for fault tolerance0:49
Cross-Region Replication for Disaster Recovery of AWS S39:15
Explore cross-region replication for disaster recovery with AWS S3 by creating a destination bucket in a different region, configuring replication rules with a retail prefix, and enabling versioning.
Overview of AWS S3 Storage Classes5:58
Explore S3 storage classes: standard, infrequent access, glacier, and glacier deep archive, compare latency and costs, and manage storage with lifecycle and replication rules for cost-efficient backups.
[Instructions] Overview of AWS S3 Storage Classes or Storage Tiers0:51
Overview of AWS S3 Glacier3:08
Compare AWS S3 standard with glacier and glacier deep archive, highlighting cost savings, slower retrieval times, and backups, plus how to apply lifecycle rules to move objects.
[Instructions] Overview of Glacier in AWS s30:19
Managing AWS S3 using AWS CLI7:07
Learn to manage AWS S3 with the F3 CLI: list objects, view details, recursive listings, and bucket creation or deletion with force options, plus cp and rm basics.
[Instructions and Commands] Managing AWS S3 buckets and objects using AWS CLI0:27
Managing Objects in AWS S3 using CLI - Lab12:17
Master the AWS S3 CLI to manage objects in S3 buckets by listing contents recursively, deleting subfolders, and copying folders as objects with include and exclude options.
[Instructions] Managing Objects in AWS S3 using AWS CLI - Lab0:34

Overview of AWS s3 and IAM for Databricks2:42
Master aws s3 and iam essentials for databricks, including the python sdk called atwater tree, s3 console and buckets, and iam roles and policies.
Creating AWS IAM Users6:23
Create and configure an AWS IAM user using the management console, assign administrator access or specific policies, and provision programmatic and console access with password, access key, and secret key.
[Instructions] Creating IAM Users0:07
Logging into AWS Management Console using IAM User2:24
Use the IDB Admin IAM user to log in to the AWS management console, retrieve credentials from the CSP file, and complete the first-login password change.
[Instructions] Logging into AWS Management Console using IAM User0:21
Validate Programmatic Access to AWS IAM User2:15
Validate programmatic access to an AWS IAM user by configuring credentials, updating the ITV Admin profile, and creating and listing S3 buckets to confirm access.
[Instructions and Commands] Validate Programmatic Access to IAM User0:29
AWS IAM Identity-based Policies9:08
Explore AWS IAM identity-based policies, attaching multiple predefined or custom policies to users or groups, and defining effect, action, and resource, with hands-on S3 permission tests.
[Instructions and Commands] IAM Identity-based Policies1:04
Managing AWS IAM User Groups6:19
Create and manage AWS IAM user groups by attaching policies, assigning users, and validating inherited permissions across admin and support groups.
[Instructions and Commands] Managing IAM Groups0:50
Managing AWS IAM Roles9:38
Create and attach AWS IAM roles to EC2 instances with policies that grant inherited permissions. Validate access by testing S3 operations and updating the IAM role on running instances.
[Instructions and Commands] Managing IAM Roles0:46
Overview of AWS IAM Custom Policies9:00
Master crafting AWS IAM custom policies to grant precise S3 bucket permissions using JSON, import managed policies, attach to groups, and validate access with real-world demos.
[Instructions and Commands] Overview of Custom Policies0:53
Managing AWS IAM Identities using AWS CLI8:56
Manage AWS IAM identities using the AWS CLI to list, view, and manage users, groups, and policies, create users, assign them to groups, and remove or delete them.
[Instructions and Commands] Managing IAM using AWS CLI0:39

Introduction to Integrating AWS s3 and Glue Catalog with Databricks1:22
Explore how to integrate AWS S3 and Glue Catalog with the Databricks platform, configure groups and roles, and enable notebooks and jobs to access essential services securely.
Create AWS IAM Group for Databricks Developers1:22
Create an AWS Databricks access group for data developers, assign permissions and policies, and prepare for inheriting those permissions to enable integration of air travel data with the database.
Create AWS IAM Users and adding to group2:24
Create and manage AWS IAM users and groups for database developers, assign admin and programming access, set custom passwords, and prepare permissions for Databricks integration.
Create required AWS s3 Bucket for Databricks Developers5:25
Create and configure an AWS S3 bucket for Databricks developers, naming ITV DB demo bucket, selecting a region, and setting permissions via groups and policies.
Grant Permissions on AWS s3 Bucket to the users in group via AWS IAM Inline P8:46
Grant full access to an s3 bucket by creating an aws iam inline policy, attaching it to the group, and verifying access by listing and reading bucket contents.
Attach AWS IAM Policy to grant access to Glue to the users via IAM Policy4:19
Attach the AWS Glue full-access policy to the user group to grant permissions. Review policy scope to limit access and keep unrelated S3 buckets restricted.
Upload JSON Dataset to s3 to crawl using AWS Glue Crawler3:13
Upload a json dataset to an S3 bucket using the web console, then run an AWS Glue crawler to crawl the data and create databases and tables.
Overview of IAM roles for Glue Crawlers7:19
Explore how IAM roles empower Glue crawlers to access S3 data and create tables while managing permissions.
Create AWS IAM Custom Service Role for Glue Crawlers6:52
Create an Amazon Web Services IAM service role for glue crawlers and assign required S3 bucket permissions to ITV DB demo so crawlers can crawl files and create glue tables.
Create Glue Crawler to Create Multiple Glue Catalog Tables6:03
Create and run an AWS Glue crawler to scan a retail S3 folder, then automatically create Glue catalog database and tables for each folder.
Overview of Integration of Databricks Clusters and AWS EC2 Instances3:40
Learn how to provision and configure Databricks clusters on AWS to access S3 and Glue, including instance profiles, two-instance setups, and Spark-based data processing.
Create AWS IAM Role or Instance Profile2:11
Create an AWS IAM role and an instance profile, attach policies, and enable a Databricks cluster to access S3 buckets and Glue tables.
Registering AWS IAM Instance Profile with Databricks Account1:43
Register an AWS IAM instance profile with a Databricks account, add the profile via the admin console, and attach it to a cluster so the cluster inherits the permissions.
Attach AWS IAM Instance Profile to new as well as existing Databricks Cluster4:59
Attach an AWS IAM instance profile to a new or running Databricks cluster, restart it, and grant S3 and Glue permissions to access S3 buckets and Glue catalog tables.
Grant Permissions on s3 to Databricks Clusters using AWS IAM Policy and Insta4:36
Grant S3 access to Databricks clusters via AWS IAM policy and instance profiles for the ITV DB demo bucket, enabling Databricks notebooks to read and write. Validate Glue catalog access.
Grant AWS IAM Glue Service Role to Databricks Clusters via Instance Profiles4:06
Grant an AWS IAM Glue service role to Databricks clusters via instance profiles, attaching policies to access the Glue catalog and blue database.

Requirements

Programming experience using Python
Data Engineering experience using Spark
Ability to write and interpret SQL Queries
This course is ideal for experienced data engineers to add Databricks as one of the key skill as part of the profile

Description

As part of this course, you will learn all the Data Engineering using cloud platform-agnostic technology called Databricks.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.

About Databricks

Databricks is the most popular cloud platform-agnostic data engineering tech stack. They are the committers of the Apache Spark project. Databricks run time provide Spark leveraging the elasticity of the cloud. With Databricks, you pay for what you use. Over a period of time, they came up with the idea of Lakehouse by providing all the features that are required for traditional BI as well as AI & ML. Here are some of the core features of Databricks.

Spark - Distributed Computing
Delta Lake - Perform CRUD Operations. It is primarily used to build capabilities such as inserting, updating, and deleting the data from files in Data Lake.
cloudFiles - Get the files in an incremental fashion in the most efficient way leveraging cloud features.
Databricks SQL - A Photon-based interface that is fine-tuned for running queries submitted for reporting and visualization by reporting tools. It is also used for Ad-hoc Analysis.

Course Details

As part of this course, you will be learning Data Engineering using Databricks.

Getting Started with Databricks
Setup Local Development Environment to develop Data Engineering Applications using Databricks
Using Databricks CLI to manage files, jobs, clusters, etc related to Data Engineering Applications
Spark Application Development Cycle to build Data Engineering Applications
Databricks Jobs and Clusters
Deploy and Run Data Engineering Jobs on Databricks Job Clusters as Python Application
Deploy and Run Data Engineering Jobs on Databricks Job Clusters using Notebooks
Deep Dive into Delta Lake using Dataframes on Databricks Platform
Deep Dive into Delta Lake using Spark SQL on Databricks Platform
Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
Overview of AutoLoader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.
Overview of Databricks SQL for Data Analysis and reporting.

We will be adding a few more modules related to Pyspark, Spark with Scala, Spark SQL, and Streaming Pipelines in the coming weeks.

Desired Audience

Here is the desired audience for this advanced course.

Experienced application developers to gain expertise related to Data Engineering with prior knowledge and experience of Spark.
Experienced Data Engineers to gain enough skills to add Databricks to their profile.
Testers to improve their testing capabilities related to Data Engineering applications using Databricks.

Prerequisites

Logistics
- Computer with decent configuration (At least 4 GB RAM, however 8 GB is highly desired)
- Dual Core is required and Quad-Core is highly desired
- Chrome Browser
- High-Speed Internet
- Valid AWS Account
- Valid Databricks Account (free Databricks Account is not sufficient)
Experience as Data Engineer especially using Apache Spark
Knowledge about some of the cloud concepts such as storage, users, roles, etc.

Associated Costs

As part of the training, you will only get the material. You need to practice on your own or corporate cloud account and Databricks Account.

You need to take care of the associated AWS or Azure costs.
You need to take care of the associated Databricks costs.

Training Approach

Here are the details related to the training approach.

It is self-paced with reference material, code snippets, and videos provided as part of Udemy.
One needs to sign up for their own Databricks environment to practice all the core features of Databricks.
We would recommend completing 2 modules every week by spending 4 to 5 hours per week.
It is highly recommended to take care of all the tasks so that one can get real experience of Databricks.
Support will be provided through Udemy Q&A.

Here is the detailed course outline.

Getting Started with Databricks on Azure

As part of this section, we will go through the details about signing up to Azure and setup the Databricks cluster on Azure.

Getting Started with Databricks on Azure
Signup for the Azure Account
Login and Increase Quotas for regional vCPUs in Azure
Create Azure Databricks Workspace
Launching Azure Databricks Workspace or Cluster
Quick Walkthrough of Azure Databricks UI
Create Azure Databricks Single Node Cluster
Upload Data using Azure Databricks UI
Overview of Creating Notebook and Validating Files using Azure Databricks
Develop Spark Application using Azure Databricks Notebook
Validate Spark Jobs using Azure Databricks Notebook
Export and Import of Azure Databricks Notebooks
Terminating Azure Databricks Cluster and Deleting Configuration
Delete Azure Databricks Workspace by deleting Resource Group

Azure Essentials for Databricks - Azure CLI

As part of this section, we will go through the details about setting up Azure CLI to manage Azure resources using relevant commands.

Azure Essentials for Databricks - Azure CLI
Azure CLI using Azure Portal Cloud Shell
Getting Started with Azure CLI on Mac
Getting Started with Azure CLI on Windows
Warming up with Azure CLI - Overview
Create Resource Group using Azure CLI
Create ADLS Storage Account with in Resource Group
Add Container as part of Storage Account
Overview of Uploading the data into ADLS File System or Container
Setup Data Set locally to upload into ADLS File System or Container
Upload local directory into Azure ADLS File System or Container
Delete Azure ADLS Storage Account using Azure CLI
Delete Azure Resource Group using Azure CLI

Mount ADLS on to Azure Databricks to access files from Azure Blob Storage

As part of this section, we will go through the details related to mounting Azure Data Lake Storage (ADLS) on to Azure Databricks Clusters.

Mount ADLS on to Azure Databricks - Introduction
Ensure Azure Databricks Workspace
Setup Databricks CLI on Mac or Windows using Python Virtual Environment
Configure Databricks CLI for new Azure Databricks Workspace
Register an Azure Active Directory Application
Create Databricks Secret for AD Application Client Secret
Create ADLS Storage Account
Assign IAM Role on Storage Account to Azure AD Application
Setup Retail DB Dataset
Create ADLS Container or File System and Upload Data
Start Databricks Cluster to mount ADLS
Mount ADLS Storage Account on to Azure Databricks
Validate ADLS Mount Point on Azure Databricks Clusters
Unmount the mount point from Databricks
Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks

Setup Local Development Environment for Databricks

As part of this section, we will go through the details related to setting up of local development environment for Databricks using tools such as Pycharm, Databricks dbconnect, Databricks dbutils, etc.

Setup Single Node Databricks Cluster
Install Databricks Connect
Configure Databricks Connect
Integrating Pycharm with Databricks Connect
Integrate Databricks Cluster with Glue Catalog
Setup AWS s3 Bucket and Grant Permissions
Mounting s3 Buckets into Databricks Clusters
Using Databricks dbutils from IDEs such as Pycharm

Using Databricks CLI

As part of this section, we will get an overview of Databricks CLI to interact with Databricks File System or DBFS.

Introduction to Databricks CLI
Install and Configure Databricks CLI
Interacting with Databricks File System using Databricks CLI
Getting Databricks Cluster Details using Databricks CLI

Databricks Jobs and Clusters

As part of this section, we will go through the details related to Databricks Jobs and Clusters.

Introduction to Databricks Jobs and Clusters
Creating Pools in Databricks Platform
Create Cluster on Azure Databricks
Request to Increase CPU Quota on Azure
Creating Job on Databricks
Submitting Jobs using Databricks Job Cluster
Create Pool in Databricks
Running Job using Interactive Databricks Cluster Attached to Pool
Running Job Using Databricks Job Cluster Attached to Pool
Exercise - Submit the application as a job using Databricks interactive cluster

Deploy and Run Spark Applications on Databricks

As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications.

Prepare PyCharm for Databricks
Prepare Data Sets
Move files to ghactivity
Refactor Code for Databricks
Validating Data using Databricks
Setup Data Set for Production Deployment
Access File Metadata using Databricks dbutils
Build Deployable bundle for Databricks
Running Jobs using Databricks Web UI
Get Job and Run Details using Databricks CLI
Submitting Databricks Jobs using CLI
Setup and Validate Databricks Client Library
Resetting the Job using Databricks Jobs API
Run Databricks Job programmatically using Python
Detailed Validation of Data using Databricks Notebooks

Deploy and Run Spark Jobs using Notebooks

As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications using Databricks Notebooks.

Modularizing Databricks Notebooks
Running Job using Databricks Notebook
Refactor application as Databricks Notebooks
Run Notebook using Databricks Development Cluster

Deep Dive into Delta Lake using Spark Data Frames on Databricks

As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark Data Frames.

Introduction to Delta Lake using Spark Data Frames on Databricks
Creating Spark Data Frames for Delta Lake on Databricks
Writing Spark Data Frame using Delta Format on Databricks
Updating Existing Data using Delta Format on Databricks
Delete Existing Data using Delta Format on Databricks
Merge or Upsert Data using Delta Format on Databricks
Deleting using Merge in Delta Lake on Databricks
Point in Snapshot Recovery using Delta Logs on Databricks
Deleting unnecessary Delta Files using Vacuum on Databricks
Compaction of Delta Lake Files on Databricks

Deep Dive into Delta Lake using Spark SQL on Databricks

As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark SQL.

Introduction to Delta Lake using Spark SQL on Databricks
Create Delta Lake Table using Spark SQL on Databricks
Insert Data to Delta Lake Table using Spark SQL on Databricks
Update Data in Delta Lake Table using Spark SQL on Databricks
Delete Data from Delta Lake Table using Spark SQL on Databricks
Merge or Upsert Data into Delta Lake Table using Spark SQL on Databricks
Using Merge Function over Delta Lake Table using Spark SQL on Databricks
Point in Snapshot Recovery using Delta Lake Table using Spark SQL on Databricks
Vacuuming Delta Lake Tables using Spark SQL on Databricks
Compaction of Delta Lake Tables using Spark SQL on Databricks

Accessing Databricks Cluster Terminal via Web as well as SSH

As part of this section, we will see how to access terminal related to Databricks Cluster via Web as well as SSH.

Enable Web Terminal in Databricks Admin Console
Launch Web Terminal for Databricks Cluster
Setup SSH for the Databricks Cluster Driver Node
Validate SSH Connectivity to the Databricks Driver Node on AWS
Limitations of SSH and comparison with Web Terminal related to Databricks Clusters

Installing Softwares on Databricks Clusters using init scripts

As part of this section, we will see how to bootstrap Databricks clusters by installing relevant 3rd party libraries for our applications.

Setup gen_logs on Databricks Cluster
Overview of Init Scripts for Databricks Clusters
Create Script to install software from git on Databricks Cluster
Copy init script to dbfs location
Create Databricks Standalone Cluster with init script

Quick Recap of Spark Structured Streaming

As part of this section, we will get a quick recap of Spark Structured streaming.

Validate Netcat on Databricks Driver Node
Push log messages to Netcat Webserver on Databricks Driver Node
Reading Web Server logs using Spark Structured Streaming
Writing Streaming Data to Files

Incremental Loads using Spark Structured Streaming on Databricks

As part of this section, we will understand how to perform incremental loads using Spark Structured Streaming on Databricks.

Overview of Spark Structured Streaming
Steps for Incremental Data Processing on Databricks
Configure Databricks Cluster with Instance Profile
Upload GHArchive Files to AWS s3 using Databricks Notebooks
Read JSON Data using Spark Structured Streaming on Databricks
Write using Delta file format using Trigger Once on Databricks
Analyze GHArchive Data in Delta files using Spark on Databricks
Add New GHActivity JSON files on Databricks
Load Data Incrementally to Target Table on Databricks
Validate Incremental Load on Databricks
Internals of Spark Structured Streaming File Processing on Databricks

Incremental Loads using autoLoader Cloud Files on Databricks

As part of this section we will see how to perform incremental loads using autoLoader cloudFiles on Databricks Clusters.

Overview of AutoLoader cloudFiles on Databricks
Upload GHArchive Files to s3 on Databricks
Write Data using AutoLoader cloudFiles on Databricks
Add New GHActivity JSON files on Databricks
Load Data Incrementally to Target Table on Databricks
Add New GHActivity JSON files on Databricks
Overview of Handling S3 Events using AWS Services on Databricks
Configure IAM Role for cloudFiles file notifications on Databricks
Incremental Load using cloudFiles File Notifications on Databricks
Review AWS Services for cloudFiles Event Notifications on Databricks
Review Metadata Generated for cloudFiles Checkpointing on Databricks

Overview of Databricks SQL Clusters

As part of this section, we will get an overview of Databricks SQL Clusters.

Overview of Databricks SQL Platform - Introduction
Run First Query using SQL Editor of Databricks SQL
Overview of Dashboards using Databricks SQL
Overview of Databricks SQL Data Explorer to review Metastore Databases and Tables
Use Databricks SQL Editor to develop scripts or queries
Review Metadata of Tables using Databricks SQL Platform
Overview of loading data into retail_db tables
Configure Databricks CLI to push data into the Databricks Platform
Copy JSON Data into DBFS using Databricks CLI
Analyze JSON Data using Spark APIs
Analyze Delta Table Schemas using Spark APIs
Load Data from Spark Data Frames into Delta Tables
Run Adhoc Queries using Databricks SQL Editor to validate data
Overview of External Tables using Databricks SQL
Using COPY Command to Copy Data into Delta Tables
Manage Databricks SQL Endpoints

Who this course is for:

Beginner or Intermediate Data Engineers who want to learn Databricks for Data Engineering
Intermediate Application Engineers who want to explore Data Engineering using Databricks
Data and Analytics Engineers who want to learn Data Engineering using Databricks
Testers who want to learn Databricks to test Data Engineering applications built using Databricks

Data Engineering using Databricks on AWS and Azure

What you'll learn

Explore related topics

Course content

Introduction to Data Engineering using Databricks2 lectures • 12min

Getting Started with Databricks on Azure14 lectures • 44min

Azure Essentials for Databricks - Azure CLI13 lectures • 47min

Mount ADLS on to Azure Databricks to access files from Azure Blob Storage16 lectures • 48min

Getting Started with Databricks on AWS14 lectures • 1hr 10min

AWS Essentials for Databricks - Setup Local Development Environment on Windows11 lectures • 53min

AWS Essentials for Databricks - Setup Local Development Environment on Mac7 lectures • 30min

AWS Essentials for Databricks - Overview of AWS Storage Solutions19 lectures • 1hr 9min

AWS Essentials for Databricks - Overview of AWS s3 and IAM Roles for Databricks17 lectures • 1hr 2min

AWS Essentials for Databricks - Integrating AWS s3 and Glue Catalog16 lectures • 1hr 8min

Requirements

Description

Who this course is for: