
Explore key features of Databricks on aws and azure, set up development environments, use notebooks and cli, and master Spark, Delta Lake, and Glue catalog for data engineering.
Learn to start with Databricks on Azure: sign up, set quotas, create a multi-node cluster and workspace, upload data from Azure Blob, run notebooks, and clean up resources.
Sign up for an azure account and log in to portal.azure.com to start using databricks on azure, with $200 credit for one month and quotas to explore.
Log in to portal.azure.com and increase your regional vcpus quotas to support Databricks on Azure, starting with 10–20 vcpus and using a single-node cluster before scaling to multi-node setups.
Learn to create an Azure Databricks workspace by selecting a region with quota (East US), creating a resource group, naming the workspace, and choosing standard pricing with no public IP.
Begin by signing into portal.azure.com, locate the Azure Databricks dashboard, and launch the workspace. Navigate the Databricks platform to manage clusters and submit data jobs using Azure AD.
Explore the Azure Databricks UI, including workspace navigation, notebooks, repos, clusters, jobs, and settings, plus token and Git integrations to develop, run, and monitor data workflows.
Create an Azure Databricks single-node cluster and explore core concepts by configuring a minimal single-node setup, with LTS runtime 9.1, idle termination after 15 minutes, and basic Spark configurations.
Upload data in Azure Databricks UI via import and export data, drag-and-drop files into retail db under file store, but folders may not upload recursively; use CLI for folder uploads.
Create a python notebook in Azure Databricks, choose scala, sql, or r languages, navigate file store and dbfs, use percentage magic, and apply spark APIs to process and validate data.
Develop and run a Spark-based data processing app in Azure Databricks notebooks, reading CSV data from Azure Blob Storage via DBFS, then group by date and count orders.
Validate spark jobs by writing the order count by date to a csv in the target location with spark.write.csv, then read with spark.read.csv to verify 364 records.
Master exporting and importing Azure Databricks notebooks, backing up workspaces and folders as archive or IPython notebooks, then reimporting as dbc files and attaching to running clusters.
Learn how to back up notebooks, terminate the Azure Databricks cluster, and delete the configuration, then clean up the workspace in the Azure portal.
Learn to clean up an Azure Databricks workspace by deleting the resource group, and when to delete the workspace alone to avoid removing other resources.
Learn to use Azure CLI to interact with Azure resources, create resource groups and storage accounts, manage containers, upload data for Databricks on Azure, on Mac or Windows.
Learn to use Azure CLI in Cloud Shell on the Azure Portal to list resource groups, view subcommands with -h, and filter results with query.
Install azure cli on macOS using brew, log in to your account, and list resource groups, then retrieve their names with a json query for streamlined management.
Install the Azure CLI on Windows with the MSI installer and validate in PowerShell. Then log in and run az group list with --query to show resource group names.
Begin warming up with Azure CLI by creating a resource group and a storage account, then containers, uploading a directory, validating results, and deleting the storage account and resource group.
Install the Azure CLI on your platform, then create a resource group with az group create using location and name, and verify it with group list and query.
Learn to manage storage accounts within a resource group using Azure CLI, including creating, listing, and deleting storage accounts with practical command examples.
Use storage fs to manage file systems in azure data lake storage gen2, listing and creating containers in an ITV demo storage account, and note containers resemble AWS S3 bucket.
Upload data from your local file system into the data container of an Azure storage account using the storage directory command.
Prepare the local retail_db dataset. Clone its GitHub repository and upload the folder to an Azure storage account file system or container via Azure CLI.
Upload a local directory to an Azure storage account container or file system using storage fs directory upload, recursively transferring the retail underscore DB dataset and validating via listing commands.
Learn to delete an Azure ADLS storage account using Azure CLI, including listing accounts by resource group and confirming deletion, then delete the resource group.
Delete an Azure resource group with Azure CLI and observe that all resources, including storage accounts, are removed; recreating the group may restore some but not all resources.
Mount ADLS onto Azure Databricks to access containers without credentials via dbfs. Integrate an AD app with storage and mount the container to Databricks workspaces.
Create a new azure databricks workspace in the azure portal, configure a resource group and east us region using standard pricing, launch the databricks ui, and prepare for storage mounting.
Install, configure, and mount ADLs onto the Azure Databricks workspace and clusters using a Python virtual environment, then set up the Databricks CLI with token-based authentication.
Configure the Databricks CLI to connect an Azure Databricks workspace by generating a token, setting the host, and using a dedicated profile for Databricks file system commands.
Register an Azure Active Directory application to connect a storage account with the Databricks workspace, and capture the client secret, application ID, and directory tenant ID.
Create a secret scope and store the Azure AD app client secret in Databricks secrets to enable mounting storage into the Azure Databricks workspace using the Databricks CLI.
Use azure cli to validate the resource group, create an adls storage account named ITV DB Demo in east us, and verify the provisioning state for integration with azure databricks.
Assign the storage blob data contributor role to the Azure AD application in the storage account's access control, completing the role assignment for mounting to Azure Databricks.
Mount the retail db dataset on Azure Databricks by cloning the retail_db repo from GitHub, removing the .git folder, and uploading the folder to a storage container.
Create the data container (file system) in the ITV Idols DB demo storage account, then upload the local data directory recursively and verify via FS directory list.
Mount the storage account onto the Databricks workspace by provisioning a single-node Azure Databricks cluster, then access mounted files with DBFS commands even after cluster termination.
Mount an Azure Data Lake Storage account to Azure Databricks via a Python notebook and a configs dict, providing app and directory IDs and client secret, then validate the mount.
Demonstrates mounting an ADLS storage container onto Azure Databricks, reads orders data with a defined schema, and writes json output to a mounted path for cloud-agnostic pipelines.
Unmount the mount point from Databricks and validate removal from /mnt using dbutils.fs.unmount, ensuring the data disappears and resources are cleaned up to avoid charges.
Remove the Azure resource group after unmounting and stopping the cluster to delete the Databricks workspace and storage accounts, ensuring no charges remain.
Learn to get started with Databricks on aws, explore how Databricks sits on a cloud data platform, and upload data using notebooks to build a basic solution.
Sign up for aws account by visiting amazon.com, creating and verifying account, and adding credit card; then log in console to start using aws services, or enterprise login if available.
Log into the AWS management console using your email, password, and two-factor authentication, then use the global search to locate services like EMR and view billing.
Learn to sign up for Databricks on AWS using QuickStart, choose a cloud provider, review plans, and set up a workspace to manage the Databricks environment.
Learn how to access a Databricks workspace on AWS by navigating to the login page, entering your email, and using bookmarks to streamline authentication and reach the ITV demo workspace.
Learn how to clean up Databricks workspaces by viewing details, updating settings, and deleting resources to prevent unexpected charges after your data engineering practice on AWS and Azure.
Explore a quick walkthrough of the Databricks UI on AWS, covering the sidebar, workspace, notebooks, tables, clusters, and data preview from DBFS and S3.
Learn to set up a single node Databricks cluster on AWS, choosing the latest runtime, configuring inactivity timeout, and selecting a suitable node type for learning and cost control.
Upload data with the AWS Databricks UI by creating a folder structure in the file store and uploading files into six folders (categories, customers, departments, order items, orders, products).
Create your first Databricks notebook on AWS, attach it to a cluster, use Python to explore the file system, and run a simple join between orders and order items.
Learn to build a Spark application in an AWS Databricks notebook, read data into dataframes, join orders with order items, and calculate daily revenue from complete or closed orders.
Review the aws databricks cluster state and restart by starting the cluster from the notebook, then write the daily revenue data frame to the path using the specified file format.
Persist processing results to dbfs to guard against ephemeral clusters. Write the daily revenue dataframe to csv with header and validate by listing the folder and re-reading data with spark.
Export and import AWS Databricks notebooks across accounts by exporting to a local file, then importing into another workspace; learn path adjustments, formats, and cluster attachment to run notebooks.
Set up a local Windows environment using Windows Subsystem for Linux to run AWS CLI and Boto3, enabling Databricks workflows on AWS and Azure.
Explore PowerShell on Windows 10 and 11, compare it with the DOS prompt, learn how to launch, basics, customization, and key features like remote machine access and file navigation.
Enable the Windows Subsystem for Linux and install an Ubuntu distribution using PowerShell. Reboot to complete the WSL setup on Windows 10 or 11.
Set up an Ubuntu-based virtual machine on Windows 10 or 11 with WSL, log in, and connect via PowerShell to explore the VM’s home directory.
Set up a Python development environment on Ubuntu by installing Python 3 and pip, creating a virtual environment, updating apt, and validating the setup for data engineering tasks.
Set up a local development environment by installing Python components with pip on an Ubuntu-based virtual machine on Windows using WAFL, and verify the tool installation and command usage.
Create an AWS IAM user with programmatic access, download the access key and secret key, and assign administrator access to enable configuration and credential validation.
Configure the AWS CLI on Windows by setting up credentials and config files, managing profiles, and validating access with S3 bucket listings and secure cleanup.
Create and activate a Python virtual environment, install required libraries, and validate Python can interact with your AWS account, using Linux commands to manage a venv project folder.
Set up a Python virtual environment and pip install boto3 to let Python interact with the services. Validate the installation, create an S3 client, and list buckets.
Install JupyterLab, launch the notebook server, and validate boto3 by creating an S3 client to list buckets. Configure credentials with a default profile to enable seamless AWS access.
Set up a Mac-based local development environment for AWS using AWS CLI and Bawtry to interact with S3 buckets.
Install Lithia on a Mac via Python 3.8's pip and verify the installation. Configure Lithia by creating a user and downloading the access key to enable account interaction.
Create an AWS IAM user with programmatic access, attach administrator permissions, download credentials, and configure the AWS CLI to enable programmatic control of the account.
Configure the AWS CLI with IAM user credentials, create and switch profiles for multi-account access, set a default region, and validate credentials to interact with AWS services.
Set up and activate a Python 3.8 virtual environment on Mac, install required libraries with pip, and validate Python versions to ensure consistent behavior across environments.
Activate Python virtual environment, install boto3 with pip, verify by importing boto3, and create an s3 client to list bucket for Amazon Web Services access in Jupyter or development-environment workflow.
Set up a Jupyter Lab environment and validate boto3 to interact with AWS services using Python, including creating an S3 client, listing buckets, and handling credentials via the default profile.
Learn Amazon S3, a low-cost cloud storage service, to create buckets, set permissions, store objects from anywhere, and use Standard and Glacier storage with versioning and cross-region replication.
Set up a local data set by cloning the repository, organizing a local folder, and copying files into three s3 buckets as objects, with careful path and typo avoidance.
Create a uniquely named S3 bucket in the AWS console using initials and hyphens, then organize and upload folders and files as objects in the bucket.
Enable bucket versioning in S3 and apply lifecycle rules to delete older versions for a prefix, reducing storage costs while preserving recoverability for disasters, and explore cross-regional application.
Configure cross-region replication for an S3 bucket pair, create a destination bucket in another region, set a replication rule, enable versioning, and ensure high availability for only new objects.
Explore cross-region replication for disaster recovery with AWS S3 by creating a destination bucket in a different region, configuring replication rules with a retail prefix, and enabling versioning.
Explore S3 storage classes: standard, infrequent access, glacier, and glacier deep archive, compare latency and costs, and manage storage with lifecycle and replication rules for cost-efficient backups.
Compare AWS S3 standard with glacier and glacier deep archive, highlighting cost savings, slower retrieval times, and backups, plus how to apply lifecycle rules to move objects.
Learn to manage AWS S3 with the F3 CLI: list objects, view details, recursive listings, and bucket creation or deletion with force options, plus cp and rm basics.
Master the AWS S3 CLI to manage objects in S3 buckets by listing contents recursively, deleting subfolders, and copying folders as objects with include and exclude options.
Master aws s3 and iam essentials for databricks, including the python sdk called atwater tree, s3 console and buckets, and iam roles and policies.
Create and configure an AWS IAM user using the management console, assign administrator access or specific policies, and provision programmatic and console access with password, access key, and secret key.
Use the IDB Admin IAM user to log in to the AWS management console, retrieve credentials from the CSP file, and complete the first-login password change.
Validate programmatic access to an AWS IAM user by configuring credentials, updating the ITV Admin profile, and creating and listing S3 buckets to confirm access.
Explore AWS IAM identity-based policies, attaching multiple predefined or custom policies to users or groups, and defining effect, action, and resource, with hands-on S3 permission tests.
Create and manage AWS IAM user groups by attaching policies, assigning users, and validating inherited permissions across admin and support groups.
Create and attach AWS IAM roles to EC2 instances with policies that grant inherited permissions. Validate access by testing S3 operations and updating the IAM role on running instances.
Master crafting AWS IAM custom policies to grant precise S3 bucket permissions using JSON, import managed policies, attach to groups, and validate access with real-world demos.
Manage AWS IAM identities using the AWS CLI to list, view, and manage users, groups, and policies, create users, assign them to groups, and remove or delete them.
Explore how to integrate AWS S3 and Glue Catalog with the Databricks platform, configure groups and roles, and enable notebooks and jobs to access essential services securely.
Create an AWS Databricks access group for data developers, assign permissions and policies, and prepare for inheriting those permissions to enable integration of air travel data with the database.
Create and manage AWS IAM users and groups for database developers, assign admin and programming access, set custom passwords, and prepare permissions for Databricks integration.
Create and configure an AWS S3 bucket for Databricks developers, naming ITV DB demo bucket, selecting a region, and setting permissions via groups and policies.
Grant full access to an s3 bucket by creating an aws iam inline policy, attaching it to the group, and verifying access by listing and reading bucket contents.
Attach the AWS Glue full-access policy to the user group to grant permissions. Review policy scope to limit access and keep unrelated S3 buckets restricted.
Upload a json dataset to an S3 bucket using the web console, then run an AWS Glue crawler to crawl the data and create databases and tables.
Explore how IAM roles empower Glue crawlers to access S3 data and create tables while managing permissions.
Create an Amazon Web Services IAM service role for glue crawlers and assign required S3 bucket permissions to ITV DB demo so crawlers can crawl files and create glue tables.
Create and run an AWS Glue crawler to scan a retail S3 folder, then automatically create Glue catalog database and tables for each folder.
Learn how to provision and configure Databricks clusters on AWS to access S3 and Glue, including instance profiles, two-instance setups, and Spark-based data processing.
Create an AWS IAM role and an instance profile, attach policies, and enable a Databricks cluster to access S3 buckets and Glue tables.
Register an AWS IAM instance profile with a Databricks account, add the profile via the admin console, and attach it to a cluster so the cluster inherits the permissions.
Attach an AWS IAM instance profile to a new or running Databricks cluster, restart it, and grant S3 and Glue permissions to access S3 buckets and Glue catalog tables.
Grant S3 access to Databricks clusters via AWS IAM policy and instance profiles for the ITV DB demo bucket, enabling Databricks notebooks to read and write. Validate Glue catalog access.
Grant an AWS IAM Glue service role to Databricks clusters via instance profiles, attaching policies to access the Glue catalog and blue database.
As part of this course, you will learn all the Data Engineering using cloud platform-agnostic technology called Databricks.
About Data Engineering
Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.
About Databricks
Databricks is the most popular cloud platform-agnostic data engineering tech stack. They are the committers of the Apache Spark project. Databricks run time provide Spark leveraging the elasticity of the cloud. With Databricks, you pay for what you use. Over a period of time, they came up with the idea of Lakehouse by providing all the features that are required for traditional BI as well as AI & ML. Here are some of the core features of Databricks.
Spark - Distributed Computing
Delta Lake - Perform CRUD Operations. It is primarily used to build capabilities such as inserting, updating, and deleting the data from files in Data Lake.
cloudFiles - Get the files in an incremental fashion in the most efficient way leveraging cloud features.
Databricks SQL - A Photon-based interface that is fine-tuned for running queries submitted for reporting and visualization by reporting tools. It is also used for Ad-hoc Analysis.
Course Details
As part of this course, you will be learning Data Engineering using Databricks.
Getting Started with Databricks
Setup Local Development Environment to develop Data Engineering Applications using Databricks
Using Databricks CLI to manage files, jobs, clusters, etc related to Data Engineering Applications
Spark Application Development Cycle to build Data Engineering Applications
Databricks Jobs and Clusters
Deploy and Run Data Engineering Jobs on Databricks Job Clusters as Python Application
Deploy and Run Data Engineering Jobs on Databricks Job Clusters using Notebooks
Deep Dive into Delta Lake using Dataframes on Databricks Platform
Deep Dive into Delta Lake using Spark SQL on Databricks Platform
Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
Overview of AutoLoader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.
Overview of Databricks SQL for Data Analysis and reporting.
We will be adding a few more modules related to Pyspark, Spark with Scala, Spark SQL, and Streaming Pipelines in the coming weeks.
Desired Audience
Here is the desired audience for this advanced course.
Experienced application developers to gain expertise related to Data Engineering with prior knowledge and experience of Spark.
Experienced Data Engineers to gain enough skills to add Databricks to their profile.
Testers to improve their testing capabilities related to Data Engineering applications using Databricks.
Prerequisites
Logistics
Computer with decent configuration (At least 4 GB RAM, however 8 GB is highly desired)
Dual Core is required and Quad-Core is highly desired
Chrome Browser
High-Speed Internet
Valid AWS Account
Valid Databricks Account (free Databricks Account is not sufficient)
Experience as Data Engineer especially using Apache Spark
Knowledge about some of the cloud concepts such as storage, users, roles, etc.
Associated Costs
As part of the training, you will only get the material. You need to practice on your own or corporate cloud account and Databricks Account.
You need to take care of the associated AWS or Azure costs.
You need to take care of the associated Databricks costs.
Training Approach
Here are the details related to the training approach.
It is self-paced with reference material, code snippets, and videos provided as part of Udemy.
One needs to sign up for their own Databricks environment to practice all the core features of Databricks.
We would recommend completing 2 modules every week by spending 4 to 5 hours per week.
It is highly recommended to take care of all the tasks so that one can get real experience of Databricks.
Support will be provided through Udemy Q&A.
Here is the detailed course outline.
Getting Started with Databricks on Azure
As part of this section, we will go through the details about signing up to Azure and setup the Databricks cluster on Azure.
Getting Started with Databricks on Azure
Signup for the Azure Account
Login and Increase Quotas for regional vCPUs in Azure
Create Azure Databricks Workspace
Launching Azure Databricks Workspace or Cluster
Quick Walkthrough of Azure Databricks UI
Create Azure Databricks Single Node Cluster
Upload Data using Azure Databricks UI
Overview of Creating Notebook and Validating Files using Azure Databricks
Develop Spark Application using Azure Databricks Notebook
Validate Spark Jobs using Azure Databricks Notebook
Export and Import of Azure Databricks Notebooks
Terminating Azure Databricks Cluster and Deleting Configuration
Delete Azure Databricks Workspace by deleting Resource Group
Azure Essentials for Databricks - Azure CLI
As part of this section, we will go through the details about setting up Azure CLI to manage Azure resources using relevant commands.
Azure Essentials for Databricks - Azure CLI
Azure CLI using Azure Portal Cloud Shell
Getting Started with Azure CLI on Mac
Getting Started with Azure CLI on Windows
Warming up with Azure CLI - Overview
Create Resource Group using Azure CLI
Create ADLS Storage Account with in Resource Group
Add Container as part of Storage Account
Overview of Uploading the data into ADLS File System or Container
Setup Data Set locally to upload into ADLS File System or Container
Upload local directory into Azure ADLS File System or Container
Delete Azure ADLS Storage Account using Azure CLI
Delete Azure Resource Group using Azure CLI
Mount ADLS on to Azure Databricks to access files from Azure Blob Storage
As part of this section, we will go through the details related to mounting Azure Data Lake Storage (ADLS) on to Azure Databricks Clusters.
Mount ADLS on to Azure Databricks - Introduction
Ensure Azure Databricks Workspace
Setup Databricks CLI on Mac or Windows using Python Virtual Environment
Configure Databricks CLI for new Azure Databricks Workspace
Register an Azure Active Directory Application
Create Databricks Secret for AD Application Client Secret
Create ADLS Storage Account
Assign IAM Role on Storage Account to Azure AD Application
Setup Retail DB Dataset
Create ADLS Container or File System and Upload Data
Start Databricks Cluster to mount ADLS
Mount ADLS Storage Account on to Azure Databricks
Validate ADLS Mount Point on Azure Databricks Clusters
Unmount the mount point from Databricks
Delete Azure Resource Group used for Mounting ADLS on to Azure Databricks
Setup Local Development Environment for Databricks
As part of this section, we will go through the details related to setting up of local development environment for Databricks using tools such as Pycharm, Databricks dbconnect, Databricks dbutils, etc.
Setup Single Node Databricks Cluster
Install Databricks Connect
Configure Databricks Connect
Integrating Pycharm with Databricks Connect
Integrate Databricks Cluster with Glue Catalog
Setup AWS s3 Bucket and Grant Permissions
Mounting s3 Buckets into Databricks Clusters
Using Databricks dbutils from IDEs such as Pycharm
Using Databricks CLI
As part of this section, we will get an overview of Databricks CLI to interact with Databricks File System or DBFS.
Introduction to Databricks CLI
Install and Configure Databricks CLI
Interacting with Databricks File System using Databricks CLI
Getting Databricks Cluster Details using Databricks CLI
Databricks Jobs and Clusters
As part of this section, we will go through the details related to Databricks Jobs and Clusters.
Introduction to Databricks Jobs and Clusters
Creating Pools in Databricks Platform
Create Cluster on Azure Databricks
Request to Increase CPU Quota on Azure
Creating Job on Databricks
Submitting Jobs using Databricks Job Cluster
Create Pool in Databricks
Running Job using Interactive Databricks Cluster Attached to Pool
Running Job Using Databricks Job Cluster Attached to Pool
Exercise - Submit the application as a job using Databricks interactive cluster
Deploy and Run Spark Applications on Databricks
As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications.
Prepare PyCharm for Databricks
Prepare Data Sets
Move files to ghactivity
Refactor Code for Databricks
Validating Data using Databricks
Setup Data Set for Production Deployment
Access File Metadata using Databricks dbutils
Build Deployable bundle for Databricks
Running Jobs using Databricks Web UI
Get Job and Run Details using Databricks CLI
Submitting Databricks Jobs using CLI
Setup and Validate Databricks Client Library
Resetting the Job using Databricks Jobs API
Run Databricks Job programmatically using Python
Detailed Validation of Data using Databricks Notebooks
Deploy and Run Spark Jobs using Notebooks
As part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications using Databricks Notebooks.
Modularizing Databricks Notebooks
Running Job using Databricks Notebook
Refactor application as Databricks Notebooks
Run Notebook using Databricks Development Cluster
Deep Dive into Delta Lake using Spark Data Frames on Databricks
As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark Data Frames.
Introduction to Delta Lake using Spark Data Frames on Databricks
Creating Spark Data Frames for Delta Lake on Databricks
Writing Spark Data Frame using Delta Format on Databricks
Updating Existing Data using Delta Format on Databricks
Delete Existing Data using Delta Format on Databricks
Merge or Upsert Data using Delta Format on Databricks
Deleting using Merge in Delta Lake on Databricks
Point in Snapshot Recovery using Delta Logs on Databricks
Deleting unnecessary Delta Files using Vacuum on Databricks
Compaction of Delta Lake Files on Databricks
Deep Dive into Delta Lake using Spark SQL on Databricks
As part of this section, we will go through all the important details related to Databricks Delta Lake using Spark SQL.
Introduction to Delta Lake using Spark SQL on Databricks
Create Delta Lake Table using Spark SQL on Databricks
Insert Data to Delta Lake Table using Spark SQL on Databricks
Update Data in Delta Lake Table using Spark SQL on Databricks
Delete Data from Delta Lake Table using Spark SQL on Databricks
Merge or Upsert Data into Delta Lake Table using Spark SQL on Databricks
Using Merge Function over Delta Lake Table using Spark SQL on Databricks
Point in Snapshot Recovery using Delta Lake Table using Spark SQL on Databricks
Vacuuming Delta Lake Tables using Spark SQL on Databricks
Compaction of Delta Lake Tables using Spark SQL on Databricks
Accessing Databricks Cluster Terminal via Web as well as SSH
As part of this section, we will see how to access terminal related to Databricks Cluster via Web as well as SSH.
Enable Web Terminal in Databricks Admin Console
Launch Web Terminal for Databricks Cluster
Setup SSH for the Databricks Cluster Driver Node
Validate SSH Connectivity to the Databricks Driver Node on AWS
Limitations of SSH and comparison with Web Terminal related to Databricks Clusters
Installing Softwares on Databricks Clusters using init scripts
As part of this section, we will see how to bootstrap Databricks clusters by installing relevant 3rd party libraries for our applications.
Setup gen_logs on Databricks Cluster
Overview of Init Scripts for Databricks Clusters
Create Script to install software from git on Databricks Cluster
Copy init script to dbfs location
Create Databricks Standalone Cluster with init script
Quick Recap of Spark Structured Streaming
As part of this section, we will get a quick recap of Spark Structured streaming.
Validate Netcat on Databricks Driver Node
Push log messages to Netcat Webserver on Databricks Driver Node
Reading Web Server logs using Spark Structured Streaming
Writing Streaming Data to Files
Incremental Loads using Spark Structured Streaming on Databricks
As part of this section, we will understand how to perform incremental loads using Spark Structured Streaming on Databricks.
Overview of Spark Structured Streaming
Steps for Incremental Data Processing on Databricks
Configure Databricks Cluster with Instance Profile
Upload GHArchive Files to AWS s3 using Databricks Notebooks
Read JSON Data using Spark Structured Streaming on Databricks
Write using Delta file format using Trigger Once on Databricks
Analyze GHArchive Data in Delta files using Spark on Databricks
Add New GHActivity JSON files on Databricks
Load Data Incrementally to Target Table on Databricks
Validate Incremental Load on Databricks
Internals of Spark Structured Streaming File Processing on Databricks
Incremental Loads using autoLoader Cloud Files on Databricks
As part of this section we will see how to perform incremental loads using autoLoader cloudFiles on Databricks Clusters.
Overview of AutoLoader cloudFiles on Databricks
Upload GHArchive Files to s3 on Databricks
Write Data using AutoLoader cloudFiles on Databricks
Add New GHActivity JSON files on Databricks
Load Data Incrementally to Target Table on Databricks
Add New GHActivity JSON files on Databricks
Overview of Handling S3 Events using AWS Services on Databricks
Configure IAM Role for cloudFiles file notifications on Databricks
Incremental Load using cloudFiles File Notifications on Databricks
Review AWS Services for cloudFiles Event Notifications on Databricks
Review Metadata Generated for cloudFiles Checkpointing on Databricks
Overview of Databricks SQL Clusters
As part of this section, we will get an overview of Databricks SQL Clusters.
Overview of Databricks SQL Platform - Introduction
Run First Query using SQL Editor of Databricks SQL
Overview of Dashboards using Databricks SQL
Overview of Databricks SQL Data Explorer to review Metastore Databases and Tables
Use Databricks SQL Editor to develop scripts or queries
Review Metadata of Tables using Databricks SQL Platform
Overview of loading data into retail_db tables
Configure Databricks CLI to push data into the Databricks Platform
Copy JSON Data into DBFS using Databricks CLI
Analyze JSON Data using Spark APIs
Analyze Delta Table Schemas using Spark APIs
Load Data from Spark Data Frames into Delta Tables
Run Adhoc Queries using Databricks SQL Editor to validate data
Overview of External Tables using Databricks SQL
Using COPY Command to Copy Data into Delta Tables
Manage Databricks SQL Endpoints