Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Azure Databricks and Spark SQL (Python)

Name: Azure Databricks and Spark SQL (Python)
Rating: 4.6 (3652 reviews)

Your Hands-On Guide to Databricks Data Engineering with PySpark and Spark SQL, including a 4-Part Course Project

Bestseller

Highest Rated

Created byMalvik Vaghadia

Last updated 6/2026

English

What you'll learn

How to use Databricks to build and run data engineering workflows
The principles of the Lakehouse architecture with Delta Lake
How to process data with Spark SQL and PySpark
Best practices for Databricks compute, jobs, and orchestration
How to apply governance with Unity Catalog and manage secure access
Working with streaming pipelines using Structured Streaming and Lakeflow
Applying concepts to real-world projects with modular code and version control
Real World Scenarios

Course content

32 sections • 221 lectures • 17h 29m total length

Welcome to the Course / Introduction6:28
Explore Databricks and Spark SQL with the Python API, mastering data engineering in the lakehouse, Delta Lake, medallion architecture, and streaming pipelines through hands-on demos and the NYC taxi project.
Connect with me...0:09
Major Course Update0:12
What is Data Engineering?2:45
Design and build reliable data pipelines that ingest, transform, and serve clean, analysis-ready data while enforcing data quality and security across ETL and ELT patterns, batch and streaming.
What is a Data Lakehouse?4:51
Explore how a data lakehouse unifies data lakes and data warehouses in an open architecture, delivering ACID compliance, scalable storage and compute, and BI, ML, batch, and streaming support.
What is Databricks?2:18
Databricks is a unified cloud native data intelligence platform built on Apache Spark, offering a lakehouse architecture and five engines for BI, data warehousing, AI, ETL, and real-time analytics.
A Brief History of Hadoop, MapReduce and Apache Spark6:08
Trace the evolution from MapReduce and Hadoop to Apache Spark, explaining big data’s four V's, horizontal and vertical scaling, in-memory processing, real-time streaming, SQL queries, and machine learning workloads.
Introduction to Spark Architecture2:31
Explore how Apache Spark uses a cluster to enable distributed processing, with the driver orchestrating stages and tasks, executors running on workers, and the cluster manager allocating resources.
Comparing Apache Spark and Databricks3:51
Compare Apache Spark and Databricks, where Spark is a self-managed open source engine and Databricks a fully managed cloud platform with Delta Lake, integrated workspace, production jobs, security, and integrations.
Overview of the Apache Spark Ecosystem5:18
Discover the spark ecosystem's core components, including spark core, RDDs, and spark sql with dataframes. Compare RDDs and dataframes, with catalyst optimizer, tungsten execution engine, and pandas api on spark.

Azure Account Set Up3:40
Explore how to set up an Azure account for the Databricks on Azure workflow, choosing free trial or pay-as-you-go, linking a Microsoft account, providing verification, and signing into portal.azure.com.
Navigating the Azure Portal4:24
Sign in to the Azure portal, enable two-factor authentication with the Microsoft Authenticator, and navigate the web-based console to manage resources via the home page and global search.
Azure Resource Hierarchy8:44
Organize Azure resources through management groups, subscriptions, resource groups, and resources, and understand inheritance, quotas, and cost management. Create and manage resource groups and storage accounts in selected regions.
Introduction to Entra ID and Azure Role Based Access Control5:52
Discover Microsoft Entra ID and Azure role-based access control, including Global Administrator, ownership, role assignments, and job function roles for storage and other resources.
Cost Management and Budgeting in Azure3:53
Master cost management in Azure by using cost analysis, monitoring accumulated costs, and setting budgets with alerts to keep pay-as-you-go and free-trial expenses under control.
Azure Naming Conventions1:17
Apply a consistent naming convention for Azure resources by prefixing with resource type abbreviations (rg, dbx), then workload, environment, region, and instance number, noting storage accounts restrict hyphens and underscores.
Creating the Resource Group for this Course1:15
Create a resource group in the Azure Portal for the course, naming it RG-data-engineering-sandbox-UK South, and review and create the Databricks data engineering sandbox environment in your selected region.
Introduction to Azure Data Lake Storage Gen29:59
Discover how Azure Data Lake Storage Gen2 provides a scalable, hierarchical, Hadoop-compatible data lake built on Azure Blob Storage, with fine-grained security and integration with Azure Databricks.

Creating the Databricks Workspace3:10
Create Azure Databricks workspace by selecting your subscription and resource group, naming with consistent identifier, opting for premium tier or 14-day free trial with serverless compute, and keeping region consistency.
Azure and Databricks Managed Resources1:35
Explore how Azure Databricks automatically creates managed resource group with virtual network, network security group, access connector, ADLS Gen2 storage, Unity Catalog with managed tables and volumes, and managed identity.
Tour of the Databricks Workspace UI4:56
Explore the Databricks workspace in the Azure portal, navigate the left sidebar to access SQL, data engineering, and AI and ML experiences, and manage notebooks, dashboards, and data assets.
Personal-Email Users: Bypassing the Databricks Account Console Restriction4:03
Show personal email users how to sign in with a Microsoft.com domain user principal name to gain account administrator access in Azure Databricks, then use accounts.databricks.net to access account console.
Databricks UI Updates1:24
Explore how the Databricks UI updates frequently, with changes to jobs and pipelines, workflows, labels, and navigation, while core features and workflows remain consistent.

Note on Managing Costs During Hands-On Compute Activities0:25
Introduction to Databricks Notebooks10:04
Introduction to Databricks notebooks teaches using the multi-language code cells as the primary interface, with real-time collaboration, automatic versioning, and built-in visualizations for hands-on notebook setup and execution.
Mix Languages in Notebooks2:22
Databricks notebooks default to Python but can mix languages with SQL; use the %python magic command to override a cell, enabling Python in a SQL notebook and back again.
Adding Comments and Markdown Text to Databricks Notebooks4:56
Learn how to document Databricks notebooks by adding Python comments with hash, SQL comments with double hyphen, and rich markdown text, headers, and code blocks to improve readability.
Organizing your Workspace Objects7:08
Organize your Databricks workspace with the three special folders (workspace, shared, users); manage notebooks, files, and folders; use import options and tabs for collaboration.

Overview of Databricks Compute Types4:18
Explore the spectrum of Databricks compute types, from serverless notebooks, jobs, and SQL warehouses to all purpose compute, with instant spin up via instance pools for pipelines.
Databricks Pricing2:59
Explore Databricks pricing by understanding DBU-based per-second processing costs and cloud infrastructure fees, compare serverless and job compute options, and learn how regional and edition choices affect total bills.
Databricks Runtime1:53
Discover Databricks Runtime, a managed, optimized Spark-based execution environment with pre-installed libraries, LTS and ML variants, and versions featuring Spark 3.5.2 and 4.0.0.
Serverless Compute Demo3:14
Attach serverless compute to notebooks, files, and workflows to spin up in seconds and run code. Use it to run notebooks and Python files, powering jobs and pipelines for ETL.
Creating and All-Purpose Compute Cluster10:25
Create an all-purpose compute cluster by configuring a single-node setup, choosing a Databricks runtime with photon acceleration, and tuning autoscaling and termination to balance cost and performance.
Serverless Compute vs All-Purpose Compute4:10
Compare serverless compute and all-purpose compute clusters in Databricks notebooks, noting serverless starts quickly and is always available, while all-purpose clusters take minutes to start and incur costs when active.
Compute Options for Databricks Jobs3:32
Explore how compute powers Databricks jobs, including serverless compute and job clusters, compare costs and spin up times, and learn how to configure tasks inside a workflow.
Creating an Instance Pool5:28
Create and manage an instance pool to keep idle virtual machines ready, improving cluster startup times on all-purpose or job compute clusters, while balancing idle costs.
Creating an SQL Serverless Warehouse3:38
Create a SQL serverless warehouse for data analysis, with quick startup and auto-stop, and use the SQL editor to run SQL queries or attach to notebooks for SQL-only execution.

Spark SQL API Reference Documentation2:23
Explore the PySpark Python API for Spark SQL, learning to read, write, and transform data with the DataFrame API. Use the API reference to navigate Spark SQL and Spark Session.
Introduction to the SparkSession2:53
Use SparkSession as the entry point to access Spark SQL dataframes, streaming, machine learning, and graph APIs under a unified object, with spark.session.builder.getOrCreate to establish the session.
Anatomy of PySpark Syntax5:47
Explore the anatomy of PySpark syntax by using the spark session to read CSV data into a DataFrame, filter results, and write CSV outputs, with single and chained calls.

Creating a Catalog, Schema and Volume for our Data Assets5:10
Learn to use Unity Catalog in Azure Databricks, exploring metastore, catalogs, schemas, and volumes, and create a catalog, schema, and volume to manage data with Python API for Spark SQL.
Uploading the Countries Data Files to our Unity Catalog Volume5:05
Upload the extracted countries dataset folder to the Unity Catalog volume in Databricks, preserving the directory structure by dragging the entire folder into the datasets volume.
File Formats3:33
Explore reading and writing data across csv, text, json, avro, parquet, and Delta Lake formats in Spark. Learn how each format handles schemas, performance, and transactions.
Overview of the DataFrameReader and DataFrameWriter Methods6:04
Explore how to read and write data with Spark's DataFrameReader and DataFrameWriter APIs in PySpark, including CSV, JSON, Parquet, Delta, and Unity Catalog table support.
Reading CSV Data into a DataFrame12:34
Learn how to read csv data into a data frame with PySpark, including setting the path, header=true, and using spark.read.csv, load, or format with options, and display the results.
Reading Single vs Multiple Files3:50
Read data with spark data frames by providing a path to a single csv file or a directory; spark reads all csv files in subfolders and appends them.
Schema Inference3:52
Explore how spark infers data types from csv inputs, and when to prefer explicit schema definitions over infer schema for efficient reads in large datasets.
Providing the Schema with an SQL String4:39
Define the read schema with an SQL DDL string using the schema method, specifying columns and data types, with an option to store the schema in a variable.
Providing the Schema Programmatically5:51
Programmatically define a schema on read using PySpark objects like struct type and struct field, with data types and nullable settings. Import the PySpark.types to enable this approach.
Writing DataFrames to CSV11:17
Learn to save a spark dataframe to csv using csv and save methods, control headers and delimiters, and manage append or overwrite modes for incremental writes.
Working with JSON7:57
Read a CSV into a dataframe and export it as JSON using JSON and save methods. Specify a schema with struct types to enforce integers over strings when reading JSON.
Working with ORC4:04
Learn to read and write data in orc format with spark, leveraging stored schema metadata, schema-on-read, overwrite mode, and snappy compression, via the orc and load methods.
Working with Parquet3:23
Learn to work with parquet data using Spark, write and read parquet files with schema on read and snappy compression, using the parquet method or save method with overwrite mode.
Working with Delta Lake5:04
Explore Delta Lake, built on Parquet, and its acid transactions, scalable metadata, and schema enforcement to enable reliable streaming and batch workloads within Spark workflows.
Rendering your DataFrame4:16
Learn how to render a dataframe using the show method and the display function in Databricks notebooks, control truncation and row count, and explore interactive data profiling and visualizations.
How to Partition your Data5:01
Partition your output data with partition by to group by region id and sub region id, creating per-value folders and speeding reads by skipping non-matching files.
Databricks File System Utilities8:44
Explore the Databricks File System utilities (dbutils.fs) for copying, listing, previewing, moving, writing, and deleting files and directories, with examples of cp, head, ls, mv, and rm.

Working with Unity Catalog Managed Tables6:53
Create and manage Unity Catalog Delta managed tables using a three-level namespace, save dataframes as tables, and query with spark.read.table while controlling data with append or overwrite modes.
Running SQL Queries with the PySpark API3:46
Learn to run sql queries with Python API for Apache Spark using spark session sql method in Databricks notebooks, returning a dataframe and enabling unity catalog operations with sql syntax.
Creating Managed Tables with SQL4:53
Create managed tables with sql to define schemas and store data. Use spark sql and python apis to insert, query, and reproduce tables like countries_population_two and countries_population_three.
Creating Views with SQL4:18
Create persistent views in Azure Databricks by using create view as select, persisting a Spark DataFrame as a read-only object in a schema and querying it anywhere via the metastore.
Creating Catalogs, Schemas and Volumes with SQL6:41
Learn to create catalogs, schemas, and volumes in Azure Databricks using SQL, specify managed locations, and verify results in the Catalog Explorer.
Dropping Unity Catalog Objects with SQL3:42
Drop Unity Catalog objects using SQL or the UI, including catalogs, schemas, volumes, tables, and views, with cascade to delete non-empty items, and reference them with the three-level namespace.
Temporary Views4:57
Create and use temporary views in Databricks to query data frames with local and global scopes, using create temp view, create or replace temp view, and global temp view equivalents.

Requirements

Basic to intermediate SQL
Basic to intermediate Python

Description

I’m Malvik Vaghadia, a Data Engineer and Architect with nearly 15 years of professional experience. I'm also a recognised Databricks Champion, an honour given to a small global community for deep platform expertise and contribution to the wider ecosystem.

I’ve worked on multiple large-scale lakehouse implementations and consulted for enterprise clients. As an instructor, I’ve taught 200,000+ students worldwide and hold a 4.6+ instructor rating. Since launching this course, it has become one of Udemy’s best-sellers in the Databricks category, and this new version (Sept 2025) has been completely rebuilt with 17 hours of brand-new content.

Why Learn Databricks

Databricks is recognised as a Leader in the Gartner Magic Quadrant for Data & AI platforms. It has become the go-to lakehouse platform for modern data engineering, enabling organisations to build, orchestrate, and optimise pipelines at scale. By mastering Databricks, you’ll be learning one of the most in-demand skills in today’s data landscape.

Course Delivery Style

This course is designed with the right balance of theory, hands-on coding, and practical projects. Every concept is explained clearly, then demonstrated live in Databricks, and reinforced with a multi-phase, end-to-end project that you’ll build step by step. You’ll also get all course notebooks as downloadable materials, containing the full code, step-by-step documentation, and extra resources so you can follow along easily.

Curriculum Highlights:

Four Part Course Project: End-to-end NYC Taxi project and further pipeline builds across multiple parts as you develop your knowledge.
Foundations: What data engineering is, why Databricks, the Spark architecture, PySpark, and the Lakehouse.
Azure setup: Account creation, resources, role-based access control, naming conventions, and cost management.
Databricks setup: Creating and configuring a workspace, navigating the UI, and handling personal email restrictions.
Databricks notebooks and workspace: Markdown, comments, organising objects, mixing languages, and notebook tips.
Databricks compute: Clusters, DBU pricing, runtimes, serverless vs all-purpose compute, instance pools, and SQL warehouses.
Spark SQL (Python): Writing Spark SQL code using both SQL syntax and DataFrame APIs, reading/writing different file formats, defining schemas, and managing tables and views.
PySpark Transformations: Column operations, functions, filtering, sorting, joining, aggregations, pivots, and conditional logic.
Medallion architecture: Bronze, Silver, and Gold layers explained and implemented.
Delta Lake: Transaction log, schema enforcement and evolution, time travel, and DML operations (MERGE, UPDATE, DELETE).
Workflows and jobs: Passing parameters, handling failures, concurrency, conditional tasks, and monitoring.
Git & local development: VS Code setup, linking with GitHub, repos, and workflow best practices.
Functions and modularization: Creating and importing Python modules, UDFs, and project structuring.
Unity Catalog & governance: Metastores, securable objects, workspace roles, external locations, and permissions.
Streaming & Lakeflow pipelines: Structured Streaming concepts, Auto Loader, watermarking, triggers, and the new Lakeflow (DLT) pipeline model.
Performance: Lazy evaluation, explain plans, caching, shuffles, broadcast joins, partitioning, Z-ORDER, and Liquid Clustering.
Automation & CI/CD: Programmatic interaction with Databricks, CLI demo, and high-level CI/CD overview.

By the end of the course, you’ll have both the knowledge and confidence to design, build, and optimise production-grade data pipelines on Databricks.

Who this course is for:

Anyone interested in working with Big Data and Spark
Anyone interested in working with Databricks
Anyone interested in working with cloud platforms
Aspiring Data Engineers

Azure Databricks and Spark SQL (Python)

What you'll learn

Explore related topics

Course content

Course Introduction10 lectures • 35min

Azure Set Up8 lectures • 39min

Databricks Set Up5 lectures • 15min

Databricks Notebooks and Workspace Objects5 lectures • 25min

Databricks Compute9 lectures • 40min

⚠️ Course Materials - Important!1 lecture • 5min

Getting Started with Spark SQL (Python)3 lectures • 11min

Reading and Writing Data with Spark SQL (Python)17 lectures • 1hr 40min

Managed Tables & Views, and SQL7 lectures • 35min

Creating DataFrames from Python Objects1 lecture • 4min

Requirements

Description

Who this course is for: