Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
Azure Databricks and Spark SQL (Python)
Bestseller
Highest Rated
Rating: 4.6 out of 5(3,652 ratings)
31,336 students

Azure Databricks and Spark SQL (Python)

Your Hands-On Guide to Databricks Data Engineering with PySpark and Spark SQL, including a 4-Part Course Project
Created byMalvik Vaghadia
Last updated 6/2026
English

What you'll learn

  • How to use Databricks to build and run data engineering workflows
  • The principles of the Lakehouse architecture with Delta Lake
  • How to process data with Spark SQL and PySpark
  • Best practices for Databricks compute, jobs, and orchestration
  • How to apply governance with Unity Catalog and manage secure access
  • Working with streaming pipelines using Structured Streaming and Lakeflow
  • Applying concepts to real-world projects with modular code and version control
  • Real World Scenarios

Course content

32 sections221 lectures17h 29m total length
  • Welcome to the Course / Introduction6:28

    Explore Databricks and Spark SQL with the Python API, mastering data engineering in the lakehouse, Delta Lake, medallion architecture, and streaming pipelines through hands-on demos and the NYC taxi project.

  • Connect with me...0:09
  • Major Course Update0:12
  • What is Data Engineering?2:45

    Design and build reliable data pipelines that ingest, transform, and serve clean, analysis-ready data while enforcing data quality and security across ETL and ELT patterns, batch and streaming.

  • What is a Data Lakehouse?4:51

    Explore how a data lakehouse unifies data lakes and data warehouses in an open architecture, delivering ACID compliance, scalable storage and compute, and BI, ML, batch, and streaming support.

  • What is Databricks?2:18

    Databricks is a unified cloud native data intelligence platform built on Apache Spark, offering a lakehouse architecture and five engines for BI, data warehousing, AI, ETL, and real-time analytics.

  • A Brief History of Hadoop, MapReduce and Apache Spark6:08

    Trace the evolution from MapReduce and Hadoop to Apache Spark, explaining big data’s four V's, horizontal and vertical scaling, in-memory processing, real-time streaming, SQL queries, and machine learning workloads.

  • Introduction to Spark Architecture2:31

    Explore how Apache Spark uses a cluster to enable distributed processing, with the driver orchestrating stages and tasks, executors running on workers, and the cluster manager allocating resources.

  • Comparing Apache Spark and Databricks3:51

    Compare Apache Spark and Databricks, where Spark is a self-managed open source engine and Databricks a fully managed cloud platform with Delta Lake, integrated workspace, production jobs, security, and integrations.

  • Overview of the Apache Spark Ecosystem5:18

    Discover the spark ecosystem's core components, including spark core, RDDs, and spark sql with dataframes. Compare RDDs and dataframes, with catalyst optimizer, tungsten execution engine, and pandas api on spark.

Requirements

  • Basic to intermediate SQL
  • Basic to intermediate Python

Description

I’m Malvik Vaghadia, a Data Engineer and Architect with nearly 15 years of professional experience. I'm also a recognised Databricks Champion, an honour given to a small global community for deep platform expertise and contribution to the wider ecosystem.


I’ve worked on multiple large-scale lakehouse implementations and consulted for enterprise clients. As an instructor, I’ve taught 200,000+ students worldwide and hold a 4.6+ instructor rating. Since launching this course, it has become one of Udemy’s best-sellers in the Databricks category, and this new version (Sept 2025) has been completely rebuilt with 17 hours of brand-new content.


Why Learn Databricks

Databricks is recognised as a Leader in the Gartner Magic Quadrant for Data & AI platforms. It has become the go-to lakehouse platform for modern data engineering, enabling organisations to build, orchestrate, and optimise pipelines at scale. By mastering Databricks, you’ll be learning one of the most in-demand skills in today’s data landscape.


Course Delivery Style

This course is designed with the right balance of theory, hands-on coding, and practical projects. Every concept is explained clearly, then demonstrated live in Databricks, and reinforced with a multi-phase, end-to-end project that you’ll build step by step. You’ll also get all course notebooks as downloadable materials, containing the full code, step-by-step documentation, and extra resources so you can follow along easily.


Curriculum Highlights:

  • Four Part Course Project: End-to-end NYC Taxi project and further pipeline builds across multiple parts as you develop your knowledge.

  • Foundations: What data engineering is, why Databricks, the Spark architecture, PySpark, and the Lakehouse.

  • Azure setup: Account creation, resources, role-based access control, naming conventions, and cost management.

  • Databricks setup: Creating and configuring a workspace, navigating the UI, and handling personal email restrictions.

  • Databricks notebooks and workspace: Markdown, comments, organising objects, mixing languages, and notebook tips.

  • Databricks compute: Clusters, DBU pricing, runtimes, serverless vs all-purpose compute, instance pools, and SQL warehouses.

  • Spark SQL (Python): Writing Spark SQL code using both SQL syntax and DataFrame APIs, reading/writing different file formats, defining schemas, and managing tables and views.

  • PySpark Transformations: Column operations, functions, filtering, sorting, joining, aggregations, pivots, and conditional logic.

  • Medallion architecture: Bronze, Silver, and Gold layers explained and implemented.

  • Delta Lake: Transaction log, schema enforcement and evolution, time travel, and DML operations (MERGE, UPDATE, DELETE).

  • Workflows and jobs: Passing parameters, handling failures, concurrency, conditional tasks, and monitoring.

  • Git & local development: VS Code setup, linking with GitHub, repos, and workflow best practices.

  • Functions and modularization: Creating and importing Python modules, UDFs, and project structuring.

  • Unity Catalog & governance: Metastores, securable objects, workspace roles, external locations, and permissions.

  • Streaming & Lakeflow pipelines: Structured Streaming concepts, Auto Loader, watermarking, triggers, and the new Lakeflow (DLT) pipeline model.

  • Performance: Lazy evaluation, explain plans, caching, shuffles, broadcast joins, partitioning, Z-ORDER, and Liquid Clustering.

  • Automation & CI/CD: Programmatic interaction with Databricks, CLI demo, and high-level CI/CD overview.

By the end of the course, you’ll have both the knowledge and confidence to design, build, and optimise production-grade data pipelines on Databricks.


Who this course is for:

  • Anyone interested in working with Big Data and Spark
  • Anyone interested in working with Databricks
  • Anyone interested in working with cloud platforms
  • Aspiring Data Engineers