
Explore Databricks and Spark SQL with the Python API, mastering data engineering in the lakehouse, Delta Lake, medallion architecture, and streaming pipelines through hands-on demos and the NYC taxi project.
Design and build reliable data pipelines that ingest, transform, and serve clean, analysis-ready data while enforcing data quality and security across ETL and ELT patterns, batch and streaming.
Explore how a data lakehouse unifies data lakes and data warehouses in an open architecture, delivering ACID compliance, scalable storage and compute, and BI, ML, batch, and streaming support.
Databricks is a unified cloud native data intelligence platform built on Apache Spark, offering a lakehouse architecture and five engines for BI, data warehousing, AI, ETL, and real-time analytics.
Trace the evolution from MapReduce and Hadoop to Apache Spark, explaining big data’s four V's, horizontal and vertical scaling, in-memory processing, real-time streaming, SQL queries, and machine learning workloads.
Explore how Apache Spark uses a cluster to enable distributed processing, with the driver orchestrating stages and tasks, executors running on workers, and the cluster manager allocating resources.
Compare Apache Spark and Databricks, where Spark is a self-managed open source engine and Databricks a fully managed cloud platform with Delta Lake, integrated workspace, production jobs, security, and integrations.
Discover the spark ecosystem's core components, including spark core, RDDs, and spark sql with dataframes. Compare RDDs and dataframes, with catalyst optimizer, tungsten execution engine, and pandas api on spark.
Explore how to set up an Azure account for the Databricks on Azure workflow, choosing free trial or pay-as-you-go, linking a Microsoft account, providing verification, and signing into portal.azure.com.
Sign in to the Azure portal, enable two-factor authentication with the Microsoft Authenticator, and navigate the web-based console to manage resources via the home page and global search.
Organize Azure resources through management groups, subscriptions, resource groups, and resources, and understand inheritance, quotas, and cost management. Create and manage resource groups and storage accounts in selected regions.
Discover Microsoft Entra ID and Azure role-based access control, including Global Administrator, ownership, role assignments, and job function roles for storage and other resources.
Master cost management in Azure by using cost analysis, monitoring accumulated costs, and setting budgets with alerts to keep pay-as-you-go and free-trial expenses under control.
Apply a consistent naming convention for Azure resources by prefixing with resource type abbreviations (rg, dbx), then workload, environment, region, and instance number, noting storage accounts restrict hyphens and underscores.
Create a resource group in the Azure Portal for the course, naming it RG-data-engineering-sandbox-UK South, and review and create the Databricks data engineering sandbox environment in your selected region.
Discover how Azure Data Lake Storage Gen2 provides a scalable, hierarchical, Hadoop-compatible data lake built on Azure Blob Storage, with fine-grained security and integration with Azure Databricks.
Create Azure Databricks workspace by selecting your subscription and resource group, naming with consistent identifier, opting for premium tier or 14-day free trial with serverless compute, and keeping region consistency.
Explore how Azure Databricks automatically creates managed resource group with virtual network, network security group, access connector, ADLS Gen2 storage, Unity Catalog with managed tables and volumes, and managed identity.
Explore the Databricks workspace in the Azure portal, navigate the left sidebar to access SQL, data engineering, and AI and ML experiences, and manage notebooks, dashboards, and data assets.
Show personal email users how to sign in with a Microsoft.com domain user principal name to gain account administrator access in Azure Databricks, then use accounts.databricks.net to access account console.
Explore how the Databricks UI updates frequently, with changes to jobs and pipelines, workflows, labels, and navigation, while core features and workflows remain consistent.
Introduction to Databricks notebooks teaches using the multi-language code cells as the primary interface, with real-time collaboration, automatic versioning, and built-in visualizations for hands-on notebook setup and execution.
Databricks notebooks default to Python but can mix languages with SQL; use the %python magic command to override a cell, enabling Python in a SQL notebook and back again.
Learn how to document Databricks notebooks by adding Python comments with hash, SQL comments with double hyphen, and rich markdown text, headers, and code blocks to improve readability.
Organize your Databricks workspace with the three special folders (workspace, shared, users); manage notebooks, files, and folders; use import options and tabs for collaboration.
Explore the spectrum of Databricks compute types, from serverless notebooks, jobs, and SQL warehouses to all purpose compute, with instant spin up via instance pools for pipelines.
Explore Databricks pricing by understanding DBU-based per-second processing costs and cloud infrastructure fees, compare serverless and job compute options, and learn how regional and edition choices affect total bills.
Discover Databricks Runtime, a managed, optimized Spark-based execution environment with pre-installed libraries, LTS and ML variants, and versions featuring Spark 3.5.2 and 4.0.0.
Attach serverless compute to notebooks, files, and workflows to spin up in seconds and run code. Use it to run notebooks and Python files, powering jobs and pipelines for ETL.
Create an all-purpose compute cluster by configuring a single-node setup, choosing a Databricks runtime with photon acceleration, and tuning autoscaling and termination to balance cost and performance.
Compare serverless compute and all-purpose compute clusters in Databricks notebooks, noting serverless starts quickly and is always available, while all-purpose clusters take minutes to start and incur costs when active.
Explore how compute powers Databricks jobs, including serverless compute and job clusters, compare costs and spin up times, and learn how to configure tasks inside a workflow.
Create and manage an instance pool to keep idle virtual machines ready, improving cluster startup times on all-purpose or job compute clusters, while balancing idle costs.
Create a SQL serverless warehouse for data analysis, with quick startup and auto-stop, and use the SQL editor to run SQL queries or attach to notebooks for SQL-only execution.
Download and unzip the course materials, then import the zipped folder into your Databricks workspace shared area. Explore sections, notebooks, and hands-on code to learn Spark SQL Python.
Explore the PySpark Python API for Spark SQL, learning to read, write, and transform data with the DataFrame API. Use the API reference to navigate Spark SQL and Spark Session.
Use SparkSession as the entry point to access Spark SQL dataframes, streaming, machine learning, and graph APIs under a unified object, with spark.session.builder.getOrCreate to establish the session.
Explore the anatomy of PySpark syntax by using the spark session to read CSV data into a DataFrame, filter results, and write CSV outputs, with single and chained calls.
Learn to use Unity Catalog in Azure Databricks, exploring metastore, catalogs, schemas, and volumes, and create a catalog, schema, and volume to manage data with Python API for Spark SQL.
Upload the extracted countries dataset folder to the Unity Catalog volume in Databricks, preserving the directory structure by dragging the entire folder into the datasets volume.
Explore reading and writing data across csv, text, json, avro, parquet, and Delta Lake formats in Spark. Learn how each format handles schemas, performance, and transactions.
Explore how to read and write data with Spark's DataFrameReader and DataFrameWriter APIs in PySpark, including CSV, JSON, Parquet, Delta, and Unity Catalog table support.
Learn how to read csv data into a data frame with PySpark, including setting the path, header=true, and using spark.read.csv, load, or format with options, and display the results.
Read data with spark data frames by providing a path to a single csv file or a directory; spark reads all csv files in subfolders and appends them.
Explore how spark infers data types from csv inputs, and when to prefer explicit schema definitions over infer schema for efficient reads in large datasets.
Define the read schema with an SQL DDL string using the schema method, specifying columns and data types, with an option to store the schema in a variable.
Programmatically define a schema on read using PySpark objects like struct type and struct field, with data types and nullable settings. Import the PySpark.types to enable this approach.
Learn to save a spark dataframe to csv using csv and save methods, control headers and delimiters, and manage append or overwrite modes for incremental writes.
Read a CSV into a dataframe and export it as JSON using JSON and save methods. Specify a schema with struct types to enforce integers over strings when reading JSON.
Learn to read and write data in orc format with spark, leveraging stored schema metadata, schema-on-read, overwrite mode, and snappy compression, via the orc and load methods.
Learn to work with parquet data using Spark, write and read parquet files with schema on read and snappy compression, using the parquet method or save method with overwrite mode.
Explore Delta Lake, built on Parquet, and its acid transactions, scalable metadata, and schema enforcement to enable reliable streaming and batch workloads within Spark workflows.
Learn how to render a dataframe using the show method and the display function in Databricks notebooks, control truncation and row count, and explore interactive data profiling and visualizations.
Partition your output data with partition by to group by region id and sub region id, creating per-value folders and speeding reads by skipping non-matching files.
Explore the Databricks File System utilities (dbutils.fs) for copying, listing, previewing, moving, writing, and deleting files and directories, with examples of cp, head, ls, mv, and rm.
Create and manage Unity Catalog Delta managed tables using a three-level namespace, save dataframes as tables, and query with spark.read.table while controlling data with append or overwrite modes.
Learn to run sql queries with Python API for Apache Spark using spark session sql method in Databricks notebooks, returning a dataframe and enabling unity catalog operations with sql syntax.
Create managed tables with sql to define schemas and store data. Use spark sql and python apis to insert, query, and reproduce tables like countries_population_two and countries_population_three.
Create persistent views in Azure Databricks by using create view as select, persisting a Spark DataFrame as a read-only object in a schema and querying it anywhere via the metastore.
Learn to create catalogs, schemas, and volumes in Azure Databricks using SQL, specify managed locations, and verify results in the Catalog Explorer.
Drop Unity Catalog objects using SQL or the UI, including catalogs, schemas, volumes, tables, and views, with cascade to delete non-empty items, and reference them with the three-level namespace.
Create and use temporary views in Databricks to query data frames with local and global scopes, using create temp view, create or replace temp view, and global temp view equivalents.
Convert Python data structures into spark dataframes with spark.createDataFrame, using data one as list of lists or data two as dictionaries, and define schema by column names or sql ddl.
I’m Malvik Vaghadia, a Data Engineer and Architect with nearly 15 years of professional experience. I'm also a recognised Databricks Champion, an honour given to a small global community for deep platform expertise and contribution to the wider ecosystem.
I’ve worked on multiple large-scale lakehouse implementations and consulted for enterprise clients. As an instructor, I’ve taught 200,000+ students worldwide and hold a 4.6+ instructor rating. Since launching this course, it has become one of Udemy’s best-sellers in the Databricks category, and this new version (Sept 2025) has been completely rebuilt with 17 hours of brand-new content.
Why Learn Databricks
Databricks is recognised as a Leader in the Gartner Magic Quadrant for Data & AI platforms. It has become the go-to lakehouse platform for modern data engineering, enabling organisations to build, orchestrate, and optimise pipelines at scale. By mastering Databricks, you’ll be learning one of the most in-demand skills in today’s data landscape.
Course Delivery Style
This course is designed with the right balance of theory, hands-on coding, and practical projects. Every concept is explained clearly, then demonstrated live in Databricks, and reinforced with a multi-phase, end-to-end project that you’ll build step by step. You’ll also get all course notebooks as downloadable materials, containing the full code, step-by-step documentation, and extra resources so you can follow along easily.
Curriculum Highlights:
Four Part Course Project: End-to-end NYC Taxi project and further pipeline builds across multiple parts as you develop your knowledge.
Foundations: What data engineering is, why Databricks, the Spark architecture, PySpark, and the Lakehouse.
Azure setup: Account creation, resources, role-based access control, naming conventions, and cost management.
Databricks setup: Creating and configuring a workspace, navigating the UI, and handling personal email restrictions.
Databricks notebooks and workspace: Markdown, comments, organising objects, mixing languages, and notebook tips.
Databricks compute: Clusters, DBU pricing, runtimes, serverless vs all-purpose compute, instance pools, and SQL warehouses.
Spark SQL (Python): Writing Spark SQL code using both SQL syntax and DataFrame APIs, reading/writing different file formats, defining schemas, and managing tables and views.
PySpark Transformations: Column operations, functions, filtering, sorting, joining, aggregations, pivots, and conditional logic.
Medallion architecture: Bronze, Silver, and Gold layers explained and implemented.
Delta Lake: Transaction log, schema enforcement and evolution, time travel, and DML operations (MERGE, UPDATE, DELETE).
Workflows and jobs: Passing parameters, handling failures, concurrency, conditional tasks, and monitoring.
Git & local development: VS Code setup, linking with GitHub, repos, and workflow best practices.
Functions and modularization: Creating and importing Python modules, UDFs, and project structuring.
Unity Catalog & governance: Metastores, securable objects, workspace roles, external locations, and permissions.
Streaming & Lakeflow pipelines: Structured Streaming concepts, Auto Loader, watermarking, triggers, and the new Lakeflow (DLT) pipeline model.
Performance: Lazy evaluation, explain plans, caching, shuffles, broadcast joins, partitioning, Z-ORDER, and Liquid Clustering.
Automation & CI/CD: Programmatic interaction with Databricks, CLI demo, and high-level CI/CD overview.
By the end of the course, you’ll have both the knowledge and confidence to design, build, and optimise production-grade data pipelines on Databricks.