What you'll learn

Optimize Spark applications using partitions, caching, execution plans, join strategies, and cluster sizing techniques
Build scalable Delta Lake pipelines using MERGE, SCD Type 1 & 2, Time Travel, Partitioning, and Z-Ordering
Develop incremental and idempotent ETL pipelines using Auto Loader, COPY INTO, schema evolution, and logging
Process complex JSON, XML, and Excel files in Azure Databricks using advanced PySpark transformation techniques

Course content

8 sections • 25 lectures • 30h 53m total length

Spark Partitions | Architecture, Shuffle | Repartition vs Coalesce Explained1:30:48
Spark Cache vs Persist vs Unpersist Explained | Why Spark Recomputes Jobs?1:31:43
Spark Cache vs Persist Hands-On | Tungsten Optimization | Parquet vs ORC vs Avro1:36:08
Spark Execution Plan Explained | Lazy Evaluation | Catalyst & Tungsten Optimizer1:09:30
Spark Join Algorithms Explained | Sort Merge vs Broadcast vs Shuffle Hash1:30:25
Databricks Cluster Sizing Explained | Memory & Partition Calculation58:32

Requirements

Basic knowledge of Azure Databricks, PySpark DataFrames, joins, and Python programming is recommended
Students should be familiar with Databricks notebooks, clusters, and basic ETL development concepts
A free Azure account or Databricks workspace is recommended for hands-on practice and exercises
Completion of a beginner-level Azure Databricks or PySpark course will help learners understand concepts faster

Description

Welcome to the Azure Databricks Intermediate course designed for data engineers, PySpark developers, ETL professionals, and working professionals who want to build strong real-world Azure Databricks skills beyond beginner concepts.

This course focuses on intermediate-level Azure Databricks concepts with practical hands-on examples, performance optimization techniques, Delta Lake implementations, and production-oriented ETL development scenarios used in real-world data engineering projects.

You will start by understanding advanced Spark concepts such as partitions, shuffle operations, repartition vs coalesce, caching, persistence, execution plans, Catalyst Optimizer, Tungsten Optimization, and Spark join algorithms including broadcast joins, sort merge joins, and shuffle hash joins.

The course also covers processing complex file formats such as JSON, XML, and Excel using PySpark. You will learn advanced transformation techniques including flattening nested JSON structures using explode and arrays_zip functions.

A major focus of this course is Delta Lake. You will learn Delta table creation, managed vs external tables, merge operations, insert, update, delete, time travel, restore, vacuum, partitioning strategies, Z-Ordering, liquid clustering, and performance optimization techniques.

You will also build Slowly Changing Dimension (SCD) Type 1 and Type 2 pipelines using Delta Lake merge logic, audit columns, and hash key generation techniques commonly used in enterprise ETL projects.

Additionally, the course covers Auto Loader, schema evolution, idempotent data pipelines, Databricks cluster types, ADLS integration, ETL logging frameworks, and cluster sizing concepts.

By the end of this course, you will have practical experience in building scalable, optimized, and production-ready Azure Databricks ETL pipelines using Delta Lake and PySpark.

Who this course is for:

Data engineers and PySpark developers who want to improve Spark optimization and Delta Lake development skills
Azure Databricks professionals looking to build scalable ETL pipelines using Auto Loader and Delta Lake
Working professionals preparing for intermediate Databricks, Spark, and Azure Data Engineering interviews
Learners who already understand Databricks basics and want to move into real-world production concepts

What you'll learn

Explore related topics

Course content

Advanced Spark Architecture & Performance6 lectures • 8hr 17min

Working with File Formats in Databricks2 lectures • 2hr 12min

Python for Real-Time ETL Development4 lectures • 5hr 14min

Delta Lake Fundamentals & Operations4 lectures • 5hr 3min

Delta Lake Performance Optimization3 lectures • 3hr 38min

Slowly Changing Dimensions (SCD)2 lectures • 2hr 19min

Incremental & Idempotent Data Pipelines2 lectures • 2hr 12min

Databricks Compute & ADLS Integration2 lectures • 1hr 58min

Requirements

Description

Who this course is for: