Spark Streaming - Stream Processing in Lakehouse - PySpark

Name: Spark Streaming - Stream Processing in Lakehouse - PySpark
Rating: 4.6 (2205 reviews)

Master Spark Structured Streaming using Python (PySpark) on Azure Databricks Cloud with a end-to-end Project

Created byPrashant Kumar Pandey, Learning Journal

Last updated 8/2024

English

What you'll learn

Real-time Stream Processing Concepts
Spark Structured Streaming APIs and Architecture
Working with Streaming Sources and Sinks
Kafka for Data Engineers
Working With Kafka Source and Integrating Spark with Kafka
State-less and State-full Streaming Transformations
Windowing Aggregates using Spark Stream
Watermarking and State Cleanup
Streaming Joins and Aggregation
Handling Memory Problems with Streaming Joins
Working with Azure Databricks
Capstone Project - Streaming application in Lakehouse

Course content

9 sections • 108 lectures • 22h 23m total length

About the Course5:57
Course Prerequisite2:27
Source Code and Other Resources0:13
Note for Students - Before Start2:05

Batch processing to stream processing39:58
Explore when batch versus stream processing is needed and master key challenges—back pressure, incremental processing, checkpointing, fault tolerance, and late data—using Spark structured streaming.
Your Spark application - Applying Best Practice37:24
Your first streaming application - Implementing Stream21:06
Transform a batch word count into a streaming application using spark structured streaming, adopting read stream, write stream, incremental data processing, and a checkpoint location.
Stream Processing Model in Spark16:25
Create Another Streaming Application35:33
Create a spark streaming application to ingest json invoices from a landing zone, explode and flatten line items, and write denormalized records to a delta table.
Stream Triggers11:14
Incremental Batch Processing25:28
Streaming Sources and Sinks6:13
Creating Chain of Streams34:15

An Introduction to Kafka24:29
Creating Kafka Cluster in Cloud13:13
Kafka Core Concepts20:48
Producing Data to Kafka Topic36:05
Consuming Data from Kafka Topic15:33
Read data from the invoices kafka topic with the spark kafka connector. Load into a spark data frame and inspect the binary key, value, and timestamps.
Working with Kafka Topic Data30:00
How to Implement Idempotence28:56
Working with Kafka Sink31:35
Learn to build a Spark Structured Streaming Kafka sink by reading json invoices from a landing zone, applying a where filter, and writing key-value messages to a Kafka topic.

Introduction to Databricks10:22
Creating Azure Free Account10:03
Azure Portal Overview5:19
Creating Azure Databricks Service11:36
Introduction to Azure Databricks Workspace3:48
Azure Databricks Architecture16:12
Explore the Azure Databricks architecture, detailing control plane and data plane with a single consolidated bill. See how DBFS abstracts storage and enables seamless access to Azure Data Lake Gen2.
Creating Azure Databricks Cluster16:34
Learn to create and configure an Azure Databricks cluster, choosing multi-node or single-node setups, and manage runtime, auto scaling, and essential tools like notebooks, libraries, and Spark UI.
Introduction to Databricks Notebooks13:08
Notebooks Magic Commands and Utilities7:48
Databricks Notebooks Utilities17:19
Introduction to Databricks Unity Catalog10:31
Unity Catalog is a Databricks premium service for metadata and user management that stores catalogs, schemas, tables, volumes, and storage locations in a metastore, using SQL grants for access control.
Introduction to Databricks Workflow Jobs15:02
Introduction to Databricks Rest API18:45
Introduction to Databricks CLI12:22

Project Scope and Background8:48
Design a Databricks lakehouse with bronze, silver, and gold layers to ingest device, profile, bpm, login, and workout data and deliver gold tables for workout bpm and gym summaries.
Taking out the operational requirement3:56
Design a lakehouse workflow to process five input datasets for analytical use, syncing profile updates from a Kafka topic to the cloud database while decoupling operational workloads.
Storage Design6:07
Implement Data Security4:55
Implement data security by inserting a Unity Catalog layer between storage and compute, enforcing fine-grained access, blocking direct directory access, and granting read/write on databases and tables via groups.
Implement Resource Policies1:56
Decouple Data Ingestion3:55
Design Bronze Layer3:25
Design Silver and Gold Layer3:44
Setup your source control3:50
Setup your environment2:52
Create a development workspace7:23
Create and Configure Storage Layer8:19
Create Unity Catalog Metastore6:46
Create Catalog and External Locations9:35
Start Coding10:34
Launch coding in a Databricks development workspace, connect to Azure DevOps, and create a feature branch to develop setup notebooks and DDL scripts for bronze, silver, and gold tables.
Test your code12:28
Load historical data5:16
Load historical and lookup data with a history loader that populates the date_lookup dimension in the silver layer, using a one-time data load from cloud storage.
Ingest into bronze layer8:40
Process the silver layer12:27
Handling multiple updates5:17
Implementing Gold Layer5:41
Creating a run script8:47
Preparing for Integration testing3:49
Outlines an end-to-end integration testing strategy for spark streaming in lakehouse, using two payloads, test data preparation, and three notebooks to validate gold layer reports.
Creating Test Data Producer2:45
Creating Integration Test for Batch mode6:11
Creating Integration Test for Stream mode17:42
Implementing CI CD Pipeline9:49
Develop Build Pipeline7:54
Develop Release Pipeline17:38
Learn to create and configure a release pipeline in Azure DevOps, linking a build artifact, using a ubuntu-22.04 agent, and deploying via Databricks CLI with integration testing.
Creating Databricks CLI Script4:00

Spark Development Environment1:44
Windows User - Spark Installation Prerequisites5:25
Install Java JDK on Windows to satisfy Spark prerequisites, set JAVA_HOME and PATH, verify with java -version, and configure Hadoop WinUtils with HADOOP_HOME for Windows environments.
Windows User - Installing Apache Spark8:26
Windows User - Setup and test your IDE4:26
Configure and test your spark development IDE on Windows with PyCharm, a conda environment, and Python 3.7. Install PySpark, open the Hello Spark SQL project, and verify the setup.
Mac User - Installing Apache Spark12:08
Install and configure Apache Spark on mac by setting up JDK 8 or 11 and JAVA_HOME, installing Spark and SPARK_HOME, updating PATH, and enabling pyspark with python3.
Mac User - Setup and test your IDE7:59
Install and run Apache Kafka10:24
Stream processing model in Spark8:34
Working with Files and Directories11:36
Streaming Sources, Sinks and Output Mode12:56
Fault Tolerance and Restarts6:29
Introduction to Stream Processing9:16
Explore how stream processing extends batch processing with Spark, addressing late-arriving records and incremental calculations, while supporting scheduling and fault handling across streaming workloads.
Spark Streaming APIs - DStream Vs Structured Streaming3:50
Creating your first stream processing application16:12
Streaming from Kafka as a Source16:55
Learn to read from Kafka as a streaming source with Spark Structured Streaming, flatten invoice items, apply from_json to value, explode items, and sink to a file system.
Working with Kafka Sinks10:32
Multi-query Streams Application4:15
Kafka Serialization and Deserialization for Spark5:00
Creating Kafka AVRO Sinks4:05
Create a kafka avro sink by reading json from kafka, applying a schema, flattening invoices, and writing key-value with avro payload using spark-avro.
Working with Kafka AVRO Source4:53
Stateless Vs Statefull transformations10:18
Event time and Windowing7:14
Explore event time and windowing in Spark Streaming, mastering tumbling and sliding windows, and learn how trigger time and event time shape 15-minute time-bound aggregates.
Tumbling Window aggregate14:03
Watermarking your windows12:42
Watermark and output modes9:59
Explore spark streaming output modes: complete, update, and append, and their effects on state cleanup, windowed aggregates, and watermark behavior with event-time windows.
Sliding Window7:38
Joining Stream to static source14:24
Joining Stream to another Stream9:56
Streaming Watermark7:12
Streaming Outer Joins11:10
Final Word0:50

Requirements

Spark Fundamentals and exposure to Spark Dataframe APIs
Programming Knowledge Using Python Programming Language

Description

About the Course

I am creating Apache Spark and Databricks - Stream Processing in Lakehouse using the Python Language and PySpark API. This course will help you understand Real-time Stream processing using Apache Spark and Databricks Cloud and apply that knowledge to build real-time stream processing solutions. This course is example-driven and follows a working session-like approach. We will take a live coding approach and explain all the needed concepts.

Capstone Project

This course also includes an End-To-End Capstone project. The project will help you understand the real-life project design, coding, implementation, testing, and CI/CD approach.

Who should take this Course?

I designed this course for software engineers willing to develop a Real-time Stream Processing Pipeline and application using Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with those implementing Apache Spark at the ground level.

Spark Version used in the Course.

This Course is using the Apache Spark 3.5. I have tested all the source code and examples used in this Course on Azure Databricks Cloud using Databricks Runtime 14.1.

Who this course is for:

Software Engineers and Architects who are willing to design and develop a Bigdata Engineering Projects using Apache Spark and Databricks Cloud
Programmers and developers who are aspiring to grow and learn Data Engineering using Apache Spark and Databricks Cloud

Spark Streaming - Stream Processing in Lakehouse - PySpark

What you'll learn

Explore related topics

Course content

Before you start4 lectures • 11min

Setup your environment3 lectures • 40min

Getting Started with Spark Streaming9 lectures • 3hr 48min

Kafka for Data Engineers8 lectures • 3hr 21min

Streaming Aggregates and State Management8 lectures • 3hr 30min

Working with Databricks Platform14 lectures • 2hr 49min

Capstone Project - Implementing Real-time Project in Lakehouse30 lectures • 3hr 34min

Final Word1 lecture • 1min

Archived - Old Course Content31 lectures • 4hr 31min

Requirements

Description

Who this course is for: