
Explore Apache Spark development environments across cloud and on-premise platforms, focusing on Databricks and Cloudera, browser notebooks, and local IDEs like PyCharm and VSCode.
Explore when batch versus stream processing is needed and master key challenges—back pressure, incremental processing, checkpointing, fault tolerance, and late data—using Spark structured streaming.
Transform a batch word count into a streaming application using spark structured streaming, adopting read stream, write stream, incremental data processing, and a checkpoint location.
Create a spark streaming application to ingest json invoices from a landing zone, explode and flatten line items, and write denormalized records to a delta table.
Read data from the invoices kafka topic with the spark kafka connector. Load into a spark data frame and inspect the binary key, value, and timestamps.
Learn to build a Spark Structured Streaming Kafka sink by reading json invoices from a landing zone, applying a where filter, and writing key-value messages to a Kafka topic.
Explore the Azure Databricks architecture, detailing control plane and data plane with a single consolidated bill. See how DBFS abstracts storage and enables seamless access to Azure Data Lake Gen2.
Learn to create and configure an Azure Databricks cluster, choosing multi-node or single-node setups, and manage runtime, auto scaling, and essential tools like notebooks, libraries, and Spark UI.
Unity Catalog is a Databricks premium service for metadata and user management that stores catalogs, schemas, tables, volumes, and storage locations in a metastore, using SQL grants for access control.
Design a Databricks lakehouse with bronze, silver, and gold layers to ingest device, profile, bpm, login, and workout data and deliver gold tables for workout bpm and gym summaries.
Design a lakehouse workflow to process five input datasets for analytical use, syncing profile updates from a Kafka topic to the cloud database while decoupling operational workloads.
Implement data security by inserting a Unity Catalog layer between storage and compute, enforcing fine-grained access, blocking direct directory access, and granting read/write on databases and tables via groups.
Launch coding in a Databricks development workspace, connect to Azure DevOps, and create a feature branch to develop setup notebooks and DDL scripts for bronze, silver, and gold tables.
Load historical and lookup data with a history loader that populates the date_lookup dimension in the silver layer, using a one-time data load from cloud storage.
Outlines an end-to-end integration testing strategy for spark streaming in lakehouse, using two payloads, test data preparation, and three notebooks to validate gold layer reports.
Learn to create and configure a release pipeline in Azure DevOps, linking a build artifact, using a ubuntu-22.04 agent, and deploying via Databricks CLI with integration testing.
Install Java JDK on Windows to satisfy Spark prerequisites, set JAVA_HOME and PATH, verify with java -version, and configure Hadoop WinUtils with HADOOP_HOME for Windows environments.
Configure and test your spark development IDE on Windows with PyCharm, a conda environment, and Python 3.7. Install PySpark, open the Hello Spark SQL project, and verify the setup.
Install and configure Apache Spark on mac by setting up JDK 8 or 11 and JAVA_HOME, installing Spark and SPARK_HOME, updating PATH, and enabling pyspark with python3.
Explore how stream processing extends batch processing with Spark, addressing late-arriving records and incremental calculations, while supporting scheduling and fault handling across streaming workloads.
Learn to read from Kafka as a streaming source with Spark Structured Streaming, flatten invoice items, apply from_json to value, explode items, and sink to a file system.
Create a kafka avro sink by reading json from kafka, applying a schema, flattening invoices, and writing key-value with avro payload using spark-avro.
Explore event time and windowing in Spark Streaming, mastering tumbling and sliding windows, and learn how trigger time and event time shape 15-minute time-bound aggregates.
Explore spark streaming output modes: complete, update, and append, and their effects on state cleanup, windowed aggregates, and watermark behavior with event-time windows.
About the Course
I am creating Apache Spark and Databricks - Stream Processing in Lakehouse using the Python Language and PySpark API. This course will help you understand Real-time Stream processing using Apache Spark and Databricks Cloud and apply that knowledge to build real-time stream processing solutions. This course is example-driven and follows a working session-like approach. We will take a live coding approach and explain all the needed concepts.
Capstone Project
This course also includes an End-To-End Capstone project. The project will help you understand the real-life project design, coding, implementation, testing, and CI/CD approach.
Who should take this Course?
I designed this course for software engineers willing to develop a Real-time Stream Processing Pipeline and application using Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organization’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with those implementing Apache Spark at the ground level.
Spark Version used in the Course.
This Course is using the Apache Spark 3.5. I have tested all the source code and examples used in this Course on Azure Databricks Cloud using Databricks Runtime 14.1.