Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
Databricks Stream Processing with PySpark in 15 Days
Rating: 3.8 out of 5(15 ratings)
28 students

Databricks Stream Processing with PySpark in 15 Days

Master Spark Structured Streaming with PySpark on Databricks through a Complete End to End Real Life Project
Created byMd Samiul Islam
Last updated 4/2025
English

What you'll learn

  • Concept of Real-time Stream Processing in Databricks
  • Spark Structured Streaming APIs and Medallion Architecture
  • Working with Different Streaming Sources and Sinks
  • Working With Kafka Source and Integrating with Spark
  • Windowing Aggregates using Spark Stream & Streaming Joins and Aggregation
  • Concept of State-less and State-full Streaming Transformations
  • Handling Memory Problems with Streaming
  • Working with Azure Databricks Platform
  • Real Life Final Project - Streaming application in Lakehouse

Course content

4 sections17 lectures2h 31m total length
  • Introduction of Spark Streaming in Databricks5:32

    Hello, and welcome to the course on Spark streaming in Databricks platform for Data Engineers.

    My name is Mohammad Samiul Islam and I am going to be your instructor for this course.

    I am full-time Data Engineer and currently working as a Lead Data Engineer for our Multinational

    Software Firm.

    I have over 10 years of experience working on some of the lab data-related project and also

    data migration project. I am an Azure Certified Data Engineer Associate and also I have completed

    the Informatica Cloud Data Engineering certification. So I hope I am super qualified to teach you

    in this course and I will do my best to make your experience enjoyable.

    The main goal of this course is to help you to understand Spark Structure Streaming from the

    beginning to the advanced level using Pi Spark. Apart from this, I will cover the Kafka and

    Databricks as well as CACD pipeline. So if I come to the pinpoint of learning step then firstly

    we will learn Spark Streaming or Spark Structure Streaming. Then I will assist you with learning Kafka

    as well. I will begin with the fundamental overview of Kafka and guide you to achieve the

    descent level of knowledge about Kafka. Also I will discuss about what function the Kafka

    in our Data Engineering platform does and will teach you to understand integrating Kafka

    with the Spark Structure Streaming in real-time data processing. I will moreover assist you in

    comprehending how to create a unified program that can be executed as a best processing job

    or as a stream. Therefore Spark Structure Streaming is quite good. Also helping you to

    construct a unified application is one of the highest quality. Additionally, you may run those

    applications as a stream or as a best processing without changing a single line of code after

    that have been built. This capability we will invest and we will discover how to put it into practice.

    The technologies of putting the scenario into practice will be covered.

    Consequently, we will discuss that and then learn how to combine batch and stream processing

    in a big project. Each example, corresponding test case and test suit and how to test

    various items and how to automate your integration and unit test. We will always be learning and

    that is a critical component of every successful project. By end of the project, we will have

    to learn how to automate your continuous integration and deployment as well as the fundamental

    of Azure DevOps and CACD implementation. We will also actually put a functional CACD pipeline

    into place for our Catastone project or final project or Lakehouse project.

    So, this is the full step of our course and we will cover all the things.

    I believe one thing, indeed, Allah SWT will not change the condition of people

    until they change what is in themselves and I believe it very strongly.

    Who is for this course? This course is for data related person, like the university student

    who are very fascinated on data engineering, the IT professional person who working in a data

    sector as well as mostly preferable for data engineer and data architect.

    Why you will choose this course? This course is 100% project-based and ensuring practical learning.

    We will learn from the fundamental tools to advance concept like spark streaming,

    Kafka, Lakehouse and we will build into a solution. This course aligns with the

    Databricks data engineering associate certification so it will help you to achieve the certification

    and will give a great conception about the certification of Databricks data engineering associate.

    And this is full-time and lifetime access.

  • Process Comparison Between Batch & Stream8:07

    In this session, we will discuss about process comparison between batch and stream.

    Stream processing becomes essential when we need to process data in real time or near real time

    as opposed to batch processing which handle data in large and periodic time.

    Transitioning from batch to stream processing introduce several challenges that needs to be

    addressed to ensure the efficient and accurate data processing.

    Let's with an example to understand these challenges.

    Imagine you are working for a large stock brokerage company that provides training service

    through videos not from like desktop terminal web application and mobile apps.

    This company wants to generate the trade summary report that updates every 30 minutes.

    The report includes the total buy and sell amount as well as the settlement amount and the

    plot it over time. Build this solution. We can break it into three main steps. The first one

    is called data ingestion. The next one is called data processing and the final or third one is called

    result storage. First we need to ingest data from training system into our data engineering

    platform. As you there is a magical system that collect data from all trading platform and store

    it centrally. We can create a data ingestion pipeline that pulls the data from the system

    in every 30 minutes and send it into our landing zone in our platform.

    So once the data is ingested we process it using medillian architecture.

    So what is the medillian architecture that we will discuss in the later session but for basic

    understanding which involves creating a raw data zone that is called bronze layer. Another layer

    is called high quality layer or silver layer and the final one is called the gold layer.

    We start by saving the raw data into bronze table then clean and standardize it in the silver layer

    and finally are from the necessary calculation to generate the trade summary report in the gold layer.

    The result are stored in a final table which the consumer can access to create visualization.

    To execute this workflow we need to orchestrate these tools that schedule and run the job

    in sequence. However the critical question is how frequently should this pipeline run?

    If the customer requirement the report to refresh every 30 minutes then the entire system pipeline

    must complete within that time frame. So if you look into this images here here the main thing is

    over the process if it started in 1130 and if it's need to be complete before 12 pm then within

    this time frame we need to complete the whole processing and instead by step by during every 30 minutes

    the process will be completed. This is on kind of the path and path processing but if we want to move

    to the streaming process this introduce several challenges. So the first one challenges is called

    back pressure. So as the frequency of processing includes like every 30 minutes to every second

    or every millisecond or microsecond the system must handle the growing volume of data without delays.

    If the processing time exits that allowed windows delay accumulate relating to back pressure

    and the system is struggled to keep up the incoming data. So the next one is called incremental

    data processing. So to reduce the processing time we can process only new data or the newly

    arrival data that is called incremental processing instead of reprocessing the entire

    or whole data set. However this requires implementing checkpoint. So higher the system records what

    has already been processed. Checkpointing ensure that only new data is processed in each iteration

    but it adds complexity essentially when handling the failure.

    So the next one is called fault tolerance. So fault tolerance means if a job fails the system

    must be able to restart from where the lift off using the checkpoint to track this progress.

    However ensuring the both job and checkpointing operation as a single transaction either both

    source, succeed or fail in challenging. So the next one is that called let arrival.

    Let arrival means in real-time system some data may arrive led due to network delay or any

    retrise. For example a transaction that should be included in the 11AM report might be arrived at

    11.30AM handling lateral data require reprocessing previous iteration to correct the result

    which adds complexity to the system. The next one is called the state management. So to handle the

    led arrival data and ensure the accurate result the system must maintain the state of previous

    calculation. This allows it to update result when the new data arrives ensuring the data

    consistency at correctness. So now if the pinpoint of the session and the summary is

    transitioning from batch to stream processing introduce challenges like back pressure, incremental

    processing, checkpointing, fault tolerance, lateral-able data and state management.

    This challenges can be more pronounced as the processing frequency increase like from minutes

    to second or to millisecond. Fortunately tools like Apaches Park Structure Streaming provide

    building solution for these challenges making it easier to implement real-time data processing

    system. Throughout this course we will explore how to use Spark to address these challenges

    efficiently. That's all for this lecture and session and in the upcoming session we will discuss

    about our example process, our project process and we will dip drive into how Spark structure

    streaming org and how to it simplifies for streaming process.

  • High Level Project Discussion on Spark Streaming6:06

    In this session, we will discuss about our project and product money.

    So for our project, we will need the data from the source and source could be anything.

    Then we will push this data to the landing zone.

    Then we will apply some cleaning and quality improvement process.

    And finally, we will store this data to the storage account or storage layer.

    So this full scenario is called the median architecture means the data will come to

    fast in the landing zone.

    Then we will apply the cleaning process.

    Then finally, store to the storage account.

    So if I go to the high level architecture of median architecture, then this is the full

    scenario of the median architecture.

Requirements

  • Python Programming Language

Description

Course Overview

In today's data-driven world, real-time stream processing is a crucial skill for software engineers, data architects, and data engineers. This course, Apache Spark and Databricks - Stream Processing in Lakehouse, is designed to equip learners with hands-on experience in real-time data streaming using Apache Spark, Databricks Cloud, and the PySpark API.

Whether you're a beginner or an experienced professional, this course will provide you with the practical knowledge and skills needed to build real-time data processing pipelines on Databricks, utilizing Apache Spark Structured Streaming for high-performance data processing.

With a live coding approach, you'll gain deep insights into streaming architecture, message queues, event-driven applications, and real-world data processing scenarios.


Why Learn Real-Time Stream Processing?

Real-time stream processing is becoming a critical technology for businesses handling vast amounts of data generated by IoT devices, financial transactions, social media platforms, e-commerce websites, and more. Companies need instant insights and decisions, and Apache Spark Structured Streaming is the best tool for handling large-scale streaming data efficiently.

With the rise of Lakehouse Architecture and platforms like Databricks, enterprises are moving towards unified data analytics where structured and unstructured data can be processed in real time. This course ensures that you stay ahead in the industry by mastering streaming technologies and building scalable, fault-tolerant stream processing applications.

What You'll Learn?

This course takes an example-driven approach to teach real-time stream processing. Here’s what you’ll learn:

  1. Foundations of Stream Processing

    - Introduction to real-time stream processing and its use cases

    - Understanding batch vs. streaming data processing

    - Overview of Apache Spark Structured Streaming

    - Core components of Databricks Cloud and Lakehouse Architecture

  1. Getting Started with Apache Spark & Databricks

    - Setting up a Databricks workspace for real-time streaming

    - Understanding Databricks Runtime and optimized Spark execution

    - Managing data with Delta Lake and Databricks File System (DBFS)

  1. Building Real-Time Streaming Pipelines with PySpark

    - Introduction to PySpark API for streaming

    - Working with Kafka, Event Hubs, and Azure Storage for data ingestion

    - Implementing real-time data transformations and aggregations

    - Writing streaming data to Delta Lake and other storage formats

    - Handling late-arriving data and watermarking

    - Optimizing Streaming Performance on Databricks

    - Tuning Spark Structured Streaming applications for low latency

    - Implementing checkpointing and stateful processing

    - Understanding fault tolerance and recovery strategies

    - Using Databricks Job Clusters for real-time workloads


  2. Integrating Stream Processing with Databricks Ecosystem

    - Using Databricks SQL for real-time analytics

    - Connecting Power BI, Tableau, and other visualization tools

    - Automating real-time data pipelines with Databricks Workflows

    - Deploying streaming applications with Databricks Jobs


  3. Capstone Project - End-to-End Real-Time Streaming Application

    - Design a real-time data processing pipeline from scratch

    - Implement data ingestion from Kafka or Event Hubs

    - Process streaming data using PySpark transformations

    - Store and analyze real-time insights using Delta Lake & Databricks SQL

    - Deploy your solution using Databricks Workflows & CI/CD Pipelines


Who Should Take This Course?

This course is perfect for:

- Software Engineers who want to develop scalable, real-time applications.

- Data Engineers & Architects who design and build enterprise-level streaming pipelines.

- Machine Learning Engineers looking to process real-time data for AI/ML models.

- Big Data Professionals who work with streaming frameworks like Kafka, Flink, or Spark.

- Managers & Solution Architects who oversee real-time data implementations.


Why Choose This Course?

This course is designed with a practical, hands-on approach, ensuring you not only learn the concepts but also implement them in real-world scenarios.

- Live Coding Sessions - Learn by doing, with step-by-step implementations.

- Real-World Use Cases - Apply your knowledge to industry-relevant examples.

- Optimized for Databricks - Best practices for deploying streaming applications on Azure Databricks.

- Capstone Project - Get hands-on experience building an end-to-end streaming pipeline.


Technology Stack & Environment

This course is built using the latest technologies:

- Apache Spark 3.5 - The most powerful version for structured streaming.

- Databricks Runtime 14.1 - Optimized Spark performance on the cloud.

- Azure Databricks - Scalable, serverless data analytics.

- Delta Lake - Reliable storage for structured streaming.

- Kafka & Event Hubs - Real-time messaging and event-driven architecture.

- CI/CD Pipelines - Deploying real-time applications efficiently.


Enroll Now & Start Your Journey in Real-Time Data Streaming!

By the end of this course, you will be confident in building, deploying, and managing real-time streaming applications using Apache Spark Structured Streaming on Databricks Cloud.

Take the next step in your career and master real-time stream processing today.


Who this course is for:

  • Aspiring programmers and developers seeking to advance their skills and knowledge in Data Engineering with Apache Spark and Databricks Cloud.
  • Software Engineers and Architects eager to design and build Big Data Engineering projects using Apache Spark and Databricks Cloud.