
Hello, and welcome to the course on Spark streaming in Databricks platform for Data Engineers.
My name is Mohammad Samiul Islam and I am going to be your instructor for this course.
I am full-time Data Engineer and currently working as a Lead Data Engineer for our Multinational
Software Firm.
I have over 10 years of experience working on some of the lab data-related project and also
data migration project. I am an Azure Certified Data Engineer Associate and also I have completed
the Informatica Cloud Data Engineering certification. So I hope I am super qualified to teach you
in this course and I will do my best to make your experience enjoyable.
The main goal of this course is to help you to understand Spark Structure Streaming from the
beginning to the advanced level using Pi Spark. Apart from this, I will cover the Kafka and
Databricks as well as CACD pipeline. So if I come to the pinpoint of learning step then firstly
we will learn Spark Streaming or Spark Structure Streaming. Then I will assist you with learning Kafka
as well. I will begin with the fundamental overview of Kafka and guide you to achieve the
descent level of knowledge about Kafka. Also I will discuss about what function the Kafka
in our Data Engineering platform does and will teach you to understand integrating Kafka
with the Spark Structure Streaming in real-time data processing. I will moreover assist you in
comprehending how to create a unified program that can be executed as a best processing job
or as a stream. Therefore Spark Structure Streaming is quite good. Also helping you to
construct a unified application is one of the highest quality. Additionally, you may run those
applications as a stream or as a best processing without changing a single line of code after
that have been built. This capability we will invest and we will discover how to put it into practice.
The technologies of putting the scenario into practice will be covered.
Consequently, we will discuss that and then learn how to combine batch and stream processing
in a big project. Each example, corresponding test case and test suit and how to test
various items and how to automate your integration and unit test. We will always be learning and
that is a critical component of every successful project. By end of the project, we will have
to learn how to automate your continuous integration and deployment as well as the fundamental
of Azure DevOps and CACD implementation. We will also actually put a functional CACD pipeline
into place for our Catastone project or final project or Lakehouse project.
So, this is the full step of our course and we will cover all the things.
I believe one thing, indeed, Allah SWT will not change the condition of people
until they change what is in themselves and I believe it very strongly.
Who is for this course? This course is for data related person, like the university student
who are very fascinated on data engineering, the IT professional person who working in a data
sector as well as mostly preferable for data engineer and data architect.
Why you will choose this course? This course is 100% project-based and ensuring practical learning.
We will learn from the fundamental tools to advance concept like spark streaming,
Kafka, Lakehouse and we will build into a solution. This course aligns with the
Databricks data engineering associate certification so it will help you to achieve the certification
and will give a great conception about the certification of Databricks data engineering associate.
And this is full-time and lifetime access.
In this session, we will discuss about process comparison between batch and stream.
Stream processing becomes essential when we need to process data in real time or near real time
as opposed to batch processing which handle data in large and periodic time.
Transitioning from batch to stream processing introduce several challenges that needs to be
addressed to ensure the efficient and accurate data processing.
Let's with an example to understand these challenges.
Imagine you are working for a large stock brokerage company that provides training service
through videos not from like desktop terminal web application and mobile apps.
This company wants to generate the trade summary report that updates every 30 minutes.
The report includes the total buy and sell amount as well as the settlement amount and the
plot it over time. Build this solution. We can break it into three main steps. The first one
is called data ingestion. The next one is called data processing and the final or third one is called
result storage. First we need to ingest data from training system into our data engineering
platform. As you there is a magical system that collect data from all trading platform and store
it centrally. We can create a data ingestion pipeline that pulls the data from the system
in every 30 minutes and send it into our landing zone in our platform.
So once the data is ingested we process it using medillian architecture.
So what is the medillian architecture that we will discuss in the later session but for basic
understanding which involves creating a raw data zone that is called bronze layer. Another layer
is called high quality layer or silver layer and the final one is called the gold layer.
We start by saving the raw data into bronze table then clean and standardize it in the silver layer
and finally are from the necessary calculation to generate the trade summary report in the gold layer.
The result are stored in a final table which the consumer can access to create visualization.
To execute this workflow we need to orchestrate these tools that schedule and run the job
in sequence. However the critical question is how frequently should this pipeline run?
If the customer requirement the report to refresh every 30 minutes then the entire system pipeline
must complete within that time frame. So if you look into this images here here the main thing is
over the process if it started in 1130 and if it's need to be complete before 12 pm then within
this time frame we need to complete the whole processing and instead by step by during every 30 minutes
the process will be completed. This is on kind of the path and path processing but if we want to move
to the streaming process this introduce several challenges. So the first one challenges is called
back pressure. So as the frequency of processing includes like every 30 minutes to every second
or every millisecond or microsecond the system must handle the growing volume of data without delays.
If the processing time exits that allowed windows delay accumulate relating to back pressure
and the system is struggled to keep up the incoming data. So the next one is called incremental
data processing. So to reduce the processing time we can process only new data or the newly
arrival data that is called incremental processing instead of reprocessing the entire
or whole data set. However this requires implementing checkpoint. So higher the system records what
has already been processed. Checkpointing ensure that only new data is processed in each iteration
but it adds complexity essentially when handling the failure.
So the next one is called fault tolerance. So fault tolerance means if a job fails the system
must be able to restart from where the lift off using the checkpoint to track this progress.
However ensuring the both job and checkpointing operation as a single transaction either both
source, succeed or fail in challenging. So the next one is that called let arrival.
Let arrival means in real-time system some data may arrive led due to network delay or any
retrise. For example a transaction that should be included in the 11AM report might be arrived at
11.30AM handling lateral data require reprocessing previous iteration to correct the result
which adds complexity to the system. The next one is called the state management. So to handle the
led arrival data and ensure the accurate result the system must maintain the state of previous
calculation. This allows it to update result when the new data arrives ensuring the data
consistency at correctness. So now if the pinpoint of the session and the summary is
transitioning from batch to stream processing introduce challenges like back pressure, incremental
processing, checkpointing, fault tolerance, lateral-able data and state management.
This challenges can be more pronounced as the processing frequency increase like from minutes
to second or to millisecond. Fortunately tools like Apaches Park Structure Streaming provide
building solution for these challenges making it easier to implement real-time data processing
system. Throughout this course we will explore how to use Spark to address these challenges
efficiently. That's all for this lecture and session and in the upcoming session we will discuss
about our example process, our project process and we will dip drive into how Spark structure
streaming org and how to it simplifies for streaming process.
In this session, we will discuss about our project and product money.
So for our project, we will need the data from the source and source could be anything.
Then we will push this data to the landing zone.
Then we will apply some cleaning and quality improvement process.
And finally, we will store this data to the storage account or storage layer.
So this full scenario is called the median architecture means the data will come to
fast in the landing zone.
Then we will apply the cleaning process.
Then finally, store to the storage account.
So if I go to the high level architecture of median architecture, then this is the full
scenario of the median architecture.
Hello, and welcome back in this video. I will guide you through sitting up
Thought Databricks community edition. So let's get started. First open your
browser and start for try Databricks and you will likely find a link of the
Databricks free trial. So just click on it and it will relate to you the trial
edition page. So here you need to fill up all the information like the first name
last name your email company title and others. So I suggest you to use your
personal email instead of your company mail and then you will click on the
continue button. So I'm just clicking on the continue button. Then it's
redirecting to new page and Databricks offer a fordindess free trial of its
full cloud product. But this require an existing account with all of the
supported cloud platforms like AWS Microsoft Azure or Google Cloud Platform.
And they will give you 400 dollar as a free carry for 14 days. If you have
already an account with any of those, then you can access the free trial. However,
for this course, we will not using the fordindess trial. Instead of this, we
will use the we like to at Databricks community version. So scroll down a little
bit. You'll see get started with community edition. I'm just clicking on the
community edition.
It's redirecting me to a new path. So and it's send a mail to your email
account. So I just go to my email account and you see a new email is available. So
they have given a permission for the Databricks community version. So I'm just
clicking on the link. And they're redirecting to the login page. I'm just
submitting here my mail and click that continue with email button. Then they will
send again a verification code to your email. So this is the new verification code.
I'm just copy and paste here. Then in turning and it's redirect to a new
page that is called the community cloud Databricks. But for my case, it's showing
to option on is this one and another is this one because by using the same email, I have
signed up for two times. That's why they have created multiple workspace for myself.
But if you were the first time, then it will not redirect to this page. It will
directly redirect to this pages. Okay, so this is the community version. And in the community
version, I have different type of feature like the collapse menu by using the collapse menu,
you can collapse or you can expand your menu and have a lot of menu like new workspace,
recent, such, catalog, workflow, compute, and also machine learning experiment.
So all the things we will discuss in our next session. So in the next session, we will explain
the workspace interface and how to use it for Spark development. So don't worry about for
this now. For now, let's log out and go to your workspace and close again. So this is the
option. I'm just log out again. And this is the log out option. Then finally, you will get
this pages. And if you submit a new email and click on the continue with email, then every time
it will send a verification code to your email. So then you will copy the link.
Then you will paste it here. Then it will redirect to the workspace page because I have to work
workspace. So if I click on the first one, then this one is the newly created workspace.
So if I go back and select the old one, then this is the old one. You see, in my old workspace,
I have created some notebooks. So it's showing here. So if I want to delete one of the workspace,
then I'm just delete the account, delete. So click on the delete button.
They're saying they have deleted it. So now if I try to log in again, then again, it will send
a verification email to my email address. Then I'm just copy and paste it here.
Then it redirect to my existing workspace. So the thing is over here. If you use same email
address for multiple time to create the community, then it will accept because
for every login, for every sign up, they will accept you and they will give you a new workspace.
Hello, and welcome to the new session in this session. We are going to discuss about the
Databricks feature. So Databricks have several key menu that helps user to navigate and manage
their workloads efficiently. So here is the breakdown all of the menu. The first one is the new.
So if I click on the new menu, then some other sub menu is available, like
notebook, table, compute, cluster, machine learning, and experiment.
So this new menu is a quick access menu that allow user to create new notebook, folder,
libraries, SQL query, dashboard, experiment, and ML model, and many more. So it helps
quickly setting up new resources without navigating through the workspace. The next one is
workspace. So if I click on the workspace, then you will be able to see this screen,
and workspace exactly like a file system that organizes your notebook, folder,
others, libraries, and other resources in Databricks. So here, the folder is available in notebooks.
So if I click on the notebook, then you will be able to see two other options,
on it's called shared, and another is user. So users can create personal or share folder to
collaborate with team, like if you want to work in a team, and the team member will access the
notebook or others' script, and you can create this thing in the shared option.
So this is where you can store and manage all your development org, including notebook and
script. And this is the user menu, and in the user menu, only one user is available. If you want to
add more user, then you can add user from here. The next one is recent. So if I click on the recent,
then this recent page is blank because we do not do apply anything here. So the recent menu
provided is to recently access notebook queries and other resources. This is useful for quickly
resuming work without navigating through the multiple menus. Next one is such. So if you want to
search anything in your workspace, then you can write down here. And if the item is available
in your workspace, then it will be visible to here. The next one is the catalog. So if I click
on the catalog, then you will be able to see some notification like no cluster is added.
Okay. So what is the cluster? I will just explain later because we have another option called
cluster. So first, I discuss about catalog. The catalog is a central point for managing data set
in Unity catalog. So what is the Unity catalog? I will discuss in later session that
provides fine gradient access control and data governance. This is the short thing of the
Unity catalog. And it allows organization to manage table schemas views and other data related
object efficiently. And also you can define permission on database tables and views at a granular
level. The next one is workflow. If I click on the workflow, then it's showing a radio database
subscription. So in the community version, we are not able to run the workflow. But if I just say
what is the workflow, then our flow just like called jobs that are used for orchestrating ETL pipeline,
ML training, mass processing, or automation task. User can schedule, monitor and manage job that
run notebook script or SQL query. And it support dependencies and conditional execution to build
complex workflow. So we will see this option in later when we will upgrade the database subscription.
Okay. The last one is called compute. So what is the compute? Compute menu manages the cluster.
So what is the cluster? Where are the underlying infrastructure that is called the virtual machine
and used to run notebook and jobs. So based on this compute or cluster, your job, your notebook,
your workflow, anything you have, it will be run. So user can create, configure, start, stop,
and monitoring the cluster because it's monitoring is very important due to the billing cycle of
compute. So in this session, we will create a cluster as well. So to create a cluster, I'm just
clicking on the create compute. And this is the name of the cluster. If you want, then you can
change it. So I'm just changing training 01. Okay. So database runtime version,
database runtime version is nothing. It's a version of the engine. So I always like to use the
preferable LTS version. So this is the LTS version. If you want, then you can use this one,
this one, and this one as well. Okay. And this is the option. And here we will get the 15GV memory
because community edition provide 15GV memory to process the job and it automatically
terminate after an ideal period like an hour or two hours. So now if I click on the create compute,
then it will take some time to create, maybe it will take two or three minutes to complete
full creation process. Okay. So I'm taking a pause and when the cluster will be created, then
I will come back. Okay. So finally, our cluster is created and this cluster is now ready to run
any notebook or any job or any workflow. So if I just describe here, the driver type is like you
say here, the community optimized. That means community optimized exactly giving to you 15.3
GV memory with the two core. And here spark configuration, they're using the spark databrics.rogsdv
file manager and other things. Okay. This is about the compute. So if I click on the compute,
then our cluster is available here. At the runtime, this is the version active memory,
active core and other things is available here. So if you want to clone or delete or restart
the cluster, then you are able to restart from here, clone from here, delete from here. Okay.
So this is all about for compute and cluster. And if I now go to the new and click on the table,
then you will be able to see here, this option is only showing to upload the folder.
Here, I just want to show you one thing. Like if I click on this point and I'm going to the setting
menu. And setting menu has an option that is called that DBFS like databrics file system.
So I'm just going to the advance option. And in the advance option, scroll down, scroll down,
and you see DBFS file browser. This is the off mood. I just active this option. Then come back to
the previous step and refresh it. You'll be able to see here, DBFS menu is available.
That means we can see the file system of databrics from here. You see, this is the file store and
within the file store have some default table or other things like this one is the table.
Okay. So now come to the upload file and we will upload some file here. Okay. So if I click on the
on this like drop files to upload clicking on here, then go to the desktop. And this is the
like source data. So if I want to take the full folder here, then I'm just dragging and drop here.
Then all the files will be uploaded to the by default directory that is called the file store
table. And the folder name is source data. And within the source data folder, we have another
folder that is called text and within the text folder, we have three files data one, data two,
and data three. Okay. I just show you again. So this is our source folder. And within the source
folder, we have text. We have text and three data file. Okay. So this already uploaded to the
data big file system. Now we are able to process this data and create our notebook workflow and
everything. So this is end of this session. And in the next session, we will write down our program
to process this data.
Hello and welcome to the new session. In this session we are going to start our coding.
So before jump to the coding we need a cluster. That's why we have created a cluster and
already we have created this cluster in the earlier lesson. So now I just go to the workspace
and in the workspace we will create a folder. So this is the workspace and I'm creating a folder
here. Pick on the create, then click on the folder. Then I'm just giving a name is par streaming.
Okay. Okay. Then within this par streaming I will create the notebook. So click
again the create button. Then create the notebook. Just the notebook. I'm just clicking on the
notebook and changing the name of this notebook like 0, 1, read, takes data.
Right. We are going to read the text data. So where we have stored this data,
firstly we will see this thing like if I click on the new and click on the table,
then you'll be able to see here dbfs, the notebook in the dbfs, just see this is a selected
file system dbfs file system and this file system exactly maintained by the data bricks.
And click on the file store, then table, then source data, then txt and within the txt we have
three file like data one, data two and data three. All the file are exactly containing some
sentences that is called paragraph. Every sentence is exactly separated by full stop.
Okay. So just scroll down and you'll able to see the directory part of this one.
Okay. So I'm just clicking a copy of this. I'm going to here and firstly we will define the base directory.
So this will be the base directory, okay. Base directory.
Okay. So in the base directory, we will keep only to this point, file store and tables.
Now print base directory and from the program, then you will see this directory is printed.
But now we want to see the folder and files under this base directory. So where it is db
utils dot fs, fs means file system, then ls means list of the file, then we will pass the
base directory base directory and within the base directory like file store tables and under the
table we have a folder that is called the source data, right. So we are passing here source data
and under the source data we have a text folder, right. So now if I hit enter, then you will
able to see it's showing the data file like data underscore one data two and data three, right.
And it's showing the size of the file also this one and this one. And the modification time is
this, okay. This is an integer number. If we want to convert this integer number into the time
time system format, it's possible. We'll see it later. So now the thing is over here. If we want to
clear the result of this cell, then how we can do this one? Click here and you'll be able to see
the clear output. I'm just clicking on the output, then this will be removed or clear, okay.
Now we will read the data from the data file, right. So now our target is read text data
from base directory with line separation, okay. So what will be our line separation? We will consider
here fully stop as a line separator, okay. So let's see how we can do this. Like first of all we are
going to assign a variable that is called lines. And now we need to write down the program.
So first of all, we are going to use Spark. So why we're going to Spark? Because Spark is an
entity point of apathy's part framework. And it's created a session to execute a program. So
Spark is an entity point of apathy's part framework and it will create a session to execute a
program. That's why we need to write down the Spark. So this Spark will initiate a session,
right. Then we need to read the data. Then with data we want to read, we will assign the format
of the data like format and the format of data is text data. We are going to read, right. Because
we have uploaded the text data. Then we can use option. Option is a method and like line separator.
So here we will use line separator as fully stop. Then we can load the data from the base
directory, right. So this one is the base directory and within the base directory we have source.
We have source data folder. Then source data folder. We have a text file, right. So now if I just display
the lines variable, then we will able to see the data of this three text file like data one and
data two and data three, okay. You see, this is the data like structure streaming and
every line is separated by fully stop. So if I just open the data file from my local environment,
then you'll be able to see the perfect scenario what containing exactly in the data file, right.
So I'm going to my data file in my local machine, okay. This all are my local file that I have
opened here to just see what are containing in this file, right. So you see here this is the file
structured streaming is the robust and scalable stream processing engine build on the spark
SQL. This line exactly available here. You see this line. Then after it always you then it always you
and it's exactly separated by this full stop, okay. This full stop. So if you go to the
every file, then you'll be able to see the same scenario, but the content is different.
Data one, data two and data three. Three folder exactly containing different type of content, okay.
So now just click here and clean the output. Now we want to apply some extra process like to read
the data from this framework, okay. So lines now write a data framework that story all the data
of data one data two and data three files. Now we will read the data from this lines and we will
apply some extra layer of process. So how we can define this like read data from data frame
and declare new data frame, okay. Okay. So how we can read this one like here all the data are
exactly coming as a line. So if I just execute again and you see here all the data is coming as a
line by line, line by line. So now we want to separate this line data into a single word, okay.
In this session, we are going to apply some data cleaning process and we will improve
the quality of the data, okay.
So let's move to our coding part.
In the last session, we write down this program and finally we have generated word as a
row.
Now we will apply some cleaning process.
So how we will apply the cleaning process, let's see.
So I'm just comment out here, apply some cleaning process to improve the data quality.
Okay.
Okay.
So I just take a variable and give a name like quality words, okay.
And now we will apply the cleaning process on the reviewer's data frame, like this is
the previous data frame.
So on this data frame, we will apply some cleaning process.
So I'm just writing here word, it's a data frame, do not select.
Then we will apply some clean process, like if our word have any left or right side
space, then we need to use stream function to remove this space.
So just make a stream and words don't want because we have mentioned this is what, right.
This is the words data frame and what data frame exactly returning a column that is called
word.
That's why we have right down here words, this is a data frame and from the data frame,
we are selecting column, the column name is word.
So this word exactly is a column and it's coming from the words data frame, okay.
So for the more specification, we can do one thing like we can do words underscore
df, just to understand this, this is a data frame and we can make it here quality words
df and this is df and this is df, okay.
So now if I execute this cell, because you need to execute, changes the data frame name,
earlier it was words, now we have converted words underscore df to understand easily, okay.
Now we can do one thing like we can give a alias, alias of this column like clean word,
okay.
Execute and see, stream is not defined, okay.
So we need to import stream, like already we have imported some function and we can
also write down here, stream and execute this again, now come to here, then execute, okay.
Now this is the thing we have applied, the trim, it's work, we want to make sure all the
items will be lower case, then we can apply this thing and make sure this play to quality
words, quality words underscore df, let's see what's happened here, the lower also we
did to include, so lower and now I'm coming back to here and execute this one, okay.
You see this is the clean word, this clean word exactly the alias of the column, so here
the column is the clean word and now if we want to apply some other filtering process,
like we want to apply some condition that we do not pick the null value, right.
So how we can do this one, we can apply higher condition and also we need to apply the
not null function on the word column of the words data frame, so I'm just taking this,
taking this word and this will be is not null.
And also we need to input this call function, otherwise it will give us error, so I'm just
importing here, execute this on a in and come back to the tf, okay, so we are not taking
any null value, so if we want to apply another filtering on this data frame, like we do
not take any numeric value, if this numeric value is existing award, so to apply this
process, what we can do, this is our word column and on the word column, we will take only
the alphabetical value, so we can use our light function and within the our light function,
we will mention only a qz is existing award, if do not have any numeric value, then it will be
populated here, so now apply the execution and let's see it's coming and there is no one
that containing the numeric value, so this is the end of this lesson and in the upcoming lesson,
we will save this data to our data link.
In this session, we will start our project one.
So in the project, we will analyze the text data.
So earlier, lesson, we already saw how to read the data from a text file.
And we have already applied some cleaning process.
Then we have dumped this data to the storage account of Delta Electable, right?
So if we modernize this code, then how we can do this?
So first of all, we will read the data from the storage account for the project one.
Then we will store this data to the landing zone.
But before to store the landing zone, we will apply some directory label cleaning process
under the landing zone, okay?
Like if the landing zone have different type of file, we will apply the landing zone directory
cleaning program.
Then we will store this data or load this data to the landing zone.
Then we will apply some data cleaning process to improve the data quality based on the architecture
of the median architecture.
And finally, we will store this data to the Delta Electable.
And finally, we will apply some unit test case to check the data and validate the data,
right?
Okay.
So let's now move to the coding part, okay?
So already we have write the program.
This is the program that we wrote in earlier sessions.
So just we will formalize and make some professional vites of this project, okay?
So firstly, I'm just creating a folder that is called demo.
And I'm just passing this previous notebook to the demo folder.
I'm just creating another new folder that is prosa to 01 or prosa to 01, okay?
So now I just create a new notebook here, not a folder, it's a notebook.
And we can give a name like 01, text.analysis, okay?
So what we can do here, we can do really copy paste this code from here to our new notebook,
but it will not be a good process because we will apply some professional process to deploy
the code in the production environment, right?
So we need to follow some standard process to write down the code.
So first of all, what we will do?
We will just mention the process that we are going to do here, like read the data from
storage account, right?
Then we will do some directory level cleaning under the landing zone, directory level cleaning
in landing zone.
Then we will apply the store the data to the landing zone, then we will store the data
in landing zone.
Then we will apply some cleaning process, cleaning process to improve the data quality, quality.
And finally we will do one thing like save the data into delta table and finally validation.
So we will do this step here, so first of all, what we need to do?
For the professional coding purpose, we need to mention the class level programming like
concept of the object oriented programming, but in the earlier notebook, we do not mention
any class or anything here, right?
So we will follow the standard process of the coding.
So first of all, we just mention a class, like giving the name word count class, okay?
And in the class, we need to write a conno-stactor, I think already you know about the conno-stactor,
what is conno-stactor and why we exactly write?
So we need to mention here the base directory as a parameter, base directory.
Like our base directory, we will be file store, right?
File store slash tables.
And we will pass a table name because we will store this data to the delta leg table,
that's why we want to pass the table name as a parameter, right?
So this is the table name and we are passing the table name as like TVL work count, okay?
Okay, so now self.bas directory equal to base directory, then here we need to mention
the table name, table name equal to the table name, right?
Okay, so the conno-stactor function is done.
Now we want to read the data from the storage account, right?
So how we can do this?
We can define another function and the function then we can mention get raw work, okay?
And mention the self.
So from this function, we want to return a data frame.
So we want to mention here the return type of this function, this is the standard process
and we can just mention a message like loading the raw work, okay?
We can mention this one.
So now what we will do, we will just copy paste the code from this point, why?
Because this is the reading process of the data from the storage account.
And I will not describe the everything here because in earlier session we already discussed,
right?
Okay, I'm just taking the copy and make it formalize and then we can return this line,
we can return this line's variable, right?
We can return this line's variable because this is a data frame, right?
It's when we have mentioned it's a variable, but exactly after reading the data from the
storage accounts, it's a return or data frame.
So lines now is a data frame.
So we can directly return this, but we will not directly return.
If we write down here lines, then no problem, but we will specify some extra things like
we have mentioned here this one, right?
We have applied some function that is called split, exploit and also we have select alias
to give the perfect definition of the name, okay?
So I'm just removing this one and paste this line here.
So now this is okay because if we want to read the data from the storage account, then
this function is able to need.
So we will do a test.
So we just mention an object of the word class, word count class, then we can call that
get raw word, right?
Get raw word, okay?
And before to this, we need to import some function that is this one is the function,
we need to import.
So before this class, we will apply this function and also we need to import data frame type
by spark dot SQL, import data frame, right?
Because we are passing the data as a data frame and also we are, our return type of this
get raw word is data frame, that's why we need to input this data frame function as well.
So one thing we need to add because this is the self, okay, self dot the base directory
because we have mentioned this base directory in the constructor level of our class, right?
Okay.
So now if we execute this program, then I guess it will return something, so first execute
this one, organize.
You see here the program is executing successfully like loading the raw data, loading the raw data
and it's returning a data frame, right?
This one is the data frame.
You see the return type is the data frame and word and string, right?
So if we want to see the data, then how can see?
So if we use the display, then you will be able to see the data, okay?
So this is end of this lesson, in the next lesson we will apply the cleaning process,
we will define the cleaning function and then we will apply some process to store the
data in the data lake level, thank you and we will see you again.
1) Write Data Cleaning Function for Text Data Streaming
1) Save Data Frame as Delta Table in Databricks for Text Data Streaming
Hello, and welcome to the new session in the session we are going to start our second project and in the second project
We will process the descent data as like the streaming process
So let's start to create a new project for project 2 so I'm creating a new folder in the work space and giving a name like
project 2
I guess the project folder is created under the home directory
Now I just drag and drop this project to our work space under spark streaming
So in this project we are now process the invoice type of descent data
So first of all we need to upload this data
So I'm just clicking on the catalog and this is the dbfs
And under the dbfs we have file stored then source data
Under the source data we have text folder and in our previous project we have uploaded this data to the text folder
So for this project we are going to upload the data in a new folder that is called in voices
And I'm just drag and drop the source file from our desktop to here
I'm just dragging dropping the file is uploading
So the upload is done
Now you see here the invoice is folder and under this folder we have three invoices json
So firstly I want to show you the invoice json data and want to discuss some structure of this json
Then we will move to our coding part
So I'm just opening this one file you see this is our JSON data
So in our JSON data we have different key and value pair
So in the key value pair we need to analyze this some extra layer of thing
So first of all I want to just structure this JSON file to easy or better understand
Now we can do this one in notepad++ we have an option like plugins and under the plugins you will see plugin admin
Here you will just write down JS tools
Already I have installed the JS tools that's why it's now not showing here
But if I go to the install then you will see the JS tools
So you will just install the JS tools then just press control alter M
Then this file will be structured
It will be easier for understand the JSON structure
So this is the base approach
So now you see here this is the JSON data key and value pair
But in every JSON we have some field that is structured type of complex field
So this field we need to call as like the array field
And we need to process this field in a separate way
We will see in our coding time how we are going to process this structured JSON file
So this is actually array like you see here in voice line item
So we will use struct keyword to process this data
So this is the JSON file and in the upcoming lesson we will start our coding part
Thank you and see you soon
Read JSON File in Databricks
Hello, and welcome to the new session.
In this session, we are going to create our schema.
So why we will create our schema?
Because if I just go to our JSON file, then you will see here all the fields in JSON
is a key value pair.
So some of the value is understood this unlike delivery address in first line item.
It's an array and within the array, we have the share type of the data, understood this
and type of.
So that's why we need to use any extra layer of functionality to manage this complex
type of this data.
So to manage this data, what we will use, we will use a struct data type.
It's actually a complex type of the data and deal with the complex data.
So let's move to the data form meeting.
So to format the data, I'm just taking single JSON like this one, right?
And here I will prepare for the schema level of gopap.
So firstly, what I need to do, I need to remove all the semi-clone and single code, a double
code.
So I'm just taking this one and replacing all the things by nothing, okay?
So what will be the data type of all key value pair?
So first of all, I just need to put the data type of each field.
Then we will use the struct for the nested JSON, okay?
So I'm just considering the invoice number, we can consider it's a string or a big integer,
I'm just considering it's string, okay?
So the created time, I'm just considering it's big integer, big int.
And it's 2ID, just I'm considering a string.
Then post ID also strings, so just I'm copy and paste, case here ID string, customer type
also string, customer card number, we can also give it a string.
The total amount it should be double, DOU, PLE, number of item, it should be integer or
big integer, we can consider anything, payment method should be string, then taxable
amount also it's double, DTS double, is DTS also double, ESS also double, ESS also double,
so other data type like IA's delivery type home delivery, we can consider as a string
and delivery address, so this is the point of using a struct complex data type.
So let's give a good and great concentration to understand the scenario, right?
So first of all, we need to know what is the struct, so in data bricks, a struct is
a complex data type that allow you to group the multiple fields like the columns in a
single column, so it's similar to a record or row or an object in other language or systems,
so you can commonly use a struct in PySpurg or database runtime when dealing with the
nested data like JSON or when you want to create structured columns, so here you see this
is a JSON data type and it acts as a nested JSON, that's why to deal with the nested
JSON, what we need to do, we need to use the struct data type, so struct data type can easily
handle the nested JSON or an nested JSON, so to do this what we can do, we can do one thing
like firstly I just remove this one and give a struct keyword, so struct keyword will
be start, then this is the less than symbol we need to use, this is the pattern of the struct
data type writing and now we can remove this one by string, so this is our string data type,
also city string, state string, pin code, also we can define as a string no problem and contact
number, contact number, contact number, we can consider as string no problem, contact number,
it's a string, so when it will be finished then we need to state the closing parenthesis as like
greater symbol, so this is the structure of defining as like struct data type, so now come to
the next point here, this is the invoice line items, so inverse line item has an extra
provided, this is then array, so when we will deal with an array field at the time we need to
deal with some extra layer of the data, so what is the solution, we need to deal with array,
right, and within the array we need to define the struct, we need to define the struct,
and we need to just remove this one and also we can remove this one as well, so it will be
ending point, so here I'm just using two greater than symbol because one for array and another for
struct, so for the item code we can mention it as a string item price also we can mention as double
item quantity, so item description we have missed, so item description we can mention here it
as also string item quantity, we can also mention big integer integer, but integer is good,
we do not need actually the big integer, but I'm mentioning here the big integer, okay,
so another is done, total price it should be double, right, okay, so this is the structure of
our schema, so just now we can do one thing like we need to mention here three double code,
I'm just mentioning here three double code, we can do one thing here like this one,
and finally this one, so this is the process, now I'm just make it a format as like this one,
and just I'm taking a copy and go to my netabrics, now I'm just defining a new function and giving
a function like like this one and get schema, okay, and self, okay, now return just returning this
one, control V, okay, so our schema is prepared, so now what we need to do, we need to just mention here
the schema in our function, so schema, self, because this is a function, and this is a schema name,
get schema, right, so now if we execute this one, then what we will see, we will see the similar
thing as like the similar result as like the previous video, because now we are just using the
schema, but where we will use this type of this is schema's inflection, that we will see in the
next session, so just I just explain a little bit like this is the delivery address, right, and this
is the inverse line item, so using the delivery address tag, we will ping the address line, like
delivery address dot address line, delivery address dot city, delivery address dot state, delivery
address dot pin code, and also item inverse line items dot line code line description, we'll see
in the next session, no problem, so just now run this one and see the result, right, okay, see
it's showing the same result, but what we can do now, you see here, this is the delivery address,
you see this is object also, so now what is our facility, so now what we can do, we can do
call this address line by using the delivery address, like delivery address dot address line,
delivery address dot city, so by using this process, we can structure our data as a single line of
the data, so we will see in the next lesson, thank you,
Course Overview
In today's data-driven world, real-time stream processing is a crucial skill for software engineers, data architects, and data engineers. This course, Apache Spark and Databricks - Stream Processing in Lakehouse, is designed to equip learners with hands-on experience in real-time data streaming using Apache Spark, Databricks Cloud, and the PySpark API.
Whether you're a beginner or an experienced professional, this course will provide you with the practical knowledge and skills needed to build real-time data processing pipelines on Databricks, utilizing Apache Spark Structured Streaming for high-performance data processing.
With a live coding approach, you'll gain deep insights into streaming architecture, message queues, event-driven applications, and real-world data processing scenarios.
Why Learn Real-Time Stream Processing?
Real-time stream processing is becoming a critical technology for businesses handling vast amounts of data generated by IoT devices, financial transactions, social media platforms, e-commerce websites, and more. Companies need instant insights and decisions, and Apache Spark Structured Streaming is the best tool for handling large-scale streaming data efficiently.
With the rise of Lakehouse Architecture and platforms like Databricks, enterprises are moving towards unified data analytics where structured and unstructured data can be processed in real time. This course ensures that you stay ahead in the industry by mastering streaming technologies and building scalable, fault-tolerant stream processing applications.
What You'll Learn?
This course takes an example-driven approach to teach real-time stream processing. Here’s what you’ll learn:
Foundations of Stream Processing
- Introduction to real-time stream processing and its use cases
- Understanding batch vs. streaming data processing
- Overview of Apache Spark Structured Streaming
- Core components of Databricks Cloud and Lakehouse Architecture
Getting Started with Apache Spark & Databricks
- Setting up a Databricks workspace for real-time streaming
- Understanding Databricks Runtime and optimized Spark execution
- Managing data with Delta Lake and Databricks File System (DBFS)
Building Real-Time Streaming Pipelines with PySpark
- Introduction to PySpark API for streaming
- Working with Kafka, Event Hubs, and Azure Storage for data ingestion
- Implementing real-time data transformations and aggregations
- Writing streaming data to Delta Lake and other storage formats
- Handling late-arriving data and watermarking
- Optimizing Streaming Performance on Databricks
- Tuning Spark Structured Streaming applications for low latency
- Implementing checkpointing and stateful processing
- Understanding fault tolerance and recovery strategies
- Using Databricks Job Clusters for real-time workloads
Integrating Stream Processing with Databricks Ecosystem
- Using Databricks SQL for real-time analytics
- Connecting Power BI, Tableau, and other visualization tools
- Automating real-time data pipelines with Databricks Workflows
- Deploying streaming applications with Databricks Jobs
Capstone Project - End-to-End Real-Time Streaming Application
- Design a real-time data processing pipeline from scratch
- Implement data ingestion from Kafka or Event Hubs
- Process streaming data using PySpark transformations
- Store and analyze real-time insights using Delta Lake & Databricks SQL
- Deploy your solution using Databricks Workflows & CI/CD Pipelines
Who Should Take This Course?
This course is perfect for:
- Software Engineers who want to develop scalable, real-time applications.
- Data Engineers & Architects who design and build enterprise-level streaming pipelines.
- Machine Learning Engineers looking to process real-time data for AI/ML models.
- Big Data Professionals who work with streaming frameworks like Kafka, Flink, or Spark.
- Managers & Solution Architects who oversee real-time data implementations.
Why Choose This Course?
This course is designed with a practical, hands-on approach, ensuring you not only learn the concepts but also implement them in real-world scenarios.
- Live Coding Sessions - Learn by doing, with step-by-step implementations.
- Real-World Use Cases - Apply your knowledge to industry-relevant examples.
- Optimized for Databricks - Best practices for deploying streaming applications on Azure Databricks.
- Capstone Project - Get hands-on experience building an end-to-end streaming pipeline.
Technology Stack & Environment
This course is built using the latest technologies:
- Apache Spark 3.5 - The most powerful version for structured streaming.
- Databricks Runtime 14.1 - Optimized Spark performance on the cloud.
- Azure Databricks - Scalable, serverless data analytics.
- Delta Lake - Reliable storage for structured streaming.
- Kafka & Event Hubs - Real-time messaging and event-driven architecture.
- CI/CD Pipelines - Deploying real-time applications efficiently.
Enroll Now & Start Your Journey in Real-Time Data Streaming!
By the end of this course, you will be confident in building, deploying, and managing real-time streaming applications using Apache Spark Structured Streaming on Databricks Cloud.
Take the next step in your career and master real-time stream processing today.