Spark Standalone Cluster Architecture

Imtiaz Ahmad
A free video tutorial from Imtiaz Ahmad
Senior Software Engineer & Trainer @ Job Ready Programmer
4.6 instructor rating • 12 courses • 254,129 students

Learn more from the full course

Master Apache Spark - Hands On!

Learn how to slice and dice data using the next generation big data platform - Apache Spark!

06:49:21 of on-demand video • Updated October 2020

  • Utilize the most powerful big data batch and stream processing engine to solve big data problems
  • Master the new Spark Java Datasets API to slice and dice big data in an efficient manner
  • Build, deploy and run Spark jobs on the cloud and bench mark performance on various hardware configurations
  • Optimize spark clusters to work on big data efficiently and understand performance tuning
  • Transform structured and semi-structured data using Spark SQL, Dataframes and Datasets
  • Implement popular Machine Learning algorithms in Spark such as Linear Regression, Logistic Regression, and K-Means Clustering
English [Auto] Hey they're in this video let's talk about how the code that we wrote in the previous lecture would run in a spark distributed cluster environment. The code that we wrote had just three main steps right. Ingesting the data transforming it and saving the data into the database. Since the sample data was so small we had no problems running it on our local machine. But let's talk about what happens if we were dealing with real big data. Well we certainly can't run it on a single note so let's say that we have access to multiple machines. We could create a as big a cluster as we'd like but for this example let's say that we have four machines on a network and I'll show you later on in the course how to create a standalone SPARC cluster like this. But as you can see in this image three nodes are configured to be worker nodes and there is one master node the master nodes primary responsibility is to ensure that workers are up and running. If the worker application crashes the master can restart it. Now keep in mind that all the work gets done by workers. OK I'm going to repeat that all the work gets done by workers workers or the guys that execute your spark code. Now the first step to have a spark application one on a cluster like this is to submit the code to the master node. OK. Typically this is done by packaging up our code into a jar file and submitting it to the master node in the master node then launches processes on the worker nodes the processes running on the worker nodes are the things that actually execute our code. Now for the sake of explaining this to you at a high level I oversimplified some of these things. OK there's a lot more that's going on behind the scenes. But let me first finish talking about this example of the code that we were on the previous lecture how it will behave differently in a cluster. We're never going to come back to this towards the end of this lecture and talk about how intra cluster communication works. What are the different processes that are spawned in the nodes and who's responsible for what all of those details are important to understand but I think we should save that towards the end of this lecture. So hold onto that thought. OK. Now when it comes to distributed processing of data partitions are a critical step. Here's an example of how a very large CXXVI file could be broken up into partitions. Simply speaking partitions are just groups of rows OK and breaking up a file into partitions like this allows for smooth parallelization For example if you have four cores available on your machine you can launch four threads. Each thread could be working on its own set of rows its own partition. Now let's talk about how workers can potentially process a file like this. Each worker loads partitions of the file in memory and spawns one thread for each partition in Sparke land. Threads are referred to as tasks but don't get confused by that terminology. You can configure a worker node to use many task threads in this case. Each worker note has been configured with two or three task threats. For the point of explaining this to you you have kept the sleights simple. But in a real world application dealing with big data there could be thousands of partitions. OK tens of thousands of partitions and thousands and thousands of threads running. Now these tasks represent the code that we wrote in the previous letter. These tasks are each going to read a partition of the file perform the transformations and filtering will be specified in the code. And then finally each task will form a connection with the database and load their respective partitions into the table that we specified in the code. All right now most databases obviously have restrictions as to how many connections can be formed at one time or so hopefully in a real world environment we're going to be using connection pools for this. But we'll get into that later. Now one last very important detail that I left out. Up to this point is when you're when you programming real world SPARC applications you're not going to be reading a file from a disk on a single machine. Real Big data is often distributed on several disks. An example of this is the Hadoop distributed file system or Amazon S3 bucket. You can write your code to access the file by its logical name right but physically that file is most likely distributed. Your worker nodes will know what to do for ingesting the file in a distributed way. SPARC takes care of a lot of that overhead. We'll talk more about this later on in this course when we are ready to create our own live supercluster. So now that you have the overview of ingesting the data and transforming it and loading it in a database in a distributed way let's go back to the topic of Internode communication within the cluster. When we start up a spark cluster who's responsible for what and when you submit your application. What happens then. OK so you need a cluster with the servers configured as either the workers or the master node and these are basically JVM processes on a worker machine a worker JVM process will be running and on a master node A master JVM process will be running. These are very very small tiny JVM processes and all they're really responsible for is starting and stopping other more important processes that are related to the job. The sparks job such as the executive process will get into that in just a moment and these processes communicate with the master and master make sure that these worker processes are up and running. You set that up first and that's your infrastructure and then you submit your application. So then what happens next. Once your application is submitted to the master node the master node will tell one of the workers to launch a very important process called the driver process this driver process is the heart of a job. OK. The driver contains all of the logic of where to get the data from and how to perform the various transformations and then finally what to do with the transform data such as loaded into a database or view it on screen or whatever the driver process contains all of the code that we wrote. OK but it's important to realize that the driver process does not execute that code. Ok the worker threads or tasks that are spawned in the machines are what actually execute the instructions. So once the application is given to the master the master tells one of the notes to spawn the driver process what's next. The next thing is the driver tells the master node to launch a JVM process on each of the worker nodes so that they can begin executing the code that the driver contains. So the master node will then tell each of the workers to launch the JVM in which execution of the code will take place. OK. And these JVM processes are called executer or executors. How are you going to pronounce that these executive processes are going to be running the code that we wrote. Now you might be wondering the code is only existing on worker 2 right. In this example the master told worker to to launch the driver process the driver process contains error code. So how will the other two nodes get the code from the driver process while the driver process will be communicating with all of the notes. OK so the code to execute will be sent to every node in the cluster in the driver process will ensure that each node is working on the tasks that it has been assigned. Then the driver will be orchestrating the entire job. So let's summarize everything. OK let's put everything together. When you start a spark cluster basically there's a master node and there are multiple workers. The master tells one of the worker nodes to launch the driver process which is going to be the orchestrator the heart of this entire SPARC job. OK. Once the driver processes launch then the Master will ask all the nodes to launch executive processes. OK. And these executive processes are basically going to be doing the majority of the work. So you want to make sure that you allocate enough memory to these executor processes and what are these the two basic JVM containers that get launched. OK. And the Java code that we wrote are there Scala code that we wrote doesn't matter the language that gets executed in these executor processes and the tasks that we spoke about right. Each worker can have many many tasks. These tasks are basically threats very task threats that are spawned within the executors and the partitions are loaded into the executor JVM memory. I can go into a little bit more detail by saying specifically that the driver node actually tells the master to hand out these instructions but you know that's a detail that you don't need to worry about as long as you get the idea that the driver is the guy that's orchestrating this entire job. And it ensures that all of the executors are doing what they're supposed to be doing with the instructions that they have been handed. OK so a lot of detail here. It's not as important. You don't have to think about these things too much but you've got to understand that the executor JVM is the actual process that executes the code on the worker machines and you want to make sure that you allocate enough memory to these executive programs. Now just a few odds and ends to keep in mind the master could be highly available so you can actually tune these workers to also have a master process running. So it's highly available if one master goes down. We don't have an issue and also another thing I'd like to point out is that you can launch an executive process on the master node as well. OK. So the Master doesn't have to just be sitting there ensuring that the workers are up and running. It can actually also be doing work by launching an executive process and the driver program will allocate the tasks to all of the nodes including the master. Ok so just some odds and ends there. If an executive process crashes the worker is capable of starting it up. OK. If the worker process crashes. The master is capable of starting up the worker. Now keep in mind that there is only one driver process per job. OK. So the driver is the main orchestrator of the entire job. If it goes down it would have to be re initiated by the master so you can you can configure it so that the driver goes down. The master can relaunch the driver program but in a situation like that everything would have to restart. So the workers and the master noticed still up but the executive processes will have to be relaunched the instructions from the drone program will have to be reset to the executors to start processing again. And I might as well also add the performance implications depending on the amount of course that you have on a given worker. Experience shows that you need to have at least two to three times the number of tasks as there are cores. So if you have four cores available it's better to have eight or 12 task threads to be run on that given worker and you can configure the amount of task threads that the executor will spawn by changing the configuration on the worker nodes. So it's just best practice in terms of performance. Keep that in mind if you have a certain number of cores multiply that by two or three. And that should be the number of task threads that should be running on that given worker machine. We'll talk more about performance later but hopefully I didn't blow up your brain. Some of this is not as important to understand right now. We'll talk about it later. But for those of you that are highly curious Hopefully I was able to quench your thirst about what's going on behind the scenes in the next lecture. We're going to get back into eclipse and continue coding SPARC applications. So thanks for watching. I'll see you soon.