How Data gets Partitioned and Distributed in Spark Cluster

Irfan Elahi
A free video tutorial from Irfan Elahi
Data Scientist in the world's largest consultancy firm
3.8 instructor rating • 2 courses • 30,552 students

Learn more from the full course

Apache Spark Hands on Specialization for Big Data Analytics

In-depth course to master Apache Spark Development using Scala for Big Data (with 30+ real-world & hands-on examples)

11:44:43 of on-demand video • Updated August 2019

  • Understand the relationship between Apache Spark and Hadoop Ecosystem
  • Understand Apache Spark use-cases and advanced characteristics
  • Understand Apache Spark Architecture and how it works
  • Understand how Apache Spark on YARN (Hadoop) works in multiple modes
  • Understand development life-cycle of Apache Spark Applications in Python and Scala
  • Learn the foundations of Scala programming language
  • Understand Apache Spark's primary data abstraction (RDDs)
  • Understand and use RDDs advanced characteristics (e.g. partitioning)
  • Learn nuances in loading files in Hadoop Distributed File system in Apache Spark
  • Learn implications of delimiters in text files and its processing in Spark
  • Create and use RDDs by parallelizing Scala's collection objects and implications
  • Learn the usage of Spark and YARN Web UI to gain in-depth operational insights
  • Understand Spark's Direct Acyclic Graph (DAG) based execution model and implications
  • Learn Transformations and their lazy execution semantics
  • Learn Map transformation and master its applications in real-world challenges
  • Learn Filter transformation and master its usage in real-world challenges
  • Learn Apache Spark's advanced Transformations and Actions
  • Learn and use RDDs of different JVM objects including collections and understanding critical nuances
  • Learn and use Apache Spark for statistical analysis
  • Learn and master Key Value Pair RDDs and their applications in complex Big Data problems
  • Learn and master Join Operations on complex Key Value Pair RDDs in Apache Spark
  • Learn how RDDs caching works and use it for advanced performance optimization
  • Learn how to use Apache Spark for Data Ranking problems
  • Learn how to use Apache Spark for handling and processing structured and unstructured data
  • Learn how to use Apache Spark for advanced Business Analytics
  • Learn how to use Apache Spark for advanced data integrity and quality checks
  • Learn how to use Scala's advanced features like functional programming and pattern matching
  • Learn how to use Apache Spark for logs processing
English [Auto] Hi everyone and welcome back in this lecture hall we're going to explore another dimension of the text file matter that is a part of this context with all its uses and holding out a lot different files and files that are in a directory or in the file that are different delimiters. That's good. No you know that spark is or distributed computing and from what we know that's probably that in memory. Very powerful very smart distributed computing framework where there are ExactTarget running all different parts of your cluster. And the driver program is running in one of the notes and actual processing is done by those vectors which operate on all of those on different note and under the back of the driver if required or to the different storage system if it works. So the thing is the point is part is determined and the basic abstraction of that is RTD is also distributed. It's a resilient distributed data set. So anything that you see on the on an RTD it actually consists of different partitions and there are these actually partitioned and distributed on different nodes. Let's make this point a bit more clear. So for instance if you have let's hear from you. Currently we are working with the text file matter. Let's take this example for example if you have a big file that files insisted on for example one G-B size and that was on a. I'm talking about the actual size of this file not the one that comes with that application. Right. And this is the last part cluster and your spotless one is designed in such a way that there are Lessie for work Nord for example and so let's take this example bridleless spark this is your spark Blaster and your spark instead is currently running on for naught. And isn't the file for that if for example let's make them five man let's make it then you need to make the math easier. So this file is basically And you'd be fine. Right. You don't need that file and. And you know that Sparkle's distribute it. And you know that module or this file and SPARC it can be in the form of R&D that we know anything that comes in bar is an oddity. That's basically the affection that sparked wide that that's Barkworth databases for a while. They also work with Spock wives who are these two Marti's infallibility. Now the thing is what do you thing that this Dandini file and what you don't like is an in-memory competition right. So anything that brings that comes into Safaga it's going to stay in memory. It's going to stay. Is it going all in one example. So the thing that this you find only in one example is that the case is that it's actually not the case. This whole one D-B file will actually be distributed by some mechanism. Will then that's and spark all these different nodes. So these are the men actually distributing some part of this already will be have some part of this business some part of this will be and some part of this will mean there are other possibility that that spark decided based on its own partitioning scheme that not all of the parts for them all then the possibility that are R&D some of that portion would be here here but not here. So it depends on a number of factors based on for example the locality principles that will be awkward. The thing is if you an already that will be distributed so the portion of our city will actually be someone who will be here here here and here for simplified examples let's center Lets say that our RTT basically is distributor's and all of the news. And let's face it we did it in a very very even way. So today we are here to do here. Would be here would you be here. So what happens with. And if you perform any operation on RTT for example you can wear the IID line off and already be off of the file and to find them back in uppercase. That's very very basic transformation. So this transmission will be happening on all of them or separately because of the time of this operation. Right. So this is also what happens particularly if RTT it inspired it gets a distributor. Now given that what happens basically that when you make use of the tax file matter how number one partitioning is done by spot. So thats the first question. And the second question is can you control a more. Right. First thing if you have a finite set of files and you're in your directory for example if you have kinetically And if you have for example then file. And if you use this text file matter that this text file matter will create one partition at least one partition for each file. For this it will create one partition. It will be another partition and then a partition and another partition. So that's the first period of textual IF you are loading multiple files and added another directory in index file matter. It will create at least one partition of these files. First thing second thing you are aware of the fact that it is the. There is one set of block size right. So for example there is that block that is the Grebo and it can be that it is 128 Hamby but some prefer to have it as 256 Thambi and some make 64 I mean whatever the case you have that one set of the block size right. And what happens mythically that for example we thought previously the example that if you have a directory of multiple files for each file that comprise of at least one partition which may be existing in one of the notes right. That makes sense. Now what happens mythically is if you have one file let's say of NGV or less than one B be what's going to happen for that one. So this is the one with the file. And you bring it in Spahr and you are not working with multiple files or there is just one file. So for multiple file when you create a different partition which word is true we don't want it for an exact or not. What about this file only for the same formula that one file for one partition. No what's going to do that. It's going to basically create a partition for each block so what there is a 128 and B block. It will create a partition for that. So it's going to be just that you videophile and do manage to fix that as well. So one partition will loosley approximately equal to wonderly and B. So if you do find one doing one DV divided by 128 M-B approximately 10 partitions. So what you get basically that for a 1 TV file knowing that the block size of SD of f is 128 and me squawkers when I use that Rotherhithe as a basis of its partitioning scheme and create partitions of this one file and this is doing them and wonder why you might even want to eat a b and partition and just want to store these 10 partitions on different north. As we have less than a bit of north so some normal have multiple partitions and found what will have let's say one part or even the possibility that someone old may not have any partition at all. But the thing is we're discussing about the partitioning McMakin metrology the Hausberg actually create an RBD. We need to distribute it on different notes and we saw that specifically in the case of the X file. If there are multiple files it cleared fire partitions for each file. If there is a file which is a big file greater than 128 and b it will create one partition equivalent to approximately 100 M-B and will assign it to one Updyke. And that the doctor will operate on that partition. So the thing is if the thing is still not clear exactly how they are played on partitions so an Ardingly consists of different parts with nothing. These are the partitions of an oddity. This partition may exist here. This partition may exist here. This may exist here this may exist here and there is the possibility that this partition may exist here or somewhere else. Send this exact to have this partition it operates on them whatever you want to do. For example if you wanted to convert it to a protégées this Executive want this section of the overall RFB. So this is a very important concept because if someone asked you how it all works and you have. I already know you have a big flexor but you are under utilizing it. Or if for example you can and you lot of park job you can define the number of exacta as well. So these things will make much more sense when you will have a sparkler and you will develop more knowledge about it at this level it's fair to understand it's enough for you to understand how the partitioning is done by a text file function. I hope this has been helpful and interesting for you. I can't wait to resume this exciting goes from the newsletter.