RDD Characteristics: Partitions and Immutability

Loony Corn
A free video tutorial from Loony Corn
An ex-Google, Stanford and Flipkart team
4.3 instructor rating • 67 courses • 138,716 students

Lecture description

RDDs are very intuitive to use. What are some of the characteristics that make RDDs performant, resilient and efficient? 

Learn more from the full course

From 0 to 1 : Spark for Data Science with Python

Get your data to fly using Spark for analytics, machine learning and data science​

08:18:52 of on-demand video • Updated February 2018

  • Use Spark for a variety of analytics and Machine Learning tasks
  • Implement complex algorithms like PageRank or Music Recommendations
  • Work with a variety of datasets from Airline delays to Twitter, Web graphs, Social networks and Product Ratings
  • Use all the different features and libraries of Spark : RDDs, Dataframes, Spark SQL, MLlib, Spark Streaming and GraphX
English [Auto] To become comfortable and proficient at working with Spok, you need to really understand how resilient distributed data facts look. So we've talked a little bit about Ardith before. We've also seen it in action. Let's take a quick recap of what we've learned before. We move on to more details. Are these are the main programming abstraction in Spok. These are the fundamental data structures, just like objects are to Java are D to spark their in memory objects. All data is processed in memory using RBD, but they're still resilient. That's where the resilient word in our belief comes from their fault tolerant. When you deal with the vast amounts of data, data errors, data corruption, losing data, all of these are part and parcel of working with vast amounts of data. Already these are designed to be more in such situations. There are a lot of complexities involved when you work with datasets of this size. For example, datasets of this size does not fit on a single note. You have to distribute them across a cluster so you can work on it in parallel if in-memory data is lost. And remember, RBD in memory, you have to build up a fault tolerance system in order to make sure that the data can be retrieved. And finally, when you actually process this data, it has to be done efficiently and fast. So these are the challenges for any big data processing. In traditional data analysis environments, for example, Excel, or if you want to go the programmatically in Java or C++, all of these would be the responsibility of the developer distributing data for tolerance, efficient processing. Our belief in Sparke exist to abstract the youso from these complexities. You can interact and play with billions of rules of data without getting about any of these. You're not in charge of distribution, fault tolerant or efficient processing. You just need to get your job done using RTD and Spark Spark takes care of all these complexities under the hood and you as a user are extracted. This lecture and a few lectures which come after this, we focus on RBD, let's get a big picture understanding first. They have three defining characteristics, optician's, the fact that they are read-only and the fact that they know their lineage. Let's work on understanding these and how these characteristics help already be the incredibly useful structure that this. The first thing that we learned about RTD was the fact that they are in memory of it, they don't actually live on the disk. Instead, they are in ram memory. So when you have huge amounts of data, it can take very long to process them on a single machine. It's also possible that the data is just too large to fit on that one machine. In distributed processing, one of the first things that you need to do is to divide this data into positions such that there is one partition of data in every node. Each division of this data is called a partition, and you'll have one partition on one machine with the entire data existing across this cluster of machines. These partitions are distributed to multiple machines on multiple nodes in the cluster. Each node will contain one bit of data, for example, Schwed, than return present on one node data about Germany and the president on another. Partitioning data, but they do split and distributed across nodes, allows processing on these nodes to run in parallel, which makes the overall processing much faster than everyone on a single node. Each partition is kept in memory on one of the nodes in the cluster. For example, partition one will be on node one. It's in the memory of node one. Partition two will be the memory of NO2 and so on. So how is the data partitioned and distributed to these nodes? What formulas used to split up the data and put them on different machines? Typically, the data that you would use on Sparke comes from a distributed file system like HDFC, where data is already partitioned. Let's take a brief look at how HDFC works and how data is distributed across various nodes using this file system, if you're already familiar with Hadoop and HDFC, you can skip the next three to four minutes when you store data on a file system like HDFC. It's already partitioned and spread across multiple machines on the Hadoop. Still, SDF is the Hadoop distributed file system. Hadoop uses this to store data across multiple desks. If you're working on Spok, it's useful to have a basic knowledge of the whole ecosystem. Hadoop is normally deployed on a group of machines, so this group of machines is called a cluster. So typically, let's say you have five machines. Each machine in a cluster is called a node, so you have five nodes in the Hadoop cluster. One node is called the master node. It's called the Neame node. This Neame note manages the entire file system, so files distributed across multiple nodes is entirely managed by the one named node in the Hadoop cluster. How does it do it? By storing information on the directly structure, as well as storing information on the metadata for every file that exists in of both of these pieces of information is stored in the name node. The other nodes are detailed, so you can typically have multiple data nodes, so you have one, two, three and four in the diagram that you set up. The data is physically stored in the detailed. The Hadoop distributed file system spans all the nodes in the cluster, one of the nodes is, of course, the main node, the remaining are data nodes and the data physically that are sites that consider a large X file. It's huge. It doesn't sit on one machine. Let's see how we can store this file on PDF such that it's partitioned to sit across multiple nodes. The entire file is first broken into blocks of 128 MPLX. Each block is one hundred and twenty eight M.B in size. And here we assume that we have eight blocks 128 and be specifically chosen as a perfect tradeoff to minimize the time to seek the block and to retrieve the block from disk. These blocks are then stored across bitterness so you can split up these blocks and assume that some of these blocks live on data node one family one to someone three and four. So we have the name node, which draws the metadata for this entire file. This name node keeps track of which block lives on which data not so block location for each file are stored in the name. So let's say the file is called file one. If it's a block, one is stored on data node one block to on the node one, block three on two and so on. Just think of the name known as the Bible of content for the data that live in HD, if it allows you to look up where the blocks of people live. A file is read using two pieces of information, the metadata in the name node and the blocks in the details. The metadata tells us where the block is and the blocks give us actual information. So let's say we have done all three and it has two blocks, it's always possible to lose portions of a file in this way where just one block is lost or corrupted or the entire data node could go on. So that part of the file which is stored in this data node, may be lost. So what do we do in such a case? Such things happen to be one of the key challenges in distributed storage, keeping track of what live where and making sure all of them are in a perfect, non corrupted working state. In order to achieve this, you can find a replication factor in that DFS. So you simply won't store one block on one node. You'll replicate that block in multiple places. Each block is replicated and the replicas are stored on different dynamics. So you have block one block to the node. One will be replicated in the data node too. So if you lose data node one, you don't lose the entire information. The name node also keeps track of replicas, so whether replica for a particular block is is also tracked in the need for every block. It'll keep track of where the master copy is and whether rectifies. Managing data stored in partitions is a big deal, and it's a complex operation, SDF, as a part of the whole ecosystem, takes care of this for you. So what does all this have to do with our duties and positions and activities if Spark happens to be leading from HDFC, you've chosen HDFC as your storage system, then the data is already partitioned. Spock simply piggybacks on the HDFC partitions, the file blocks and HDFC become the partitions in Spark. And every node which has the data on disk then loads this data into memory on the same node where the data is present. On the disk that SPARC uses the data, it piggybacks on the partitions present in HDFC. Now the same node may not have enough memory to hold the data, so the RTD might be really huge and that particular node may not have enough memory. In that case, another node is chosen to store the data for that partition and the choice is made such that the network transferred time from the disk on one node to memory on another. Node is minimized. Back to discussing the three specific characteristics of a resilient, distributed dataset being discussed, partitions. Let's now focus on the rebe only nature of our babies are these are immutable ones created. You can't actually perform any operation which affects that. RTV You can only do two things within our DB once it's created, you can read data from it, which you do using actions at airlines dot org. And that is an action which leaves and draws from the result. If you're not reading data from it, you can transform it to another RTV, so any transform operations that you do on an RTT gives you another RTV as a result, that particular RBD does not change its immutable.