RDD Characteristics: Partitions and Immutability

Loony Corn
A free video tutorial from Loony Corn
An ex-Google, Stanford and Flipkart team
4.2 instructor rating • 73 courses • 131,400 students

Lecture description

RDDs are very intuitive to use. What are some of the characteristics that make RDDs performant, resilient and efficient? 

Learn more from the full course

From 0 to 1 : Spark for Data Science with Python

Get your data to fly using Spark for analytics, machine learning and data science​

08:18:52 of on-demand video • Updated February 2018

  • Use Spark for a variety of analytics and Machine Learning tasks
  • Implement complex algorithms like PageRank or Music Recommendations
  • Work with a variety of datasets from Airline delays to Twitter, Web graphs, Social networks and Product Ratings
  • Use all the different features and libraries of Spark : RDDs, Dataframes, Spark SQL, MLlib, Spark Streaming and GraphX
English [Auto] Become comfortable and proficient at working with Spock you need to really understand how resilient distributing the sex work. So we've talked a little bit about our babies before. We've also seen it in action. Let's take a quick recap of what we've learned before we move on to more repeats are these other mean programming abstraction Spock. These are the fundamental data structures just like objects are Kujawa are they these are two Spock. They are in memory objects. All meat is processed in memory. They are babies but they are still resilient. That's where the resilient word in our babies come from. Their fault tolerant. When you deal with the vast amount of data data and those data corruption losing data all of these are part and parcel of working with vast amounts of data. All of these are designed to be Foyt are living in such situations. There are a lot of complexities involved when you work with the assets of the scientists for example datasets of this size do not fit on a single note. You have to distribute them across a cluster so you can work on it and probably if in-memory data is lost. And remember our baby is in memory. You have to build up a four star system in order to make sure that that detail can be addicting. And finally when you actually process this data it has to be done efficiently and fast. So these are the challenges for any big data processing in traditional data analysis environments for example Excel or if you want to go the programmatic way in Java or C++ all of these would be the responsibilities of the apple distribution be for tolerance efficient processing and midi's inspan exist to abstract the users from these complexities. You can interact and play with billions of rules of data without caring about any of these. You're not in charge of distribution for tolerance or efficient processing. You just need to get your job done using IDs and spark spark takes care of all these complexities. Under the hood and you as a user are abstracted this lecture and a few lectures which come after this we focus on Archy's. Let's get a big picture understanding first they have the defining characteristics optician's the fact that they are Read-Only. And the fact that they know their lineage let's look on understanding these and how these characteristics help are be neither incredibly useful structure that this. The first thing that we learned about our TVs was the fact that they were in memory datasets. They don't actually live on the disk. Instead they are in a RAM memory. So then you have huge amounts of data. It can be very long to process them on a single machine. It's also possible that the detail is just too large to fit on that one machine. Industry processing one of the first things that you need to do is to divide this data into partitions such that there is one partition of data in every node. Each division of the state is called a partition and you'll have one partition on one machine with the entire eight existing across this cluster of machines. These partitions I disputed do multiple machines are multiple nodes in the cluster. Each node will contain one bit of data for example in Schwed dined with talus present on one node they are generally in the deepest present another partitioning data. When they split and distributed across nodes allows processing on these nodes to run in badly which makes the overall processing much faster than if run on a single each partition is kept in memory on one of the north in the cluster. For example partition 1 will be an old one. It's in the memory of node one action that will be in the memory of no. And so on. So how is that the partitioned and distributed to these nodes. What form allows you to split up the data and put them on different machines. Typically the data that you would use on sparc comes from a distributed file system like HFS that it is or the partition. Let's take a brief look at how the FS worked and how it is distributed across various nodes using this file system. If you're already familiar with her open at the DFS you can skip the next three to four minutes. When you store data on a file system like DFS it's already partitioned and spread across multiple machines on the Hadoop system. It's the FS if the Hadoop distributed file system how do you use this to store data across the disks. If you're working on sparc it's useful to have a basic knowledge of the Hadoop ecosystem are Aruba's normally deployed on a group of machines. So this group of machines is called a cluster. So typically let's say you have five machines each machine in a cluster is called a node so you have five nodes in the whole to just one node. It's called the master node. It's called the nime Nord this name node manages the entire file system so files distributed across multiple nodes is entirely managed by the one named node in the loop tester. How does it do it by storing information on that it actually structure for less storing information on the method data for every file that exists in. Actually if both of this piece of information is stored in the name or the other nodes are detached. So you can typically have multiple be done on. So you have that 1 2 3 and 4 in the diagram that we've set up. The data is physically stored in the downwards the distributed file system spans all the more than the cluster. One of the notes is of course the need. Note that the meaning are the code and the it physically decides that. Consider a large text file. It's huge. It doesn't sit on one machine. Let's see how we can store this file on the FSS that it's partitioned to set across multiple nodes the entire file is first broken into blocks of 128 and B each block is 128 and B in size. And here we have a zoom that you have eight blocks 128 and be specific need to in effect lead off to minimise that time to seek the block and to read the block from disk. These blocks are stored across beaten works. So you can split up these blocks and as you that some of these blocks live on and on on one family one to someone three and four. So we have the need node which stores the method data for this entire fight this this node keeps track of which block lives and which they cannot so block locations for each file are stored in them. So let's say the file is quite a fine one in the same block one is stored on beaten or one block to be done on one block three on the on and so on. Just think of the name nor the table of contents for the data that lives in actually if it allows you to look up where the blocks of the palace a file is read using two pieces of information. The metadata in the node and the blocks in the leader is the method that tells us where the block and the blocks give us actual information. So let's say we have that and all three and it has two blocks. It's always possible to lose portions of a file in this way but just one block is lost or corrupted on the entire to be done or could go on. So that part of the file which is stored in this data and old may be lost. So what do we do in such a case. Such things happen to be one of the key challenges in distributed storage. Keeping track of what gets well and making sure all of them are in the thick nonconductive book extatic in order to achieve this. You can put a fine Legations factor in actually face. So you simply want to store one block on one node you'll replicate that block in multiple places. Each block is replicated and the replicas are stored on different. Than x. So you have block one block doing that and will one day be replicated in. They don't old to. So if you lose the old one you don't lose the entire information. The name node also keeps track of. So there are replica of the particular block. This is also Drachten the them for every block and keep track of where the master copy is and whether the guys managing the stored in partitions is a big deal. And it's a complex operation. It's DFS as a part of the ecosystem. Ex-gang of this for you. So what does all this have to do with our duties and partitions and Biggie's if Sparke happens to be eating the of them actually. FS you've chosen DFS as a storage system then the DS is already partitioned Spock simply D-Backs on that DFS partitions the file blocks and the FS become the partitions in bulk and every node which has the data on disk then loads this data into memory on the scene node where the data is present on disk. That's how SPARC uses the data. It piggybacks on the partitions present in HD if it's not the same node may not have enough memory to hold that the does with the RTD might be really huge. And that particular note may not have enough memory in that case. Another node is chosen to store the data for that partition. And the choice is made such that the network transfer pain from the disk on one node the memory on another node is maze Bactrim discussing the pre-specified got it to the sticks of a resolution distribution dataset and discussed partitions. Let's not focus on the read on the nature of Arbys are these are immutable ones created. You can't actually perform any operations which affects bad RTD. You can only do two things within IDB once it's created you can read data from it which you'll do using actions big airplanes dot ache and that is an action that you retrieve and Andrews from the if you're not reading data from it you can transform it to one another. RBD so any transform operations that you do on an be gives you another oddity the other side that particular IPD does not change it in.