Key/Value RDD's, and the Average Friends by Age Example

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.6 instructor rating • 23 courses • 520,286 students

Lecture description

You'll learn how to use key/value pairs in RDD's, and special operations you can perform on them. To make it real, we'll introduce a new example: computing the average number of friends by age using a fake social network data set.

Learn more from the full course

Taming Big Data with Apache Spark and Python - Hands On!

Apache Spark tutorial with 20+ hands-on examples of analyzing large data sets on your desktop or on Hadoop with Python!

06:54:01 of on-demand video • Updated August 2021

  • Use DataFrames and Structured Streaming in Spark 3
  • Frame big data analysis problems as Spark problems
  • Use Amazon's Elastic MapReduce service to run your job on a cluster with Hadoop YARN
  • Install and run Apache Spark on a desktop computer or on a cluster
  • Use Spark's Resilient Distributed Datasets to process and analyze large data sets across many CPU's
  • Implement iterative algorithms such as breadth-first-search using Spark
  • Use the MLLib machine learning library to answer common data mining questions
  • Understand how Spark SQL lets you work with structured data
  • Understand how Spark Streaming lets your process continuous streams of data in real time
  • Tune and troubleshoot large jobs running on a cluster
  • Share information between nodes on a Spark cluster using broadcast variables and accumulators
  • Understand how the GraphX library helps with network analysis problems
English Ok so now we've seen a simple example of how spark works. Let's build upon that and start doing some more complicated things. A powerful thing to do with an RTD is to put some more structure data into it. One thing we can do is put key value pairs of information into spark RTD. Then we can treat it like a very simple database if you will. Let's walk through an example where we have a fabricated social network set of data we'll analyze that data to figure out the average number of friends broken down by age of people in this fabricated social network. We're gonna use key value pairs in RTD to do that. Let's cover the concepts first and then we'll come back later to actually run the code OK. Let's build upon what we've learned about RTD and extend it to a new kind of RTD a key value RTD. And we're gonna do this in the context of a real example where we're going to look at some fake social network data and try to figure out the average number of friends broken down by age for people so RTD can hold key value pairs in addition to just single values in our previous examples. We looked at an example where an RTD included for example lines of text for an input data file or an RTD contained movie ratings. So in those cases every element of the RTD contained a single value. In that case either a line of text or a movie rating. But you can also store more structured information in RTD so there's a special case where you have a list of two items that it contains and that is considered a key value pair RTD. And these are really useful things because once you start storing key value pairs in an already D It looks a lot like a no sequel database. It's just a giant key value data store at that point and you can do some things where you can aggregate information by key values for example and that can come in handy for example in this example we're about to do so syntactically there's nothing really special about key value RTD. It's all kind of magic in Python if you are storing a list of two items as your values in the RTD then it is a key value RTD and you can treat it as such. So here's a simple example. If I want to take create a totals by age RTD out of the original RTD that contains a single value in each element. My lambda function here X can take each say rating and then or each number whatever happens to be an X and transform that into a pair of the original number and then the number one. So the syntax with the parentheses this indicates that this is a single entity as far as Python is concerned but it consists of two elements. It's a list of two things. And this example the first item will be the key and the second item will be the value. And again the key is important because we can do things like aggregate by key and that's all there is to creating a key value already. And it's also okay to have complex things in the value there. So I could keep the key as being the original value here from the first RTD and make the value itself a list of however many elements I want to. So I'm not limited to just storing one thing in the value of a key value RTD I can store a list of things there if I want to as well. And we're gonna do that in this example to just to illustrate how that works. So like I said one of the most useful things you can do with the key value RTD is reduced by key. And what that does is combine all the values that are found for the same key value using some function that you define. So for example if I want to add up all of the values for a given key let's say all of the numbers of friends for a given age you just pull that out of the air something like this would do the job. So when you call a reduced function in SPARC it will get not just an X but also an X and a y value. And you can actually call those whatever you want but where you need to do is define a function that defines how to combine things together that are found for the same key. And in this case our function is just X plus Y. So it says we're gonna keep adding things together to combine things together for a given key value. So for example again if I was doing a reduced by key on a key value already where the keys were ages and the values were number of friends I could get the total number of friends found for that age with something like this because every value for every value that's found for that key will be added together using that function that just has X plus Y. There are other things you can do as well with key value RTD you can call group by key so if you don't actually want to combine them together quite yet you can just get a list of all the values associated with each key using grouped by key and you can also do sort by keys so if you want to sort you're already done by the key values that makes it easy to do and you can also split out the keys and the values into their own RTD using keys and values. I don't see that too often but it's good to know that it exists and like I mentioned before you've kind of created a no sequel data store here right. It's a giant key value data store so you can start to do sequel ish sorts of things. Since we have keys and values in play here we can you joins in right our outer joins and left out or joins in CO group and subtract by key. So these are ways to combine two key value RTD together to create some some joined RTD as a result. And later on when we're looking at making similar movie recommendations we'll have an example of doing that by joining one key value already deal with another to get every possible permutation of movies that were rated together. Now this is a very important point with key value RTD. If you're not going to actually modify the keys from your transformations on the RTD make sure you call map values or flat map values instead of just map or flat map. And that's important because it's more efficient. It allows getting a little bit technical. It allows SPARC to maintain the partitioning from your original RTD instead of having to shuffle the data around and that can be very expensive when you're running on a cluster. So anytime you're calling map or flat map on a key value already to ask yourself Am I actually modifying the key values. The answer is no you should be calling map values instead. Or flat map values. And just to to review again map values will have a one to one relationship. So every element in your original R2 you'll be transformed into one new element using whatever function you define. But flat map can actually blow that out into multiple elements personal elements. You can end up with a new RTD that's actually more you know longer or contains more values than your original one with flat map values. Now one thing to keep in mind is that if you're calling map values or flat map values all that will be passed into your function that you're using to transform the RTD will be the values themselves. So don't take that to mean that the keys are getting discarded they're not they're just not being modified and not being exposed to you. So even though map values and flat map values will only receive one value when one value which is the value of the key value pair. Keep in mind the key is still there. It's still present you're just not allowed to touch it and I realize this is a lot to digest it's gonna make a lot more sense when we look at a real example so let's look at a real example. Bear with me. So get back to what we're going to try to do to illustrate these concepts what I've done is generated a fake dataset just completely at random and it represents a social network and what it has in it and every line is a user i.e. a user name the age of that user and the number of friends that user has. OK so for example user I.D. 0 might be named will and he's 33 years old and has three hundred eighty five friends in these ages and numbers of friends are all completely assigned at random so don't associate any sort of deep meaning to them and you might notice that I'm a star trek fan here. So that's our source data that we're gonna work with and our task is to figure out the average number of friends by age. So for example what's the average number of friends for the average 33 year old in our data set. Well let's figure that out so our first step is just to pass that data into what we need. And you know typically nothing special here is gonna start by creating a lines already D that is calling text file on our spa context with our source data and that's just gonna give us an RTD where every individual line of that comma separated value list is an individual entry in our already D and now things get interesting. So I'm going to transform my lines RTD into an RTD RTD very creatively named by calling map on it and I'm passing in the pass line function to actually conduct that mapping so every line from my lines RTD will be passed into a pass line one at a time and I'm gonna pass it out and first thing where do you split it based on commas. And that will bust out the different fields we need. And I will then extract the fields that I'm interested in. So if I'm just trying to figure out the number of friends by age all I care about is the number of friends and age information the the user I.D. and the user names are irrelevant so I'm just gonna discard those. So I will extract the age from field number two which is actually the third field because we start counting from zero remember. And this is an important point here I'm actually casting it to an integer value because I want to treat this as a numerical value. And that allows me to do arithmetic operations on it later. Now if I didn't do that it would just keep treating it as a as a string and I wouldn't be able to do things like add them up and divide them which I'm gonna have to do if I want to get averages at the end of the day. Similarly I'm going to cast the number of friends to an integer value as well using the syntax and you know in between parentheses. So fields 3 will give me back a string value of some number and it will actually make sure Python knows that it's actually a number and I should treat it as such and I can perform arithmetic on it. And here is where we actually transform things into a key value already. So instead of returning a single value I'm Reese I'm returning a key value pair of the age and the number of friends so the RTD I'm creating with this pass line mapper function is creating a new RTD that is a key value already D with the key of age the value of num friends with me so far. So for example transforming that the original data that I showed in the previous slide the output will be a key value pair RTD that contains something like this. So the first user in our data at an age of 33 and three hundred eighty five friends a second user had an age at 33 and two friends. The third user was 55 years old and had 221 friends and so on and so forth with me so far this is a important concept to grasp. So feel free to hit pause and stare at the slide a little bit longer if you needed to sync in let's move on. All right. So now I'm gonna throw you into the deep end of the pool here. Look at that big scary line. But if we break it down into its components what's going on here is pretty straightforward. What we need to do next is aggregate this information somehow. So the first thing we're gonna do. Let's let's just break this down one component of time so you see we have sort of a compound operation going on here we're taking our RTD of age a number of friend key value pairs and we're calling map values on it and then we're taking the resulting RTD and calling reduced by key on it. So let's take this one step at a time we'll start with the map values piece of it. So what this does is transform every value in my key value pair. Because remember we're calling map values. So the exits getting passing is only going to be the value piece of the original RTD. So let's take that first line for an example. The first entry in our RTD RTD is an entry that has a 33 year old that had three hundred eighty five friends. So the value three eighty five gets passed in through map values for every line and that's our ex here in the lambda function and our output is going to be a new value that is actually a pair list if you will of three eighty five and the number one. And the reason the method behind our M.A. here is that in order to get an average we need to count up all the total number of friends seen for a given age and the number of times that age occurred. OK. So later on if we sum up all of these pairs of information we will get the total number of friends for that age and the total number of times that age occurred. Once we look at the totals for a given age. So that's kind of our strategy here. Build up a running total of how many times 33 year olds were seen and the total number of friends that they had. And so we're trying to accomplish but to get back to the syntax here just review again that values will receive each value which in our case is the number of friends and output a new value which is actually a a tuple that contains the original number of friends and the value one. So again this is an example of a key value pair that we end up with where the value is not just a single a single value or a single number it's actually a collection of numbers a list. And that's perfectly OK. So our output is a new RTD that is still a key value pair but the keys are untouched because we called Map values and the values are now transformed from just a number of friends to this pair value of three eighty five and the number one. Now we need to add everything up together and that's where the reduced by key part comes in. So we reduced by. He just tells us how do we combine things together for the same key. So again going back to our example here let's say that we have every we're looking at every 33 year old. So we're looking at keys of 33 are lambda function here takes in two values and says how do we add them up. So for example in this case we have X coming in as three eighty five and one and Y what might be two in one. And this just says add up each component so take the first element of each value and add them together take the second element of each value and add them together and the output in this case would be three eighty five plus two is three eighty seven and one plus one is two. And we'll keep doing that repeatedly for every time we encounter values for the key 33 and add them all up together. So you see it we have at this stage is the grand total of number of friends and the number of times we saw that key for the given key and it will do that for every single key. So that's what reduced by key does. And with that we have the information we need to actually compute the averages we want. We'll do that on the next slide but probably a good idea to hit pause at this point because even I had a hard time wrapping my head around this at first. So you want to step through this through your head a little bit a little bit more you know run through it in your mind might be useful. So go ahead and hit pause and let this sink in if you need to let's move on so the last step is just to transform these pairs of total number of friends and a number of times like he was encountered to an actual average value. And that's what this final line does with map values. Again we're just receiving the value part of our key value pair into our lambda function because we're calling map values and leaving the Keys untouched. So the age in this example 33 will remain untouched and not even shown to us in our lambda function but our lambda function will receive the value which is this pair of total number of friends and number of times that age was encountered and just divide the two to get an average value. So the the output at this stage for the key 33 would be we would transform 33 and this pair of the total number of friends the number of times 33 year olds were seen to the number to 33 year olds had an average number of friends to one eighty three point five. And now we have our final results all we have to do is call collect together them and print them out. Now one thing I want to mention is that remember nothing actually happens in spark until the first action is called. And if you go back to when we called reduced by Qi that's actually the first action in our code. So nothing actually occurs in our script until reduced by Qi was called which is kind of interesting and the other action that we have in our script is collect call so opposed stages spark will go out and actually construct that directed a simple graph and figure out the optimal way to compute the answer we want. So again that's the key of why spark is so fast and there we have it. So let's take a look at the actual code and dive in and actually run it and see what happens. Okay. So that's the overview of how we're going to use key value pairs in our RTD to analyze our data set of fake social networks. Now that we've walked through how that code is gonna work let's make it real in our next lecture we'll run this code for real on an actual dataset. See there.