Map and Reduce Transformation Functions

Imtiaz Ahmad
A free video tutorial from Imtiaz Ahmad
Senior Software Engineer & Trainer @ Job Ready Programmer
4.6 instructor rating • 12 courses • 254,218 students

Learn more from the full course

Master Apache Spark - Hands On!

Learn how to slice and dice data using the next generation big data platform - Apache Spark!

06:49:21 of on-demand video • Updated October 2020

  • Utilize the most powerful big data batch and stream processing engine to solve big data problems
  • Master the new Spark Java Datasets API to slice and dice big data in an efficient manner
  • Build, deploy and run Spark jobs on the cloud and bench mark performance on various hardware configurations
  • Optimize spark clusters to work on big data efficiently and understand performance tuning
  • Transform structured and semi-structured data using Spark SQL, Dataframes and Datasets
  • Implement popular Machine Learning algorithms in Spark such as Linear Regression, Logistic Regression, and K-Means Clustering
English [Auto] Hider welcome to lecture. Now we're going to talk about the map function and the reduce function that I briefly introduce to you in the previous lecture. The map function is. Think of it as a transformation. And basically what it does is it takes one element and produces an output of one element. OK. So if you want a map for example alphabets from A to Z to their number representation you can have a function that takes a alphabet letter and outputs a number for that alphabet. That's the map function so a very straightforward one input and it gives one output the reduce function takes multiple inputs and produces one output. OK. So if you had for example 10 items going into the reduce function it would output just a single item. And these are fundamental functions that were a big piece of Hadoop and there's still a big piece of Sparke as well but these are lower level functions that are happening all the time behind the scenes in a batch application. So just Sparc or Hadoop but we normally with the advancements in the API of SPARC we normally don't have to think of it at that granular level but it's important to understand these two functions because you're going to be using them to convert your objects and you need to understand what mapping actually means and what reducing means to work with your own polo's. OK so let's go over an example. So let me clear away some of the stuff that we worked on the previous lecture and up to this point where we create a data set. Now the map method could be invoked on a dataset object like this. OK. So we can do D.S. dot map and call the function the map method could also be invoked on our Didi's which we'll talk about later. Those are more foundational underlying data containers. But again the idea behind a map function is to map each of these items. OK. Each of the items that are inside of the data set to something else. OK. So if the input is going to be banana to the map function then we're going to create in just a moment then it would produce something else. All right let's say it takes banana and prints out something else that concatenates to this string or it or turns this into some other fruit or whatever. All of those instructions of what to do with each of these items in the data set could be defined in the mapping function. So let's actually do that if I do D.S. dot map notice it has a few options here. This second option read here is the one that I'm referring to you'll see two functions. One is a Scala representation of that function. The map function and then here is a Java representation. So basically in this Java representation it accepts a map function interface as the argument. OK. And the second argument to the map is going to be the encoder that we're going to use. So let's select this option. And now in the first argument it accepts an instance of the map function implementation. So we need to define a class that implements this map function. You can also use lambdas. OK lambdas are Serializable anything that is a serializable class that is an implementation of the map function interface. You can certainly use that as the argument here. Now I don't want to overcomplicate this example with lambdas in case you're not aware of how to use lambdas. But for now let's keep things simple. I'm going to define a class instead of this array the data set and define a class outside of here. And we want to make sure that it's going to be a static class so reduce static class miters class correctly and I'm going to call it string mapper. OK. And in this class We'll have the logic to map out what each of these items are going to equal once this function is invoked. So string mapper is going to implement what is an implement of your mouse. Here again you'll see that it's map function with a string and another type that we defined here. So we're not using our own user defined types. So the first argument is string. And the second argument is string as well so the input is string and the output is also going to be string. And we also want to implement serializable. We want to make sure that this is a serializable class hover your mouse here let's do the import for Serializable from the Java IO package. Now this is just a specific Java detail that if you want to create a serializable class inside of a parent class like what we're doing here it better be a static. That is why I call this static. We need to make sure that this class is serializable and then we were implementing the map function with the input and argument and output types. And then of course we are also implementing the Serializable interface. Now hover your mouse over the map function and notice it says import. So let's import that. And now it's going to ask us to implement the method. So hover your mouse over the string mapper and say hey you want to add the unimplemented methods. What does the unimplemented method. Let's click on it and notice here is that guide. So we need to implement the call method and in here is where we're going to define the mapping of what each of these items when they're passed into the call function. What each of these items are going to return. Now of course the return is going to be of type string as well. So input is string output is also a string. So what we can do is we could say return the following will. We'll create a string expression here we'll call it word with a colon. And we want to concatenate it with the value and what is a value. That's the argument to the call function. So each of these right each of these are going to be elements that are going to be passed in as value to the call function. OK. Each of the items that belong in the data set. And what this function is going to do is just going to take that element and return this expression which is going to have this word concatenated before it can be concatenated to return the value. Now since this is serializable it's actually saying hey you need to actually add a default serial version number. I'll just accept that suggestion and we'll leave it as that. So now this class is serializable. That's again a very important detail that I'm telling you. Make sure your classes are Serializable you're mappers as well as your posable. We'll talk about later. So now we can invoke this string mapper like this. OK. And the encoders the encoder is going to be in coder's dot string just like we were doing before. So the instructions of what to do or in this string mapper and now we're going to assign it to the data set. So now let's print what this data set contains so we can do Diest much show just like we would on a data frame those methods still of course all those methods are going to work out and the output will just you know. I know we have only a couple of elements here I'm just going to print the first 10 doesn't matter. So let's go to the application class and run this OK. That program completed Let's scroll up here and boom there we go notice each of the rows have been modified instead of it saying banana carb class that actually tags on the word with a colon just as we specified in the upper class. OK. Now I just want to show you that if if I change this to non-certified remove the serialization. It looks commande out this line here. And let's just make this instead of static I'll just make this a regular class. This class is no longer serializable. OK. If I run the same code again we're going to run through an exception and I want to see what that exception looks like. Once that run is complete Let's scroll up and boom there you go. You see the exception. It gives a serialization stack keeps calling up and you'll see the actual error as exception in thread main task not serializable. OK so if you see this then that means you need to serialize your mapper OK or whatever function it is that you're using. So that is why we need to make this static class. And we can also implement the Serializable and then let's uncomment out this line here. Now for those of you that are Lamda fence and Java 8 lambdas I have a special section in this course in the appendix section of this course you can learn all about lammed us if you're interested. But there's a way to do the same exact thing with a lot less code rather than defining a new class that implements this map function. You could just call this this kind of instruction inside of lambda. OK. And at that lambda expressions you look like this. All we need is an implementation of what's going on here of the mapping function. So we could just literally take this and paste it inside of a group of parentheses. So inside of these parentheses we paste in that expression. OK. And we're going to say that this is implemented by this row for example. OK. Each row of the data set is going to do this. What it's going to do is take the word string and concatenate it with whatever is the value of the row. OK. That's it. So instead of instead of defining a separate class and making sure that it's Serializable and then invoking the call function we can just implement the call function right here using this lambda expression and we won't even need this entire thing because lambda expressions by default are serializable. OK so we're sort of implementing the string mapper right here. OK there's an implementation of the mapping function interface. So there's another way of doing it. We don't even need this class anymore. Let me get rid of that and the code will still run as before. Let me run this to prove to you we need to run it from the application class. And once that completes if you scroll up there you'll notice we get the same output as before. So it's up to you which syntax you like better if your alarm does fan. You can certainly do this. But if you are not as comfortable lambdas use the other approach that I that I showed you where you create a serializable class that implements this map function and interface with your input and output argument types. So this is how to invoke the map function on a data set. Let me show you another function known as reduce. OK. Let's actually invoke that down here. The reduce function takes in multiple elements and outputs one element. So in this case we've got one two three four five six elements in this data set. OK. We could pass all six of these elements to the reduce function with the reduced function and give us is one element. And what is the type that Almont that's of course going to be of type string. Because each of these elements are strings so it's going to just return us one string and we can define in that method how we want to treat all of those six elements. What we want to do with them. OK we can define that logic in the reduced function. So the way that works is we can do D.S. reduce. And of course we've got two implementations one of them is the Skala representation and the other one is the java representation let's choose a Java representation right here where it actually gives the reduce function as the interface type so let's select that and now it expects us an implementation of the reduce function interface with the input as string. So this is of course smart enough to understand that a the input is going to be string because the data set is of type strength. OK so it expects us to give it multiple strings so it can give us one string back if it was a data set of our own POJO than it would expect multiple instances of those POJO to be sent to the reduce function will return us one instance of that POJO whatever it does to reduce it. OK so let's do that. Let's implement a will create the class and that's going to be called String reducer which will be an implementation of the reduced function and that's it. That's the only argument that it needs. We don't need to give in and coater because we're not returning a data set we are returning one item. OK one item it's going to return. What is that item that's going to be of type string. So this of course is going to return a string. So whatever value this returns are say string value it's going to return a single value. So that's actually defined what this thing's going to do. And a critical mass outside of this start method instead of the array data asset class we're going to create a static nested class. So let's do that static class we'll call it string reducer and it's going to implement the reduce function interface which accepts an argument of type string and the it's also going to implement the Serializable interface make sure that you spell the implements correctly with one s like that. And then we want to import this hover your mouse let's import the reduce function. And of course going to ask us to add the unimplemented methods. Except that suggestion and this method is of course also called call. So it accepts two arguments. OK. Because it knows is going to do multiple. It's going to take multiple inputs. Now it's important to realize that it doesn't just take two inputs. It basically iterates through this entire list of elements and assigns value one to the first one and assigns value to. Once it's done reducing those two values it re-assign these variables to the next value in the iteration. OK so I understand that this is not just two values it accepts a list of values and reassign these variables to each of the elements in that list and reduces it in whichever way we define here. So how are we going to reduce this thing. All I want to do is return v 1 plus B too. So what is this going to return. It's of course going to concatenate one string with the next and it's going to do that for all the strings in the dataset. And we should expect one string as the output with the concatenated words. So that's it that's that's all we need to do and of course is going to ask us that we want to do a default serialization version ID that select that will accept that suggestion. And so that method is complete. That's it. We're done reducing this dataset. Now let's print out what this string value is going to be. So I do this out and print the string value right here. This is after the reduce function. So let's run this code back to the application class. Click run. OK once that completes Let's scroll up and boom there we go. Notice that it took every single element in the data set which at the time that the reduce function ran the data set contained each of these words. Each of these elements with the word concatenated before it. OK. So notice if you examine this this is the entire dataset concatenated together. I'll remind you if you scroll up that each of the elements in the data set is now this. OK. So when we combine all the way to reduce this data set it's going to take this concatenated with this and then concatenate those two elements with this concatenate all three of these with that and so on until we form this single reduced string representation which we are calling here a string value. OK so that's what are printed here. Hopefully that makes sense. Now you understand what it means to map something and what it means to produce something. It's actually a very simple concept. These map and reduce functions are fundamental to any big data processing. OK these were first implemented in the Hadoop map produce algorithm and they're based on both of these fundamental algorithms of mapping and reducing. So now that you understand what mapping is and what reducing is you're ready to understand how to involve Polo's in your SPARC application how you can map a row in a data frame to a POJO in your application. A plain old of object so let's wrap it up here. Thanks for watching. I'll see you in the next lecture.