Reduces on RDDs

Richard Chesterwood
A free video tutorial from Richard Chesterwood
Software developer at VirtualPairProgrammers
4.7 instructor rating • 4 courses • 64,752 students

Learn more from the full course

Apache Spark for Java Developers

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!

21:40:47 of on-demand video • Updated July 2020

  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data
  • See how Structured Streaming can be used to build pipelines with Kafka
English [Auto] In this chapter we're going to look at a very common operation on an already day and that's called a reduce. So where are we. Well previously we set up one of these are the days now albeit it's a very basic data set with them save for items. As I mentioned though we can very easily switch this data set for properly massive data sets and we'll be looking at how to do that a little bit later on. For now this little bit of test data will suffice. And what we want to show you in this jounce is one of the fundamental operations on an RTD and that's called a reduce you might be familiar with this already maybe from working in. How do people working in just general functional programming such as using Java 8 lambdas. So if you are familiar with this already then you can safely skip this chapter if you're not familiar with it then this chapter should be a great introduction. I'll try to keep things nice and straightforward analysts here that you don't have a lot of experience of using Java 8 lambdas now or reduces use when we want to perform an operation against NRD which is going to transform a big datasets like our massive four items here into a single answer. So we are going to be crunching down a big data set into a single answer. So for example a nice obvious one would be let's say we want to sum up all values in this collection. Well actually we want to sum up the values contained inside the resulting of the day. Now if it were a normal Java collection we could of course just loop around this input data have a total keep adding to the total and at the end of the loop. Real simple. We've got our total ridiculously trivial stuff. But of course we're going to be working in a massive data center where our data is going to be spread across multiple nodes in a cluster. Well of course here on this page from heavily simplifying but it's very similar to the picture we used in the introduction I'm showing here that we have a dataset spread across three nodes somehow. Actually I've just randomly scattered the values around. Now we can't just iterates around this collection because it's spread across multiple jayvee ends multiple physical computers. So the way we use works is that we define a function of some kind and this function is the reduced function. Now for a summation the reduce function is going to be very simple it's just going to be take a value and add it to another value. Now it might not be obvious to you exactly how this works across this cluster. What happens is the driver. I remember that's the virtual machine that's actually running. This program is capable of sending that function across to each partition. And now each partition is going to be free to execute its function against just that partition. So let's take on Node 1 as an example. So we've got some values in this partition and we know now that this function has been sent across to this partition. Now what happens in redos is that this function is applied to any two values in the collection. So we could say for example and it doesn't matter which of the values would take what we could for example take the first element that number 7 as being value 1 and the second element value for is being value 2. But crucially we could have picked any of the values because the redos must be able to be done in any order. So we can use any of the values from value 1 and any of the values for value to reduce. There will always be a value one under-valued to an all that happens. It's incredibly simple. The function is applied to those two values. Now we determined that what we want to happen is we want the values to be added together. This function could be anything it can be arbitrarily complex. So all is going to happen here is that two values will be added together. In this case to give the value 11 now we're left with three values on this node. And now the reduced simply continues. Spock will nominate any two values to be value 1 and value 2. In this case it's PAIK The nine on the train and it applies the function to those two values. So that's going to give us a 2. And it continues while because we've only got two values left then one of them will become value 1. One of them will become value. So the function is applied for this node. We've reduced the problem down to the simple answer of the T-3 presumably in parallel the same process will have been happening on Node so and on no 3. So we happen so we've reduced down. The answer is 18 on 2 and 18 on those three. That's just a coincidence from the random values that when I build the caption. Now once the reduced has been applied all of the individual notes the next step is that the. In this case the three answers are gathered together on sou'west single node. Now I'm being very imprecise that we will be talking in a lot of detail about how SPARC does shuffling. Later on in the course at some point in the series but for now it's not really important which node this is going to happen on but I hope you can see that the the three subtotals if you like that came from the three nodes now gathered together in the same place and Sparke will now run another reduce it will nominate any two of these to be value 1 value 2 and it'll apply the function. So we get 50 1 and it continues until there's just a single number. So just once more another value on another value to apply the function and we now finally have the answer. In this case 69. So you don't mind me going through that in detail. It is a very simple process but it's so fundamental to working with big data that really I thought I ought to really expand on it and elaborate but how do we do that in the code. Well it depends on whether you've done Java 8 Lander's. I hope if you all need to land. Is that what I'm about to do will feel fairly natural and straightforward. So I want to repeat what I was doing on the captions so I haven't RTD. I would like to do a juice on RTD now or another the Ardeidae is distributed across multiple partitions probably across multiple physical nodes but to us here in the driver program we don't need to worry about that we can simply call the regroups method now. Probably shouldn't surprise you that there is a function called reuse on the Ardee. Now this is where things get a little bit complicated. And if you are reading the intellisense there it's telling us that we need to pass in a parameter of type function to which has a generic of a double double and a double. Again as full details of this on our Java advanced course but I'll send this into English for you is telling us that we need to pass in as a parameter. The function that we want to be the reducing function we want to write a function that takes any two values and adds them together to turn this into English it's saying we need to apply a function which takes two input parameters and they're of type double and the function must return an answer again of type double. By the way this is something that I haven't mentioned with the reduced function one of the restrictions of the reduce function is the return type of the function needs to be the same type as the input types. If you need to do something where the return type is different then you need to do something other than averages and that's OK for us of course given only two values if we add them together we will still have a double is the answer. Now I think that's enough theory I'm just going to dive in and show you how we express our function as a parameter in Java 8 and above. It'll turn out we don't actually need them but it might make it clearer if I use these curly brackets here. I'm going to put inside the curly brackets my reducing function. The syntax of the function is. We have a round brackets and inside those brackets we're going to declare the input parameters and then after the round brackets we have not sure what the symbol is code is an arrow really it's a dash followed by a greater than sign and then we continue with the implementation of this function. Again we can work from the caption. The function is going to be taking a value 1 and a value too. And it needs to add them together. So in these round brackets we declare that function is going to call with two input parameters value 1 and value you we could have called those anything we like. And then the implementation of this function will be to take value added it to value to. Now just put those curly brackets in just for illustration just to show you that I'm trying to trying to get across here that this is a block of code that we want sense of the partitions in the or the day. Well it won't compile with those in place. If I take those out I hope you can see now that I don't have any red squiggles this is now valid Java code. I must admit I do this kind of thing in other languages but as a longstanding Java programmer and this still feels a little bit natural in Java. You might be wondering for example how does it know what the types of value one value to us. Surely we should have to declare the double value one double value too. And yet you can. You can absolutely do that if you want to but because Java knows that the day is working on objects of type double it knows the values in the Ardeidae doubles already. So these types can be inferred and you're going to find in Java 8 it's quite natural to not include the type set because we know what the types are. We know that the end results of reduced even if we've got a thousand partitions and it's had to do lots of different reduces which is combined together. We know the end result is going to be an answer which will be of type but it's going to be at the. Now we're going to find as we go a little further with spot that it can be quite difficult when you doing some complex operations here it can be quite difficult to work out what is inside going to be. We can kind of work it out by hovering over the reduce and the intellisense is telling is that this reduce. Yeah it takes this complicated function as an input but the output is going to be of type and that's here he is going to be of type double. So that's one way you can work out what the return type is going to be. Anyway more on that later when we get more complicated but for now let's just print out the results. So a very simple program. But first reduce let's give this a run now watch out for this the program has run. I can't see any output apart from this logging and I will explain what the stage zero is a bit later on but I can't see my answer. I don't think that's just because if I expand the console yeah watch out for this. This text is actually being output so the airstream. Even though it's not an error on the system out. Is this here in black and sometimes quite often actually they get into a woven and you'll find they're not they're not kind of neatly aligned but I hope you're happy with that. Of course I could have probably done that a lot quicker on a calculator but we'll be scaling all of this later and you'll be having a good chance of practicing with reduces for yourself quite a lot in this course. I hope you're happy for now with the principle of reducers in the next chapter. We'll look at how to work with mapping operations.