A free video tutorial from Richard Chesterwood
Software developer at VirtualPairProgrammers
4.7 instructor rating • 4 courses • 64,752 students
Learn more from the full courseApache Spark for Java Developers
Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!
21:40:47 of on-demand video • Updated July 2020
- Use functional style Java to define complex data processing jobs
- Learn the differences between the RDD and DataFrame APIs
- Use an SQL style syntax to produce reports against Big Data sets
- Use Machine Learning Algorithms with Big Data and SparkML
- Connect Spark to Apache Kafka to process Streams of Big Data
- See how Structured Streaming can be used to build pipelines with Kafka
English [Auto] In this chapter we will get your local development environment set up to use spark really the great thing about Sparke is that you don't need anything special installed on your local development computer if you're going to write a regular Java project and the spark distribution is embedded in a regular jar. Attached to this chapter is a zip file which you can download from the U TO ME website and just extracts it and you will find in there a folder called starting workspace. Now this is really basic. There's very little inside this workspace we're going to be doing all of the spot work by hand from scratch. But there is a little bit of routine work that I've already done for you inside this project. So you need to open that in your development environments. As always I don't know what development environment you're using but for the videos we use Eclipse. If you're using a different development environment then I'm going to assume that you know how to use that development's environment inside that workspace. I've created a folder called Project which represents the Java projects in Eclipse file new Java projects and the project name needs to match the exact name as on the folder. So that's going to be projects with a capital PE and that will allow the wizard to automatically configure the JRA of this thing to say is it's really important that you're using Java 8 for this course. We're going to be using a lot of Java 8 features and really for me as I said in the introduction chapter. If you don't have Java 8 if you're working on Java 7 for example then really working in SPARC would be a horrible experience. Now at the time of recording we are in 2018 right now. Spock does not support Java 9. So unless you're absolutely certain that things have changed since I recorded this video and you know that Spock now supports Java 9 then you need to make sure that you are running a JDK of one point eight point. So I think I'll give you a quick tour of this project was really very little to show you. There's a folder in rather a package called com virtual pair programs and there's a class in there called util dot Java that we're going to be using later on in the course of one of the work's exercises. We really don't need to think about that right now. Now this also will lead to one surround for this little bit because the classpath isn't set properly yet but under the Saussy main resources folder there are a few folders in there and that's a set of resource files that we're going to be using throughout the course on various world examples. We can ignore that for now. But the important file in here is the Palm dot axonal. Now when you open that you might need to click on the pondok a small tab right here. Of course you don't have to use Mavin on the spark project you could be using great tool or really any other Java build tool but maven will do the job for us and really all there is inside here is a very short dependences block where we've supplied you with three dependencies. We have the spark core. We have SPARC as well. Many people pronounce that sparks call lots of lovely stuff in there but we need to do the core first. And we also have a dependency on the Hadoop hate DFS file system and more information on that later in the course. But you might be surprised to discover that they're the only dependences that we're going to need in order to use SPARC. So let's get started then and we're going to do is create a regular Java class and it's going to be in the kind of virtual peer programs package. And I'm going to call my class Main and we're actually going to be doing most of the work on the course in this single class. Of course it doesn't really matter what you call this class. And before I click on finish I'll take the generates public static void main method because this font program is just going to become a standalone. Kind of like a console based application. It really didn't matter what you call this class box. I should just draw your attention to the fact that inside this you to class I have referred to that class name here on line 28. So for any reason you've gone for a different class name then you might be to make a change right there. We don't need that you you for quite some time anyway before we do any serious work with SPARC I just want to test that your SPARC job files are working correctly. So I'm going to right now a very very basic SPARC application in fact SPARC application that isn't really going to do anything but we will build on it through the next few chapters. I just want to check that everything's working OK. So for any Sfar program we're going to need some input data that was mentioned in the introduction this input data is going to be big data. It could be gigabytes it could be terabytes or exabytes So it's going to be a lot of parts. But of course when we're working locally we don't want to be dealing with a great big unwieldy files that are going to take hours to run. So it's very common when we're developing to create hard coded small data sets that we can hold in memory just to check the work that we're doing. And a nice feature of SPARC is that we can do that by using a regular Java collection. So as a start on sunbeds create a simple list of doesn't really matter. I'm going to go for a list of just numbers. So let's create a list of bottles and I'll call this input data. And that's going to be a new array list that this initialize this. MC So I'll do my usual control or command shift to organize the imports and be careful to let Jarve adult to adult list. Now I know that you're an experienced Java Developer some really patronizing you here at this stage but I just want to make sure that everybody's at the same points. Now we are going to be picking up a lot of new style Java syntax on this Corsten as was the first thing I should mention is that I can't remember when they introduced this into Java but if you have a collection and on both sides of the collection you using a generic type you don't need the second reference to the type. If I leave those double angle brackets empty then Java's going to infer that well obviously that type inside here must be the type inside here. So a minor syntactic improvement there. Now what I want to do is input data is just that some data that has some numbers in here it can be any set of numbers that you like. They've got to be doubles and I'm bored already. So that's going to do for now. I promise you that the data will be getting bigger later on but that's enough to start with are introduced in the opening chapter of the concepts of an R D D in Sparc and that really inspire core is the most important concept in Sparke it's a resilience distributes data sets and it's this RTD that we're going to use to distribute data across multiple computers in a cluster and we can very easily convert this input data into a spark RTD. So we need to create one of these are the ideas from our input data to get sparks started. We didn't have to do a little bit of boilerplate code first book. It's not too bad. We need to create an object of type spark calm and this as the name suggests it's going to represent the configuration of SPARC. So this is going to be an object which Angolans called copper and it's an object that we directly instantiates using new. But then as is the fashion now we can then go on we can call various methods we can call various methods in a chain to specify the configuration that we want. Now already we can see that this dot com class is not being recognized and I'm getting nothing on the Tele's sense. And the reason for that I'll just that the semi-colon for now. The reason for that is of course we haven't downloaded any of the job files so to do that we'll need to run the palm. So right click on a palm and a as maven build I'm working in eclipse so will use the clips Colan eclipse plug in and run that and that will run even though the core code isn't compiling. Now I'm guessing this is the first time that you've used SPARC so you'll probably see a lot of downloads going on at this point. I've been using SPARC for a while now so I'm not seeing any downloads. I'm just getting rotes Eclipse project. Blah blah blah. Hopefully you'll see a build success there. And now if you do a refresh on the project so either F5 or right click and select refresh. The big difference now will be on the reference libraries. Well we only had I think it was three dependences in the palm but of course there were a lot of transitive dependencies that are being pulled in. You should be seeing a lot of files there. Some of them referring to spark so there some build environments. OK so now again could come under control. Also should be importing the spark. COM object which comes from not surprisingly. Olga Patchi Sparke. So as I'm saying we can now go on with this part COM object we can hit the dot and we can call various methods to configure SPARC. Now we don't really have to do much to configure SPARC one of the methods in there is the sect's app named method which allows us to give this application a name. It's just a string of islands of code that starting Sparke you're going to find that this name will appear in reports and we'll be looking at that later on. One other thing we need to do is call the set Master method. Now I'm going to be explaining this in a lot more detail in. I think it's going to be the last chapter of the course where we look at performance. For now you can just copy what I'm doing in here we have the string local followed by square brackets and an asterisk close to square brackets and then you can close the string. Now what that is saying is that we're using SPARC in a local configuration we don't have a cluster and the star main is and I'm not going to be completely accurate here. I will be explaining in full detail in the chapter on performance but this means use all the available calls on your machine to run this program for as I say full details coming up later. But just very briefly if you didn't have this in place and I just said Lowcock then what would happen is your support program would run on a single thread and by running on a single thread we're not going to get a full perform up and get the full performance of this machine. So now I haven't been fully accurate there but that will do for now. An asterisk in square brackets he would see is a very common pattern for the opening line of a spark program which sets up the configuration and now the next line of code is going to be. I need an object type Java spark context. Now this Java spark context represents a connection to our SPARC cluster. We don't have a spot cluster just yet puts this object is going to allow us to communicate with Sparke. It's common to call this object s c and again we can directly instantiate this object. This is going to be a new Java SPARC context. The constructor will take a parameter which is the com for Jecks that we created on the line above. No command and control shift. Should give us an imports so that's great. Now we're basically up and running with SPARC we can now start doing operations again Sparc and the first thing you will need to do is load some kind of data into SPARC. Again and I've already said it but this stage will be where we're going to scan a file from a big file system such as Amazon S3 or a DFS system book. For now we're going to use this local input data and so load a Java laced into an R D D. We use the method which perhaps isn't very intuitive. It's called parallelize which sounds a bit scary but really it means loading a collection and turn it into an RTT as you can say here. Demanding a single parameter which is the list that we're working with. So that's the input data and that's going to return as our or the day. Now because Java has a very different syntax to Scala and as I mentioned in the introduction Spock is implemented in Scala and actually a lot of the objects that we're going to be dealing with on this course are written in Skala now we won't feel what's generally at least we won't feel that in the job. But it's worthwhile knowing that that's what's going on now because Skoll has a very different syntax to Java. The creators of SPARC have provided various classes to sort of bridge the gap between our Java and the Scala and the first one of those consummates is the object called the Java or the daily. I'm going to call this object my body the Java Ardi is a Java representation of one of these are the days it's going to allow us to communicate with this all the day using regular Java methods and Java syntax. But actually under the hood this Java RTD is communicating with a scholar of the day so you can think of this class as being a wrapper. If you're really interested and you might be looking at the source code of these classes at some point in the future it's worth mentioning that the Java DD is actually implemented in Scala itself which you don't really need to know. But the important thing is this Java RTD allows us to communicate with the Sardi day as if it's a regular Java object and we're going to call regular Java methods and we have a couple of compilers here. The first one will be that we haven't yet imported the Java or the the class so command or control shift. And can you see where that it's coming from the spark dot API dot Java package. OK well that's compiling book. I don't know how well it's coming across on the video but I have some yellow lines underneath there and that's because I'm being given a warning when the warnings not appearing in the problems. But if I have you'll say that this warning is related to generics and the fact that Java are the currently keepings of a type. Now that warning is really just saying if we carry on like this then we're going to have all kinds of problems downstream working with the Saudi because we're going to have to constantly every time we call a method on the Ardeidae we're going to have to remind you of what that type of data we put into this Ardeidae was. So the Java RTD is a generic class and we're able to specify in the double angle brackets what type of data they saw the DS working with and we know that for us we're working with doubles. So we'll put double angle brackets and that's good. We've now got two warnings this one is fine. This is just saying the one on my Ardeidae is just saying well never used this variable. That's OK. We're fixing that shortly. One annoying warning here is this Sparke context. Now in theory we're supposed to close the spot context when our program is finished. So the very last line of code should probably be as close as you probably well aware in Java. The usual recommendation is in case of an unchecked exception being thrown in the middle of here. Really the sexy dark clothes should be in a finally block in all my experience working with Spock. They don't bother going to those lengths because generally we're not going to see an unchecked exception in the middle of this code anyway because we're going to make it work. And if there isn't an exception any way then the whole job has failed. I'm going to abandon the whole process so you don't think you really need to see dark clothes but I'm going to put it in anyway you know. So we haven't done anything with this data yet but we have turned it into an on the day. It would be interesting to see if this code will run. So we'll run this as a regular java application. Well OK we have a lot of red text here but it's just logging. And hope you can see that most of it is just info logging. There is one more line here. You might not see this. This is just complaining that a can't find the native Hadoop library. That's really not a problem because we don't need the native Hadoop library for the work we're doing now. This logging is quite annoying really. None of it's useful will get in the way of readability later. So I suggest before you set up the spark com for objects. What I would normally do on the spot project is I would configure the logging. Now of course you can do this using a lock for Jadot properties file as you might have done in the past but actually far quicker on a spot program is just to get hold of the logger objects. How are you going to do this. Be very careful that you get hold of the correct lock object. It is not the Java use of logging. And in fact I've got at least a dozen different longer objects or interfaces on my classpath. The one you're looking for is the log for J. Org dot Apache implementation of Locher so make sure very carefully that your import for logger matches the one I have here. I know not laga object we can call get longer. I think repassing actually called Dot Apache that's going to filter out all of the Apache libraries logging and we could call the set level method on that and I suggest that maybe what we should do is set level dot warn. And again make sure you choose the right one. You need the log for Jey version check the import there now that's going to filter down the logging so we can only see warnings if we run again. The program runs through with just that boring warning that we can safely ignore. But we have at least got started we've got of the boilerplate code here which initializes the spot context. We've got our first java RTD set up what we're gonna want to do some operations on this already. So in the next chapter I'll show you a very common operation to perform on NRD date and that's called a reduce. So see you in the next video.