SparkSQL Getting Started

Richard Chesterwood
A free video tutorial from Richard Chesterwood
Software developer at VirtualPairProgrammers
4.7 instructor rating • 4 courses • 78,269 students

Learn more from the full course

Apache Spark for Java Developers

Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!

21:40:47 of on-demand video • Updated February 2021

  • Use functional style Java to define complex data processing jobs
  • Learn the differences between the RDD and DataFrame APIs
  • Use an SQL style syntax to produce reports against Big Data sets
  • Use Machine Learning Algorithms with Big Data and SparkML
  • Connect Spark to Apache Kafka to process Streams of Big Data
  • See how Structured Streaming can be used to build pipelines with Kafka
English [Auto] So in this chapter, we're going to start working with Sparkasse school. Now, I'm assuming that most of you will have come to this section of the course from Section one of the course. If that's the case, then you'll have a project in your idea that we were working on previously. We were doing various experiments previously, but in particular, we have a class called Main Java. I'm applying for this section. This will just modify the code that we have in there. If for any reason you've come directly to this section of the course and you haven't done the previous work, then don't worry. Attached to this course, I've provided a starting workspace just for this section. And as we covered in Section one, one of the joys of working with Apache Spark is there isn't really a lot in the way of dependencies that you need. You don't need to install any special software. We just have a maven upon XML file here and if we have a look in there, you'll see that independence is a very basic we have the spark called dependency. We also in the previous section added spark fuel. Just so we were ready for this part of the course. And there's also a dependency on Apache Hadoop as well. That's just so we can access the HDFC file system. So that's really the only file you need to do this section of the course. And just very quickly, in case you have come straight here, the first thing you will need to do is to run this palm as a medium build. And you're going to run the eclipse coal on eclipse go. That's just so that Eclipse can go and download all of the jar files required for Spark. Now I realize that the previous versions I gave you were worse version 2.0, not dot n now is quite an old version of Sparke. Actually, if we have a look on the Apache Spark page, here's a list of all of the current versions. And at the time of this recording, the latest version, in fact, only just released is Spok 2.0 3.0. I think it probably would make sense just to upgrade these versions to 2.0 3.0 just so we make sure that the course is fully up to date. Now, of course, you can go and check the latest version for you, which might well be a much later version. The problem with doing that is that you might have variances between what happens in your development environments and what happens on the video. So I would recommend that you stick to Version 2.0 3.0 aspart. Version three comes out in the future. Then we will monitor that and we will issue any updates to the course as required. Now, the complexity of this is that the art fights ideas for both core and Ezekial. You'll notice that there is a version number on the artifact side as well. This underscore to dot one zero. Now, that's quite confusing, but there is a clue here on the Spark download page, and it's telling us that from version two, Spark is built with Scala two point eleven. So that's what we're seeing here. We're seeing the version of Scala that the library was built against. Now, if we have a look at this latest version of Spok core, it's actually telling us here that the artifact, Heidi, now needs to be Sparke core and score two point one one, and that will be the Scala version. So if you are going to change these versions, do shorter increments, the two artifact sides, if I run the eclipse kolon eclipse build again. You should see lots of downloads in your console. Now, I had a problem off camera running that build, and I think I'll leave this in the video because it might affect some of you, although it said build success as the build was running, I did notice it was complaining that Sanja files could not be written properly. And once the build had finished and I did a refresh on the project instead of all the code compiling correctly, I've got this red exclamation mark on the projects. I'm looking in the main class, which should be completely compiling. I'm getting errors basically on all of the spark objects. Now, all this was, I think was a corruption of my Mavin repository. If you use Mavin, you'll probably be very familiar with this. And it does happen quite a lot when you upgrade a library from one version to the other. The easiest way to fix this is to go into your home directory, wherever that is on your platform, and find a hidden folder called Dopp M2, which is the MAVEN cache, and just delete that directory and run the build again. And that will force everything to be downloaded again. It will make for a slower build, but that's the easiest way of solving any may, even corruption problems. So I've run that build again. I should find now with a refresh on the projects. Yeah, it took a while to rebuild the workspace, but I am now back to a clean projects. There are six warnings in there, but I'm sure that's just a hangover from what we had before. So if we go into the main Java class, which if you did Section one, this is where we were experimenting with our days, we're going to look now at how to convert this into a Spark Haskell program. And we need to work with some structured data on this section. And the basic data that I'm going to give you, you will find under the source main resources folder, under the exams subfolder there is in there a regular CSV file. I think if a double click on this, I don't know if you have and I did an edit there because it did take quite a while to open this file. It's not true big data, but it is still a pretty big file for local development. It's about 60 megabytes in size. Now, I don't know if you have a program like Excel or Leave Office installed, so I don't know if that will have opened for you. You might just want to have a little look at this data, but you certainly don't need to open it in a spreadsheet. You can always do an open with text editor in Eclipse. It's going to be a little bit sluggish because the eclipse doesn't really work very well with files of this size. But you could do that. Just have a little peek at the file. So this file then contains the exam results for a list of students. So we can see, for example, on this row that the student with ID one sat some exams in quarter one, 2005. So the point of this is that we definitely have structured data now in this video or we're going to do is load that comma separated file into Apache Spark. So if you are working with the code that we wrote in the previous section, we need to delete lines. Twenty nine onwards, basically all the way down to the bottom. If you've come straight to this section, then all you're going to need is a public static void main method with the two lines you can see here line 23 and 24, where we set a system property Hadoop home here and we also set the logging level for org Apache. And it's as simple as that. Now, the two lines of code that I've left behind here, you'll be familiar with these lines of code from are the IDs. This is where we're creating a Java Spark context. And all of the work we did in the ARD library was done through this Java spok context. Now, I'm not going to use these lines of code again, but I do want to leave them there just for a few moments so that we can compare the difference between that and what we need to do to work in Sparkasse scale. It's pretty much a similar recipe really. It's just that the end result is that we're going to end up with an object called Spark Session. It's not that obvious, but Sparks Session is how we're going to operate Sparkasse cual. I wish they'd said that in the class name, whereas the Java Spark context was how we work with Java RTD. Yeah, not that clear really. In fact, the way that we get hold of this SPARC session is to use a factory method inside the spark session class itself. So if we go spark session Dotts, then eventually the Intellisense is giving the option of calling a method called Belder. Now the idea is that rather than splitting all of this into several lines of code, we can do all of this in one fluence line of code, just as we did previously. We can call the app name method on this builder and that allows us to set up the name of this application. I'm going to call it testing escudo. Now, I think this step is optional. If you don't do this, then you will get a default name. And just as with Java adds, this name is going to appear in reports in particular, it will appear on the console that we use when we're doing performance analysis. Notice a slight difference here in the APIs we just call app name here. Previously on The Older API. It was set app name and exactly the same as before. We also need to call the well, it used to be called set master, but now it's just called Dot Master. And exactly as before, this is going to specify how many of the calls that we want to work with. We're working locally. So this will be local square brackets, star or asterisk and square brackets. Now, one thing that we need to add in here that we didn't need previously, and I'm going to do this on a new line. And in fact, you only need this if you're running on Windows, if you're on Linux or a. Back then, you do not need this line that I'm about to type config method and we need to pass in a configuration parameter. Now, this is a little bit obscure. The parameter is called sparked escudo dots. Where else dot the I's are. Now, this is just going to be a directory where some temporary files are going to be stored by Sparkasse scale. This just needs to point to a temporary directory and it doesn't have to already exist. And the syntax for this on Windows is going to be file kolon and then three forward slashers. And then presumably we're going to do this on the C drive. So C Kolon. And we can use forward Slasher's here, even though we're on Windows, even though I'm on Windows, are I'm going to just call the folder Tempy and another forward slash and that will be automatically created when we run this code. In the previous API, we created an object called Java Sparke Context directly using the new keyword. But in the Sparkasse AQAP, we finish off with a call to the get or create method and that's going to return as that instance of the spark session objects. So it's boilerplate code, very similar to what we had before. I think now I can delete these two comments about lines of code. The last line of code in here is going to be called to the closed method on that spark session. Now we need to points the spark session, objects the data that we're working with. And we do that by calling only spark objects. There's a method in there called Read. Now, this is slightly strange method, really, that read itself isn't going to do anything. We have to go on and make further call to tell Sparke what type of data we're reading. If I had the dot there, the intelligence tells us that there is a method called C ASV, which is for reading in a comma separated values file, just like the one that we have. But just look at the other options available. We can see there there's a GDB C method which will allow us to connect to a database table. There's a method to allow us to read Jason and there's methods here to allow us to read text files. So we'll look at some of those options later. But for now, we want the C as the option. And the arguments of this is going to be a reference to the file that we're reading so far as very similar to the previous section. This is going to be a source forward slash Maine, forward slash resources, forward slash oh exams, forward slash students dot c. S v, which will point it to this file here. But there is one further thing we need to do in this command. Just reviewing the structure of the data. Again, we do have a header line in this case V now the header is optional, but the advantage of having a header is that Spark School is going to be able to read this line. And when we start manipulating the data, we're going to be able to refer to this data by its column header, which proves to be very useful. But the only slight problem with that is by default, Sparke will assume that there is not a header line to tell it that there is a headline after the call to read, before the call to see ASV. The method we can call in there is the option method. This allows us to set some options on this data set. And the one option that we need is the option called header, which is just a string, and we set it to the value of true. So this is pointing Sparke at the datasource, the resulting object that we will get back from here is an object that Spock calls a data set. So I'll just set up a variable called data sets and assign it to the result of what we've just done. Now a data sets in Spock. Pascual is an abstraction that represents our data. And you can absolutely think of this data set as being very much like a table containing our data. Well, we do have to declare this objects and I'll use the quick fix to do that, I think. So command or control one while hovering over that variable and else let's create local variable data sets. And that's helpfully filled in the data type of this object, which is itself called data sets. We'll talk a bit more detail about what the generic is here. You can probably guess that this dataset is going to contain a series of rows of our data. Now, I don't think I need to emphasize this too much because by now you will have experience of working in Sparke are deeds. And you know that when we do things like reading in data for an day, we're not actually reading in that data. We're just setting up an in-memory structure representing that data. And in fact, it's only when we reach an action that work needs to be done. So we need to go a little further with this data set before we can actually see anything interesting happening. We'll come onto that in the next video. But for now, there is a very useful method on a data sets. And we didn't have this in the old version of Sparke, but in Sparkasse cual, it is so useful that on a data set objects you can call the method show before we do anything else with that object, you can probably guess what that's going to do. Let's give this code a run. Now, bearing in mind this is a multi megabyte data sets, it's going to be pretty huge. Many, many thousands of rows. But very helpfully, we're going to see the show method. Will print out the first 20 rows of that file in a nice readable format. One more thing we can do before we stop this video is we can also get a count of how many rows that are in this data sets. If we call on the data, set the. Counts method, not surprisingly. Store that in a variable called number of Rose. And if we outlaw this. We should be able to see how many records there are. Well, I was a little ahead of myself there. I declared this variable as an integer. But in fact, as you can see from the compiler error here, it's going to be off type long, which, of course, makes sense. And it can only go up to the number about two billion. Of course, this is a big data set, so that could well be more than two billion records in this dataset. So it needs to be along. Let's give that a try. Now, I did some video editing there just to speed the video up, please don't worry if that took quite a while for you. Remember, we do have a fairly big data set here, or at least that some big data set to work with in developments. But apparently we have about two million rows in this data set. Now, in the previous demonstration, we just very casually called this count method on our data sets, and they certainly seem to return the right kind of answer. But we didn't really think about what was going on when we called that count method. And really in real life, we don't need to think about what's going on under the hood. But I think it's just worth pausing for reflection a little bit here in the previous work, in the previous section, when we were working on Java, RBD, we looked in detail at operations such as that produce operations which we would typically use to implement things such as accounts on big data. Now, I really want to just draw your attention to the fact that although we are using this data set, which very much feels like just an in-memory objects, we are still working with big data. We are still working, in fact, with the D. It's just that those are the days are kind of hidden away inside this data sets. So what I'm trying to say is that although we can very casually call counts and not really think very deeply about what's going on under the hood, you know, now that there's an R d underneath this. So our data might be so big that it's spread across multiple nodes exactly as we've done in the previous section. And the implementation of this count could well be an operation such as a map reduce. It's just that in this API, we don't need to know exactly how the counseling is implemented. We just know it's going to be implemented as a distributed count potentially across multiple nodes. So nothing exciting so far, but we're up and running with Sparke Ezekial. So in the next chapter will do some proper operations against this data set. I'll see you for that.