SparkSQL Getting Started

A free video tutorial from Richard Chesterwood
Software developer at VirtualPairProgrammers
7 courses
142,180 students
Learn more from the full course
Apache Spark for Java Developers
Get processing Big Data using RDDs, DataFrames, SparkSQL and Machine Learning - and real time streaming with Kafka!
21:40:47 of on-demand video • Updated February 2021
Use functional style Java to define complex data processing jobs
Learn the differences between the RDD and DataFrame APIs
Use an SQL style syntax to produce reports against Big Data sets
Use Machine Learning Algorithms with Big Data and SparkML
Connect Spark to Apache Kafka to process Streams of Big Data
See how Structured Streaming can be used to build pipelines with Kafka
English [Auto]
So in this chapter, we're going to start working with Spark SQL. Now, I'm assuming that most of you will have come to this section of the course from Section one of the course. If that's the case, then you'll have a project in your ID that we were working on previously. We were doing various experiments previously, but in particular we have a class called Main.java and my plan for this section is we'll just modify the code that we have in there. If for any reason you've come directly to this section of the course and you haven't done the previous work, then don't worry. Attached to this course I've provided a starting workspace just for this section and as we covered in section one, one of the joys of working with Apache Spark is there isn't really a lot in the way of dependencies that you need. You don't need to install any special software. We just have a Maven Pom.xml file here and if we have a look in there, you'll see that the dependencies are very basic. We have the spark core dependency. We also in the previous section added Spark SQL just so we were ready for this part of the course. And there's also a dependency on Apache Hadoop as well. That's just so we can access the Hdfs file system. So that's really the only file you need to do this section of the course. And just very quickly, in case you have come straight here, the first thing you will need to do is to run this. POM As a maven build. And you're going to run the eclipse. Colon eclipse goal. That's just so that Eclipse can go and download all of the jar files required for Spark. Now, I realize that the previous versions I gave you were was version 2.00. That's quite an old version of Spark. Actually, if we have a look on the Apache Spark page, here's a list of all of the current versions. And at the time of this recording, the latest version, in fact only just released is Spark 2.3.2. I think it probably would make sense just to upgrade these versions to 2.3.2 just so we make sure that the course is fully up to date. Now of course you can go and check the latest version for you, which might well be a much later version. The problem with doing that is that you might have variances between what happens in your development environment and what happens on the video. So I would recommend that you stick to version 2.3.2. If Spark version three comes out in the future, then we will monitor that and we will issue any updates to the course as required. Now a complexity of this is that the artifact IDs for both core and SQL. You'll notice that there is a version number on the artifact ID as well. This underscore 2.10. Now that's quite confusing, but there is a clue here on the Spark download page and it's telling us that from version two, Spark is built with Scala 2.11. So that's what we're seeing here. We're seeing the version of Scala that the library was built against. Now, if we have a look at this latest version of Spark Core, it's actually telling us here that the artifact ID now needs to be spark core, underscore 2.11, and that will be the Scala version. So if you are going to change these versions, do be sure to increment the two artifact IDs. So now if I run the eclipse colon eclipse build again. You should see lots of downloads in your console. Now, I had a problem off camera running that build, and I think I'll leave this in the video because it might affect some of you. Although it said build success as the build was running, I did notice it was complaining that some jar files could not be written properly and once the build had finished and I did a refresh on the project instead of all the code compiling correctly, I've got this red exclamation mark on the project and looking in the main class, which should be completely compiling. I'm getting errors basically on all of the spark objects. Now all this was, I think was a corruption of my Maven repository. If you use Maven, you'll probably be very familiar with this and it does happen quite a lot when you upgrade a library from one version to the other. The easiest way to fix this is to go into your home directory, wherever that is on your platform, and find a hidden folder called M2, which is the Maven cache and just delete that directory and run the build again. And that will force everything to be downloaded again. It will make for a slower build, but that's the easiest way of solving any maven corruption problems. So I've run that build again and I should find now with a refresh on the project. Yeah, it took a while to rebuild the workspace, but I am now back to a clean project. There are six warnings in there, but I'm sure that's just a hangover from what we had before. So if we go into the Main.java class, which if you did Section one, this is where we were experimenting with Rdds, we're going to look now at how to convert this into a Spark SQL program. Now we need to work with some structured data on this section and the basic data that I'm going to give you, you will find under the source main resources folder under the exams sub folder, there is in there a regular CSV file. I think if I double click on this, I don't know if you have and I did an edit there because it did take quite a while to open this file. It's not true big data, but it is still a pretty big file for local development. It's about 60MB in size. I don't know if you have a program like Excel or LibreOffice installed, so I don't know if that will have opened for you. You might just want to have a little look at this data, but you certainly don't need to open it in a spreadsheet. You can always do an open with text editor in Eclipse. It's going to be a little bit sluggish because Eclipse doesn't really work very well with files of this size. But you could do that just to have a little peek at the file. So this file then contains the exam results for a list of students so we can see, for example, on this row that the student with ID one sat some exams in quarter one, 2005. So the point of this is that we definitely have structured data. Now in this video, all we're going to do is load that comma separated file into Apache Spark. So if you are working with the code that we wrote in the previous section, we need to delete lines 29 onwards. Basically all the way down to the bottom. If you've come straight to this section, then all you're going to need is a public static void main method with the two lines. You can see here lines 23 and 24 where we set a system property, Hadoop, homedir and we also set the logging level for org dot Apache. And it's as simple as that. Now the two lines of code that I've left behind here, you'll be familiar with these lines of code from Rdds. This is where we're creating a Java spark context, and all of the work we did in the RDD Library was done through this Java spark context. Now, I'm not going to use these lines of code again, but I do want to leave them there just for a few moments so that we can compare the difference between that and what we need to do to work in Spark SQL. It's pretty much a similar recipe really. It's just that the end result is that we're going to end up with an object called Spark session. It's not that obvious, but Spark Session is how we're going to operate. Spark SQL. I wish they'd said that in the class name, whereas the Java spark context was how we work with Java. Rdds. Yeah, not that clear really. In fact, the way that we get hold of this spark session is to use a factory method inside the spark session class itself. So if we go spark session dot, then eventually the IntelliSense is giving the option of calling a method called builder. Now, the idea is that rather than splitting all of this into several lines of code, we can do all of this in one fluent line of code, just as we did previously. We can call the app name method on this builder and that allows us to set up the name of this application. I'm going to call it testing SQL. Now, I think this step is optional. If you don't do this, then you will get a default name. And just as with Java Rdds, this name is going to appear in reports. In particular, it will appear on the console that we use when we're doing performance analysis. Notice a slight difference here in the APIs we just call app name here. Previously on the older API. It was set app name and exactly the same as before. We also need to call the well. It used to be called set master, but now it's just called dot master. And exactly as before, this is going to specify how many of the calls that we want to work with. We're working locally, so this will be local square brackets, star or asterisk and square brackets. Now, one thing that we need to add into here that we didn't need previously, and I'm going to do this on a new line and in fact, you only need this if you're running on Windows if you're on Linux or. Or a mac. Then you do not need this line that I'm about to type config method and we need to pass in a configuration parameter. Now this is a little bit obscure. The parameter is called spark SQL dot warehouse dot dir. Now this is just going to be a directory where some temporary files are going to be stored by Spark SQL. This just needs to point to a temporary directory and it doesn't have to already exist. And the syntax for this on windows is going to be file colon and then three forward slashes and then presumably we're going to do this on the C drive, so C colon and we can use forward slashes here even though we're on Windows, even though I'm on Windows and I'm going to just call the folder TMP and another forward slash and that will be automatically created when we run this code. In the previous API, we created an object called Java Spark context directly using the new keyword. But in the Spark SQL API, we finish off with a call to the get or create method and that's going to return us that instance of the spark session objects. So it's boilerplate code very similar to what we had before. I think now I can delete these two commented out lines of code. The last line of code in here is going to be a call to the close method on that spark session. Now we need to point the spark session object at the data that we're working with, and we do that by calling on the spark objects. There's a method in there called Read. Now, this is a slightly strange method really, in that read itself isn't going to do anything. We have to go on and make further call to tell Spark what type of data we're reading. If I hit the dot there, the IntelliSense tells us that there is a method called CSV, which is for reading in a comma separated values file, just like the one that we have. But just to look at the other options available, we can see there there's a Jdbc method which will allow us to connect to a database table. There's a method to allow us to read Json and there's methods here to allow us to read text files. So we'll look at some of those options later. But for now, we want the CSV option. And the argument to this is going to be a reference to the file that we're reading. So for us, very similar to the previous section, this is going to be source forward slash main forward slash resources, forward slash O exams, forward slash students dot CSV, which will point it to this file here. There is one further thing we need to do in this command. Just reviewing the structure of the data. Again, we do have a header line in this CSV. Now the header is optional, but the advantage of having a header is that Spark SQL is going to be able to read this line and when we start manipulating the data, we're going to be able to refer to this data by its column header, which proves to be very useful. But the only slight problem with that is by default, Spark will assume that there is not a header line to tell it that there is a header line after the call to read, but before the call to CSV. The method we can call in there is the option method. This allows us to set some options on this data set and the one option that we need is the option called header, which is just a string and we set it to the value of true. So this is pointing spark at the data source. The resulting object that we will get back from here is an object that Spark calls a dataset. So I'll just set up a variable called dataset and assign it to the result of what we've just done. Now a dataset in Spark SQL is an abstraction that represents our data, and you can absolutely think of this dataset as being very much like a table containing our data. What we do have to declare this object and I'll use the quick fix to do that, I think so command or control one while hovering over that variable and I'll select create local variable datasets. And that's helpfully filled in the data type of this object, which is itself called dataset. We'll talk a bit more detail about what the generic is here. You can probably guess that this data set is going to contain a series of rows of our data. Now, I don't think I need to emphasize this too much because by now you will have experience of working in Spark Rdd's and you know that when we do things like reading in data for an RDD, we're not actually reading in that data. We're just setting up an in-memory structure representing that data. And in fact, it's only when we reach an action that work needs to be done. So we need to go a little further with this data set before we can actually see anything interesting happening. We'll come on to that in the next video. But for now, there is a very useful method on a dataset. We didn't have this in the RDD version of Spark, but in Spark SQL, it is so useful that on a dataset object you can call the method show. Before we do anything else with that object, you can probably guess what that's going to do. Let's give this code a run. Now, bearing in mind this is a multi megabyte dataset, it's going to be pretty huge, many, many thousands of rows. But very helpfully, we're going to see that the show method. Will print out the first 20 rows of that file in a nice, readable format. One more thing we can do before we stop this video is we can also get a count of how many rows that are in this data set. If we call on the data, set the. A counts method, not surprisingly. We'll store that in a variable called number of rows. And if we output this, we should be able to see how many records there are. Well, I was a little ahead of myself there. I declared this variable as an integer. But in fact, as you can see from the compile error here, it's going to be of type long, which of course makes sense. And since can only go up to the number about 2 billion. Of course this is a big data set, so there could well be more than 2 billion records in this data set. So it needs to be a long. Let's give that a try. Now, I did some video editing there just to speed the video up. Please don't worry if that took quite a while for you. Remember, we do have a fairly big dataset here, or at least it's a big dataset to work with in development. But apparently we have about 2 million rows in this dataset. Now, in the previous demonstration, we just very casually called this count method on our dataset and it certainly seemed to return the right kind of answer, but we didn't really think about what was going on when we called that count method. And really, in real life, we don't need to think about what's going on under the hood. But I think it's just worth pausing for a reflection a little bit here. In the previous work, in the previous section, when we were working on Java Rdds, we looked in detail at operations such as MapReduce operations, which we would typically use to implement things such as a count on big data. Now I really want to just draw your attention to the fact that although we're using this dataset, which very much feels like just an in-memory object, we are still working with big data. We are still working, in fact with Rdds. It's just that those rdds are kind of hidden away inside this dataset. So what I'm trying to say is that although we can very casually call counts and not really think very deeply about what's going on under the hood, you know, now that there's an RDD underneath this. So our data might be so big that it's spread across multiple nodes exactly as we've done in the previous section. And the implementation of this count could well be an operation such as a map reduce. It's just that in this API we don't need to know exactly how the counting is implemented. We just know it's going to be implemented as a distributed count potentially across multiple nodes. So nothing exciting so far, but we're up and running with Spark SQL. So in the next chapter we'll do some proper operations against this data set. I'll see you for that.