Ingesting CSV and JSON Files

Imtiaz Ahmad
A free video tutorial from Imtiaz Ahmad
Senior Software Engineer & Trainer @ Job Ready Programmer
4.6 instructor rating • 12 courses • 255,719 students

Learn more from the full course

Master Apache Spark - Hands On!

Learn how to slice and dice data using the next generation big data platform - Apache Spark!

06:49:21 of on-demand video • Updated October 2020

  • Utilize the most powerful big data batch and stream processing engine to solve big data problems
  • Master the new Spark Java Datasets API to slice and dice big data in an efficient manner
  • Build, deploy and run Spark jobs on the cloud and bench mark performance on various hardware configurations
  • Optimize spark clusters to work on big data efficiently and understand performance tuning
  • Transform structured and semi-structured data using Spark SQL, Dataframes and Datasets
  • Implement popular Machine Learning algorithms in Spark such as Linear Regression, Logistic Regression, and K-Means Clustering
English [Auto] Either in this lecture we're going to continue coding would spark Hiraman might eclipse environment and I've already as I explained earlier downloaded all of the source code for this course so I encourage you to do the same. It's available on the Get hub repository and I've provided instructions on how to get that code so I've opened up eclipse and I'm an import that project into eclipse and the way to do that is you right click and go to new Java project and the project name. This is important. The project name is going to be exactly the same as the project name that you downloaded from Get up. So that project name is called Project 2 for this lecture. So make sure you get the same one and then just said finish and is saying it requires compliance level 9 or above. You may or may not see this but basically what it's asking us to do is making sure that we're using the right execution environment. So I already set this up to run on Java version 8. OK and the way you can do that is if you right click go to build path configure build path make sure under the library section you're using the version of Java JDK 8 that you downloaded OK and you can change this if it shows something else you can go to Edit and choose the Java runtime environment that you have available on your machine. Ok so mine is using Java 8 as you can see right there. So make sure again the code all the code in this course needs to be running with Java 8. So make sure you have that correct version. But anyway in the source main Java folder I have a package com that jobbery program or not SPARC. So let's open up our main application class which is the you know it contains the main method and here I making references to other objects Souquet other classes that I've already created here. So we're going to go for the first example and we just comment out these two examples here. We're just going to deal with this first example. So I'm getting an instance of this in first GSV schema class and basically I want to print the schema of a CSP file. So let's go into that class and see what it's doing OK in that classes of course right here. So notice that this is going to create a spark session and using that spark session it's going to read the format of a CSFB file. Ok so it's expecting a CSP file. Now there are some various options here because this is a more complex CSP file than the code that you saw earlier. OK so let's open this up. This file is located in source main resources folder. So that's right here and it's called Amazon products. Let's open that up. So this is a CSP file that actually goes on multiple lines. So this is first of all these are the headers. OK. So it has one two three four five columns and then the first column contains the ID and the second column is the product ID and the third column is basically the title which has all of this OK. Up to this point and then the publish date is right here and the URL is right here. And notice that this is not really a comma separated file. This is actually a semi-colon separated file. Notice there are these semi-colons that are separating the different data elements. OK. So that's one major change from the previous version that we saw. So the title column contains all of this text that spans two lines. OK. And notice in the beginning there is this carrot. This is known as a carrot character and towards the end of that text there there's also a carrot. So this actually represents in this example let's consider this to represent quotes. OK. So you'll see that anywhere you see this carrot. We want to interpret this Texas quotes. So this is a more complex version of a SEUS we feel than we've seen before. So similar structure here we create a data set with roll which is known as a data frame. OK. And that is going to have the variable d f to represent data for him and we're doing a spark read format the SB and the option header is true. Then we set up a another option which is called multi-line true meaning that if the value of a given column is on multiple lines we need to make sure that we do multi-line because remember there that the title field could span multiple lines in this example as we saw. OK. Then the separator is not a comma. It's actually semi-colon. OK. So you can specify any separator that you want here. We know from the file that it's semi-colon so that option is available to us. We can also separate using tabs pipes and all other different characters. So a lot of flexibility here. And then the next thing is this quote property. So we're basically telling spark that hey any time you see this carrot consider that a quote. OK we don't want this carrot to show up in the data. This could have been any character but anytime it sees that character we can actually tell spark that hey treat that as a quote. And that's what we're doing here. Then the date format option is also available to us so now we're telling spark that hey there is going to be a month day and year format. OK. So it's going to look for that for any anywhere which sees this kind of a date format. And then we are inferring the schema to be true. So we're not hard coding a schema or anything we're telling Sparke to figure out what the data types of some of these columns should be. OK and I'll show you later how we can create our own schema. Customize the schema for a given file. Show you how to do that. And then finally that load option you've seen this before where we specify the location of where that file is and then right here we're just printing to the screen. Excerpt from the data from content and we want to show the first seven rows. OK. And that's pretty much it. And then over here DFT print schema. This is useful to be able to see how Sparke interpreted the values in those columns. You know what type what data types that it interpret those values as okay. So that's what this print schema method does. And in the application class where we have the main method we're creating an instance of that class and we're executing the print schema method to execute all that code that we just went over. All right. So let's right click and run this application bring the console up a bit OK so that finished executing If you scroll up. You'll see that it says the data frame's schema is this. OK. And it took the listing ID the product ID treated that is integer. OK. It says that those numbers the total it's treating it as a string. So great it's interpreting that the published date. However it's not treating that as string. Ok it leaves that up to us to figure out whatever that data type is. It's not trying to be too smart. I'll show you how to treat this as a date column in just a moment. And then the you are others of course going to be a string as well. OK so this is being printed from the last portion of this method which was right your DFA print schema. That's what this is showing. Then the DFT shows seven. This is actually going to show seven records so we only have four records in that file so that's what this is going to show. And notice it was able to correctly parse those things out based on the semi-colon separator. Okay. And the title it contained the carrot character in the beginning and the end right. So we're not seeing that here so it's treating that as quotes. If we were to get rid of this option then it would show it would show this carrot symbol as part of the title. Now if you wanted to see more characters for a given column you can actually use the second option down here I mean uncomment this line let's comment out the after show seven and let's do DVR shows seven ninety. This basically specifies how much of the field you want to see. So truncate after 90 characters if we use this option it truncates after I think 30 or 25 or so characters we want to see more of each of these fields. OK so let's try the 90 option so save this file and let's go back to the application class and run this class. Make sure when you're running you're always here because this is the father contains our main method. I'm sure you're familiar with this already for your job which I hope you are OK. No that's done. That's scroll up. And notice now we see a lot more of each field. OK. The title it's showing a lot more up to 90 characters. OK. So this of course doesn't give as pleasant of a format into the display as it was before but that's OK we're changing the amount of characters that are displayed. So of course going to shift things around and it won't be as pretty as the default but hopefully you get the idea. So now let's move on to the next example I and a comment out this first example and let's uncomment out this second example here we're actually this is referring to a class called define cxxviii schema so we can actually manually define the files schema. OK. The previous class that we looked at the in ferse CSP schema we're not specifically giving any kind of names to these fields these are coming raw from the file and we're leaving it up to spark to figure out what the data types of these columns should be. All right. And it was pretty much smart enough to figure that stuff out on its own but we can actually hard code the different data types for each of the columns. And that's what that's what this class defined CXXVI schema does. So let's open this class up. Here is that class. So the print defined schema method basically you know does the same thing establish a spark session and then there is this thing called struct type. OK this is a type that comes from. Let's scroll up here. It comes from or that Apache that sparked sequel dot types. OK. And what that allows us to do is define our own data types. OK. You can use this data types data create struct type and then you do new struct field and then you specify an array containing all the fields that we are trying to define for the for the given schema. Okay. And retiling this variable schema because that's exactly what this is. So the different struct fields are the first one is ID. So instead of listing ID I want it to just show ID. OK. And the data type for that is integer. And this false value we can take a look at what that means if we open up this method definition. Notice that the last argument here is whether we want it to be nullable or not. When you do nullable false then it expects a value. So that's what we're specifying here. Meaning that every value in a given field must exist. Now in this example we have the product ID this is set to true meaning that it is nullable. So if there is a product that doesn't have a product ID that product will still be displayed because you know we have set this to true and same thing goes for this right here. OK for the published on it's ok to not know the date at which this product listing was published. And then down here this is a multi-line where specifying that a given field could be split on to two or three lines so that's what this represents. Header is of course true. The separator still semi-colon the date format is still the same and we're considering these characters quotes. OK. And this is the key word here. The previous version where we had the first schema class that did not contain this line where it says schema over here we're not inferring the schema automatically we're giving it the schema that we created up here. So we're tagging on this dot schema option down here. OK. And then we're doing the familiar load method. OK. So let's close this file and we're creating an instance of that class and we're running the print defined schema. So I should see a defined schema here similar to what SPARC was able to create for us right here. So let's rerun this and you'll see the difference. Notice over here it's same published data as a string. But when I run this you'll see that the published date is now being considered a date field. So let's reichlich run as a job application OK so that program completed if you scroll a little bit up. Notice this is our new schema. Okay. And the columns have now changed. Instead of listing ID or item ID or whatever it was I just left it as ID. And then the product ID item name published on. These are the new columns that I defined in the schema. So it's registering those. And more importantly the date column is now actually the date data type. OK. So this is pretty cool. Scroll up and you'll see the breakdown here. OK. The field names have changed since we changed them in the schema. So hopefully that's straightforward. The main thing to keep in mind is when you create a schema when we go back to that defined CCSB schema class when you create a schema like this using the struct type you want to make sure that this schema variable that you define your tag that on using the dot schema method to the spark reading function. And just like the other options. Are basically just adding on a dot schema and that's when you will that's one spark we'll be able to recognize that we have a hard coded schema. OK so let's comment out this now and move on to the third example here. And this is not a CSP file. This is as you can see by the name of JS online parser. So let's open this up and take a look at what it's doing. So this is actually straightforward. We're doing the same Sparke session creation here. And then we have this spark that reads format Jaison. OK. Instead of CSB this is now Jason and we're loading the Giese on file from right here which is simple dachas on. So let's open that up right here. This is that file. OK. Now I would open it up in my default text editor on the system which is sublime text. For you it may choose to open an eclipse doesn't matter but as long as you can see this file. So each record here is on a different line. That's the key takeaway from this. This is known as a JS on line syntax where the entire Jaison is on one line and then the next on entries on another line and so on. Normally you would need an array with commas to separate the different Jason documents. But here we have four documents four records. OK. And they're each being given they're all mine. OK. So let's close this file. So this first option here we're creating a data for him for Jason and we're loading this japes on a simple J on the second example I'm just going to comment that out for now. Get back to that in just a moment. So this deal if we're going to show the top five records we only have four records in there so it doesn't really matter. And we're showing up to 150 characters for each of the given fields. And we were also printing the schema. Now this is not going to print as a Jason format. This is going to print in the DFT show format which is going to be looking like this like a table. So let's close this and run this. And by the way to avoid all the logging takes a lot of time to print all the logging messages. I'm skipping ahead in the video so it's not that my computer is super fast. I'm basically pausing and recording once the log entries are all printed out. So anyway here is the schema that I was able to figure out. So is the schema for a given record and notice that it a name and then the owns field. And this is an array of different elements. So let's take a look at the actual data. So here it is is a very simple document where we have the name for a given person. So first name here is top to kind of name that is but it exists and top owns a car that is Honda and a laptop that is Dell So noticed the formatting here is a multi dimensional array. OK. In this array we have two arrays one of them has to do with one item and the other array has to do with the other item that that top owns. And same thing goes for Frank and Peter and cement and so on. Peter here does not own anything so we love that as well as an empty array. OK. You can check that in the Jason. Now a more complex. Jason is right here multi-line juice on fall let me open that up and notice in this example this is known as a multi line Jaison meaning a on document such as this entire thing that I'm holding You're known as a Jason document with the opening and closing curly braces. This entire document can split multiple lines. OK. And to separate one document from the other there are commas and all of the documents are surrounded with this array. Open close brackets as you can see right here on top and bottom. OK so here is an example of two documents. So we're going to leave it up to spark to figure out how to parse that. Now let's go back to the Jason lines parser file real quick so we can not parse that file as as simple d'arte Jaison like we did. And let me comment this out and uncomment out this portion right here. So this is actually the way this is different is first of all of course it's referring to the other file which is multi-line dodgiest on. And then there's another property here which is multi-line is equal to true. OK so we're tagging on this option to let Sparke know that hey we're going to have documents that span multiple lines just like you saw in the CXXVI example. We also had to add multi-line when the value of a given field splits or spills over to the next line. So this is important for this kind of a complex case on document. OK. In the simple dodgiest on each line represent an entire adjacent document here. That's not the case. So instead of DFA does show it's going to be DFA to show for these for this example. So let's save this file and let's right quick and run the application class and there we go we're done. So if you scroll up this is the schema definition. So you can actually go through the document yourself and see how this matches up with that. This is the correct schema definition. The way it's translated here if you go back to the Jason File you'll be able to see that the structure here matches the multi-line Jason File. OK so let's look at the actual table that this was able to construct for us. So there were two documents that recognized that. And then there is this column called Building key geo location is basically an array. It's a multi-field. So it has an array as well as this other value. And then in the properties there's an array as well as two other values. OK. Now notice that it's not actually giving the field names in the record that it's displaying here and that is a normal way of SPARC parsing your adjacent documents if you open up the multi-line J-Zone real quick. For example the geo location column notice that it is printed exact and it printed these two values. So exact Has the field name call type. It's not displaying that in the table. And then the these values that are an array they belong to the coordinates field. It's not displaying the field name. All right. It just shows the values. Same thing goes for this field right here latitude longitude. This is just an array. So in the properties column it won't say per permit number. It will just show this value and it will just show this array. OK. So let's go back to the file here. Notice it just gives the address. It doesn't it doesn't show the full name here for latitude longitude and same thing goes for the geo location field. OK so hopefully you've got the idea. Nested fields fields are not present printed in the record only their values are shown. And that's typically the most important thing in adjacent right you're usually dealing with the values and you can access them through their field names and then we'll get into that later. So these were a few examples of how to parse different kinds of files. CACP file is a very common thing and so is Jason. So you've got the most two most important techniques for parsing files. There's also XML files X-amount just getting so old but you can actually parse SML the same way that you did the CSP or Jason. All you had to do is change the format. So instead of saying Jaison or C as we are you'd have to write axonal here and Sparke will be able to read it in that format. All right so this lecture was on ingestion of different types of files in justing into spark. Well we didn't talk too much about transforming these values but we'll get into that pretty soon. So let's wrap up this letter of thanks for watching. I'll see you on the next one.