Using Dataframes with MLLib (Activity)

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.5 instructor rating • 21 courses • 416,550 students

Lecture description

We'll run our Spark ML example of linear regression, using DataFrames.

Learn more from the full course

Apache Spark with Scala - Hands On with Big Data!

Apache Spark tutorial with 20+ hands-on examples of analyzing large data sets, on your desktop or on Hadoop with Scala!

09:03:23 of on-demand video • Updated September 2020

  • Frame big data analysis problems as Apache Spark scripts
  • Develop distributed code using the Scala programming language
  • Optimize Spark jobs through partitioning, caching, and other techniques
  • Build, deploy, and run Spark scripts on Hadoop clusters
  • Process continual streams of data with Spark Streaming
  • Transform structured data using SparkSQL, DataSets, and DataFrames
  • Traverse and analyze graph structures using GraphX
  • Analyze massive data set with Machine Learning on Spark
English Narrator: So let's walk through our example of using Linear Regression and Spark. And see if it works, open up the linear regression data frame data set script. And before we dive into the code, let's look at the data that we're dealing with first. That's always a good first step, right? So if we go into our course materials and go into the data folder, there's a regression.textfile here. Let's just familiarize ourselves with what's in it. Now, as you recall, what we're trying to do is predict how much people spend based on page speed. So the thing we're trying to predict is amount spent, and the amount that we're our feature that we're trying to predict that on is page speed. So in this first column, these numbers represent the amount spent, but it's normalized and scaled down. So this is normalized to fit into a bell curve distribution. That's why you're seeing things like negative values here, even though you wouldn't actually see negative amount spent unless somebody got a refund or something, right? So like we said, in the slides, you need to make sure that you're scaling your data down into that consistent range. And then remembering to scale that back up again, when you're done. That second column is going to correspond to our feature data. So in that case, that's going to be the page speeds. Again, we've normalized this and scale it down to what would fit into a bell curve distribution. So these numbers by themselves are not raw values, they've been scaled down. And that sort of feature engineering that we call it in the world of machine learning. It's a very important step in getting good results. A lot of machine learning models do require that your features and your labels are all scaled down into similar ranges. And sometimes even more specifically centered around 0 or between 0 and 1. You kind of have to read the documentation for the algorithm you're using to make sure that your data is in the right format. Otherwise it won't work. You'll get really weird results and you'll be wondering why. So don't forget that. All right, so let's dive into the code. So our regression schema is going to be, correspond to our two columns of double precision information there. We have the label, which again, corresponds to our amount spent and our features_raw, which corresponds to our paid speeds. And again, these are scaled down to a normal distribution. All right, we sparked up our logger settings. We spark up our spark session, nothing new there. And we define a regression schema that matches that case class that we just talked about. So we can import that in from disc using a data frame interface. So we just say spark.read, with a comma separator using that schema. And we loaded up from the data/regression.textfile that we just looked at and then cast it to a dataset using the regression schema case class. Now things get interesting. So the format that's expected by this algorithm in the ML library is very specific. And oftentimes you need to refer to the sample code that comes with spark to figure out exactly what it wants. That's how I went about getting this code to be honest. So what we expect is that you will create a vector assembler object, which is also part of the Machine Learning Libraries. And you want to set in its input columns as an array of your feature columns. Now, in our case, we only have one feature column; it's named features_raw. So we just pass in an array without 1 column name features_raw If you had more than one feature, if you had a multidimensional problem, you would pass in the additional features here as well. And then that vector assembler will output into a new column, something called features. And that's actually what we're going to be passing into our model. So once we have this assembler, that's this vector assembler object, we're going to call transform on it, feeding in that data set of our raw data. And we'll transform that into that features output column, we'll select from it, the label column and the features column that it produced. So now we have a data frame called DF that just contains labels and features. And labels will correspond again to our amount spent and columns to the page speeds. Again, all scaled down to a uniform scale. All right, so now that we have that. What we're gonna do to measure the performance of this algorithm is to split that data into two sets. We're going to set aside half of that data for training our model, and we'll set aside the other half for testing the model. So the idea here is that we will train this model using just half of the data that we have, and then using that trained model, we'll see how well it can predict the known correct values on the other half of the data. And we will measure that error to get a sense of how good our model really is. That can inform us as to whether we need to try different hyper parameters or not, for example. Or clean our data better, right? So we extract the first split of that from random split; and call that the training data frame and the second test block there is going to be the test data frame. So random split just splits up a data frame using an array of percentages that you give it. In this case, 50 50, we take that first resulting data frame and call it trainingTF and the second resulting data frame and call it testDF. And that is just based on randomly assigning every row of DF into 1 or 2 of these data frames. So now we have everything we need to actually create our linear regression model. We'll go ahead and create it. New linear regression, we'll call it lir. And we set all their hyper parameters. Again, I think I just plucked these out of the documentation for places of where you might want to start from the samples. Again, in the real world, you'd want to iterate on these and find what combination of parameters yields the best results. Once we have our object, we call fit on it, passing in our training data frame. That's half that we set aside for training the model, and then we take that trained model and make predictions on it. So we take our model and we call transform on it, passing in our test data frame, and that will create a new full predictions data frame that will contain predicted amounts spent based on the page speed scene and the tested data frame set. Now what's interesting here is that we can then compare our predicted amount spent to the actual amount spent that's living in our test data frame values and the labels that we have there. Note that I'm cashing it here, not strictly necessary for this simple example, but if you were going to do things using that full predictions data set repeatedly, you'd want to cash that to speed that up, right? So once you have a trained model and a set of predictions based on it, generally speaking, you'll probably want to cash those results. So you can use them repeatedly, right? In this case, we're just going to pluck out the predictions and labels after clicking that back from the full predictions data frame, and we will call that prediction and label. And then we just collect that back to the script and print them all out. So we can just take a look at what the actual predicted and actual values were for each point. You can probably guess what a good exercise for the reader here would be. And that would be to actually do the work of comparing those predicted values to the actual values and actually measuring that error from the test data frame. But for now, we're just going to look at the results because you know, this isn't really a course about machine learning, it's a course about using Spark. So let's right click on linear regression data frame, data set, and run it. And it worked. Again, we don't have a, or at least we didn't do the work to know if these are reasonable predictions or not, but these are predictions. So you can see the output here. Again, these are scaled down to whatever scaling factor we used when we pre-processed the data. So in order to make good use of this, we would have to scale it back up again. But it looks like it worked, you know, these look like real values and within a reasonable range of values. So there you have it. Linear regression in a dataset using Apache Spark. So again, the power of that is that you can take a truly massive dataset and perform linear regression on it and create a model using the full power of a whole cluster. So sky's the limit here pretty much, literally. All right. So that's machine learning in Spark, in a nutshell. You'll find that most of the models work in a similar manner. You just create a model object with a bunch of hyper parameters. You fit that model to some training data, and then you can transform that model with data that you want to make predictions for. And it's pretty easy to use. So that's it let's move on.