Logistic Regression Example Code Along

A free video tutorial from Jose Portilla
Head of Data Science at Pierian Training
87 courses
4,272,426 students
Learn more from the full course
Spark and Python for Big Data with PySpark
Learn how to use Spark with Python, including Spark Streaming, Machine Learning, Spark 2.0 DataFrames and more!
10:34:29 of on-demand video • Updated May 2020
Use Python and Spark together to analyze Big Data
Learn how to use the new Spark 2.0 DataFrame Syntax
Work on Consulting Projects that mimic real world situations!
Classify Customer Churn with Logisitic Regression
Use Spark with Random Forests for Classification
Learn how to use Spark's Gradient Boosted Trees
Use Spark's MLlib to create Powerful Machine Learning Models
Learn about the DataBricks Platform!
Get set up on Amazon Web Services EC2 for Big Data Analysis
Learn how to use AWS Elastic MapReduce Service!
Learn how to leverage the power of Linux with a Spark Environment!
Create a Spam filter using Spark and Natural Language Processing!
Use Spark Streaming to Analyze Tweets in Real Time!
English [CC]
(bright music) Instructor: Hello, everyone, and welcome to the documentation example lecture for the Logistic Regression section. What we're going to do is quickly walk through the documentation Logistic Regression example. And after reviewing this lecture, as well as the previous lectures, you should begin to pick up on Spark's general pattern and workflow for using machine learning with Spark's and Python's MLlib library. Regardless if you're doing a classification task or a regression task, the general workflow after this lecture should become clear to you. What we're also going to do is introduce the concept of evaluators. Evaluators behave similar to machine learning algorithm objects, but are designed to take in evaluation data frames. That is, the data frames that are produced when we run things like model.evaluate on some test dataset. That produces an evaluation data frame that has that prediction column and this evaluator object is going to be able to take in that object and that way we can call metrics off of it. And we'll explain that when we actually see the coded out example. As a quick note, if you view the documentation, evaluators are still technically labeled as experimental. So you should use caution when using them for production code. However, they have been around and part of Spark since version 1.4 and at this time in filming the latest version of Spark is 2.1, so you can assume they have some stability to them. However, you should definitely use caution, they're labeled experimental for a reason, so keep that in mind. The files we're gonna be using for this lecture are Logistic_Regression_Example.ipynb for the notebook and sample_libsvm_data.txt for the actual data. Look under the Logistic Regression folder under the Machine Learning folder for access to these files. And all links we show can be found inside that notebook. Okay, let's get started in a brand new notebook and I'll also have some links open in new tabs that we're gonna visit throughout this lecture. All right, here I have the Logistic Regression folder open under Spark for Machine Learning. And you can see here we have the Logistic Regression code along, that's gonna be the next lecture. The consulting project, the consulting project solutions, and the example, which is exactly the notebook we're using right now. So if you click on that you should see a notebook that looks like this. This has all the codes and all the links as well as some explanatory text for everything that we're gonna be doing in this lecture. And then we have also as well the sample_libsvm_data.txt that you're gonna need for this lecture as well. I've created a new notebook, it's called untitled right now, it's a blank notebook so what we can do is just get started in it. What you need to do is say from pyspark.sql and begin a Spark session in order to create a data frame. Hopefully by now this is second nature to you, but we can just say Spark session, builder, Give our app a name. I will call it, I don't know, mylogreg for my Logistic Regression. Then I'm going to get or create. And then the next thing I wanna do is say from pyspark.ml. And since Logistic Regression is a classification task the family it falls under is classification. And then I can say import and if I hit tab here I can see the various things that are available to me. A lot of these aren't models, they're just general things, but I'm looking for Logistic Regression. So let's run that and then let's grab our training data. So we'll say my data and I will say spark.read.format. And this is a libsvm format, remember, 'cause it's a documentation example. And I'm going to load sample_libsvm_data.txt. So let's see what that data looks like by saying .show and I can see it here it's already nicely formatted for me. We have labels, which are either zero or one, so this is binary classification. And then I have features as well, so that vector of features. So we'll scroll down here and continue on. Next I'm going to just go straight ahead and actually create my Logistic Regression model. So we'll say my_log_reg_model. Maybe call it something shorter, and then we'll say it's an instance of a Logistic Regression model. And as always in this Logistic Regression model you can specify the features, the label column, and the prediction column. We'll keep everything with the defaults right now, that should be enough for us. And what we're going to do next is call logreg_model, whoops, is equal to, and we'll just say something like, my_log, well, let's call this fitted. So fitted_logreg is equal to my_log_reg model.fit and I'm going to fit it on that training data, which I called my data. So we'll say my_data and run that. So now what I've done is I went ahead and fitted this Logistic Regression to this data. I didn't bother with the train test split, we'll see that later on. Oftentimes the documentation doesn't do that. Now that my model has been fitted to the data, what I can do is get a summary off of it. So I can say results is equal to, well, let's call this something like log_summary, is equal to that fitted_logreg model and I'm going to call .summary off of it. So I've called .summary, which means I can grab my log_summary and grab something the predictions data frame. So if I just hit shift enter here, it says it's a data frame and we can check out the columns it has. I could just print schema and check out what it has. It has the label, the features, the rawPrediction, the probability, and then the prediction. So in order to do this, we're going to need data which we already knew the labels for. In this case we already knew the labels for data. So basically what this is saying is, here's the actual correct label, here were the features, here's the rawPrediction value, which has to do with Logistic Regression, here's the probability for that prediction value, and then here is what the model predicted. So what we really wanna compare at least in the most straightforward way, is does this label match our actual prediction for it? So let's show this result. Hopefully we're not too zoomed in here. Okay, let me zoom out one more time and then show this again so we can see this nicely formatted. Oh wow, basically this is actually even nicer formatted because I can see label and then I can see prediction, they're kind of stacked right on top of each other. Which is actually pretty convenient. So I can see here my label was 0.0 and my prediction was 0.0 as well. And then one one, et cetera. So if you keep going down, it looks like we're doing a pretty darn good job of predicting the labels correctly. So what we can end up doing is checking how we can evaluate this. Now that we've seen the basic way to perform a Logistic Regression, let's expand on this documentation example a little bit and introduce the concept of evaluators. in order to really show that what we're gonna end up doing is grabbing my_data and splitting it into a training set and a test set and then running evaluate on that and then passing in that evaluate data frame into the evaluator itself. So we're gonna scroll down here and we're actually gonna run a new cell. And in this new cell what I'm going to do is use random split on that all_data. So we'll say all_data, let's double check what it was called, my_data. Okay, so we'll say my_data, randomSplit, and I'll do a 70 30 split. And then we'll say lr_train for Logistic Regression and then lr_test. Set that equal to that. And then what we're going to do is retrain just on the training data and then evaluate on lr_test. So if we come back up here we're basically gonna follow these exact same things, except we're only going to fit on the training data instead of all the data. So we'll come back up here and we can create a new model for this. So we'll say final_model is equal to a LogisticRegression object. And then we'll say final_model.fit on lr_train. And then we'll call this fit_final. So now I have that final model, it's fitted to that Logistic Regression training data. So I want the actual prediction and results. So what I'm going to do is say prediction_and_labels is equal to fit_final. And then I evaluate on that test set. So we've seen a really similar process back in the regression lectures, but hopefully these four cells basically show everything you need to know. You take all your data, once it's formatted well, then you randomly split it into whatever ratio you want for your training data and your test data. You create your Logistic Regression model with any parameters you want. Then you fit that model onto your training data and set that as your fitted model. And then using that fitted model you can evaluate on the test set. And this is everything shown in separate steps but sometimes people like to combine these steps, so they'll call method on a method. So I have this prediction_and_labels data frame that I'm about to create. And if we take a look at it, I have prediction_and_labels and then I can call various things off of this. But what I wanna call is predictions and then show the results here. So I can see here I have my label, the features, the raw prediction value, probability, and then prediction again. So essentially this is the exact same thing I had up here with the log summary. But what I did is I called evaluate on this test data. So what that means is I'm on the test data so I may not get perfect label and prediction matching up. Right now it looks like it's doing pretty well, especially because it's kind of just the documentation's own data. It may be a perfect fit, but we're about to explore that. So what I wanna do to actually explore the evaluation of this prediction_and_labels data frame is do an evaluator object, or an evaluation object. So we'll scroll down here and what we're gonna do is say from pyspark.ml.evaluation, import, and then we're going to be discussing both multi classification and binary classification. So the ones that may interest you, I'm just hitting tab to auto complete there, is BinaryClassificationEvaluator. And, let me put this on a new line so you can see it. MulticlassClassificationEvaluator. So to have these on multiple lines what we can do is just wrap them in parentheses. Now that we've imported both of these evaluator objects I wanna quickly hop over to the documentation so we can dive a little more into the details about BinaryClassificationEvaluator and MulticlassClassificationEvaluator. Let's start with binary classification. So if you come over here and you see ml.evaluation we have this BinaryClassificationEvaluator. That takes in the raw prediction column and by default it expects it to be called rawPrediction, a label column that it expects to be called label, and then some sort of metric name. In this case it expects areaUnderROC by default. So that's that receiver operator characteristic curve. And you can see here that it actually has an example where it creates a data frame for you to kind of play around with. So this is essentially just the basic evaluator for binary classification and the rawPrediction column can actually be of type zero or one for prediction or probability of label one or some sort of type of vector. So that length-2 vector of raw predictions, scores, or label probabilities, and we actually had all of those. If we hop back to our notebook and scroll up we had some of those, we had both rawPrediction and we had the label and prediction. So we have this prediction column which is just a zero or one, so we could use that as well. And coming down here, we want to explore the metric name. So the actual metric name, if we scroll down here, we basically have two options. And that's gonna be all the way under metric name right here. So the metric name it expects is either the area under the receiver operator characteristic curve or the area under the precision recall curve. So those are the only two metrics it can return to you since it's the BinaryClassificationEvaluator. It's a little more difficult to actually grab things like precision or recall directly from this specific evaluator. Now speaking of grabbing things like accuracy, precision, or recall directly, we could use the MulticlassClassificationEvaluator for that. So what I'm going to do is hop over to the next tab I have open, which is MulticlassClassificationEvaluator. And on this MulticlassClassificationEvaluator I have the prediction column expected, which can no longer just take rawPrediction and that needs the actual prediction column, which is gonna be those zeros, those ones, and since this is multiclass, also twos. And you can see here it's basically building one for you with just the straight prediction and the actual label. And then the label column, which is the true label, as well as the metric name. So MulticlassClassificationEvaluator if you begin to scroll down here you'll eventually get to metric name and it can return back the F1, a weightePrecision score, a weightedRecall score, or even an accuracy score. So those are all your options for dealing with MulticlassClassificationEvaluators. Now they're a little complicated, so don't worry if you totally don't get the idea right now, we're really gonna focus in on these evaluator objects a lot more when we discuss pipeline objects, because that's where they're really gonna come into play. But for right now, what we can do is go ahead and just show you a very basic example of using one. So we'll come back to untitled. We have prediction_and_labels.predictions.show. So I'll come down here and use this BinaryClassificationEvaluator. So I'll create my evaluator object, so we'll say my_eval is equal to a BinaryClassificationEvaluator and we'll just keep the actual defaults, since I have a label column and I have a rawPrediction column. So we'll keep those defaults here and then what I'm going to do is run this. And then the next thing I wanna do is actually say my_eval and I want to evaluate and I wanna evaluate on this predictions data frame. So that was predictions_and_labels.predictions. We'll pass that in and we'll call this my_final_ROC results, or whatever you wanna call it. We'll run that and that may take a little bit of time depending on your actual computer. And then what we can do is check this out. So if we say my_final_ROC, well that happens to be one. So what does that actually mean? Well, it means the area under the curve of the ROC was 1.0, meaning this essentially was a perfect fit and it predicted everything accurately. So is that realistic? Probably not unless you have a lot of faith that everything was highly separable. In this case, this data was highly separable and you can even see a hint of that. We were getting predictions that matched up the labels perfectly, even with the train test split. So this wasn't a super good example for what the my_final_ROC would look like, but this is an indication that we filled up the ROC curve and we basically matched our prediction to every single label perfectly. But what I really want you to get out of this is how to generally use evaluation or evaluator objects. You'll say from pyspark.ml.evaluatio, call BinaryClassificationEvaluator or MulticlassClassificationEvaluator, and then you can get back for binary class either the area under the ROC curve or the area under the precision recall curve. For a multiclass classification you can get back accuracy, the precision, and recall, et cetera. All right, I hope you found that useful. This was not a super realistic example since we got everything perfect. So let's head up next and see a realistic example using a custom code along. Thanks, and I'll see you at the next lecture.