Logistic Regression Example Code Along

Jose Portilla
A free video tutorial from Jose Portilla
Head of Data Science, Pierian Data Inc.
4.6 instructor rating • 41 courses • 2,591,229 students

Learn more from the full course

Spark and Python for Big Data with PySpark

Learn how to use Spark with Python, including Spark Streaming, Machine Learning, Spark 2.0 DataFrames and more!

10:34:29 of on-demand video • Updated May 2020

  • Use Python and Spark together to analyze Big Data
  • Learn how to use the new Spark 2.0 DataFrame Syntax
  • Work on Consulting Projects that mimic real world situations!
  • Classify Customer Churn with Logisitic Regression
  • Use Spark with Random Forests for Classification
  • Learn how to use Spark's Gradient Boosted Trees
  • Use Spark's MLlib to create Powerful Machine Learning Models
  • Learn about the DataBricks Platform!
  • Get set up on Amazon Web Services EC2 for Big Data Analysis
  • Learn how to use AWS Elastic MapReduce Service!
  • Learn how to leverage the power of Linux with a Spark Environment!
  • Create a Spam filter using Spark and Natural Language Processing!
  • Use Spark Streaming to Analyze Tweets in Real Time!
English [Auto] Hello, everyone, and welcome to the documentation example lecture for the logistic regression section. Well, we're going to do is quickly walk through the documentation, logistic regression example, and after reviewing this lecture, as well as the previous lectures, you should begin to pick up on sparks, general pattern and a workflow for using machine learning with Sparks and Python's Emelle Live Library. Regardless if you're doing a classification task or regression task, the general workflow after this lecture should become clear to you. But we're also going to do is introduce the concept of evaluators, evaluators behave similar to machine learning algorithm objects, but are designed to take an evaluation, data frames, that is the data frames that are produced when we run things like model thought evaluate on some test data set that produces an evaluation data frame that has that prediction column. And this evaluator object is going to be able to take in that object and that what we can call metrics off of it. And we'll explain that when we actually see the code it out example. As a quick note, if you view the documentation evaluators are still technically labeled as experimental, so you should use caution when using them for production code. However, they have been around and part of Sparke since version one point four. And at this time in filming, the latest version of Sparke is two point one. So you can assume they have some stability. Two of them, however, you definitely use caution. They're labeled experimental for a reason. So keep that in mind. The files we're going to be using for this lecture are logistic regression example, the IPY and B for the notebook and sample SBM underscore data that text or the actual data look under the logistic regression folder, under the machine learning folder for access to these files and all links we show can be found inside that notebook. OK, let's get started in a brand new notebook and also have some links, open a new tabs that we're going to visit throughout this lecture. All right. Here I have the logistic regression folder open under Sparke for machine learning. And you can see here we have the logistic regression code along. It's going to be the next lecture, the consulting project, the consulting project Solutions and the example, which is exactly the notebook we're using right now. So if you click on that, you should see a notebook that looks like this. This has all the codes and all the links, as well as some explanatory text for everything that we're going to be doing in this lecture. And then we have also as well, the sample of SVM data text that you're going to need for this lecture as well. I've created a new notebook. It's called Untitled. Right now. It's a blank notebook. So we can do is just get started in it. You need to do is say from Pice Park that school and begin a spark session in order to create a data frame, hopefully by now this is second nature to you. But we can just say Spark session builder, give our app and name, I will call it, I don't know, my log rig for my just regression and I'm going to get or create. And the next thing I want to do is say from Pice Park, the smell, and since logistic regression is a classification task, the family it falls under is classification. And then I can say import. If I hit tab here, I can see the various things that are available to me. A lot of these aren't models or just general things, but I'm looking for logistic regression. So let's run that and then let's grab our training data. So I'll say my data and I will say Sparke read Format's and this is a live SBM format, remember, because it's the documentation example and I'm going to load. Sample of SVM data, that text, so let's see what that data looks like. By saying that show and I can see it here, it's already nicely formatted for me. We have labels which are either zero or one. So this is binary classification and then I have features as well. So that vector of features. So we'll scroll down here and continue on. Next, I'm going to just go straight ahead and actually create my logistic regression model. So we'll say my log rig model, maybe call it something shorter and then we'll say it's an instance of a logistic regression model. And as always, in this logistic regression model, you can specify the features, the label column in the prediction column. We'll keep everything with the defaults right now. That should be enough for us. And what we're going to do next is call. Log Greg model. Is equal to and we'll just say something like my log, let's call this fitted, so. Fitted like Greg. As you go to my Rick model that fit and I'm going to fit it on that training data, which I called my data. So we'll say my data and run that. So now what I've done is I went ahead and fitted this logistic regression to this data. I didn't bother with the train test split. We'll see that later on. Oftentimes, the documentation doesn't do that now that my model has been fitted to the data. What I can do is get a summary off of it so I can say results is equal to well, let's call this something like log. Summary is equal to that fitted log rig model, and I'm going to call the summary off of it. So I've called that summary, which means I can grab my log summary and grab something, the predictions data frame. So if I just shift enter here, it says is a data frame and we can check out the columns it has. I could just print schema and check out what has it has the label, the features, the raw prediction, the probability and the prediction. So in order to do this, we're going to need data which we already knew the labels for. In this case, we already knew the labels for data. So basically what this is saying is here's the actual correct label. Here were the features. Here's the raw prediction value, which has to do with logistic regression. Here's the probability for that prediction value. And then here is what the model predicted. So we really want to compare at least the most straightforward way is does this label match our actual prediction for it? So let's show this results. Hopefully we're not to zoom in here, OK? Let me zoom out one more time and then show this again so we can see this nicely formatted as well. Basically, this is actually even nicer formatted because I can see label and then I can see prediction. They're kind of stacked right on top of each other, which is actually pretty convenient. So I can see here my label zero point zero and my prediction was zero point zero as well, and then one one, etc.. So if you keep going down, it looks like we're doing a pretty darn good job of predicting the labels correctly. So we can end up doing is checking how we can evaluate this. Now that we've seen the basic way to perform logistic regression, let's expand on this documentation example a little bit and introduce the concept of evaluators in order to really show that what we're going to end up doing is grabbing my data and splitting it into a training set and a test set and then running evaluate on that and then passing in that evaluate data frame into the evaluator itself. So we're going to scroll down here and we're going to run a new cell. And in this new cell, what I'm going to do is use random split on that all data. So we'll say all data. Let's double check what it was called, my data. OK, so say my data, random split and I'll do a 70 30 split. And then we'll say L.R. train. Her logistic regression and then L.R. test set that equal to that, and then what we're going to do is retrain just on the training data and then evaluate on L.R. test. So if we come back up here. We're basically going to follow these exact same things, except we're only going to fit on the training data instead of all the data. So we'll come back up here and create a new model for this. So we'll say final model. Is equal to a logistic regression object. And then we'll say final model fits on a train. And then we'll call this final. So now I have that final model, it's fitted to that logistic regression training data. So I want the actual prediction and results. So what I'm going to do is say prediction. And labels is equal to. Fit final, and then I evaluate on that test set. So you've seen a really similar process back in the regression lectures, but hopefully these four cells basically show everything you need to know. You take all your data. Once it's formatted well, then you randomly split it into whatever ratio you want for your training data and you test data. You create a logistic regression model of any parameters you want. Then you fit that model onto your training data and set that as you're fitted model and then using that fitted model you can evaluate on the test set. And this is everything shown in separate steps. But sometimes people like to combine these steps so they'll call method on a method. So I have this prediction labeled data frame that I'm about to create. And if we take a look at it, I have prediction and labels and then I can call various things off of this. But what I want to call is predictions and then show the results here. So I can see here I have my label, the features, the raw production value, probability and the prediction again. So essentially this is the exact same thing I had up here with the log summary. But what I did is I called evaluate on this test data. So that means is I'm on the test data, so I may not get perfect label and prediction matching up. Right now. It looks like it's doing pretty well, especially because it's kind of just the documentation's own data. It may be a perfect fit, but we're about to explore that. So what I want to do is actually explore the evaluation of this prediction and labels data frame is due in evaluator object or an evaluation object. So we'll scroll down here. And what we're going to do is say from Weisbach, Mel, the evaluation import, and then we're going to be discussing both multi classification and binary classification. So the ones that may interest you, I'm just hitting Tab's autocomplete. There is binary classification evaluator. And let me put this on a new line so you can see it. Multiclass classification evaluator. So to have these multiple lines, what we can do is just wrap them in parentheses now that we've imported both of these evaluator objects. I want to quickly hop over to the documentation so we can dive a little more into the details about binary classification evaluator and multiclass classification evaluator. Let's start with binary classification. So if you come over here and you see Melda evaluation, we have this binary classification evaluator that takes in the raw prediction column and by default, it expects it to be called Round Prediction, a label column that expects to be called label and then some sort of metric name. In this case, it expects area under our seat by default. So that's that receiver operator characteristic curve. And you can see here that actually has an example where it creates a data frame for you to kind of play around with. So this is essentially just the basic evaluator for binary classification. And the raw prediction column can actually be of type zero or one for prediction or probability of label one or some sort of type of vector. So that links to vector of your prediction scores or label probabilities. And we actually had all of those. If we got back to our notebook and scroll up, we had some of those. We had both for our prediction and we had the label and prediction. So we have this prediction column, which is just a zero or one. So we could use that as well. And coming down here, we want to explore the metric name, so the actual metric name, if we scroll down here, we basically have two options and that's going to be all the way under Metrick name right here. So the metric name it expects is either the area under the receiver operator characteristic curve or the area under the precision recall curve. So those are the only two metrics it can return to you since it's the binary classification evaluator. It's a little more difficult to actually grab things like precision or recall directly from this specific evaluator. Now, speaking of grabbing things like accuracy, precision or recall directly, we could use the multiclass classification evaluator for that. So I'm going to hop over to the next tab I have open, which is multiclass classification evaluator. And on this multiclass classification evaluator, I have the prediction column expected, which can no longer just take raw prediction. And I needs the actual prediction column, which is going to be those zeroes, those ones. And since this is multiclass also 2s and you can see here, it's basically building one for you with just the straight prediction and the actual label and the label column, which is the true label as well as the metric name. So multiclass classification evaluator, if you begin to scroll down here, you'll eventually get to Metrick name and it can return back the EF1, a weighted precision score, a weighted recall score or even an accuracy score. So those are all your options for dealing with multiclass classification evaluators now? They're a little complicated, so don't worry if you totally don't get the idea right now, we're really going to focus in on these evaluator objects a lot more when we discuss pipeline objects, because that's where they're really going to come into play. But for right now, what we can do is go ahead and just show you a very basic example of using one. So we'll come back to Untitled. We have prediction and labels that predictions that show. So I'll come down here and use this binary classification evaluator. So I'll create. My evaluator objects or say my eval is equal to a binary classification evaluator and we'll just keep the actual default, since I have a label column and I have a prediction column. So we'll keep those defaults here. And then when I'm going to do is run this. And then the next thing I want to do is actually say my. Eval and I want to evaluate and I want to evaluate on this predictions data frame, so that was predictions and labels that predictions. We'll pass that in and we'll call this my final RLC results or whatever you want to call it, we'll run that and that may take a little bit of time depending on your actual computer. And then what we can do is check this out. So we say my final RLC, well, that happens to be one, so what does that actually mean? Well, it means the area under the curve of the RLC was 1.0, meaning this essentially was a perfect fit and it predicted everything accurately. So is that realistic? Probably not, unless you have a lot of faith that everything was highly separable in this case. This data was highly separable. And you can even see a hint of that. We were getting predictions that matched up the labels perfectly, even with the train test split. So this wasn't a super good example for what the my final oracy would look like. But this is an indication that we filled up the RLC curve and we basically matched our prediction to every single label perfectly. But what I really want you to get out of this is how to generally use evaluation or evaluate or objects, you'll say from PAE sparked the EML, that evaluation called binary classification evaluator or multiclass classification evaluator. And then you can get back for binary class, either the area under the RLC curve or the area under the precision recall curve. For multiclass classification, you can get back accuracy, the precision and recall, et cetera. All right. I hope you found that useful. This was not a super realistic example since we got everything perfect. So let's head up next and see a realistic example using a custom code along. Thanks. And I'll see you at the next lecture.