Hello everyone and welcome to the documentation example lecture for the logistic regression section. Well, we're going to do is quickly walk through the documentation, logistic regression example. And after reviewing this lecture as well as the previous lectures, you should begin to pick up on Spark's General Pattern and workflow for using machine learning with Sparks and Python's ML Lib Library. Regardless, if you're doing a classification task or a regression task, the general workflow after this lecture should become clear to you. But we're also going to do is introduce the concept of evaluators. Evaluators behave similar to machine learning algorithm objects, but are designed to take an evaluation data frames. That is, the data frames that are produced when we run things like model evaluate on some test data set that produces an evaluation data frame that has that prediction column and this evaluator object is going to be able to take in that object and that what we can call metrics off of it. And we'll explain that when we actually see the coded out example. As a quick note, if you viewed the documentation, evaluators are still technically labeled as experimental, so you should use caution when using them for production code. However, they have been around and part of SPARC since version 1.4 and at this time in filming, the latest version of SPARC is 2.1, so you can assume they have some stability to them. However, you should definitely use caution their label experimental for a reason. So keep that in mind. The files we're going to be using for this lecture, our logistic regression, example, IP and B for the notebook and sample lib SVM underscore data text for the actual data. Look under the logistic regression folder under the machine learning folder for access to these files and all links we show can be found inside that notebook. Okay. Let's get started in a brand new notebook and I'll also have some links open a new tabs that we're going to visit throughout this lecture. All right. Here I have the logistic regression folder open under SPARC for machine learning. And you can see here we have the logistic regression code along. That's going to be the next lecture, the consulting project, the consulting project solutions and the example, which is exactly the notebook we're using right now. So if you click on that, you should see a notebook that looks like this. This says all the codes and all the links, as well as some explanatory text for everything that we're going to be doing in this lecture. And then we have also as well the sample lib SVM data text that you're going to need for this lecture as well. I've created a new notebook. It's called Untitled right now. It's a blank notebook. So we can do is just get started in it. What you need to do is say from Spark SQL and begin a Spark session in order to create a data frame. Hopefully by now this is second nature to you, but we can just say Spark Session Builder. Give our app a name. I will call it. I don't know my log rig for my logistic regression. Then I'm going to get or create. And the next thing I want to do is say from pi sparc. ML And since logistic regression is a classification task, the family it falls under is classification. And then I can say import. And if I hit tab here, I can see the various things that are available to me. A lot of these aren't models are just general things, but I'm looking for logistic regression. So let's run that and then let's grab our training data. So I'll say my data and I will say spark read formats and this is a lib svn format remember because it's the documentation example and I'm going to load. Sample lib SVM data text. So let's see what that data looks like. By saying that show. And I can see here it's already nicely formatted for me. We have labels which are either zero or one, so this is binary classification and then I have features as well. So that vector of features. So we'll scroll down here and continue on. Next, I'm going to just go straight ahead and actually create my logistic regression model. So we'll say my log REGG model, maybe call it something shorter and then we'll say it's an instance of a logistic regression model. And as always in this logistic regression model, you can specify the features, the label column and the prediction column. We'll keep everything with the defaults right now. That should be enough for us. And what we're going to do next is call. Log rig model. Hoops is equal to. And we'll just say something like my log, let's call this fitted. So. Fitted log rig. Is equal to my lug work model fit and I'm going to fit it on that training data which I called my data. So we'll say my data and run that. So now what I've done is I went ahead and fitted this logistic regression to this data. I didn't bother if the train test split. We'll see that later on. Oftentimes, the documentation doesn't do that. Now that my model has been fitted to the data, what I can do is get a summary off of it so I can say results is equal to, well, let's call this something like log. Summary is equal to that fitted log rig model, and I'm going to call that summary off of it. So I've called that summary, which means I can grab my log summary and grab something the predictions data frame. So if I just hit shift enter here, it says it's a data frame and we can check out the columns it has. We could just print schema and check out what it has. It has the label, the features, the raw prediction, the probability and the prediction. So in order to do this, we're going to need data which we already knew the labels for. In this case, we already knew the labels for data. So basically what this is saying is here's the actual correct label here with the features, here is the raw prediction value which has to do for logistic regression. Here's the probability for that prediction value. And then here is what the model predicted. So what we really want to compare, at least in the most straightforward way, is, does this label match our actual prediction for it? So let's show this results. Hopefully we're not too zoomed in here. Okay. Let me zoom out one more time and then show this again so we can see this nicely formatted. Oh, well, basically, this is actually even nicer formatted because I can see label and then I can see prediction. They're kind of stacked right on top of each other, which is actually pretty convenient. So I can see here my label is 0.0 and my prediction was 0.0 as well, and then one one, etc.. So if you keep going down, it looks like we're doing a pretty darn good job of predicting the labels correctly so we can end up doing is checking how we can evaluate this. Now that we've seen the basic way to perform a logistic regression, let's expand on this documentation example a little bit and introduce the concept of evaluators in order to really show that we're going to end up doing is grabbing my data and splitting it into a training set and a test set and then running evaluate on that and then passing in that, evaluate DataFrame into the evaluator itself. So we're going to scroll down here and we're actually going to run a new cell. And in this new cell, what I'm going to do is use random split on that all data. So we'll say all data. Let's double check what it was called. My data. So say my data. Random split and I'll do a 7030 split. And then we'll say LR train. Her logistic regression and then r test set that equal to that. And then what we're going to do is retrain just on the training data and then evaluate on our test. So if we come back up here, we're basically going to follow these exact same things, except we're only going to fit on the training data instead of all the data. So we'll come back up here and we can create a new model for this. So we'll say Final model. Is equal to a logistic regression object. And then we'll say final model fits on our train. And then we'll call this fit final. So now I have that final model. It's fitted to that logistic regression training data. So I want the actual prediction and results. So what I'm going to do is say prediction. And labels is equal to. Fit final and then I evaluate on that test set. So you've seen a really similar process back in the regression lectures, but hopefully these four cells basically show everything you need to know. You take all your data. Once it's formatted well, then you randomly split it into whatever ratio you want for your training data and your test data, you create your logistic regression model with any parameters you want. Then you fit that model onto your training data and set that as your fitted model. And then using that fit and model, you can evaluate on the test set. And this is everything shown in separate steps, but sometimes people like to combine these steps, so they'll call a method on a method. So I have this prediction in labels DataFrame that I'm about to create. And if we take a look at it, I have prediction and labels and then I can call various things off of this. But what I want to call is predictions and then show the results here. So I can see here I have my label, the features, the raw prediction, value, probability and their prediction again. So essentially this is the exact same thing I had up here with the log summary. But what I did is I called evaluate on this test data. So what that means is I'm on the test data, so I may not get perfect label and prediction matching up. Right now, it looks like it's doing pretty well, especially because it's kind of just the documentations, own data and may be a perfect fit, but we're about to explore that. So what I want to do is to actually explore the evaluation of this prediction and labels data frame is due in evaluator object or an evaluation object. So we'll scroll down here and what we're going to do say from py spark ML dot evaluation import and then we're going to be discussing both multi classification and binary classification. So the ones that may interest you, I'm just hitting tab two autocomplete there is binary classification evaluator. And let me put this on a new line so you can see it. Multiclass Classification Evaluator. So to have these on multiple lines, what we can do is just wrap them in parentheses. Now that we've imported both of these evaluator objects, I want to quickly hop over to the documentation so we can dive a little more into the details about binary classification evaluator and multiclass classification evaluator. Let's start with binary classification. So if you come over here and you see my evaluation, we have this binary classification evaluator that takes in the raw prediction column, and by default it expects it to be called raw prediction, a label column that expects to be called label and then some sort of metric name. In this case, it expects area under OC by default. So that's that receiver operator characteristic curve. And you can see here that actually has an example where it creates a data frame for you to kind of play around with. So this is essentially just the basic evaluator for binary classification and the raw prediction column can actually be of type zero or one for prediction or probability of label one or some sort of type of vector. So that length to vector of raw prediction scores or label probabilities. And we actually had all of those. If we hop back to our notebook and scroll up, we had some of those. We had both raw prediction and we had the label and prediction. So we have this prediction column, which is just a zero or one, so we could use that as well. And coming down here, we want to explore the metric name. So the actual metric name, if we scroll down here, we basically have two options and that's going to be all the way under metric name right here. So the metric name it expects is either the area under the receiver operator characteristic curve or the area under the precision recall curve. So those are the only two metrics it can return to you. Since it's the binary classification evaluator, it's a little more difficult to actually grab things like precision or recall directly from this specific evaluator. Now, speaking of grabbing things like accuracy, precision or recall directly, we could use the multiclass classification evaluator for that. So what I'm going to do is hop over to the next tab I have open, which is multiclass classification evaluator. And on this multiclass classification evaluator, I have the prediction column expected which can no longer just take raw prediction and it needs the actual prediction column, which is going to be those zeros, those ones. And since this is multi class also twos and you can see here, it's basically building one for you with just the straight prediction and the actual label and then the label column, which is the true label as well as the metric name. So multiclass classification evaluator, if you begin to scroll down here, you'll eventually get to metric name and it can return back the F one, a weighted precision score, a weighted recall score, or even an accuracy score. So those are all your options for dealing with multiclass classification evaluators. Now, they're a little complicated, so don't worry if you totally don't get the idea right now, we're really going to focus in on these evaluator objects a lot more when we discuss pipeline objects, because that's where they're really going to come into play. But for right now, what we can do is go ahead and just show you a very basic example of using one. So we'll come back to Untitled. We have prediction and labels, predictions show. So I'll come down here and use this binary classification evaluator. So I'll create. My evaluator object. So we'll say my eval is equal to a binary classification evaluator and we'll just keep the actual defaults. Since I have a label column and I have a raw prediction column, so we'll keep those defaults here. And then what I'm going to do is run this and then the next thing I want to do is actually say my. Eval. And I want to evaluate and I want to evaluate on this predictions data frame so that was predictions and labels dot predictions. We'll pass that in and we'll call this My Final Rossi results or whatever you want to call it. We'll run that. And that may take a little bit of time, depending on your actual computer. And then what we can do is check this out. So if we say my final rock. Well, that happens to be one. So what does that actually mean? Well, it means the area under the curve of the rock was 1.0, meaning this essentially was a perfect fit, and it predicted everything accurately. So is that realistic? Probably not, unless you have a lot of faith that everything was highly separable in this case, this data was highly separable, and you could even see a hint of that. We were getting predictions that matched up the labels perfectly, even with the train test split. So this wasn't a super good example for what the my final rock would look like. But this is an indication that we filled up the rock curve and we basically matched our prediction to every single label perfectly. But what I really want you to get out of this is how to generally use evaluation or evaluator objects. You'll say from the ML that evaluation called binary classification evaluator or multiclass classification evaluator, and then you can get back for binary class, either the area under the rock curve or the area under the precision recall curve. For multiclass classification, you can get back accuracy, the precision and recall, etc.. All right. I hope you found that useful. This was not a super realistic example since we got everything perfect. So let's head up next and see a realistic example using a custom code along. Thanks, and I'll see you at the next lecture.