
Transcript:
Let's get started. This is a course on machine learning. We're going to be immersed in machine learning and we are going to learn a lot of things. So first of all, let's contextualize machine learning in the world of artificial intelligence. Artificial intelligence and machine learning are not new disciplines.
Machine learning is a branch of artificial intelligence that concerns itself with learning from large quantities of data. Deep learning, which regards neural networks, is a subset of machine learning, which uses a type of model called neural network to learn from data. What is machine learning? A definition that dates back from the very early days of machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.
This is referring to using data as an example for the computer to learn from. A much more recent definition of her that the conference last year is: it's a new way of computer programming. And this is really the paradigm shift that we're assisting to, in these days. More and more so the systems and softwares we use have components that are not explicitly programmed, but are learned from a large amount of data.
And we see this in many different applications. For example, if you've ever bought a book on Amazon, when Amazon recommends you other products to buy, that's an example of machine learning applied to a web interface. You're buying a book and the computer has learned from your purchase history, as well as other customers' purchase histories, what could be the things you are interested in buying next.
Another very common application is for example, sentiment analysis, which is used a lot by financial traders to gauge the sentiment of markets and then trade stock. This regards analyzing large quantities of text and for each of these pieces of texts, deducting or extracting what the sentiment associated is, so is it a positive or negative sentiment. Another domain where sentiment analysis is used a lot is customer service, where, when customers leave reviews about products or about your services, you can analyze if they're happy or not happy and therefore correct your actions.
Two more examples very commonly known are image classifications and machine translation is probably the most common example of machine learning applied to text. Where you have input is a sentence in one language, and the output is a sentence in another language.
There are many other applications predicting price of houses, intrusion, fraud detection, classifying documents, analyzing logs of applications or social media and machine learning is applied increasingly more frequently by companies and other institutions in many, many different domains.
Transcript:
What are the enablers of machine learning? So machine learning is fundamentally enabled by two big revolutions. One is data. So the techniques in machine learning are not new. For sure we have new techniques being used recently, but overall they are not new in total. A lot of the material and tooling we are using dates back to decades, if not half a century. However, what's new and recent and only happened in the last 10 years or so is the massive amount of data we've been collecting from a number of sources, including phones, applications, sensors, cameras, microphones, logs, et cetera.
As well as the compute power that's been increasingly available for us in forms of cloud computing. So cloud compute can be CPUs, GPUs or even TPUs, which are specific chips for machine learning. And these two things combined, the large amount of data together with the large amount of compute power has enabled modern machine learning on a variety of different data types, including tabular data, time series, data documents, images, sound, and video.
Transcript:
In the previous class, we talked about machine learning and saw that machine learning is a branch of artificial intelligence that is concerned with learning from data. And in particular, what we said is that it also can be considered a way of computer programming.
Machine learning systems are systems that learn from data, and therefore, improve their performance as we feed them with more and more data. So there is a relationship between the amount of data available and the performance of the machine learning system. Machine learning nowadays is quite well-developed field, but there are three techniques that are the most popular and also the most widely used in industry.
And even if you're a complete novice to machine learning, I want to ensure you that you are already familiar with all three of them. And that's because you have a brain and your brain is an amazing pattern recognizer. So I'm going to go over the three main techniques and show you that your brain is capable of recognizing what the patterns are that we're asking the machine to recognize.
For example, if you look at this chart, what is the pattern that comes immediately to mind? This is a chart displaying humans, so each human is represented by a dot and for each human, we know the height and the weight. And the immediately visible pattern is there is a correlation between height and weight.
Okay. So taller people are on average also heavier and we can represent that with a straight line going upwards to the right. And that's what we call the line of best fit. And I'm sure you're all very familiar with this. This is an example of a regression and it's called a regression because the output space, in this case, the weight is a continuous variable. So there are numbers. We are predicting the numbers. So we apply a regression technique.
On the other hand, do you see any pattern in this figure? What is represented here? And, are there any patterns? So what you can see here is, internet service providers. Okay. Each dot represents one ISP and they're characterized by two variables or features the download and upload speed.
And there is a third element, which is the shape of the symbol or the color that kind of groups them into two classes. One is the class of fast ISPs, and one is a class of slow ISPs. So if we were to separate them or look for a boundary that would separate them, we would probably draw a line like this, or a similar line that crosses and separates the fast from the slow.
So notice that we're still predicting something here, but differently from the previous case, now we were predicting a categorical quantity. Okay. We're predicting a quantity that is discrete in nature. It's either fast or slow.
The third technique that we're looking at is represented in this chart. Do you see any pattern? And here, typically people say there are two groups, each dot is a flower. We know two features of these flowers, this sepal length and the petal length. But there are clearly two groups of distinct behavior. One group at the bottom of flowers, whose petal length is more or less static or similar, and another group circled in green here where the petal length and sepal length are correlated. So these three techniques go by the names of regression, classification, and clustering, and they account for 90% of machine learning commercially used and applied in everyday products. So these are the three main techniques and are the three techniques that we will cover in this and the next two lectures.
Transcript:
I want to zoom out a little and talk about different types of machine learning, because there are roughly speaking three families of machine learning. There are actually a few others, but the common ones are three. The most common by far is supervised learning of which we've just seen two examples.
This is when you task a machine to learn from data in the forms of input, output pairs. So examples with a ground truth answer, and classification and regression are the two main techniques used in the realm of supervised learning, when you want the machine to learn from example, practical applications of these are things like spam detection, fraud detection, image recognition, forecasting, future values, time series, and so on. So anything where you're taking data, trying to predict a given output, is called supervised learning.
And this is in contrast with a technique called unsupervised learning, which is when your goal is not to learn from examples, but your goal is to find relationships from the data and represent it in a way that offers you a deeper meaning in the data itself.
So the most commonly used technique here is that of clustering where you're aggregating data, based on similarity. Okay. Practical applications of this customer segmentation, log analysis, discovery of new diseases, and so on.
A third type of learning, that's increasingly more relevant, especially for companies that have lots and lots of data is reinforcement learning.
This is when, instead of giving the machine static pairs of input outputs, like in supervised learning, we offer the machine an environment in which the machine can control an agent. And this environment/agent pair allows the machine to learn by trial and error. So the agent can do actions in a certain environment, and the environment will reward the agent with punishment or with premiums, based on the action taken by the agent. This is used in simulation environments.
For example, video games and the most common application of these is robotic control. So if you want to build a robot that can navigate an environment, for example, a self-driving car, you will apply reinforcement learning in a simulated environment first, and then, once the agent has learned to drive, deployed on a car and hopefully it will not crash. This course is not about reinforcement learning. So we will actually focus for the remainder of the scores, mostly on supervised learning and a little bit on unsupervised learning.
Transcript:
Supervised learning is when humans provide labels to the machine telling what is the answer that it's supposed to give? And this is typically a column in our data set. So for example, let's say we had two types of images, cats, and dogs, and we wanted to recognize the images, we would have to label each image by telling the machine whether it's an image of a cat or an image of a dog and we would have to do that manually.
Typically there is a human or a team of humans providing the labels. And the goal here is to generalize from example. So we will evaluate the performance of the machine on how well it can recognize cats and dogs on previously unseen images. Typical examples of these are spam detection, forecasting, any algorithm where you're asking the machine to predict a quantity.
In contrast unsupervised learning is when you don't have labels and your goal is not to predict something, but to discover something, okay. Understanding your data at a deeper level. For example, here we are applying clustering, which is an unsupervised learning technique to find out that there are four clusters in our dataset of similar points, and one common application of this is customer segmentation.
Transcript:
So when you're faced with a problem in machine learning, and you're asking yourself, what technique should I use? How do you go about deciding which technique to use? You ask a series of questions? The first question is, do I have labels? am I trying to predict something that I already know what it should be?
If the answer is yes, then I'm in "supervised learning world". If the answer is no, then I'm in "unsupervised learning world". So let's say we've answered yes, and we are in supervised learning mode. Then the second question we are asking is are these labels categories? Yes or no. If they are categories, then we are performing classification problem.
If they are not categories, meaning they are numerical, we are performing a regression problem. Vice versa, if we are not in supervised learning and we are in unsupervised learning, the next question we're going to ask is, are we looking for groups? If we are looking for groups, then we're probably doing clustering.
If on the other hand, we're not looking for groups, that's everything else that is not one of these three techniques. So this is a very quick map of how to select the best technique in machine learning. And we are going to go down the path of supervised learning with numerical labels to talk about a regression problem.
Transcript:
So we are going down the path of having supervised learning and predicting numbers. And we're going to start with the simplest possible regression, which is a linear regression. So let's take a practical example. Let's say we have a data set of housing data. We have houses all of a certain size and square feet and the respective prices in thousands of dollars.
And you can represent these as data points here on a plot. Okay. So we would like to find the line of best fit that allows us to predict the price of the house, given its size, by looking at this data, we can make what's called a hypothesis. Okay. And in this case, the hypothesis that we're formulating is that there is a function connecting the inputs X in this case, the size to the output, Y in this case, the price.
And furthermore, we are making the hypothesis that this relationship between inputs and outputs is a linear equation. It's an equation of the form b + X w. So there are two parameters that control our model: b, which is the intercept or the price of the house of zero size, and w which is the rate of increase of price as the size of the house increases. So the question went from, let's find the line of best fit to let's find the optimal values of b and w. And this is actually solvable exactly. you could reverse this equation and find the exact values of b and w. However, I'm going to solve it in a different way.
We need to define what's called a loss or a criterion to know whether our solution is good or not. We'll do so by looking at how far our predictive model the line is from the actual data. We define this as a residual, and it's the difference between the actual value and the predicted value. And then we'll sum the squares of all of the residuals actually average the squares of all of their residuals in what's called a cost function herethe mean squared error. So we are averaging the squares of the residuals and we are using that cost function as our guiding light to find the best values of b and w.
That's because b and w are both inside the prediction. And so when we try to minimize the mean square there, we're actually changing the values of b and w.
Transcript:
So how do we do that? Well, we'll start with random variables, random initialization for both b and w. And then we will calculate the residuals for, that particular combination of b and w. This will give us a value for the cost or the mean squared error that we could for example, plot as a function of one particular parameter, either b or w.
Next, what we're going to do is change that parameter in the direction of decreasing costs. This is done through a procedure called gradient descent. What we do after one step, we recalculate the cost and we also calculate the gradient and this will inform us of which direction we should take another step in the change of the parameter.
And we can keep going with this procedure until we reach a point where we have found the value of that particular parameter for which the cost is minimal. And this should correspond to the line of best fit. This will exactly correspond with the line of best fit in the case of a simple unidimensional linear regression, that's because we only have two parameters and the minimum is well-defined, but also in higher dimensional cases, we will have an exact solution.
So when we have multiple features, what we'll do is simply extend the model to multiple parameters. So for example, here, let's say we want to predict the price of the house as a function of not only the size of the house, but also the age and the number of rooms.
What we will do is build a model where the prediction is a constant. Plus the weighted average of all the features. So feature one times its own weight number one, plus feature two times weight number two, plus and so on, adding with a weight for every feature. Okay. If we know that the relationship between inputs and outputs is nonlinear, we can also build a linear model, use a linear regression, however, as features this time, we can use nonlinear features. For example, polynomials. So instead of having just b + X w we have s w + w2 X^2, or a function of X squared and so on up until, power of order N of the feature. Okay. So we can do a non linear regression using nonlinear features and a linear combination of nonlinear features.
Transcript:
How do we know if our model is any good? Well, we're starting to dip our toes in machine learning. And as you shall see in the whole course, there are actually many steps in the machine learning process. We are, basically at step number three right now, which is the model building phase. In this case, we don't even have a neural net, we have a much simpler model, which is a linear regression, but the principle still applies. We have collected some data labeled the data, probably processed it. And we'll talk more about processing and data cleaning. And we're at the point where we're training a model and we want to evaluate the model next.
So how do we do this for the case of a regression and for the general case of any model? Well, the question we have, ultimately is does the model generalize well on new data? That's half of the question. And the second question is, does the model perform better than a dumb guess or, or a random choice?
Okay. So we need to establish a baseline and we need to compare scores across different subsets of the data.
Transcript:
The second point I mentioned is how do you know that your model is generalizing well? And the technique used to assess that is called a train / test split technique. So we take the data and split it into two parts, typically 80% for training and 20% for testing. And then we train and evaluate the model on the training set and evaluate and compare the model on the test set.
This is because we hope that the performance on the model on the test set is equivalent to the performance of the model on the training set. This will tell us that the model is good at generalizing what it has learned on the training set over previously unknown data. If that's not the case and when we compare the results on the training score versus the test score, and the test score is worse, that probably means the model is over-fitting meaning the model has learned really well the training data, but it's not able to generalize to unseen examples.
Transcript:
The library we're going to use in the next lab is called Scikit Learn. And this is a very good library for taking your first steps in machine learning, because it's simple and efficient to use for predictive data analysis. And it's built on top of solid libraries, such as Numpy and Scipy and Matplot. It has wonderful documentation, I encourage you to take a look at it. and it does cover the three techniques that we're interested in here, which are classification, regression, and clustering. Also, there are a lot of other things that we can do with Scikit Learn, including dimensionality reduction, model selection, and preprocessing of the data.
This library is widely adopted and fairly mature. It started development over 10 years ago almost, and it's used by a hundred thousand people almost.
Transcript:
The basic component of most Scikit Learn objects is called an estimator. An estimator object, takes a few pieces of information and input and returns some out. What are the pieces of information? Well, it definitely will depend on data as we've seen machine learning is based on data. And so data will play an important role in what the estimator does.
estimators also may depend on one or more parameters that are kind of like settings that the estimator has, and that tell the estimator what to do with the data. If we're going to be more granular, these could be parameters that are modified during a training procedure or could be fixed settings, these are also called hyper parameters. And the behavior of the estimator will also depend on some randomness in some cases. And therefore it will be important to set the seed of the random number generator in order to obtain reproducible results. These three components combined will generate some output.
And I can see that this is very abstract. So let's look at a few examples of estimators. Here's an example of one family of estimates called predictors. These classes all implement a few common methods like fit, predict, and fit_predict. So fit will be the method to train the predictor, so it will be the method that receives the input data and allows the estimator to learn. Whereas predict, as the name says, is asking the predictor to formulate a prediction on new data, fit_predict does the two things at the same time. So inheriting from this class, we have classifiers, regressors which are the two most common, but also other objects, like for example, outlier detectors and clusters.
The important thing of predictors is that they go from a feature space to a prediction space of some sort. Here is an example of how we would use a predictor estimator. We instantiate the estimator in this case, it's a random forest classifier, and then we fit this model on features and labels, and then use the trained model to predict on something. This is a very typical workflow in Scikit Learn.
How about a different type of estimator that transformer, the transformer implements, three common methods. One is fit, which is similar to the previous one, it learns from data. But then instead of predicting it transforms the data and as usual fit_transform is the combination. So these estimators take input data and return new data.
These could be, for example, new features in the case of a scaler that takes, data and changes the size. For example, imagine you have a data set where one column is measured in a scale that is much, much bigger or much, much smaller than the others. You may want to rescale that column by dividing for a comment factor or subtracting the mean or something like that.
But there are other types of transformers that perform feature extraction and vectorization. These typically are transformers that start from raw data and return some sort of feature. Typically these are used, for example, to encode features from text, you would have texts coming in and numbers coming out of the vectorizer or the feature extractor.
Here is an example of a transformer. You instantiate the transformer and then you use the fit method to have that transformer learn from your data. And then you would use a transform method on the data or on new data to obtain a different version of the data.
Transcript:
The transformers and predictors can be combined in what are called pipelines. Pipelines are interesting objects because they combine both the transformations and the predictions, even a single object that respects the same API, which means that the pipeline object will expose a fit method and a predict method and a fit predict method.
But when we call for example pipeline.fit, what will happen under the hood is that all of the transformers are both trained and transform the data and pass it along the pipeline to the next transformer, which will also learn from the data and then transform it, pass it along and so on until we get to the estimator part.
And the estimator part will then generate predictions or in this case, if we call it fit the estimator part, we'll just learn. Same thing for predict. Predict, we'll call the whole chain of transformations without fit, and then call predict at the output. So here is an example of what a pipeline would look like.
We have a pipeline made of two steps. A standard scaler is the first step and the logistic regression is the second step. So the first step is a transformer and the second step is a predictor and calling model.fit in this case we'll train the standard scaler, transform the data with the standard scaler and then train the logistic regression.
And calling the model.predict, will call transform on the first one and then predict on the second one. So pipelines are really useful objects that allow us to combine multiple operations.
Since we're talking about regressions. There are quite a few regression models that are available in Scikit Learn, starting from the simple linear regression, and then going to more fancy ways of doing regressions that are, for example regularized regressions, or robust regressions that are less sensitive to noise and outliers.
Here are a few links to the documentation. I strongly encourage you. To take a look at together with a repository called awesome Scikit Learn, which contains a lot of links and references to projects that involve Scikit Learn and expand its capability or examples or papers that use Scikit Learn in their results.
So it's a very good starting point to dive deeper into this awesome library.
Welcome to your first lab. You can access the lab by clicking on the link provided, and this will bring you to a Google CoLab session. Google CoLab is a free application that allows you to interact with code and annotate it in a very easy and convenient way. You can access it by clicking on the link or opening your Google drive and looking for CoLab.
The first thing you should do is create a copy of this exercise on your Google drive. This may ask you to log into your Gmail account and make a copy of the notebook. In this way all the changes you make to the notebook will be stored on your local copy.
Google CoLab is an application based on Jupyter notebook, which is an open source software that allows you to create these interactive documents called notebook.
If you prefer to work in Jupyter on your local machine you can simply download a copy of the exercise in the ipynb format, and this will allow you to run the notebook on your local machine.
So how do you work in CoLab? The fundamental thing you can do is execute a code cell. Okay. So here, for example, I have a cell that contains a comment in Python and I can execute it by clicking the play button. And this first cell should just produce nothing, no errors, just, stop spinning at some point. However, something more interesting happens when you start to execute the other cells. Here there is a cell that imports a number of libraries, and if we execute this cell, it will import functionality and you may see some warning message here.
So, how does this lab work? You're encouraged to read through it and try to do what is asked of you. For example, here are some keyboard shortcuts that you can try.
For example, if I hit command M H. It will open the help of CoLab with all of the keyboard shortcuts that you can access. And you can try other commands like ctrl or cmd M, ctrl a ctrl cmd M A, ctrl cmd M D.
Read through the code. you can save your work in your Google drive, as a matter of fact, to execute the notebook, you will be asked to create a copy of this notebook in your Google drive. So it will be saved in your Google drive. And the goal here is to read through the code and try to do some of the exercise. And whenever you see a cell containing code, You should hit play or better hit shift enter, and this will also execute the cell so I can hit shift, enter multiple times and keep executing the cells.
Transcript:
Let's take a look at this lab. As usual you have here, the documentation for all of the libraries that we're going to use, including the documentation for Google CoLab. let's take a look at the code. First we import a few libraries, and you can see Numpy, Pandas, matplotlib.pyplot and Seaborn.
These are the libraries we will use to do data manipulation and plotting. Next we create a Pandas data frame, loading data from a URL that's defined here. We load this particular CSV file with the read CSV function. And the read CSV function can take both a file in the local file system, as well as a remote URL and load directly from that URL.
So we load the data into a Pandas data frame. And with the head command, we display the first five rows in this data. So you can see there are three columns, gender height and weight. And, here we have a categorical column and then, two numerical columns. So for this particular exercise, we will ignore the categorical column and just focus on the two numerical columns here. The scatterplot function from Seaborn allows us to display the numerical data in a scatter plot. So we have the height and the weight that are assigned as the X and Y coordinates in the scatterplot. And what we understand here is that this is a plot of human beings, each represented by two attributes, the height and the weight.
And what we're going to do is build a simple linear regression model that tries to predict the weight as a function of the height. So how do we do that? First, we import a couple of functions from the Scikit Learn library, and notice there is a quirk here the Scikit Learn package, has actually a different name in the Python packages namespace, so it's called sklearn. so we will import from sklearn. The library's name is Scikit Learn. And we import the linear regression class. This will be our model. As well as the train test split function to split our data into two subsets of training and test set. We then define the features and the labels or targets from the data frame itself.
And you see two things here, one for the features, we needed two dimensional objects, so we use the double bracket to create a one column data frame. Whereas for the labels, we are okay with a one dimensional object, so we can use a single bracket which will leave us a Pandas series. Also notice that I am actually extracting the Numpy values out of the data frame.
So just to explain what's happening here, if we create a new cell and print out the values of this. Here we have a single column data frame with the height. And if we use the single bracket, it would be a series. Regardless of which of the two we use, if you append values, what you obtain is a Numpy array with the values without the index that's typical of Pandas. Okay. So it's best practice to use Numpy arrays instead of Pandas objects, especially if you work with other libraries that are not a Scikit Learn. Scikit learn actually works really nicely with Pandas, and so you could if you want do this and it would work just the same.
The advantage of doing the second feature definition would be that your X and Y variables carry the information of columns and rows coming from the original data frame. Okay. So, once we have defined the features and the labels, we, split the data into two parts using the train test split function, we create a test set of size 20%, so 0.2 stands for 20% here. And one important thing is we set the random state of the train tests split function. So let's take a look at what that means. The random state, as you can see in the documentation, is the seed used by the random number generator. What this means. It means, if you and I set the random state to zero, we would get exactly the same split "random" but you and I would have the exact same test and train subsets. So it's a way to make the train test split deterministic. We get a sample of 20%, which is randomly selected, but it's always the same random selection. And this is good for reproducibility of results. Once we have defined the train and test set we create an instance of the linear regression class, we call it model. And then here is where the machine learning happens. model.fit on the training features and the training labels. So this operation takes the linear regression and finds the values of b and w. As we discussed in class. once we have trained the model with the fit function, we can call the score method, on the training or on the test set to assess the R squared score of the model.
So in this case, it's a regression, Scikit Learn knows that the correct metric to use is the R squared score. So when you call dot score, it will automatically call the R squared metric for the regression, as you can see in the documentation.
So we want to compare the score on the training set, which is 0.85 with the score on the test set, which is also almost 0.6. So the two scores are very close to one another. That's a good sign that the model is generalizing well from data that it has seen the training data to data that it has never seen, that's the test data. All right. We can also call model dot predict on the test data, save the results of the predictions in a variable.
And then do two plots to see how our model is doing. The first plot is the one you're more familiar with, more likely, where we're plotting the original data, the test data, and we overlay a line that contains the features in the test set and the predicted values on the test set. That's the red line you see here, and that's the line of best fit.
You can also see the coefficient here. That's coef_ corresponds to the w in our notation and the intercept_ corresponds to b in our notation. A more interesting plot because it generalizes well, when you have multiple features is a scatterplot of the true versus the predicted values.
So here in this plot, it's the same quantity in this case, the weight, which is our target variable, and it's displayed with the true values here and the predicted values along the vertical axis. And this is interesting because even if you had many, many features in input, which obviously you couldn't do a line of best fit, because then your input space would be multi-dimensional.
You can always do this plot of actual versus predicted and compare them, and if your model is perfect they should align on a diagonal line.
Transcript:
So in this exercise, you are asked to extend what you've learned for a single variable regression to a dataset with many features. In this case, we have three features and a label. So you're asked to load the data set housing_data.csv, visualize it, using a pair plot from Seaborn. Add more column features to the definition of the feature array.
Then train and evaluate the linear regression model. Compare the predictions and decide if your score is good. You can also change the random state in the train test split function to see if the score changes or not. So we're going to do all of that. Loading the data is quite simple. We still use the pd.read_csv function and we pass the URL plus the file name. We see that there are four columns. We are going to use the first three as features and the last one has label. Next. We are checking the column types. We see they're all integers. And we also notice that there are quite a few values. There are only 47 numbers in our dataset, which is going to be a problem.
Using Seaborn sns.pair_plot allows us to gauge the correlation between the various columns and what we can see very clearly is that the price seems to be correlated with the size of the house in square feet. As you can see, bigger houses are also more expensive on average. the number of bedrooms also seems to be somewhat correlated with the price as the number of bedrooms goes up, the price also seems to go up. However, the age does not seem to show any of this correlation, there are old and new houses, both in the low and high price range. So these could inform us as to which will be the feature that are important in our model and which are the ones that will be unimportant.
Alright. Next we define the features and labels arrays. Notice that I decided to perform a dropping operation here. So instead of selecting the columns, I told Pandas which column to drop from the original data frame, I'm dropping price, meaning I'm retaining all of the other columns. Along the axis of the columns.
Now axis equal to zero is the rows, axis is equal to one is the columns. And I'm getting the values. Also I am doing a train test split very much like before. And this part is exactly the same as the code. If I do a linear regression, then fit it on the training features and labels, and then score the model on the training data and on the test data.
Now immediately, you should notice that the score here is very different in the case of the training set with respect to the case of the test set. And this should be a big alarm in your mind signaling that you need to understand what's going on because it's a big jump from the training score to the test score.
To further investigate that we can do some plotting so we can generate predictions from the model and then do the actual versus predicted plot, which is here is price versus predicted price. You can see that this line is highly influenced by a couple of outliers. And so it does not really overlap with the data well.
Also one of the suggestions was to change the random state. So let's see if we change the random state to something else. And as you can see, the two scores dramatically changed. So I change it from one to zero, I'm going to select two and here I get another different result where the test score is even better than the training score. So there seem to be a lot of fluctuation and we're going to investigate that in the next exercise.
Transcript:
In exercise three, we are testing different models on a larger data set. We load the California housing data set from Scikit Learn. This is a data set that's provided within the library. And the goal is to define a function that trains the model and plots the predicted versus actual values.
For this exercise we are skipping train/test, and we are going to compare the performance of a few models. And here you are provided with the classes that Scikit Learn makes available. Now, one of the things that's amazing about Scikit is that all of these models will respect the exact same API, so all will implement the model.fit and model.score function.
So it would be actually very easy to define a function like this one that takes a model and performs all of the operations on the model. So we start by importing a number of classes. We get the fetch California housing function and a few models. We get the data, with fetch California housing data set, and we set the labels using the target attribute of the dataset.
Now this data set variable is a Scikit Learn bunch. Which behaves very much like a dictionary. So we extract the attribute that is the target then give it the name of Y. And then we wrap the data in a pandas data frame that has the data set feature names as columns. This brings us back to our usual data frame format, and we then extract the features as the values of the data frame. So, this is what the data looks like. We have a number of features about the houses and the Y contains the price. So you can also print out the description of the data set and that explains what the different features are.
So it makes sure you read that. there are no missing values. we are ready to roll. Here is the definition of the function. He will take a model, run the fit function on X and Y and then generate predictions using X and saving the result in a variable called y_pred. Then we take the code from above where we do a scatterplot of the actual versus the predicted.
We add a couple of labels. And, finally we take the minimum maximum values and plot a red line. We also add that little title to make things clearer that takes the class name and print it out as the title of the plot. Once we have this function, it's quite easy to iterate over the models and see what each of them does.
So for model in models, where models is a list of different models, we run the train_eval_plot model function. And so we can see that the linear regression has quite broad spread and definitely doesn't do well at quite high values of price. The gradient boosting regressor also does not seem to be doing that well, as you can see it's wrong here and here, the random forest regressor, seems to be doing a little bit better, it's less spread out and also following the diagonal line a bit better. Ridge: not so great, Lasso: definitely not good. So I would say there is no ideal model, but probably random forest regressor is the one that has learned to fit the data in the most accurate way. If you wanted to extend this, you could, for example, add metrics about how well your fit is doing in the title quite easily.
Transcript:
There's one last piece of information I want to share with you. which is the following. Let's say our code has a bug. And for the sake of simplicity here, I'll introduce just a typo so remove one letter. And when we run our code, we get an error. Okay. Now, sometimes these error messages are quite easy to read, for example, in this case, we would clearly understand what's going on, but in some other cases we want to debug what's going on. So the interesting thing to do is use the percent debug function, which will rerun the last piece of code and get us into a Python debugger. And so what we can see is it will point us to the line, but we can also inspect the values of the variables inside a function.
For example, we can check who's Y who is Y_predicted, and so on. And when we're done and we think we understand what's going on, we can just type q to exit the debugger. So that's a neat trick. That's also not so well-documented, so make sure you use it when you want to debug your code.
Transcript:
Classification is a supervised learning technique that deals with outputs that are discrete in nature. We call them categories. Okay. And it means the labels are not necessarily numbers, but they belong to a space that is discretized. The example that we've seen previously is that of the internet service providers that have two features, the download and upload speed and a label that categorizes them into two classes, the slow ISPs and fast ISPs. So here, the model is being asked to produce a boundary between the two classes or a rule that given a new internet service provider with certain upload and download characteristics, will be able to tell whether that's an ISP that is slow or an ISP that is fast. Okay.
So going back to our tabular data, enumeration, we've talked about features, records and labels. We are in supervised space. So we have labels. And the important thing here is that the labels are not numbers.
They are categories. Okay. The label is also a column in the table. Most likely if you're dealing with tabular data, but you're not using it as an input to your model. And in that sense, it's not a feature for your model. Features typically indicate the inputs to your model, whereas a label or a ground truth or target refers to an output.
Okay. So even though it may be a column in the data, and there are cases where it's not represented as a column in the data, it's not used as a feature in the sense that it's not going to go into the input of the model.
Transcript:
All right. So let's see what are some examples of a classification label. Well, obviously the easiest and most common type of label is a binary label. That's a zero/one, yes/no, true/false type label. Okay. There are two classes possible and it's either one or the other. An enormous variety of machine learning problems, especially applied machine learning problems can be represented as binary classification. Think for example, some we've seen previously, like sentiment analysis sentiment can be positive or negative. Those are two labels. Or, it could be, is this email as spam? Yes or no. Is this credit card transaction fraudulent or not? So, all of these are binary classification.
A second example is the multi-class classification. So this happens when you have more than one class. Okay. And there are two ways to represent these labels. especially if we have classes that are mutually exclusive, meaning, our samples can only belong to one and only one class.
You can either indicate them as numbers like zero one, two, that's the index of the class, or with letters, A, B, C, D, or with what's called one-hot encoding notation, which, as you can see here, the first class is represented by the vector [1, 0, 0], the second class is represented by the vector [0, 1, 0] and the third class is represented by the vector [0, 0, 1].
So all these three notations are valid, and they represent the fact that the samples can only belong to one out of many classes. There's a third case that is still a classification, but it's often neglected, which is the case of non-exclusive classes.
These are also called tags and that's when a single sample can belong to more than one class. Think for example of the practical problem, where you have documents and you have a classifier that classifies them as personal or work. And, the same classifier also has a class for picture versus text.
And you have all possible combinations of, personal picture, work picture, work text, and so on. So you have two outputs, picture / not picture, and work / not work. And those two outputs are somewhat independent. Each of them is a binary classification, but you want to output all of them at the same time.
This is to say the output is vectorized, but there is not just a single one. And the rest of the zeros. There can be as many ones and zeros as you need. So every class is an independent, binary classification, that's called multi-class non-exclusive.
Transcript:
Let's start by looking at the simplest of these classifications, which is the binary classification. Just as a reminder, that's when the labels are just true and false. And let's start with a very simple problem, which is that of predicting credit risk. Let's assume we have a problem, a dataset with just two features the age of, the client in a bank, and the salary this person has, and we have historical data of whether these clients have returned a loan or didn't return a loan. Okay?
So we plot this data in our tool dimensional feature space, and we'll use the plot as a guideline. And we notice that the people who returned a loan are more aggregated towards the upper right, and the people who didn't return the loan are more towards the bottom left. And obviously this is an imaginary situation. I'm not claiming that this is real data. How would we go about classifying this data set? Well, one very simple algorithm is called a decision tree. And the decision tree works like this.
It will pick the feature that is more effective in separating the two classes with a single cut. For example, here, if I drew a line at age equals two 30. I'm making the number up, right now. I see that above the age of 30, they're only blue crosses, whereas below there's a mixture of blue crosses and red dots.
And so what I could do is say, okay, if my decision is "is this person's feature age greater than 30?", Then I have two children nodes one that if it's yes then I immediately go to a label, a node or a leaf node that says this is not risky. So you should, catalog that or classify that as a blue cross.
On the other hand, if it's a no, then I need to ask a second question. And a second question I could ask is, for example, is salary above a certain number. Again, numbers here are just for representation. And this would partition the remaining space into two areas. One that is labeled as not risky, and the other that is labeled as risky.
This is the algorithm used by a decision tree to classify our space based on the features. The splits are typically binary. If the feature is numeric, for example, is the body temperature and is it greater or lower than a certain threshold, but you could have multiway splits if you have a categorical feature and it has more than two options.
Transcript:
I really like the decision trees because they have a lot of advantages and they are a great model when you're starting to approach a classification problem from scratch, and you don't know how the data is going to perform. Let me tell you some of the advantages of decision. First, they are somewhat easy to interpret.
Okay. So now that I have the decision tree formed, I can go back and understand where the boundary lies and why some decisions are taken. That's often a good thing. Also, once the decision tree is trained, the only thing that is retained in memory are the decision points. There is no memory of the actual data.
So in other words, I could have used millions of data points, but if I came up with these two simple cuts at the end of the training, I will only have to retain these two simple cuts in memory, which can effectively implement it even as a hardware in an FPGA, or very fast as logic decisions in a program.
This makes it for a very low latency predictor that's often used when you have requirements based on the latency that you want the prediction. Also it's a rule-based non parametric model and that has an advantage that it makes it insensitive to scale. Okay. So I don't have to worry too much about the type of data that I'm feeding in.
It can work with features that are numerical. It can work with features that are categorical or discrete, and in all these cases, it's quite good at generating a starting point. The only downside that I find in decision trees is that they typically tend to overfit.
And so I use them as a starting point, knowing that I will have to iterate on the model, but I just want to get a first sense of "can machine learning do anything with this data?". These models are widely used in business, as I said, especially when you have. Low latency requirements, where you need to make a decision, very quickly.
For example, you are a sales person and you need to decide which a person to reach out to and people are on the website . You can have real time scoring of which lead is more promising. Another situation where decision trees are commonly used is advertisement choice, when the website needs to decide which ads to display in a certain webpage.
Transcript:
What other models can you try to improve the results? And I'm going to go over now a few models that are very useful for classification, but many of them can also be applied to regression problems. So, I'll make sure to mention which of them work in a regression case because we are introducing them in a classification lecture, but keep in mind that linear regression is not the only technique for regression.
Let's start with a little bit of history. If we go back to the 1950s and 1960s, the first idea of a machine learning algorithm is actually that of a neural network, in the simplest form, a shallow neural network a Perceptron, but that idea kind of remained stale for, over two decades because we didn't know how to train it properly, and we didn't have large data sets that would be needed for a neural net.
So starting from the 80s, we've seen a flourishing of other algorithms, decision trees, support vector machines, logistic regression, and so on. And so I'm going to go over some of them to give you a bit of perspective, as well as some options when you try your model. I will also comment on which, in my opinion, are the most promising ones nowadays.
Transcript:
So, let us start from a model that I strongly encourage you not to use in any real world project. I'm using this because you may hear about it a lot. And I want to highlight what the advantages are, but K nearest neighbors is in my opinion, not a production ready model, and I'll explain why.
So how does K nearest neighbors work? It is a model that stores in memory all the training data, and then when and you data point comes in, it looks for the 3 nearest data points to the new data point for which we are asking a prediction and it outputs, the majority vote of the labels of the known points in the training data set.
So in other words here, you see, there is a point that is marked in gray, that's the new point. The three nearest neighbors are this blue point here, this other blue point here, and this red point here. And so the new point will be classified as a blue point because the majority of the neighbors are blue.
Now, this approach has a couple of advantages. First of all, it's very easy to understand what's going on. And also it is able to basically draw any boundary because if it's not really building a model, but it's using the training data. So it does not have any preconception on what the boundaries should look like.
Whereas, for example, the decision tree is building a boundary out of straight cuts, that split a feature into two sides. Okay. The biggest limitation of K nearest neighbors is that it does not scale. And this is because since there is no model, there is no assumption. It needs to retain in memory all of the training data.
And so it's good if you have a hundred or a thousand points, but when you start to have millions of points, it becomes really cumbersome to find the three nearest neighbors to any new point coming in and so it's not really a scalable approach to predicting classes.
Transcript:
So what are other approaches that are more scalable? Well, a very simple approach. It's the logistic regression. The logistic regression is the sister of the linear aggression we've seen in the previous class where you were building a parametric model. But instead of having the output linearly related to the input, you have a probability in the output space.
So the hypothesis you're building is that the prediction is a sigmoid applied to a linear combination of the features and a bias. Okay. So the sigmoid function is represented here on the right, is this smoothly and monotonically increasing function that starts from zero at minus infinity and grows all the way.
Two one at loss, infinity passing at one half, eh, when it crosses zero, when the input crosses zero. So it's basically modeling as smooth, a step between a zero and one. So between two classes. Okay. What this ends up doing is drawing a straight boundary between the two classes. Okay. So it's a simple function that can be parallelized and scales very well to large spaces and large number of features and number of data points.
Also, since it's a parametric, we can extract the weights and therefore, do some analysis on the relative influence of the features on the output. And also it has the benefit of not just predicting the class, but predicting a probability of belonging to a class. This is a good thing, because in some situations you may want to be more stringent or less stringent on your assignment of the class.
For example, let's say I have data about COVID-19 cases where my labels are zero and one based on the features that are going to a certain test. Okay. using a model that predicts the probability allows me to make decisions of whether I want to be more stringent. Or less stringent than so for example, if my test says there is probability 23% that this particular sample is COVID positive, I could be a little more prudent and decide to catalog it as a one, even though it's only 23%, setting a threshold at 20%. And that would be a more prudent decision than setting the threshold at 50%.
So it's a good advantage in some cases to have probability output instead of just the class output. In business this is used in many cases, for example, predicting clicks and ranking, since it gives you a continuous value between zero and one, a ranking search results, ranking ads doing recommendation.
These are all examples where it's useful to have a nuanced, kind of gray scale between zero and one.
Transcript:
An extension of the logistic regression are neural networks. Neural networks basically take the concept of a logistic regression and expand that to multiple nodes and multiple layers. They've been extremely successful in dealing with complex data sources. So cases where traditional AI like image recognition had been failing, neural nets have proven superior and got great results. It's a super interesting algorithm.
Transcript:
Before neural networks were famous, the state of the art in machine learning were support vector machines. And although nowadays they are a bit less used and less famous, I think it's worthwhile knowing that they exist and what they do. They are somewhat similar to a logistic regression, but they're able to make a feature transformation under the hood. So they go from very high dimensional spaces to smaller dimensional spaces.
And their goal is to separate the data with a maximal margin hyperplane. Meaning they look for the boundary that is equidistant from the two classes. This equidistance property makes them quite robust to noise on both sides of the classification. Probably the biggest success case of SVMs is spam filters, spam filters have been using SVM for a long time.
Transcript:
One method that you may hear a lot about are ensemble methods. And this is when, instead of using a single model, you use a combination of many models. There are several ways of forming ensembles of models. I want to go over the three most common approaches to ensembles.
The first approach, and very common approach is called bagging classification. So what is bagging? It is selecting a base model for example, in this case, what you see is a decision tree and then, creating multiple copies of such decision tree and training each of the copies on a different sample of the data. So for example, the model on the top here, will receive 80% of the data points.
Another model here will receive a different 80%,and the third model here at the bottom will also receive a different, 80%. So in other words, each of these three models has a picture of the world that is slightly biased based on each sample they're receiving. And so they will learn a different boundary, that is influenced by the data that they have received.
Now when we do the ensemble, what we do is we train the three models independently and then combine their prediction into a single prediction. So if the first model was biased in a certain direction, and the second model is biased in a different direction and a third model in a different direction altogether, their combined prediction will kind of average out their respective biases and, promote what are the congruences, the things that are the same, between the three models.
Transcript:
An even more common approach is to do a similar thing, but instead of sampling the data, you sample the features. This is so common that has its own name. It's called the random forest. And it's exactly the same idea, but instead of sampling the data points, you are now sampling the features. So each of the copies of the decision tree will receive a random subset of the features.
And it's exactly the same idea. They're forming a view of the world that is biased and they're making decisions based on that bias. And then when you average out the different base estimators, you get a more precise picture of what's going on.A different approach from random forest and bagging is what's called gradient boosting.
Where you train copies of the model, but you don't train them all together at the same time, but you train them in a sequence where the next copy is trying to correct for the mistakes of the previous ensemble. So you train the first tree. This tree will be wrong in some predictions, and then you train a second tree where the wrong points are weighted more so that the tree can correct for those mistakes. And then you train it third three to correct for the mistakes of the first two. And so on.
This approach is really powerful and recently has proven very effective. So ensemble methods end up drawing a boundary that is a superposition of many simple boundaries and therefore end up giving a very nuanced picture of what's going on.
They remove the bias and they are less prone to overfitting compared to, for example, a decision tree, which has a single estimator.
Transcript:
Scikit Learn offers a number of these estimators out of the box. And here you see a comparison of many models, many of which we have just mentioned on three different datasets: the moon dataset, the circles dataset and the linearly separable classes. And as you can see, each of these models performs differently on the three classes.
And it also ends up drawing different boundaries. What we can say though, is that models that have a straight boundary, like the linear support vector machine, only succeed when the classes are linearly separable, but completely fail in the other cases. whereas, more complex models like the nonlinear support vector machine or the random forest and the neural network and so on, they have a much better grasp of the nonlinear boundaries and so they can do a much better job.
How does a Scikit implement these methods? If you remember from the previous class, there is this object called an estimator. And these estimators, they take data parameters and randomness in input and then they return a certain output. We need an estimator, that's called a predictor.
Okay. The predictor object implements the fit, the predict and the fit predict method. And we've seen one of these when we've touched on linear regression. Okay. So an example of these would be we create a model called random forest classifier, and then we call the function fit on inputs and labels.
And then once we fit the model, we can call predict on some set of features.
Transcript:
If you recall from the previous class, we then want to evaluate if our model is any good. And the first thing we need to establish is a baseline. Now, for the case of regression, the baseline was built in the metric we were using. We were using R squared score that it's by itself comparing the score to a baseline, the loss to a baseline.
In the case of classification, we actually need to be a bit more mindful. The baseline could be as simple as 50%, if we are dealing with a binary classification of perfectly balanced classes. And so a coin toss is our baseline. But for example, if we are dealing with a case where the training set is highly skewed, for example, anomaly detection where I don't know, 95% of the sample is in the set are in one class and only 5% are in the other class.
In this case the naive baseline is already 95% accuracy. Meaning if for every new sample we predicted to be in the majority class. We would get an accuracy of 95%. So you need to be careful when you ask what's your baseline and you need to investigate what are the labels, ratios in your dataset as well as do you have any other model already in production that you can compare with?
Whichever your baseline is what you will do is similar to what we've done for regression. So you will split your data into training and test typically 80, 20, but that's not a hard-set rule. And then you will evaluate the performance of your model, both on the training set and on the test set. And you will want to compare the performance on one set with the performance on the other set.
Transcript:
So which score should we be using for classification tasks? Well, the obvious one is the accuracy, which is the overall percentage of correct predictions. But in the case of binary classification, you can actually be a bit more nuanced and look at how your model is mistaken. So if we plot these true values on a vertical axis here where classes we labeled zero and label one, and their respective predictions are on the horizontal axis, we can build what's called a confusion matrix, which is a four quadrants matrix were each of the quadrants contains how many samples end up in that particular class.
So for example, we had samples that were labeled zero, and correctly predicted zero. Those are the true negatives. Similarly, the true positives are labeled one predicted one, but then we could be wrong in two ways. We could be wrong because we have labels that are one that are predicted zero and vice versa labels that are zero that are predicted one.
Now these two errors called type one and type two errors are very different in nature. And depending on the case, one may be more relevant than another. For example, let me give you two extreme opposite cases. Let's say we're building a model that decides whether or not a certain message is safe for work, and we are a social media company, and a certain message is safe for work or should be censored or should be blocked. Okay. So label one are messages that are toxic or otherwise not appropriate. And prediction one has an implication of blocking that message. Okay. Now a false positive here is a perfectly legitimate message that gets blocked, and therefore the outcome of that is a bad user experience. You're trying to post something and the platform tells you, you cannot because that message is offensive. And you're like, no, it's not. On the other hand, a false negative is a toxic message that gets through the filter and gets broadcasted in the wild.
Now, depending on situations, one can be. More negative than another. And you can clearly see the different approaches taken by different social media platforms, where in some cases they tend to suppress false negatives. And in some other cases, they tend to suppress false positives. And I'm sure you can make up your mind as to what each platform is doing.
An extreme case is that of fraud. For example, label one is a fraudulent card transaction labeled zero is a legitimate card transaction, and again, a false positive would be a blocked transaction. And on the other hand, a false negative would be a stolen credit card that gets used to purchase something by a hacker.
There are ways to mitigate one and the other. For example, a very practical approach to this problem would be to try to find a balance that minimizes both and then purchase an insurance to cover for the remaining risk. So this is to say that the confusion matrix can give us a more nuanced understanding of how our model is performing and from the confusion matrix, we can derive some auxiliary scores, for example, the accuracy, obviously, but also the precision and the recall. And these are fractional measures of the false positives and false negatives. So the precision to be precise, is the amount of true positives over the total of predicted yes. So that's true positive plus false positive. So basically it's the blue quadrant here over the sum of the orange and the blue quadrant.
And recall is the blue quadrant over the sum of the yellow and the blue. So it's the fraction of true positives over either the predictions or the actual. There are a lot of other metrics that you can use to assess the validity of your model, and Scikit Learn implements lots of them. So they're all under the sub module called metrics and I encourage you to take a look at it to see what's available.
Transcript:
What to do when you have multiple classes. In the case of multiple classes, while there are some models that accommodate for vector labels, like we've seen. And so they will naturally extend to multi-class classification. This is the case for the logistic regression, but it's also the case for the decision tree or random forest.
They all naturally extend to the multi-class case without any problem. And they can give you probabilities in the case of logistic regression or classes, in the case of decision tree, for all of the classes. If the model you've chosen does not generalize to multi-class, you can always build auxiliary binary problems.
For example, here, you could build problems that are A versus not A, B versus not B and C versus not C and have binary classifications for each of the three classes independently. In all of these cases, you can still build a confusion matrix. In this case, you will have as many rows as there are classes and as many columns as there are classes, and your goal is to have non zero values only along the diagonal.
Whereas, if you have like here, two data points that belong to class C, but are predicted to be in class A, that's an indication of something that your model is not able to classify well, and it should serve you to investigate how to improve your model. So for example, Here we could do two things. We could go and look back at the labels of these two points. Is it by any chance possible that they've been mis-labeled to be in class C, but they're actually in class A. Or we could realize that some of the features that make these points belong to class C are not well represented, and so the model is not able to pick them up and distinguish them from, points that belong to class A that are similar in nature.
So the confusion matrix is also a tool to navigate your model and see how you can actually improve it.
Transcript:
We import some libraries, load this data that has four features, three numerical, one categorical, and a binary target. We can look at the number of counts in the source, which is the only categorical variable we see they're almost equally distributed. One third in each class, roughly speaking, and target we have two thirds in class zero and one third in class one.
Also, by looking at this chart, we see that it would be very easy to separate the two classes if our model was able to draw a circular boundary in the lat-lon space, which means lat and lon are probably the two important features in this dataset, whereas elev, whatever that is, does not seem to offer any way to cut the blue dots from the orange dots.
Also if we look at the distribution in the elev coordinate, we see that the two classes are very similar in their distribution. We set the features and the labels by assigning target to be the label and then dropping target from the remainder. We also use this pd.get_dummies function. What it does is expand the source column into three columns that are one hot encoded.
So, whereas earlier we had C, Q and S as three values in the source column. Now we have three columns where it says either a zero or a one, depending on what the value in the source was. Now we have these six features. We can pass them to our model. We will split the data, using a train test split, and then initialize the decision tree classifier with a maximum depth of three and a random state of zero, and then train the model.
Model trains. We can predict values on the test set and generate the confusion matrix that we can then display using pandas. So we wrap the values of the confusion matrix into a data frame, assigning some nice values for indexes and columns, and this is our neat confusion matrix.
So the model is doing okay, but it's not doing great with the decision tree of level three, of depth three. We can investigate what's going on by using these little helper function called plot_decision_boundary that takes the model and plots a decision boundary.
So, what you can see is since our decision tree has a maximum depth allowed of three, it can only do three decisions. And so it ends up cutting the space into these three directions: one, two, and then three. And the reason why the performance is not so good is that all these blue points here end up being classified in the red space because of the depth that we've enforced on our decision.
Transcript:
So what the exercise is asking is, try changing the model and see if you can improve the score. For example, changing the initialization parameters, like the maximum depth, or changing the model altogether, trying with different types of model, like logistic regression, random forest, support vector, et cetera, et cetera.
One suggested way, by no means you have to do this, but one suggested way is to take a lot of the code from above and bundle it into a function called train_eval that will train the model and then evaluate it, and then loop your train_eval function over many models.
Okay. Let's look at one possible way to solve exercise number one. We can import a number of different models. I here imported the random forests, extra trees classifier, those are two ensemble models. Import the logistic regression, the support vector classifier, naive bayes, and a simple neural network called multi-layer perceptron. I also imported the gradient boost classifier from a different library called XG boost. I just wanted to show you a different library that exposes a similar API. And finally I import the accuracy score.
And then, what did I do? First, I defined a little helper function here called pretty confusion matrix or pretty_cm, that calculates the confusion matrix and returns it as a data frame. Notice that I decided to use the exact same signature for the pretty_cm as the default signature for any of the metrics, which is always first the true values, and then the predicted values.
Then the train_eval function is defined here. Let's walk through it little by little. First we fit the model on the training features and training labels. Notice that here I'm passing the values instead of the Pandas data frame. So values extract, just the content of the data frame without the indices. And the reason I'm doing this is one of the models, the XG boost classifier, does not work well with pandas. And so I wanted to make this function compatible also with the XG boost classifier, and that's why I'm passing a Numpy object instead of the data frame, the values are exactly the same. Then once the model is trained, I run the predict method, both on the training and the test data and store the predictions into two arrays.
I can calculate the training accuracy in the test security and the confusion matrix with my little helper function pretty_confusion_matrix on the test set. Then I compile a string here with the model name, the training accuracy and the test accuracy. And do some plotting, like plotting the data, plotting the decision boundary, assigning the string to the title and putting the confusion matrix in a little text box here at the bottom right.
So what this does is when I call the function on the model, what I get is a little nice plot with the decision boundary, the name of the model, the training score, the test score and the confusion matrix all in one go. And now that I have this, I can iterate over many models very easily.
Here is an example. I compile a list of models, for example, the decision tree of depth three.A decision tree of depth six, and then all of the models mentioned earlier. And all we need to do is just loop over these models and call the train_eval function on each of the models. So let's take a look at the results.
The decision tree classifier with depth three, generates exactly the same plot we've seen before. As you can see if we increase the depth of the tree, the tree is able to draw a full boundary around the inner class. And so we are able to separate that almost perfectly, almost to 100% accuracy.
You can see that there is one hit and one miss that are predicted incorrectly in each of the classes. The random forest does an even better job. And this is because it's using 10 different decision trees and it's building a boundary as the average of their boundaries. And similarly, the extra trees classifier, is a combination of random trees and therefore is able to capture perfectly what's going on.
The Gaussian naive bayes, built a very narrow probability distribution around the center and ends up not being too resilient and to be able to generalize. The logistic regression is definitely not able to separate the two classes and that's because the logistic regression can only draw straight hyperplane cuts and therefore it will never be able to draw a curve boundary here in this dataset, and that's why we get such a low score. As you can see what the logistic regression ends up doing it predicts everybody to be in the zero class. And, therefore, there is a very high, false negative rate. The XGBoost classifier is similar to the other ensemble models does a pretty good job. And the support vector classifier is probably the best one. It's very resilient to noise and ends up separating the test set really well. And so is the neural network, which also has learned the perfect boundary and is able to generalize. So this is a good comparison of different models that also offers an intuition on what they do and how they work on a very simple dataset.
Transcript:
In exercise two, we are going to work with a different data set. It's the churn.csv data set. And the assignment is to load the data set. Then use the churn column as label and assign it to a variable called Y, assign the other columns to a variable called features and then select only the numerical columns from those features with the select_dtypes method, assign them to a variable called X, and then split the data into train and test sets with the test size of 30% random state 42, and then adapt the train_eval function that we defined earlier to, test and compare different models in this situation. So here is how you solve it. You first load the data set it's from the same URL the dataset churn.csv.
And as you can see, this data set contains a lot of columns, and there is one particular column that we're interested in, which is a churn column. That's going to be our label. Whereas of the other columns were only focused on the numerical columns. Okay. So let's take a look with the df.info method, as you can see, a lot of the columns have a data type of object, which is typically what happens when you have a string variable. So we're actually going to exclude those and only focus on the numerical features. Now what you could do is write explicitly the names of the columns when you are going to use, but we're actually going to do something different, which is drop the columns that are not numbers. So let's see how we do it.
First. We define the label or the target, which is defined as where the Churn column is equal to yes, this is going to return a Boolean series, so if we look at Y, Y is a Boolean series of True and False and that's okay because Booleans are interpreted as zero and one, and so they will work as labeled.
We then define the features data frame by dropping the churn column along the column access. And then we take features and from features, we select the numerical data type, and this select_dtypes method, as you can see, has both an include and an exclude parameter where we can either give lists or, scalars that describe which types should be included or excluded in this case, we want all the numbers.
Okay. So we have the X and Y variable defined. Next, we call the train test split function with a test size of 30% and a random state of 42. And finally, we adapt the train, evolve our model by removing the plot. That's the big difference. We now have lots of features, so we cannot do the two dimensional plot.
So just remove that plot and print out the results. And so this function will take the features and the labels in the training data set, fit the model on those numbers, and then use the predict method from the model to predict the values, both on the training set and on the test set. We calculate the accuracy score on the training set and on the test set and we print out the training and test accuracy. Once we have this new function, we can just loop over the models we've defined in the previous exercise and see which ones perform well. So let's take a look at what we have here. The first decision tree of depth three does not overfit too much. The two scores are comparable, but the score is actually quite low.
On the other hand models that are more complex, like the decision tree of level six, the random forest and the extra trees classifier, they all end up overfitting on the training set. As you can see, their scores are very high on the training set, but their score on the test set does not improve significantly from our decision tree comparison.
Gaussian naive bayes is not doing any better. A logistic regression in this case seems to be performing slightly better than all of the tree based models. It's not only not overfitting, but it also has a test score that is a bit higher than all of the other models. The XGB classifier is comparable with the logistic regression. The support vector classifier performs really poorly, and the neural net has comparable results to the logistic regression and the XGB classifier.
In this situation, I would probably pick the logistic regression as my model of choice because of its additional interpretability that is not offered by the neural net and the XGB classifier for comparable scores, or I would try other models to see if I can beat any of these scores. Probably to improve the score we need to bring back in all the features that are not numerical and see if we can use them in our model.
Transcript:
In exercise three, we are going to extend the previously defined function to also keep track of the time required to train the model. This is a very simple modification. You will start from the given function and modify it so that it keeps track of the time it takes to run this line. So how are we going to do that?
First we import the time library, and then we just keep track of the time before and after the model.fit call. And then once we've done all our predictions and scores, we return the model, the training accuracy, the test accuracy and the delta_t, so the interval between before and after running the model fit.
Notice that I changed the signature, so there is no more print statement inside the function, but the print statement is now outside of the function , so the function returns the values, and then I can print them however I want. So when we run this, we can see that some models, for example, the neural net take a lot longer to train, then decision trees or a Gaussian naive bayes.
And this is also a consideration once you're deciding which model you're going to use, as you can see the logistic regression, which has a higher performance is also quite fast, comparably, to train. One last thing I want to show you is the magic command %timeit. This is a Jupyter command that allows you to time what happens in a cell.
So for example, I can take the train_eval_time and run it on model zero, which is the first decision tree. And so we can compare the result of running our function and timing it with Jupyter. So when we run this cell, what happens is this function call is executed a certain number of times.
As you can see here, it was executed a hundred times and the best of three was 6.5 millisecond, which is definitely compatible with our estimated time, considering that the whole function, not only trains the model, but also runs the predictions and the scores. And so it takes a little bit longer than just training the model. This is a very useful function and I encourage you to take a look at it more in detail.
Transcript:
The main technique we are going to discuss in the realm of unsupervised learning is the technique of clustering. And we're going to see how to do that with Scikit Learn. So, as we've discussed previously, clustering pertains to the world of unsupervised learning, and it has to do with when you're looking for groups of similar data points.
So as a reminder, this is what was happening when we were given the example of flower data points, and we found that there were two groups, one group at the bottom of constant petal length and another group of increasing pedaling. So clustering has a lot of practical applications that range from malware detection and intrusion detection to clustering documents and products to even scientific discoveries of new genes and new diseases, or segmenting customers for a more targeted marketing.
So it's a widely popular technique and it's important to know how to use it.
Transcript:
Now, to understand how clustering works, first we need to talk about the concept of distance and similarity. What do I mean by distance and similarity? Well, let's take a look at this data represented in this table. As you can see, there are no labels in this table.
So we don't have predetermined groups that we are trying to predict. And let's say we , ask the question, can we form groups? Let's divide these four clients into two groups of similar clients. And let's see what, clients would you group together and why? Okay. First criterion is low and high debt.
That's a very valid criterion. So current debt would be client one, three, and four, that have high debt and client two would be in a separate group. Or debt to income ratio. in this case, we would get a similar grouping, where the client three, however, would be in a different group with respect to the others because it has a much higher debt to income ratio.
We could use less than one year at the job versus more than one year of the job. All of these are very valid criterion, below drinking age and above drinking age, that would also be a criterion, there is one person below 21. So basically, we can find as many criteria to group them as we want.
Right? And this is to show that clustering is highly dependent on which features we consider. And also what we decide to mean that two things are similar to things that are close to one another. You could also do nonlinear operations and set thresholds. For example, if you decided to use drinking age as a threshold, that's like a clear-cut boundary where below 21 it's one class and above 21 it's a different class. So you can use all sorts of criteria and based on different distances to group your data. Let me give you a couple of other real world examples. If you look at the picture on the bottom right, you see a grid that is a pictorial representation of a city and two points, that are like your departure and destination, on a map. And presented there, are four different routes. One is the green route that's if you can fly. And then there are three routes that involve navigating through the blocks in the streets. So the red route, and the yellow route probably going around the periphery, whereas the blue route is going through the city center.
Now, in absence of any other information, we would tend to group the three routes: red, blue and yellow in one class because they all have the exact same distance of number of blocks up a number of blocks to the right. Whereas the straight line that's flying from point a to point B is probably in a different class.
However, if we had, for example, access to traffic information, it could happen that also the blue route ends up being in a different class because going through the city center maybe involves more stops and more traffic and therefore the time to destination would be different from the two routes. The red and the yellow going around the periphery.
So this is again, to say that depending on the criteria we are using, we may end up with different clusters. And it's always important to be explicit about the choice of distance metric you're using. Just want to tell an anecdote. I was once, sitting at a bar and a person's stroke, a conversation and said: "Hey, so when you fly back to Italy, do you fly East or West?" And I smiled because the mental model of this person was that of flying within the continental United States, whereas to fly to Europe, you need to fly over the North pole or over Greenland. And therefore you fly North from San Francisco. And sometimes we have hidden assumptions in the distance metric that we use.
And that's the same for machine learning models. So first thing is we need to be explicit about choosing the appropriate distance metric. And this will depend on the kind of problem. So here are four very common, distance metrics that you will find in clustering problems. The first two are numerical distances, one is the intuitive Euclidean distance, which is a vector distance between two data points.
And the second is the cosine similarity, which is kind of a normalized dot product between the two vectors. The Jaccard score is very commonly used when you have categorical or binary features. And it's the ratio of the size of the intersection over the size of the union of the two sets.
And finally, the Levenshtein distance or string distance is used when you're trying to measure the distance between two strings of texts. So depending on the type of data and on the problem. You may want to use one type of a measure or a different type of distance measure.
Transcript:
Now let's look into clustering and let's see what clustering algorithms are there. The most commonly used, clustering algorithm is k-means. And by no means this is the best algorithm, but it's definitely the most commonly used. So I'm going to highlight where to use k-means and how to use k-means, and also where not to use k-means.
So what is k-means? First of all, it tells us that data scientists have very poor fantasy in terms of letters of the alphabet because they only know the letter. K. K-means and K nearest neighbors. And you'll see, we'll find K many times. but more importantly, it's a clustering algorithm that works really well when your clusters are linearly separable blobs of points that are near one another.
So if you look at the picture on the top, we have two clusters of points that are forming blobs of points that are close-by and k-means works really well in that scenario. The reason is the way k-means works, and we'll see that in detail, is you initialize the number of clusters you're looking for, and then k-means greedily partitions the space into as many parts as there are cluster centers.
So what you can see in the bottom figure are two examples of failures of k-means where with two clusters k-means ended up partitioning the space, but not in the correct way. Here. The correct way for the bottom left figure would have been to cluster the inner circle in one cluster and the outer circle in another cluster and instead k-means simply divides the space into two parts.
Similar thing for the moons dataset. So it's a type of algorithm that works well for some type of data. Let's see how it is working. Let's say we have some data like this where we have two features, representing our data. And then, we want to partition them into a certain number of clusters. Now, let's leave aside the question of how do we determine the perfect number of clusters?
We'll look into that in a second. Okay. Let's say we know that there are four clusters. We just want to find those four clusters. So this is how k-means work. We initialize the centroids in some position. And then we assign every point in the data set to the nearest centroid. Once we have done that, we shift the centroid to kind of the center of gravity of all of the connected data points, and we iterate this procedure until we reach a stability.
So greedily refers to the fact that every point in the dataset is assigned to one centroid, the nearest one, and the assignment is repeated until the centroids don't move anymore and no points in the data set shifts position. Now, as you can see, this procedure is prone to error because depending on how we initialize the positions of the centroids, for example, now I'm initializing the centroids in a different position.
You will see that I end up with a solution that is different from the previous one. In this particular case, I end up with three clusters and one of the centroids never get any of the data points because it's initialized too far from everyone. Similarly to this if we assign it to an altogether different position, we will get another different solution.
But I hope by now the procedure is clear that that's how k-means works. So how do we overcome the problem of sensitivity to the initial condition. The solution to that is to assign the centroids to random position and possibly repeat the operation quite a few times before we end up with the final initialization.
So, in other words, we repeat the starting, quite a few times, and then we run it for one long run. So that it stabilizes to a stable condition. So this is the random initialization. We initialize it randomly and then we let it run. And this cures the sensitivity to the starting condition. So that's how k-means works.
And as a summary, it requires the number of clusters, it will greedily partition the space, and it only works for linearly separable clusters.
Transcript:
Now I had left aside the question of how do we determine, okay, how do we know what is the correct number to initialize k-means with? And this goes into the realm of model evaluation. Now we need to say a couple of things about model evaluation because since we are in an unsupervised learning world, there is no ground truth.
There is no concept of the correct answer, the correct clustering, and therefore we cannot define a loss function in the same way we were defining a loss function. based on the ground truth and the predictions. Okay. So like we were doing in supervised learning. So depending on the situation.
We may want to use certain criteria to decide whether or not the clustering worked, and the first criterion is: "is the result we obtained useful?". Remember, we're not trying to teach the model to predict something.
Here, we're using machine learning to discover something that we may want to use later on. So is the model, helping us understand the data at the deeper level that we can then use, for, further purposes. For example, if we look into the clusters, can we realize what makes these points similar?
Let's make a practical example here. Let's say we've used a clustering algorithm to cluster a population of users of our application. Then the question we would ask is, do these people in each of the clusters have anything in common, and this could be demographic, it could be purchased patterns, but can we understand them and build personas out of the clusters and say, okay, these are, for example, the compulsive shoppers, these are the millennials, these are the boomers and whatever. Okay. Or, if we are clustering scientific data, for example clustering is very often used with images. Can we cluster images and find similarities between those images? Okay. Another valid question to assess whether clustering was useful or not is to ask the question, can we use the clusters as labels for some sort of supervised learning downstream.
In other words, now that we have the clusters and that we can attribute each point to one and only one cluster, if we were to use the clusters as labels in a supervised learning task, would these labels make any sense to us? So it's always a meaning question that we want to address when we're looking at clustering.
That said, we also want to have some objective criteria to know if the clustering was good. And especially like, what's the best number of clusters.
Transcript:
So I'm going to show you two methods here to assess the perfect number of clusters. The first method is called the elbow method. So let's go back to our k-means example. And let's say it has converged and it has given this configuration of clusters. What we can do is calculate the total square distance from cluster centers.
In other words, we take all of the distances of each point from the respective cluster center, and we add up all of their squares. Now if we repeat this procedure for different values of K, what we should obtain is a curve that looks like this. At the beginning, with one cluster, we basically obtain the sum of the squares of the distance from the center of mass of the data. And that's also going to be the maximum.
But as we add clusters, what's called the inertia will decrease. So the total sum of the squares will decrease because there are more centers, there are more centroids, and therefore each point is more likely to find a closer center, and so the total sum of the squares will decrease. Now, obviously, as we add more and more clusters at some point there will be diminishing returns and this will not really change the total square distance, the total inertia. And so what people refer to as the elbow method is to graphically look at the elbow of this curve, which here we could, pick it to be either two or three, which is the point where the curve changes in slope and becomes from steep to kind of flat.
And that region there is what we would call the optimal clustering region.
Transcript:
A different approach would be to consider not only how compact the clusters are, so that's the sum of the squares. of the distances from the center, but also how well separated the clusters are. These are called cohesion and separation. And we can use these two measures to calculate what's called a silhouette coefficient.
So the silhouette coefficient is a number that we can calculate for each point in the cluster. And this tells us basically how well separated that point is from the other points in other clusters. Okay. And we can calculate an average silhouette coefficient, which is really nice. And it has the nice property of being high and close to one, if the clusters are well separated and going more towards zero if the clusters are not well separated. So you see here an example of data with two two centroids and the red dashed line is the average silhouette coefficient. For this set. Now, if we increase the number of centroids, for example to three you can see that the average silhouette coefficient goes down.
But if we increase it again to four, now the average silhouette coefficient went up a little. So you can see that the average silhouette coefficient captures this property of whether clusters are well separated from one another, and also quite compact within each of them. So increasing the number makes it go down again.
So the idea here is for different values of K, try to train a K means clustering algorithm or any other clustering and then calculate the silhouette coefficient, the silhouette score for the given partition and pick the number of clusters that has the highest silhouette score.
This is not a perfect method, but it's the most commonly used method, and it's one of the few methods that we can use to score a clustering algorithm without having access to any of the labels.
Transcript:
Let's take a look at methods beyond K-means. So what do we do in all of those situations where k-means is not the successful method to cluster the data. One very common approach is called Gaussian mix mixtures. And we can think of that as being a softer version of k-means. So whereas k-means attributes each point to a certain centroid what Gaussian mixtrues do they build Gaussian probability distribution that describe how likely is a point to belong to a certain Gaussian distribution.
And so what ends up happening, it's kind of like a soft labeling. You have a probability that one point belongs to a certain distribution. But that could be non-zero for more than one class and you can have overlaps. This is really interesting in cases, for example, where you have overlapping clusters.
And therefore they would be bundled into one by k-means. But here you could have two distributions that are overlapping with one another. An even more interesting approach is density based clustering or DBSCAN. And I'm going to display how this works with a little simulation that you can find at this address.
So this is how density-based clustering works. You need to define two configuration parameters. The first is the size of the radius that you will use to consider whether points belong to the same cluster or not. So let's say we are considering points to be in the same cluster within an epsilon radius of 1.2 for example. And the second one is how many points do you want in your surroundings to decide that that point is an inner point of the cluster. So in other words here, if around one point there is a minimum of four other points in that radius of size Epsilon, then that point will be marked as an inner point, and the four points will be marked as the frontier. So this approach grows a component by expanding the frontier, adding the points that ended up having four enablers within the radius of size epsilon. And so it can grow the frontier without having a global assumption on the distance between points.
So, whereas k-means was using Euclidean distance to aggregate points at the global level. Here, what we have is a local assumption on the points that be there in a cluster, and as you can see, it can grow a component with a very different topology. This is interesting because it can end up correctly clustering data that has very weird topologies that are not necessarily just globular, well separated clusters like in this case.
Obviously you will need to still tune the hyper-parameters and therefore you will need to do some sort of search with a silhouette score in this case as well. However, the interesting fact about DBSCAN is that it does not ask you to decide how many classes there are, but instead, the number of clusters is an output of the algorithm.
So as you can see here, I've initialized it with a much different set of initialization coordinates, we are looking for a much smaller radius, and still four points, and as you can see, the algorithm is now finding a lot of small, uh, droplets of nearby, uh, points and finding a lot of clusters. And obviously that's why we need to tune the value of Epsilon and find the range in which the number of output clusters is stable , and not changing.
Transcript:
Let's take a look at other algorithms that are implemented by Scikit in the clusterings submodule. So you can see here, quite a few examples of data sets. Starting from the top to the bottom, we see a data set of two concentric circles, then two moons, then three distributions that are probably Gaussian in nature, but they have very different density.
Then three distributions, three clusters that are oblique and not isotropic, not uniformly distributed in all directions. Then three, globular clusters, well separated from one another, and finally a uniform distribution of points where there are no clusters. And let's take a look at what's happening with the different methods.
So here, k-means only works with well separated clusters, but it basically fails at all the other cases. It cannot separate concentric circles. It cannot separate the moons and nor it can't separate any of the others. More importantly, as you can see, the uniformly distributed points still get partitioned in three parts, because it's a greedy algorithm that is not capable of understanding these should not be divided. That's why the Silhouette score becomes so important because you need to assess whether your clustering makes any sense, whether your clusters are well separated. Yeah.
On the other hand, a clustering algorithm, like the density based clustering [DBSCAN] is effective in separating the data in most of the cases. As you can see, it works well in the uniform case. It works well in the globular case. It does not have any assumption on the distribution, so it can do the skewed anisotropic Gaussians. The only case where it fails is the case where the points from the distribution overlap. And therefore here the points in this little side lump of data points is considered together with the broader blue set of points . Clustering algorithms that work well in this scenario are the Gaussian mixtures, obviously, but also agglomerative clusterings, and ward and spectral clustering.
They all end up solving the different distributions that are overlapping quite well. So. What's the bottom line here? The bottom line is there is not one-size-fits-all solution and you will have to try different type of clustering algorithms and inspect what the results are in order to decide whether or not clustering was useful.
Transcript:
So going back to how we actually implement them well, we go back to our estimators in Scikit, and again, we're going to need a predictor estimator. So these will implement the fit and predict methods. But in this case, when we call model.fit, then we will not pass the labels because we don't have labels.
So it's going to be model.fit just on the feature vector X. One last comment: the black dots here are considered outliers, so they don't belong to any cluster.
Transcript:
Okay. Let's quickly take a look at the code. We're loading the usual libraries and then importing a dataset called Iris, which is the data set of flowers. Now, the original file contains both the features. Here are four columns. As well as the label, the species. However, in this particular case, we don't care at all about the labels, so we will not use the species columns.
And as you can see, these flowers, if we don't plot them with their labels, they really do look like there are two types of flower. One cluster here and one cluster here. So we dropped the species columns and initialized k-means and fit it, only on the remaining features. So notice that we're not passing any labels.
This gives us the cluster centers. Now notice that these centers have four coordinates because there are four features, and therefore, even though we're only plotting two of them, the dot that actually leaves in a four dimensional space. And so the centers also have a four dimensional space.
We can plot the data and overlay the centers. And as you can see there is a little bit of error here where these three points are misattributed to this centroid whereas they should pertain to this centroid. But otherwise the k-means algorithm has well separated, attributing most of the points, this centroid and a little bit to the centroid.
Transcript:
In the first exercise, you are asked to try to find the optimal number of clusters using the elbow method. And the suggestion is to iterate K from one to 10 for each of the Ks, fit a K means, and then plot the inertia as a function of K. Now this should be quite easy. So let's see what a possible solution could look like.
We define a range from one to 10. This is a Python object range, and we call it Ks. And then we have a list that is the collector for the results. And then we iterate over Ks, and for each of the Ks, we train a k-means model with a number of clusters equal to K. Now, notice that the parameter here is n_clusters. So we're setting it to be equal to K. What is that n_init 10? And that has to do with what we were discussing in the lecture, which is k-means is quite sensitive to the initialization. So what this will do is it will initialize k-means 10 times before committing to an average initialization position, and then going through the full clustering until convergence. So you can set that to basically minimize the influence of the initialization step. Then you fit the model. And then from the model, you extract the inertia, which is the sum of the squares of the errors of the distances from the centroid.
Okay. Notice a convention that is used in Scikit. This trailing underscore. It is used in Scikit in class attributes that only have a value after the fit method has been run. In other words, when you define a model before you fit it, inertia attribute will not be defined because it only exists after the model has been fit on the data.
And it's a calculated property of the training. Okay. So we append the score to our list. And then finally we plot the results as a function of K using matplotlib. Okay. So we create a figure then plot Ks versus the inertia, and then add a little bit of titles to the axes and the title. And that's it.
So, in this situation, the elbow is somewhere here and we could potentially pick two, but also three as the optimal number of clusters, and these would be debatable whether based just on this figure, you would choose one or the other.
Transcript:
Let's take a look at exercise number two. So here you're asked to assess the correct number of clusters using the silhouette score. And it's similar to the previous one, so where we have to iterate over the clusters, but instead of using the inertia, we're going to calculate the silhouette score.
Also there are a couple of bonus questions like plotting the score as a function of K, but also plotting the clusters for each value of K. And we could use the helper function. So let's take a look at one possible solution. First you may find the helper function, which is basically taking the plotting code from above and making it slightly more flexible in that if we pass the centers, then it will plot the centers, but if you don't pass the centers, it will just plot the data with the colors from the labels. Okay. This is to accommodate for models that do not have the centers. Like, for example DBSCAN. Then once you have that helper function, what you can do is iterate over K, and for each value of K, create a model, fit the model, calculate the score, append the score, and then do some plots.
So a nice way to keep them all together is to use the subplot API. That will create subplots in a given figure. And so, for example, here I've created nine subplots for different values of K and these subplots are representing the different clusters.
So as you can see here is with two, then three and four, et cetera. And the silhouette score as a function of K, this very clearly indicates that probably two is the best clustering aggregation because it not only has the highest silhouette score. But also it has two very well separated colored groups, colored clusters.
Transcript:
Exercise three, switch the method to DBSCAN and iterate the search, but for different values of Epsilon, it's a very similar procedure. We can reuse most of the code from the previous exercise and therefore define a list of possible values for Epsilon. And then loop over the values for Epsilon and calculate the clustering and then the silhouette score.
Now let's take a look at this. You can see that when Epsilon is very small there are basically only two clusters where most of the points are in the purple one. And here, purple is the -1 label. That means they've been considered not belonging to any clusters. They're outliers. As we increase the size of the radius the clusters increase in size and more points get aggregated together.
And as you can see, we were kind of conquering the outliers into groups that are colored. Until we reach a regime that's four Epsilon equals to one and above where we kind of reached a stable solution of two clusters. Okay. There is one cluster here and one cluster here. And that's it.
So this analysis tells us that the optimal value Epsilon is somewhere in the range 1 to 1.5, but also tells us that the optimal number of clusters is two. And that's what both methods seem to converge on the k-means with the silhouette score and the DBSCAN with the silhouette score as a function of the Epsilon radius.
This course introduces machine learning covering the three main techniques used in industry: regression, classification, and clustering.
It is designed to be self-contained, easy to approach, and fast to assimilate.
You will learn:
What machine learning is
Where machine learning is used in industry
How to recognize the technique you should use
How to solve regression problems to predict numerical quantities
How to solve classification problems to predict categorical quantities
How to use clustering to group your data and discover new insights
The course is designed to maximize the learning experience for everyone and includes 50% theory and 50% hands-on practice. It includes labs with hands-on exercises and solutions.
No software installation required. You can run the code on Google CoLab and get started right away.
This course is the fastest way to get up to speed in machine learning and Scikit Learn.
Why Machine Learning?
Machine Learning has taken the world by a storm in the last 10 years, revolutionizing every company and empowering many applications we use every day.
Here are some examples of where you can find machine learning today: recommender systems, image recognition, sentiment analysis, price prediction, machine translation, and many more!
There are over 3000 job announcements requiring Scikit Learn in the United States alone, and almost 80000 jobs mentioning machine learning in the US. Machine Learning engineers can easily earn six figure salaries in major cities, and companies are investing Billions of dollars in developing their teams.
Even if you already have a job, understanding how machine learning works will empower you to start new projects and get visibility in your company.
Why Scikit Learn?
It's the best Python library to learn machine learning
Simple, yet powerful API for predictive data analysis
Used in many industries: tech, biology, finance, insurance
Built on standard libraries such as NumPy, SciPy, and Matplotlib