K-Means Clustering Tutorial

Kirill Eremenko
A free video tutorial from Kirill Eremenko
Data Scientist
4.5 instructor rating • 46 courses • 1,899,162 students

Lecture description

What is K-Means? Learn the K-Means Clustering algorithm step by step. Choose the number of clusters in K-Means.

Learn more from the full course

Machine Learning A-Z™: Hands-On Python & R In Data Science

Learn to create Machine Learning Algorithms in Python and R from two Data Science experts. Code templates included.

43:48:25 of on-demand video • Updated September 2021

  • Master Machine Learning on Python & R
  • Have a great intuition of many Machine Learning models
  • Make accurate predictions
  • Make powerful analysis
  • Make robust Machine Learning models
  • Create strong added value to your business
  • Use Machine Learning for personal purpose
  • Handle specific topics like Reinforcement Learning, NLP and Deep Learning
  • Handle advanced techniques like Dimensionality Reduction
  • Know which Machine Learning model to choose for each type of problem
  • Build an army of powerful Machine Learning models and know how to combine them to solve any problem
English Hello and welcome back to the course on machine learning. And in this section we're talking about the K means clustering algorithm. And in this tutorial we're going to talk about the intuition behind Kamins. So Kamins is a algorithm that allows you to closter your data and as we will see it's a very convenient tool for discovering categories of groups in your data set that you wouldn't have otherwise thought of yourself. And in this section or in this specific tutorial we'll learn how to understand k means on an intuitive level and we'll see an example of Hardwick's. So let's dive straight into it. Let's make this topic this complex topic simple. All right so here we've got a scatterplot. And how does this robot come to be. Well let's imagine that we've got two variables in our data set and we decided to plot those two variables on x and y axis so here we go that's one variable. That's the other one. And this is how our observations are configured according to these two variables. So the question here is can we identify certain groups among all variables. And how would we go about doing this too. Can we maybe identify this as one. This is one group and this is the second group or maybe this is one group the second maybe there's four groups or five of these three groups. How do we identify that number of groups how do we identify the groups themselves. So what the Kamins does for you is it takes out the complexity from this decision making process and allows you to very easily identify those clusters are actually called clusters of data points in your data set. So they would go in this case we have three classes and of course this is a very simplified example we only have two dimensions here so two variables Kamins can work with multidimensional objects so it can work with three four five 10 dimensions. This example is just for illustration purposes so that we can visually see what is going on but in reality they can be as many as 10 or 100 any number of variables and Kamins will do that complex calculation for you. So it's an algorithm designed to find these clusters for you. And today we're going to find out a bit more on how the Kamins algorithm works. So what is the step by step process. My favorite part. We get to break you down as such a simple step by step process that will be extremely intuitive and simple to understand. All right so step one for the Kamins algorithm is to choose the number of clusters K and we'll talk more about how to select the optimal number of clusters. Further down in this section of the course. But for now let's imagine that we've agreed on a number of clusters for a certain challenge and let's say we have to agree that there's three clusters of this five clusters or two classes. And once you've done that then you proceed to step two which is to select at random k points which will be the centroid of your clusters and not necessarily these points have to be from your dataset. So as you saw we had a scatterplot you could select any points and that scatterplot. They don't have to be part of the observations that A-plot on the scalp but they can be any random x and y values on your scatterplot as long as you just selects a certain number of centroid that are going to equate to the number of clusters that you have decided upon. Yes we'll see this in practice just now. Then you move on to Step 3 where you assign each data point to the closest centroid and that will form K clusters right away immediately. So yes you're starting clusters and then there's going to be an iterative process to refine those clusters and we'll see this in action. But basically so you just check for every point in a data set which out of the centroid that you've identified and step 2 which of them is the closest to your Pacific data point and then that's where that data point will be assigned to and closest is a kind of vague term here because what kind of distance you're measuring it by is it Euclidean distances at some other sort of distance that can be better defined in your business. But for the purposes of these tutorials we're going to talk about Euclidean distances. And so that's basically geometrical distances. In simple terms. All right. Step Four where you proceed to step three is to compute and place the new centroid which cluster and we'll see how that's done just now. So once you find your clusters you recalculate the centroid again it's going to be much easier to understand this when we look at an example. Then in step 5 you reassigned each data point to the new courses centroid. So basically you perform step 3 again but we'll just call it Step 5 here and then if any assignment place you repeat step 4. So it becomes an iterative process of Step 4 survive. And if Noria someone to place to go to finish and your models. I know this might seem a bit complex but let's go through a visual example now so that we understand this all in a very intuitive level. And then you can always reference the slide or the step by step rule. Further down when you're actually performing k means clustering or when you want to understand how it's happening or what's happening in the background of Kamins clustering. So this slide will be a great reference point after we discussed this visual exercise that we're going to have just now. So let's move on. All right so there is our scatterplot And here we've got the observations in our data set they're plotted against two variables and right away the first question is can you visually just quickly identify the final clusters that you think that will end up with. It's pretty tough isn't it. And that is even when we just have two variables imagine how complex a situation or this challenge would be if we had three verbs or five variables we couldn't we wouldn't even be able to plot a five dimensional scatterplot like that. So that's where it came in is clustering comes into play and that's where this algorithm will help us simplify the process. So let's see how it will perform in action. So in this case we're actually going to manually perform the same means clustering algorithm. So something that would normally be done by tools of your choice for such as our Python or any other tool. But in this case just so that we can get very comfortable with this algorithm. We're going to do it manually and see exactly how it works. All right. So let's go through the steps. Step 1 choose the number of clusters K and let's say we somehow identify that the optimal number of clusters is equal to 2. Again we'll discuss how to find the opposite number of clusters further down in S.. But for now let's agree that for this challenge it's just two clusters. All right. And then a step to select at random k points which will be the centroid of your clusters and not necessarily from a data set. So what that means is that they can be actual points in your data set or they can just be random points on your scatterplot it doesn't matter where you select them. So let's say we selected these two centroid is a blue centroid and the red cent.. Next step is step three. Assign each dot a point to the closest centroid and that will form K clusters. So basically for every data point in our data set what we need do is identify which of the two centroid is blue or red is the closest. So let's say for this data point obviously blue is closer than red. So it would be assigned to the blue cluster for this data point. Blue is again closer than red for this data point red is closer than blue. So we could keep going like that for every data point and assign it to one of the two central. But we're going to use a quick hack here we going to use something that we learned from geometry so what are we going to do is we're going to just connect these two centroid with a line like that and then we'll find the center of the light which is so here and I'll put a perpendicular light exactly through that central like that. And so from Jones-Drew we know from high school geometry or maybe I'm at university or even if you didn't have geometry it's it's a very straightforward concept that any point on this green line is equidistant from the blue and the red. So if I take this point will be the same distance to the blue centroid and to the red centroid. If I take this point it'll be same distance to blue and the red. So this whole line is equidistant to the blue in there and that's how we constructed it. And so based on that it is obvious that any of our points on the scatterplot above the green line is close to the blue and any of the points below are the green line is closer to the red and that's how we're going to just color all points. It's a very quick way of coloring them. Again you could totally go through every point and individually identify which one to which centroid is closest. But just for our sake while we're doing this exercise it's going to be faster if we apply this quick hack method. So there we go we've colored our centroid or our points on our scatterplot before we continue I do want to mention something here that's closest is a very even though it seems straightforward it's quite an ambiguous term because closest when you're visualizing things on a scatterplot Yes it's pretty straightforward that it's the distance. Right but at the same time in mathematics and in data science there's lots of different type of distances like the one we're using is Euclidean and the question is should you be using Euclidean distances or should be using some other type of distances defined for your challenge and that is something up to you to decide. And there's something for you to specify for the algorithm what type of distance you're going to be using. And but for our sake which is going to be using Euclidean distances for illustration purposes. And basically that's just the very straightforward simple geometrical distances between two different points. So there's just a caveat there and if you'd like to research more about distances and what other kind of distances you can use just make sure to look into that because for some challenges sometimes you might want to use non-Euclidian non-Euclidean distances are defined in other specific ways. All right. So with that in mind let's proceed to step four. So for compute and place the new centroid of each cluster. So basically right now we've got the new points the old the blue ones excluding the centroid itself and all the red ones excluding the centroid itself. And we need to find out where the actual centroid the new centroid for the Blue Points is and the new centroid for the red point is. And the way to think about it is imagine that the centroid themselves they're weightless but these other points they have a certain weight say. Let's say one kilo each. Then you need to find the center of mass or the center of gravity of these centroid and you need to pinpoint it on your scatterplot. So for the blue closer it'll be somewhere here for the red cluster. It'll be somewhere right here. And the way to think about it on a two dimensional scatterplot you kind of like can just visually see where it is or you can just look at the x y coordinates for all of the blue points and find the center of gravity for the y coordinates which will be somewhere here and then do the same thing for the x coordinates find the center of gravity which Or like the average of the X Cornus which will be our here. So that's how you get your new centroid for the red blue clusters. And so we just move our centroid into them. Next you perform step 5. So re-assign each data point to the new closest centroid. And if any re-assignment took place and go back to Step 4. Otherwise you may finish algorithm meaning that it has converged. So let's do that let's see how now the data points will re-assign to the new centroid. So if we put our line through the scatterplot you'll see that there is one point on it is actually three points. This point is close to the blue and the red and these two points are actually close to the red to the blue. So now we will recolored them. There we go. So now we've got a new clustering and some re-assignment did take place. So we're going to proceed back to Step 4. So we're going back to Step 4 compute and place the new centroid for each cluster. Same thing. Find the center mass for this centroid find the center of mass for this centroid place the centroid in those locations and now repeat step 5 re-assign each data point to the new closer centroid. So we're going to put the line through the chart so the equidistant line between the two centroid. As you can see this point is actually now in the Blue Cross rather than being in the red re-assign that point. And what happens next is again we return to step 4 so if I go back here you'll see that area some someone did take place going back to Step 4 and that's where our new clusters are going to be located are new centroid. Move them there. Repeat step 5 as you can see it's an iterative process which is going to keep going like this until the algorithm converges. So there's our equidistant line as it can see one point needs to be reassigned. It gets reassigned. So now we go back to Step 4 compute in place the new centroid for each cluster move our centroid and now assign each data points or repeats the five. And as you can see this time there could distant line does not make any points. Re-assign says he can see all the points are already in their correct clusters and that means no re-assignment took place during this step so we can proceed to completing our algorithm. This means that the algorithm has converged. So there we go. Those are our clusters and the model is ready and so now we can just remove the centroid and equidistant line and there we go that's our final result. So as you can see that's how the Kamins algorithm works it's very very intuitive and straight forward just an iterative process. And if we compare to what we had at the start this is what we had. As you can see it's not that obvious how the Klaas's would have been arranged. For instance you might have thought that maybe this could be a cluster on its own and this could be a cluster or maybe this bottom part would be a cluster in this top part would be a cluster but based on the Kamins algorithm what we got as a result is these two clusters so the points in each cluster are not exactly very similar to each other but this is what the K means algorithm is suggesting hopefully enjoyed the storyline. Hopefully now we've demystified the Kamins algorithm and we've broken down into simple terms and I look forward to seeing an ex tutorial. Until then insuree machine learning