Theory: K Means Clustering, Elbow method .....

Dr. Junaid Qazi, PhD
A free video tutorial from Dr. Junaid Qazi, PhD
Data Scientist
4.7 instructor rating • 1 course • 6,966 students

Lecture description

This lecture will cover state-of-the-art k-means clustering unsupervised machine learning algorithm.

Learn more from the full course

Data Science and Machine Learning using Python - A Bootcamp

Numpy Pandas Matplotlib Seaborn Ploty Machine Learning Scikit-Learn Data Science Recommender system NLP Theory Hands-on

24:52:28 of on-demand video • Updated February 2020

  • Python to analyze data, create state of the art visualization and use of machine learning algorithms to facilitate decision making.
  • Python for Data Science and Machine Learning
  • NumPy for Numerical Data
  • Pandas for Data Analysis
  • Plotting with Matplotlib
  • Statistical Plots with Seaborn
  • Interactive dynamic visualizations of data using Plotly
  • SciKit-Learn for Machine Learning
  • K-Mean Clustering, Logistic Regression, Linear Regression
  • Random Forest and Decision Trees
  • Principal Component Analysis (PCA)
  • Support Vector Machines
  • Recommender Systems
  • Natural Language Processing and Spam Filters
  • and much more...................!
English [Auto] I gosh where are you on. Welcome to the unsupervised learning you have done a great job while exploring supervised learning matters such as aggression and classification. In the previous sections that was Crit. not in this section will you focus on unsupervised learning which is a set of statistical tools intended for the setting in which we have only a set of chips. So we are only given a set of features measured on an observation. We are not interested in predictions anymore because we don't have an associated response variable. It is we call used to call the target class. We are not counting the target class so we can talk about Kamin clustering algorithm. A very simple and elegant approach for partitioning a dataset into distinct non overlapping clusters. So the goal is to discover interesting things about the measurement based on features that is unsupervised learning. So before moving forward to the algorithm Let's have a brief overview on unsupervised learning unsupervised learning is often much more challenging. It's more subjective because no simple goal for analysis. We don't have predictions. We don't have a response unsupervised learning is often performed as part of an exploratory data analysis. It can be hard to access the results obtained from unsupervised learning methods since there is no universally accepted mechanism for performing cross-validation or validating results on an independent dataset. So no way to check our blog because we don't know the true answer. The problem is an superblocks technique for unsupervised learning are of growing importance in a number of fields. One sample is cancer research or my essay gene expression level in hundred patients with breast cancer. He or she might then look for subgroups among the breast cancer samples are among the genes in order to obtain a better understanding of the disease. And other very common Zambo is an online shopping site that site might try to identify groups of shoppers with similar browsing and prescious histories as well as items that are of particular interest to the shop within each group. Then an individual shopper can be preferentially shown the items in which he or she is particularly likely to be interested. Based on the purchase histories of similar shoppers you may have experiences this example while shopping on Amazon or several other online shopping websites. Let's think about and other common examples. There is some change in mind choose what such results to display to a particular individual based on the click histories of other individuals with similar research patterns. These churches to get learning tasks and many more can be performed via unsupervised machine learning techniques some other very common examples include identifying similar groups feature based clustering of customers clustering of similar documents and much more. There are tons of examples for unsupervised machine learning so we got a brief overview of unsupervised machine learning. Let's move on to the algorithm that we are going to use. This is K means clustering. Imagine clustering algorithm attempts to group similar clusters together in the data in the unsupervised learning the overall goal is to divide the data into distinct groups such that the observations within each group are similar. So we want to separate similar observations in dissimilar groups in a very simple example given below. The algorithm takes an unlabeled data which is our training data and attempt to group it into four distinct color clusters on the right side. So the K equal to 4 here K equal to 4 mean we want to divide this data into four clusters. So the question is how does that walk how does actually the algorithm is assigning it did point to a distinct class. Let's explore consider we are given the data and we decided to select K equal to 4. In this case in the very first step after selecting K equal 2 for each datapoint our observation is randomly assigned to a cluster like in this case all the data points are overlapping. We have four colors for groups because we we have chosen KCl to four. But there is no distinct class so far. So in the next step the centroid for each cluster are computed sent drives are shown as Losh color disks with black spot in the center. They are computed by taking the mean vector of points in each cluster. Initially the centroid are almost completely overlapping because the initial cluster assignment was chosen at random. So now we have cluster Center-Right we go back to the data point and each observation or we can see it did point is assigned to the nearest Center-Right. So this is the point where we can see four distinct classes in our data but we are not done yet. We need to do more competition until the clusters start changing. The next step is actually step two that we perform again. And this leads us to a new cluster Center-Right so we repeat these steps until I can watch and watch mean the clusters stop changing. So finally we get similar clusters to get those in our data. After our iteration in this example we only did three iterations. So in this case we had KCl to four you think KCl to four is the optimum value do you think there could be more clusters than 4. Did we choose the right value for k. These are the questions What we need to answer. Like let's have a look on the same data that came from 2 to 8. They all look cooked KCl to that looks a good clustering KCl to it. Yes it looks good clustering. So what is the method. Is there any auction value for key. There's no easy answer for choosing the best or optimum for Kate in order to choose a correct or reasonable value of k. Domain knowledge is important however and momentum the whites evade to roughly determine the value of k. Let's explore whole disk and movement that has us to find the optimum value for k let's do some thinking here. Before we more so in a given dataset increasing the number of clusters will always reduce the distance to the data point the extreme is far reaching zero. Lin key is same as the number of dead appoint. So what we do actually compute some of square distance for a range of Kavan use and then plot those key values against the sum of squares distant in the plot. We noticed that the other diseases and key gets larger. This is because when the number of clusters increases the clusters get smaller. So a distortion is also smaller. In this example Elbel point is Kate equal to fight. This is the value the sum of squared distance decreases sharply. And additional You can be used in too much for k. So the focus is to find the point where the sum of squared distance decreases sharply. So this was all about the theory behind K means clustering in this lecture. We got a quick overview on unsupervised learning along with K means clustering algorithm. We have discussed the principle behind Kamins clustering and how to find the optimum value of k to get the best from this state of the art machine learning algorithm. I hope you enjoyed Slackman it's time to work with the dataset now. Let's move on to the Jupiter note and learn by doing chewing the next election. Good luck.