*A SAS portion will be added by August 20th, 2017* (about 30 minutes worth)
This course is ideal for those that are interested in data mining/data analysis.
Most data in the world (whether text,audio,visual, etc) is raw or unlabeled. This is precisely the reason that unsupervised machine learning has become so important. By using certain approaches to unsupervised machine learning (like clustering) we can discover patterns or underlying structures in data. This is a major component of exploratory data mining. Furthermore, when one does EDA, it is used to draw hypotheses, assess assumptions about our statistical inferences, and its used as a basis for further research. For example, the conclusion of a cluster analysis could result in the initiation of a full scale experiment.
The course starts by covering two of the most important and common non-hierarchical clustering algorithms, K-means and DBSCAN using Python. Later, I cover hierarchical clustering using the Agglomerative method, utilizing the SAS programming language. Quite a few examples are used to aide learning.
With K-Means, we start with a 'starter' (or simple) example. We then discuss 'Completeness Score'. The next lesson we discuss how k-means deals with larger variances and different shapes. Then we discuss 'Color Quantization'. This is used when an individual wants to decrease the size of an image/and or see if there is any underlying structure to an image. Finally, we will take a look at cells of the human body, and do some cell segmentation. For DBSCAN, we will look at a starter example as well using Blobs. Then I will show you how DBSCAN overcomes some of the issues of K-means.
We will also cover (available soon) the Agglomerative method using SAS programming language. Single-linkage and average-linkage criteria will be discussed. There will be a visual example (not using SAS), and then there will be a dataset used with SAS.
An intro to k-means. We talk about a couple parameters(arguments), and look at a visual example.
A completeness scores essentially tells us how closely the grouping (cluster) resembles the class (or label).
This lesson helps you understand how to think about clustering. Essentially, in clustering, imagine that you have the attributes/dimensions (the different columns of data (age, weight), but you don't know the classes(labels). If you pick the wrong k, then it makes sense that completeness score would not be great as it measures how well the cluster(grouping) resembles the class(label).
K-means is not great at handling more variance (more spread to the data), and certain shapes.
K-means has real applications. One common application is the reduction of image sizes, and often trying to seeif there is any underlying structure to the data.
Clustering is being used more in medical research, and it's because algorithms like k-means can help researchers focus on certain portions of cells, and intensify and uncover certain structures that may not be seen with the naked eye.
I discuss the two main parameters of DBSCAN, and expand on it with a clear example.
A DBSCAN example with Blobs..it helps illustrate one way that DBSCAN overcomes some of the issues of kmeans. (good outlier detection)
DBSCAN works its magic when it comes to non-spherical data. I compare the power of the DBSCAN algorithm on data that is oddly shaped vs kmeans. We compare side by side.
The success and fun I had with statistics based courses in University has resulted in my current teaching interests. My interest in data and statistics boils down to my passion for finding the objective truth, and applying these findings in life and business. Currently, I teach five courses. A Statistics course, a SAS course in English, a SAS course in Portuguese (with subtitles, but English instruction), a SAS SQL course, and a Pandas (Python 3 ) course.
O sucesso e diversão que eu tive durante meus cursos de estatística na Universidade resultaram em meu interesse em ensinar. Meu interesse em dados e estatística vêm de minha paixão por encontrar verdades objetivas, e aplicar estas descobertas na vida e no negócio. Atualmente, eu ensino quatro cursos. Um curso de Estatística, um curso de SAS em inglês, um curso de SAS em português (com legendas, mas instruções em inglês) e um curso de Pandas (Python 3).