This course will explain how to use scikit-learn to do advanced machine learning. If you are aiming to work as a professional data scientist, you need to master scikit-learn!
It is expected that you have some familiarity with statistics, and python programming. It's not necessary to be an expert, but you should be able to understand what is a Gaussian distribution, code loops and functions in Python, and know the basics of a maximum likelihood estimator. The course will be entirely focused on the python implementation, and the math behind it will be omitted as much as possible.
The objective of this course is to provide you with a good understanding of scikit-learn (being able to identify which technique you can use for a particular problem). If you follow this course, you should be able to handle quite well a machine learning interview. Even though in that case you will need to study the math with more detail.
We'll start by explaining what is the machine learning problem, methodology and terminology. We'll explain what are the differences between AI, machine learning (ML), statistics, and data mining. Scikit-learn (being a Python library) benefits from Python's spectacular simplicity and power. We'll start by explaining how to install scikit-learn and its dependencies. And then show how can we can use Pandas data in scikit-learn, and also benefit from SciPy and Numpy. We'll then show how to create synthetic data-sets using scikit-learn. We will be able to create data-sets specifically tailored for regression, classification and clustering.
In essence, machine learning can be divided into two big groups: supervised and unsupervised learning. In supervised learning we will have an objective variable (which can be continuous or categorical) and we want to use certain features to predict it. Scikit-learn will provide estimators for both classification and regression problems. We will start by discussing the simplest classifier which is "Naive Bayes". We will then see some powerful regression techniques that via a special trick called regularization, will help get much better linear estimators. We will then analyze Support Vector Machines, a powerful technique for both regression and classification. We will then use classification and regression trees to estimate very complex models. We will see how we can combine many of the existing estimators into simpler structures, but more robust for out of sample performance, called "ensemble" methods. In particular random forests, random trees, and boosting methods. These methods are the ones winning most data science competitions nowadays.
We will see how we can use all these techniques for online data, image classification, sales data, and more. We also use real datasets from Kaggle such as spam SMS data, house prices in the United States, etc. to teach the student what to expect when working with real data.
On the other hand, in unsupervised learning we will have a set of features (but with no outcome or target variable) and we will attempt to learn from that data. Whether it has outliers, whether it can be grouped into groups, whether we can remove some of those features, etcetera. For example we will see k-means which is the simplest algorithm for classifying observations into groups. We will see that sometimes there are better techniques such as DBSCAN. We will then explain how we can use principal components to reduce the dimensionality of a data-set. And we will
use some very powerful scikit-learn functions that learn the density of the data, and are able to classify outliers.
I try to keep this course as updated as possible, specially since scikit-learn is constantly being updated. For example, neural networks was added in the latest release. I tried to keep the examples as simple as possible, keeping the amount of observations (samples) and features (variables) as small as possible. In real situations, we will use hundreds of features and thousands of samples, and most of the methods presented here scale really well into those scenarios. I don't want this course to be focused on very realistic examples, because I think it obscures what we are trying to achieve in each example. Nevertheless, some more complex examples will be added as additional exercises.
How to install Numpy, Scipy and scikit-learn. Making sure it works
Basic scikit-learn concepts and terminology. How to load data externally via Pandas. Some useful standarization functions that scikit-learn provides
How we can use scikit-learn to create data for clustering problems, regression problems and classification problems
We review briefly what are the bayesian ideas behind Naive Bayes. We then explain how we can use the bernoulli bayes or the multinomial one depending on the assumptions we make on the data
We use a real SMS spam dataset from Kaggle in order to test Bernoulli and Multinomial Naive classifiers. We end up achieving 95% and 96% accuracy using cross validation (vs 86% accuracy that we would have obtained if we had used the proportion of non-spam emails / total emails). Now you know why spammers hate machine learning practitioners!
We introduce SVM within a very simple (linear) context. Even though it is an extremely powerful algorithm, it will tend to generate too many support vectors, possibly over-fitting the data. Is there a solution to that? Even though SVM is famous as a classification tool, we will see how it can be used as a very powerful regression tool
In the previous lesson we presented SVM and showed that we can't control the number of support vectors directly. An alternative formulation (NuSVM) will allow us to do exactly that
The most famous (and simplest) neural network is used a lot to predict categorical outcomes. Such as whether an observation belongs to group A or B. And it does have a nice thing: we can draw conclusions on the parameters. We explore linear_model.LogisticRegression in scikit-learn
We use a logistic regression model to predict if the income of several people will be greater than 50K using Census data from the US. We show how L1 and L2 regularization methods work, and we finally present a dataframe containing the coefficient values and coefficient names. This is certainly a nice feature from logistic regression (being able to assign a meaning to each coefficient - the sign of each logistic coefficient tells us if they increase the probability of observing a 0 or 1), which is not shared by many methods
Isotonic regression is a very useful method when the sign of the relationship between two variables is known. It can be easily implemented in scikit-learn. We review isotonic.isotonic_regression in scikit-learn with a price/sales example
We show how to run a linear regression model via ordinary least squares, lasso, and ridge. We see how we LASSO can reduce the dimensionality of a feature set, and how Ridge can estimate using a correlated feature set. At the end, we also end up with models with bias, but that can generate more stable predictions. In the example analyzed here, we end up with all models having a very similar "score", so we can't conclude that either one is "better" than another in terms of prediction. But we show how LASSO can generate a model that competes really well with Ridge and OLS, even with high correlation; and at the same time reduce correctly the dimensionality of the problem. We also how to use the "LASSOCV" and "RIDGECV" functions which automatically compute the regularization parameter we need for those methods, even though in this case we can't get a specific improvement.
We review the tree functions available in scikit-learn, both for classification and regression
The best performing methods nowadays rely on building smaller models and then averaging (or choosing one) between them. Many of the winning algorithms in Kaggle competitions do exactly this. We describe the two big families of ensemble methods: (A) - Averaging ensemble methods (B) - Boosting ensemble methods
We introduce one of the very best functions in scikit-learn: ensemble.BaggingClassifier. It allows us to plug any estimator into an ensemble family, reducing the bias in our estimator, and performing much better in out-of-sample scenarios.
Because trees are used frequently in an ensemble context, scikit-learn has specific functions to deal with this. We focus on ExtraTreesClassifier, ExtraTreesRegressor and RandomForestClassifier + RandomForestRegressor
We practice how to encode the simplest image classification problem into the format we need in scikit-learn. We see that even though we have few pixels, and few samples, we can predict quite well whether an image is an "I" or a "C" using random forests
Boosting is a process of generating simple classifiers and then improving them. We focus on Adaboost, a simple idea, with very solid results for image processing, text classification, and general ML.
We show how to use the fantastic GridsearchCV function. It allows us to get the best parameters for any model using cross validation. We explain how to use it with random forests
We use a real dataset containing house prices for the US. We use several features to predict those prices, and we determine some of the parameters using cross validation. We end up with an ExtraTrees classifier having a 82% accuracy.
When we want to visualise the shape of uni-dimensional data, histograms are the best tool. But what happens when we want to generate a smoother version of it? Scikit-learn provides some density estimation methods, ideal for this. In this example we see a weird example of data truncated between 0-1, where density estimation can be estimated, but not before applying a trick.
In ML, we typically deal with hundreds (if not thousands of features), and for many reasons (either for plotting, modelling, identifying rare observations) we will need to reduce that set. We show how to use scikit-learn to compute PCA, and later project that same data into a low-dimensional space. After that, we plot that data, understand which features move in similar directions, which features have high loadings into the principal components, and even identify weird observations.
When we observe M observations that we want to group into L groups, there is no easiest way than K-Means. We review how to use it in scikit-learn, and show when it does not perform as expected
We review the theory behind the best clustering algorithm nowadays. How it estimates the density and when it considers a point to be an outlier. We review some tuning strategies for its parameters
Kmeans and DBScan comparison
We use a dataset containing information on multiple human development indexes, to cluster the countries into 3 groups. We show that both K-means and PCA+K-Means (with one principal component extracted) achieve practically the same results. We finally report the results per cluster and present some insights
Assume you have data containing certain proportion of outliers (abnormal observations). Is there a robust way of identifying them? Can that be used to predict more abnormal observations?
Assume you have data not containing outliers, but want to predict whether a new set of observations share that same data structure, or they are outliers (belonging possibly to another distribution). We show how to use the one class SVM to estimate the data density and classify the new set of observations.
I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.