This course explores several data science and machine learning techniques that every data science practitioner should be familiar with. Fundamentally, the course pivots over four axis:
This course explores the fundamental concepts in these big four topics, and provides the student with an overview of the problems that can be solved nowadays.
I only focus on the computational and practical implications of these techniques, and it is assumed that the student is partially familiar with Statistics-ML-Data Science - or is willing to complement the techniques presented here with theoretical material. Python programming experience will be absolutely necessary, as we only explain how to define Classes in Python (as we will use them along the course)
The teaching strategy is to briefly explain the theory behind these techniques, show how these techniques work in very simple problems, and finally present the student with some real examples. I believe that these real examples add an enormous value to the student, as it helps understand why these techniques are so used nowadays (because they solve real problems!)
Some examples that we will attack here will be: Forecasting the GDP of the United States, forecasting London new houses prices, identifying squares and triangles in pictures, predicting the value of vehicles using online data, detecting spam on SMS data, and many more!
In a nutshell, this course explains how to:
The student needs to be familiar with statistics, Python and some machine learning concepts
Classes are fundamental for writing clean and robust Python code. We review the basics behind classes creations, constructors, and methods
A powerful element about classes is that they can be inherited. This allows us to write classes, that inherit methods and data from their parent classes.
Reading a csv via Pandas, and doing some basic data manipulation
Lambda expressions allow us to execute functions on every element of a Pandas dataframe, looping in a natural way
We review the different ways of merging data in Pandas. Full, Inner, Left and Right joins
Building aggregate analysis is a fundamental technique for validating our analysis and results, before jumping into the actual machine learning algorithms. For example, we can compare our results versus some aggregated reports that we might get, thus validating that our data is in good shape.
Pivoting our dataframes is quite easy in Pandas. We show how to transform a data frame from the long format to the wide format and vice-versa
Let's review what we learnt in this section
We review how to install Matplotlib and we provide a general introduction
Creating line plots
Producing bar plots and stacked bar plots
Running a linear regression model in Statsmodels. Analyzing the results. Selecting the proper parameters
Working with dummy variables in StatsModels. Ensuring that the model is full rank and the eigenvalues of the design matrix look good
Basic ideas behind ARIMA modelling. Why we need stationary series. How we can decompose a stationary series into the sum of AR and MA terms.
Identifying the AR and MA order of the GDP of the United States series by inspecting the ACF and PACF. Ensuring that the model is stationary by using the ADF test.
Building an actual model for the GDP of the US. The essential ARIMA() parameters. Making sure that the residuals are valid, and making the predictions for the next quarters
Forecasting the prices of London new houses using the techniques that we learnt in the previous lectures.
Fundamental review of some ML concepts. What is ML? How does it compare to traditional statistics? Distinction between supervised and unsupervised problems
Installing Scikit-learn and Numpy+Mkl+ Scipy
We review briefly what are the bayesian ideas behind Naive Bayes. We then explain how we can use the bernoulli bayes or the multinomial one depending on the assumptions we make on the data
We use Bernoulli and Multinomial Naive Bayes classifiers to predict spam in a real SMS dataset from Kaggle. We finally achieve a 96% accuracy (in sample) vs 86% that we would have obtained by using the proportion of non-spam/total sms. This probably gives a good reason for spammers to hate machine learning!
We introduce SVM within a very simple (linear) context. Even though it is an extremely powerful algorithm, it will tend to generate too many support vectors, possibly over-fitting the data. Is there a solution to that? Even though SVM is famous as a classification tool, we will see how it can be used as a very powerful regression tool
We show how to run a linear regression model via ordinary least squares, lasso, and ridge. We see how we LASSO can reduce the dimensionality of a feature set, and how Ridge can estimate using a correlated feature set. At the end, we also end up with models with bias, but that can generate more stable predictions. In the example analyzed here, we end up with all models having a very similar "score", so we can't conclude that either one is "better" than another in terms of prediction. But we show how LASSO can generate a model that competes really well with Ridge and OLS, even with high correlation; and at the same time reduce correctly the dimensionality of the problem. We also how to use the "LASSOCV" and "RIDGECV" functions which automatically compute the regularization parameter we need for those methods, even though in this case we can't get a specific improvement.
We review the tree functions available in scikit-learn, both for classification and regression
The best performing methods nowadays rely on building smaller models and then averaging (or choosing one) between them. Many of the winning algorithms in Kaggle competitions do exactly this. We describe the two big families of ensemble methods: (A) - Averaging ensemble methods (B) - Boosting ensemble methods
We introduce one of the very best functions in scikit-learn: ensemble.BaggingClassifier. It allows us to plug any estimator into an ensemble family, reducing the bias in our estimator, and performing much better in out-of-sample scenarios.
Because trees are used frequently in an ensemble context, scikit-learn has specific functions to deal with this. We focus on ExtraTreesClassifier, ExtraTreesRegressor and RandomForestClassifier + RandomForestRegressor
Boosting is a process of generating simple classifiers and then improving them. We focus on Adaboost, a simple idea, with very solid results for image processing, text classification, and general ML.
In ML, we typically deal with hundreds (if not thousands of features), and for many reasons (either for plotting, modelling, identifying rare observations) we will need to reduce that set. We show how to use scikit-learn to compute PCA, and later project that same data into a low-dimensional space. After that, we plot that data, understand which features move in similar directions, which features have high loadings into the principal components, and even identify weird observations.
When we observe M observations that we want to group into L groups, there is no easiest way than K-Means. We review how to use it in scikit-learn, and show when it does not perform as expected
We review the theory behind the best clustering algorithm nowadays. How it estimates the density and when it considers a point to be an outlier. We review some tuning strategies for its parameters
We use a dataset containing information on multiple human development indexes, to cluster the countries into 3 groups. We show that both K-means and PCA+K-Means (with one principal component extracted) achieve practically the same results. We finally report the results per cluster and present some insights
We have multiple recordings per word: "Banana", "Chair", "IceCream", "Hello", "Goodbye". We want to extract some metrics from each file, so we can do machine learning later. The difficult part is that the metrics that we need are related to the signal encoded in each file (audio file actually). Luckily, we can leverage an existing R package that reads .wav files, and outputs many properties about the frequencies operating in each file. At the end, we produce 2 csv files (one for training and one for testing) containing 21 features that we can use later for doing machine learning. The approach presented here, can be extended to situations requiring the classification of any sound.
We load the features that we extracted before, both for our training and testing datasets. We evaluate the performance of both Adaboost and SVM. Both methods have a practical in sample accuracy of 100%, 80% of cross-validation accuracy, and 80% of out-of-sample accuracy.
I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.