This course explores several modern machine learning and data science techniques in R. As you probably know, R is one of the most used tools among data scientists. We showcase a wide array of statistical and machine learning techniques. In particular:
Most of the examples presented in this course come from real datasets collected from the web such as Kaggle, the US Census Bureau, etc. All the lectures can be downloaded and come with the corresponding material. The teaching approach is to briefly introduce each technique, and focus on the computational aspect. The mathematical formulas are avoided as much as possible, so as to concentrate on the practical implementations.
This course covers most of what you would need to work as a data scientist, or compete in Kaggle competitions. It is assumed that you already have some exposure to data science / statistics.
Dataframes are a fundamental concept in R. They are internally stored as a list of vectors. They are very powerful for storing heterogeneous data. We explain how to work with them in R.
We explore the different variable types that are available in R
Reading a csv file in R via the read.csv function.Specifying the column names, and interpreting the columns as factors or strings
Creating customized classes for dates using astype()
Working with text in R. Substrings, searching for letters, concatenating text
Functions allow us to encapsulate similar functionality together. We explore how to use them in R
The family of apply() functions can be used to replicate a function to several elements at the same time. We explain how to use these functions in the context of lists, vectors, and data frames.
Histograms are particularly useful for visualizing the distribution of a random variable. We show how to use the hist() function, and how to analyze the plot produced by this function.
Generating random deviates in R according to several distributions
Calculating the density and the cumulative distribution function
Comparing two distributions. Are they the same? We use the Kolmogorov-Smirnov test for doing this.
sqldf allows us to use sql syntax directly on R dataframes. Thus, we can easily build very complex queries, that would otherwise require a lot of cumbersome R notation. We discuss the basic sqldf operations, filtering, ordering, transposing. And how to execute queries on data frames
We use sqldf for more realistic applications, such as merging data from different dataframes.
We review the Full vs Inner vs Left vs Right join. We end up using a customised approach to simulate a full join on sqldf, and can also be extended for finding observations in one table and not in the other one.
Categorical variables. How are these included in R using as.factor()
Using the lm() function to estimate a model. Interpreting the coefficients, pvalues and t-values. F statistic
R2 and adjusted R2. ANOVA, selecting factors and looking into the ANOVA Fvalues
Comparing nested and non-nested models using ANOVA and Akaike. Predicting new observations using the predict() function. Choosing between different models
Analyzing the residuals. Detecting structure and heterocedasticity. Plotting the leverage vs residuals, removing the influential observations
Re-estimating our house model using a log-log model. Difference between log-log, log-linear, linear-log models. Extracting elasticities.
We introduce mixed and random effects models. They are used for modelling the covariance between observations sharing a subject variable, such as "zipcode", "person", "animal" etc. Every observation belonging to the same group (subject) will receive a random shock with mean 0 and std error sigmax, which will be estimated by the model.
We apply our previous methodology to our Kings County dataset. The objective is to include the zipcode as a random effect. Every house that belongs to the same zip code, will be assumed to be correlated (will receive a common random shock).
We explore two techniques for dealing with outliers in the context of linear regression. We show how to estimate conditional quantiles, and robust regression using the rlm function. Both approaches yield quite robust estimates that don't get affected dramatically by outliers
Introduction to logistic regression. How to formulate a model using glm() and how to do the corresponding statistical tests
Analyzing the degrees of freedom. Looking at the coefficients, odds ratio, and ROC curve. Calculating the area under the curve.
Using the "performance" package. Calculating the area under the curve. Doing ANOVA for GLMs
It is quite hard to interpret the coefficients of a logistic regression model, since it is a nonlinear model. Nevertheless, we can build curves profiling the predicted probabilities as one variable changes.
Poisson regression is one type of GLM (Generalized linear model) which is adequate for modelling count data (discrete and highly skewed data). The only conceptual problem is that we only estimate one lambda parameter which controls (at the same time) the variance and mean. As we will see this might be a problem.
We continue with our previous model. In this lecture we profile the coefficients as one variable changes. In particular, we study how many people are expected to be affected as the year changes for two situations.
The fundamental assumption of the Poisson model is that the same parameter describes both the mean and variance (lambda). The problem is that sometimes the variability increases more than it should as lambda increases. One easy solution is to use the negative binomial model, which can control very easily this problem (think of the NB model as a generalization of the Poisson model)
Nonlinear regression is usually used in the context of biology and pharmacokinetics. We show how to fit a model using nonlinear least squares. Specifically, we fit a Michaelis-Menten model for enzyme kinetics.
We explore the basic parameters in XGBoost
We continue analyzing the parameters in XGBoost
We use XGBoost for modelling the house prices in Kings County.
XGBoost also comes with its own cross validation function that allows us to compute the cross validated score for each sample. The only problem is that we need to use a manual approach (this function does not tune the parameters for us)
We use the caret package to execute an XGBoost model via cross validation. This allows us to select some of the hyper-parameters via cross validation
We work with a real dataset containing 65 indexes for 188 countries. The idea is to use Principal Components to extract a set of PCAs that explain a reasonable amount of variability from this dataset. The advantage is that we end up with a much smaller set of features. PCAs can then be used for modelling (since we are using few features, it's likely that we are mitigating overfitting)
We use the PCAs that we obtained in the previous lecture for regression. In particular, we predict the HDI (Human Development Index) in terms of a set of PCAs.
We introduce the caret package in R. It is a fundamental package in R, incredibly important for doing machine learning. It can be thought as a layer standing between us, and the underlying R packages, which essentially simplify a lot the process of training, evaluating and predicting machine learning models.
We explain how to leverage some of the excellent caret preprocessing capabilities. In particular we use it for scaling features, and constructing a matrix of dummies from a categorical variable.
We continue doing data preprocessing, in this case filtering the variables with no enough variability. We also show how to find and filter correlated variables.
We use a specific R package for extracting several sound features. This is particularly relevant for classifying speech. In later stages of this course, we will run several machine learning models over this dataset.
I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.