24h Pro data science in R

Name: 24h Pro data science in R
Rating: 3.9 (27 reviews)

Practice data science with 24hs of material using real examples

Created byFrancisco Juretig

Last updated 6/2017

English

What you'll learn

Do machine learning in R
Process data for modelling

Course content

16 sections • 75 lectures • 18h 29m total length

Introduction19:52
Brief introduction to this course
Setting up R3:12
Installing R. How to install packages (and RStudio in case you want to do so)

The data frame13:15
Dataframes are a fundamental concept in R. They are internally stored as a list of vectors. They are very powerful for storing heterogeneous data. We explain how to work with them in R.
Variables13:03
We explore the different variable types that are available in R
Reading data19:56
Reading a csv file in R via the read.csv function.Specifying the column names, and interpreting the columns as factors or strings
Reading data with dates: Classes for customized dates13:15
Creating customized classes for dates using astype()
Text8:48
Working with text in R. Substrings, searching for letters, concatenating text
Functions11:36
Functions allow us to encapsulate similar functionality together. We explore how to use them in R
The apply family of functions10:04
The family of apply() functions can be used to replicate a function to several elements at the same time. We explain how to use these functions in the context of lists, vectors, and data frames.
Histograms12:56
Histograms are particularly useful for visualizing the distribution of a random variable. We show how to use the hist() function, and how to analyze the plot produced by this function.

sqldf - Part116:50
sqldf allows us to use sql syntax directly on R dataframes. Thus, we can easily build very complex queries, that would otherwise require a lot of cumbersome R notation. We discuss the basic sqldf operations, filtering, ordering, transposing. And how to execute queries on data frames
sqldf - Part217:55
We use sqldf for more realistic applications, such as merging data from different dataframes.

We review the Full vs Inner vs Left vs Right join. We end up using a customised approach to simulate a full join on sqldf, and can also be extended for finding observations in one table and not in the other one.

Dummy variables12:23
Categorical variables. How are these included in R using as.factor()
The lm() function: Part119:58
Using the lm() function to estimate a model. Interpreting the coefficients, pvalues and t-values. F statistic
The lm function: Part213:44
R2 and adjusted R2. ANOVA, selecting factors and looking into the ANOVA Fvalues
Comparing models19:56
Comparing nested and non-nested models using ANOVA and Akaike. Predicting new observations using the predict() function. Choosing between different models
Normality, residuals and transformations19:30
Analyzing the residuals. Detecting structure and heterocedasticity. Plotting the leverage vs residuals, removing the influential observations
Log-log models12:35
Re-estimating our house model using a log-log model. Difference between log-log, log-linear, linear-log models. Extracting elasticities.
Linear mixed effects models: Part117:34
We introduce mixed and random effects models. They are used for modelling the covariance between observations sharing a subject variable, such as "zipcode", "person", "animal" etc. Every observation belonging to the same group (subject) will receive a random shock with mean 0 and std error sigmax, which will be estimated by the model.
Linear mixed effects models: Part219:16
We apply our previous methodology to our Kings County dataset. The objective is to include the zipcode as a random effect. Every house that belongs to the same zip code, will be assumed to be correlated (will receive a common random shock).
Robust regression19:36
We explore two techniques for dealing with outliers in the context of linear regression. We show how to estimate conditional quantiles, and robust regression using the rlm function. Both approaches yield quite robust estimates that don't get affected dramatically by outliers
Robust regression

Logistic regression - Part119:08
Introduction to logistic regression. How to formulate a model using glm() and how to do the corresponding statistical tests
Logistic regression - Part 219:49
Analyzing the degrees of freedom. Looking at the coefficients, odds ratio, and ROC curve. Calculating the area under the curve.
Logistic regression - Part 319:52
Using the "performance" package. Calculating the area under the curve. Doing ANOVA for GLMs
Logistic regression - Part 419:57
It is quite hard to interpret the coefficients of a logistic regression model, since it is a nonlinear model. Nevertheless, we can build curves profiling the predicted probabilities as one variable changes.
Poisson regression: Part119:54
Poisson regression is one type of GLM (Generalized linear model) which is adequate for modelling count data (discrete and highly skewed data). The only conceptual problem is that we only estimate one lambda parameter which controls (at the same time) the variance and mean. As we will see this might be a problem.
Poisson regression: Part214:43
We continue with our previous model. In this lecture we profile the coefficients as one variable changes. In particular, we study how many people are expected to be affected as the year changes for two situations.
Poisson regression: Part33:21
The fundamental assumption of the Poisson model is that the same parameter describes both the mean and variance (lambda). The problem is that sometimes the variability increases more than it should as lambda increases. One easy solution is to use the negative binomial model, which can control very easily this problem (think of the NB model as a generalization of the Poisson model)
Nonlinear regression19:45
Nonlinear regression is usually used in the context of biology and pharmacokinetics. We show how to fit a model using nonlinear least squares. Specifically, we fit a Michaelis-Menten model for enzyme kinetics.

How does it work? Relevant parameters - Part116:24
We explore the basic parameters in XGBoost
How does it work? Relevant parameters - Part28:10
We continue analyzing the parameters in XGBoost
Using XGBoost for regression14:37
We use XGBoost for modelling the house prices in Kings County.
Cross validation in XGBOOST: the xgb.cv function12:36
XGBoost also comes with its own cross validation function that allows us to compute the cross validated score for each sample. The only problem is that we need to use a manual approach (this function does not tune the parameters for us)
GridSearch for XGBoost via the caret package17:10
We use the caret package to execute an XGBoost model via cross validation. This allows us to select some of the hyper-parameters via cross validation
Using XGBoost for classification8:49
We use XGBoost for classifying mushrooms.

Selecting PCA and projecting the data20:01
We work with a real dataset containing 65 indexes for 188 countries. The idea is to use Principal Components to extract a set of PCAs that explain a reasonable amount of variability from this dataset. The advantage is that we end up with a much smaller set of features. PCAs can then be used for modelling (since we are using few features, it's likely that we are mitigating overfitting)
PCA regression20:01
We use the PCAs that we obtained in the previous lecture for regression. In particular, we predict the HDI (Human Development Index) in terms of a set of PCAs.

Introduction13:03
We introduce the caret package in R. It is a fundamental package in R, incredibly important for doing machine learning. It can be thought as a layer standing between us, and the underlying R packages, which essentially simplify a lot the process of training, evaluating and predicting machine learning models.
Preprocessing data: Part116:46
We explain how to leverage some of the excellent caret preprocessing capabilities. In particular we use it for scaling features, and constructing a matrix of dummies from a categorical variable.
Preprocessing data: Part219:38
We continue doing data preprocessing, in this case filtering the variables with no enough variability. We also show how to find and filter correlated variables.

Requirements

Some R programming experience is ideal, but not strictly necessary
Some general knowledge on statistics is mandatory: What is a density function? What are random variables?

Description

This course explores several modern machine learning and data science techniques in R. As you probably know, R is one of the most used tools among data scientists. We showcase a wide array of statistical and machine learning techniques. In particular:

Using R's statistical functions for drawing random numbers, calculating densities, histograms, etc.
Supervised ML problems using the CARET package
Data processing using sqldf, caret, etc.
Unsupervised techniques such as PCA, DBSCAN, K-means
Calling Deep Learning models in Keras(Python) from R
Use the powerful XGBOOST method for both regression and classification
Doing interesting plots, such as geo-heatmaps and interactive plots
Train ML train hyperparameters for several ML methods using caret
Do linear regression in R, build log-log models, and do ANOVA analysis
Estimate mixed effects models to explicitly model the covariances between observations
Train outlier robust models using robust regression and quantile regression
Identify outliers and novel observations
Estimate ARIMA (time series) models to predict temporal variables

Most of the examples presented in this course come from real datasets collected from the web such as Kaggle, the US Census Bureau, etc. All the lectures can be downloaded and come with the corresponding material. The teaching approach is to briefly introduce each technique, and focus on the computational aspect. The mathematical formulas are avoided as much as possible, so as to concentrate on the practical implementations.

This course covers most of what you would need to work as a data scientist, or compete in Kaggle competitions. It is assumed that you already have some exposure to data science / statistics.

Who this course is for:

Students aiming to do serious data science in R, with some knowledge about statistics

24h Pro data science in R

What you'll learn

Explore related topics

Course content

Basics2 lectures • 23min

General R programming8 lectures • 1hr 43min

Random numbers, probability and statistics3 lectures • 42min

Advanced data processing using sqldf2 lectures • 35min

Statistical modelling: Linear regression9 lectures • 2hr 35min

Statistical modelling: GLM and Nonlinear regression8 lectures • 2hr 16min

XGBOOST: Gradient Boosting6 lectures • 1hr 18min

Principal components2 lectures • 40min

Machine learning - the CARET package - introduction3 lectures • 49min

Sound1 lecture • 5min

Requirements

Description

Who this course is for: