24h Pro data science in R
4.0 (10 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
101 students enrolled
Wishlisted Wishlist

Please confirm that you want to add 24h Pro data science in R to your Wishlist.

Add to Wishlist

24h Pro data science in R

Practice data science with 24hs of material using real examples
4.0 (10 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
101 students enrolled
Created by Francisco Juretig
Last updated 6/2017
Current price: $12 Original price: $50 Discount: 76% off
3 days left at this price!
30-Day Money-Back Guarantee
  • 18.5 hours on-demand video
  • 75 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion

Training 5 or more people?

Get your team access to Udemy's top 2,000 courses anytime, anywhere.

Try Udemy for Business
What Will I Learn?
  • Do machine learning in R
  • Process data for modelling
View Curriculum
  • Some R programming experience is ideal, but not strictly necessary
  • Some general knowledge on statistics is mandatory: What is a density function? What are random variables?

This course explores several modern machine learning and data science techniques in R. As you probably know, R is one of the most used tools among data scientists. We showcase a wide array of statistical and machine learning techniques. In particular:

  • Using R's statistical functions for drawing random numbers, calculating densities, histograms, etc.
  • Supervised ML problems using the CARET package
  • Data processing using sqldf, caret, etc.
  • Unsupervised techniques such as PCA, DBSCAN, K-means
  • Calling Deep Learning models in Keras(Python) from R
  • Use the powerful XGBOOST method for both regression and classification
  • Doing interesting plots, such as geo-heatmaps and interactive plots
  • Train ML train hyperparameters for several ML methods using caret
  • Do linear regression in R, build log-log models, and do ANOVA analysis
  • Estimate mixed effects models to explicitly model the covariances between observations
  • Train outlier robust models using robust regression and quantile regression
  • Identify outliers and novel observations
  • Estimate ARIMA (time series) models to predict temporal variables

Most of the examples presented in this course come from real datasets collected from the web such as Kaggle, the US Census Bureau, etc. All the lectures can be downloaded and come with the corresponding material. The teaching approach is to briefly introduce each technique, and focus on the computational aspect. The mathematical formulas are avoided as much as possible, so as to concentrate on the practical implementations.

This course covers most of what you would need to work as a data scientist, or compete in Kaggle competitions. It is assumed that you already have some exposure to data science / statistics. 

Who is the target audience?
  • Students aiming to do serious data science in R, with some knowledge about statistics
Compare to Other Data Science Courses
Curriculum For This Course
75 Lectures
2 Lectures 23:04

Brief introduction to this course

Preview 19:52

Installing R. How to install packages (and RStudio in case you want to do so)

Setting up R
General R programming
8 Lectures 01:42:53

Dataframes are a fundamental concept in R. They are internally stored as a list of vectors. They are very powerful for storing heterogeneous data. We explain how to work with them in R.

The data frame

We explore the different variable types that are available in R


Reading a csv file in R via the read.csv function.Specifying the column names, and interpreting the columns as factors or strings

Reading data

Creating customized classes for dates using astype()

Reading data with dates: Classes for customized dates

Working with text in R. Substrings, searching for letters, concatenating text


Functions allow us to encapsulate similar functionality together. We explore how to use them in R


The family of apply() functions can be used to replicate a function to several elements at the same time. We explain how to use these functions in the context of lists, vectors, and data frames.

The apply family of functions

Histograms are particularly useful for visualizing the distribution of a random variable. We show how to use the hist() function, and how to analyze the plot produced by this function. 

Random numbers, probability and statistics
3 Lectures 42:11

Generating random deviates in R according to several distributions

Generating random numbers

Calculating the density and the cumulative distribution function

Density and cumulative distribution function

Comparing two distributions. Are they the same? We use the Kolmogorov-Smirnov test for doing this.

Comparing distributions
Advanced data processing using sqldf
2 Lectures 34:45

sqldf allows us to use sql syntax directly on R dataframes. Thus, we can easily build very complex queries, that would otherwise require a lot of cumbersome R notation. We discuss the basic sqldf operations, filtering, ordering, transposing. And how to execute queries on data frames

sqldf - Part1

We use sqldf for more realistic applications, such as merging data from different dataframes. 

We review the Full vs Inner vs Left vs Right join. We end up using a customised approach to simulate a full join on sqldf, and can also be extended for finding observations in one table and not in the other one.

sqldf - Part2
Statistical modelling: Linear regression
9 Lectures 02:34:32

Categorical variables. How are these included in R using as.factor()

Dummy variables

Using the lm() function to estimate a model. Interpreting the coefficients, pvalues and t-values. F statistic

The lm() function: Part1

R2 and adjusted R2. ANOVA, selecting factors and looking into the ANOVA Fvalues

The lm function: Part2

Comparing nested and non-nested models using ANOVA and Akaike. Predicting new observations using the predict() function. Choosing between different models

Comparing models

Analyzing the residuals. Detecting structure and heterocedasticity. Plotting the leverage vs residuals, removing the influential observations

Normality, residuals and transformations

Re-estimating our house model using a log-log model. Difference between log-log, log-linear, linear-log models. Extracting elasticities. 

Log-log models

We introduce mixed and random effects models. They are used for modelling the covariance between observations sharing a subject variable, such as "zipcode", "person", "animal" etc. Every observation belonging to the same group (subject) will receive a random shock with mean 0 and std error sigmax, which will be estimated by the model.

Linear mixed effects models: Part1

We apply our previous methodology to our Kings County dataset. The objective is to include the zipcode as a random effect. Every house that belongs to the same zip code, will be assumed to be correlated (will receive a common random shock).

Linear mixed effects models: Part2

We explore two techniques for dealing with outliers in the context of linear regression. We show how to estimate conditional quantiles, and robust regression using the rlm function. Both approaches yield quite robust estimates that don't get affected dramatically by outliers

Robust regression

Robust regression
3 questions
Statistical modelling: GLM and Nonlinear regression
8 Lectures 02:16:29

Introduction to logistic regression. How to formulate a model using glm() and how to do the corresponding statistical tests

Preview 19:08

Analyzing the degrees of freedom. Looking at the coefficients, odds ratio, and ROC curve. Calculating the area under the curve. 

Logistic regression - Part 2

Using the "performance" package. Calculating the area under the curve. Doing ANOVA for GLMs

Logistic regression - Part 3

It is quite hard to interpret the coefficients of a logistic regression model, since it is a nonlinear model. Nevertheless, we can build curves profiling the predicted probabilities as one variable changes.

Logistic regression - Part 4

Poisson regression is one type of GLM (Generalized linear model) which is adequate for modelling count data (discrete and highly skewed data). The only conceptual problem is that we only estimate one lambda parameter which controls (at the same time) the variance and mean. As we will see this might be a problem. 

Poisson regression: Part1

We continue with our previous model. In this lecture we profile the coefficients as one variable changes. In particular, we study how many people are expected to be affected as the year changes for two situations.

Poisson regression: Part2

The fundamental assumption of the Poisson model is that the same parameter describes both the mean and variance (lambda). The problem is that sometimes the variability increases more than it should as lambda increases. One easy solution is to use the negative binomial model, which can control very easily this problem (think of the NB model as a generalization of the Poisson model)

Poisson regression: Part3

Nonlinear regression is usually used in the context of biology and pharmacokinetics. We show how to fit a model using nonlinear least squares. Specifically, we fit a Michaelis-Menten model for enzyme kinetics.

Nonlinear regression
XGBOOST: Gradient Boosting
6 Lectures 01:17:46

We explore the basic parameters in XGBoost

Preview 16:24

We continue analyzing the parameters in XGBoost

How does it work? Relevant parameters - Part2

We use XGBoost for modelling the house prices in Kings County. 

Using XGBoost for regression

XGBoost also comes with its own cross validation function that allows us to compute the cross validated score for each sample. The only problem is that we need to use a manual approach (this function does not tune the parameters for us)

Cross validation in XGBOOST: the xgb.cv function

We use the caret package to execute an XGBoost model via cross validation. This allows us to select some of the hyper-parameters via cross validation

Preview 17:10

We use XGBoost for classifying mushrooms. 

Preview 08:49
Principal components
2 Lectures 40:02

We work with a real dataset containing 65 indexes for 188 countries. The idea is to use Principal Components to extract a set of PCAs that explain a reasonable amount of variability from this dataset. The advantage is that we end up with a much smaller set of features. PCAs can then be used for modelling (since we are using few features, it's likely that we are mitigating overfitting)

Selecting PCA and projecting the data

We use the PCAs that we obtained in the previous lecture for regression. In particular, we predict the HDI (Human Development Index) in terms of a set of PCAs.

PCA regression
Machine learning - the CARET package - introduction
3 Lectures 49:27

We introduce the caret package in R. It is a fundamental package in R, incredibly important for doing machine learning. It can be thought as a layer standing between us, and the underlying R packages, which essentially simplify a lot the process of training, evaluating and predicting machine learning models.


We explain how to leverage some of the excellent caret preprocessing capabilities. In particular we use it for scaling features, and constructing a matrix of dummies from a categorical variable.

Preprocessing data: Part1

We continue doing data preprocessing, in this case filtering the variables with no enough variability. We also show how to find and filter correlated variables.

Preprocessing data: Part2
1 Lecture 05:23

We use a specific R package for extracting several sound features. This is particularly relevant for classifying speech. In later stages of this course, we will run several machine learning models over this dataset.

Extracting meaningful sound features
6 More Sections
About the Instructor
Francisco Juretig
3.8 Average rating
154 Reviews
1,355 Students
8 Courses

I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.