24h Pro data science in R
4.1 (7 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
76 students enrolled
Wishlisted Wishlist

Please confirm that you want to add 24h Pro data science in R to your Wishlist.

Add to Wishlist

24h Pro data science in R

Practice data science with 24hs of material using real examples
4.1 (7 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
76 students enrolled
Created by Francisco Juretig
Last updated 6/2017
English
Current price: $10 Original price: $50 Discount: 80% off
5 hours left at this price!
30-Day Money-Back Guarantee
Includes:
  • 18.5 hours on-demand video
  • 75 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Do machine learning in R
  • Process data for modelling
View Curriculum
Requirements
  • Some R programming experience is ideal, but not strictly necessary
  • Some general knowledge on statistics is mandatory: What is a density function? What are random variables?
Description

This course explores several modern machine learning and data science techniques in R. As you probably know, R is one of the most used tools among data scientists. We showcase a wide array of statistical and machine learning techniques. In particular:

  • Using R's statistical functions for drawing random numbers, calculating densities, histograms, etc.
  • Supervised ML problems using the CARET package
  • Data processing using sqldf, caret, etc.
  • Unsupervised techniques such as PCA, DBSCAN, K-means
  • Calling Deep Learning models in Keras(Python) from R
  • Use the powerful XGBOOST method for both regression and classification
  • Doing interesting plots, such as geo-heatmaps and interactive plots
  • Train ML train hyperparameters for several ML methods using caret
  • Do linear regression in R, build log-log models, and do ANOVA analysis
  • Estimate mixed effects models to explicitly model the covariances between observations
  • Train outlier robust models using robust regression and quantile regression
  • Identify outliers and novel observations
  • Estimate ARIMA (time series) models to predict temporal variables

Most of the examples presented in this course come from real datasets collected from the web such as Kaggle, the US Census Bureau, etc. All the lectures can be downloaded and come with the corresponding material. The teaching approach is to briefly introduce each technique, and focus on the computational aspect. The mathematical formulas are avoided as much as possible, so as to concentrate on the practical implementations.

This course covers most of what you would need to work as a data scientist, or compete in Kaggle competitions. It is assumed that you already have some exposure to data science / statistics. 

Who is the target audience?
  • Students aiming to do serious data science in R, with some knowledge about statistics
Students Who Viewed This Course Also Viewed
Curriculum For This Course
75 Lectures
18:29:50
+
Basics
2 Lectures 23:04

Brief introduction to this course

Preview 19:52

Installing R. How to install packages (and RStudio in case you want to do so)

Setting up R
03:12
+
General R programming
8 Lectures 01:42:53

Dataframes are a fundamental concept in R. They are internally stored as a list of vectors. They are very powerful for storing heterogeneous data. We explain how to work with them in R.

The data frame
13:15

We explore the different variable types that are available in R

Variables
13:03

Reading a csv file in R via the read.csv function.Specifying the column names, and interpreting the columns as factors or strings

Reading data
19:56

Creating customized classes for dates using astype()

Reading data with dates: Classes for customized dates
13:15

Working with text in R. Substrings, searching for letters, concatenating text

Text
08:48

Functions allow us to encapsulate similar functionality together. We explore how to use them in R

Functions
11:36

The family of apply() functions can be used to replicate a function to several elements at the same time. We explain how to use these functions in the context of lists, vectors, and data frames.

The apply family of functions
10:04

Histograms are particularly useful for visualizing the distribution of a random variable. We show how to use the hist() function, and how to analyze the plot produced by this function. 

Histograms
12:56
+
Random numbers, probability and statistics
3 Lectures 42:11

Generating random deviates in R according to several distributions

Generating random numbers
14:59

Calculating the density and the cumulative distribution function

Density and cumulative distribution function
17:26

Comparing two distributions. Are they the same? We use the Kolmogorov-Smirnov test for doing this.

Comparing distributions
09:46
+
Advanced data processing using sqldf
2 Lectures 34:45

sqldf allows us to use sql syntax directly on R dataframes. Thus, we can easily build very complex queries, that would otherwise require a lot of cumbersome R notation. We discuss the basic sqldf operations, filtering, ordering, transposing. And how to execute queries on data frames

sqldf - Part1
16:50

We use sqldf for more realistic applications, such as merging data from different dataframes. 

We review the Full vs Inner vs Left vs Right join. We end up using a customised approach to simulate a full join on sqldf, and can also be extended for finding observations in one table and not in the other one.

sqldf - Part2
17:55
+
Statistical modelling: Linear regression
9 Lectures 02:34:32

Categorical variables. How are these included in R using as.factor()

Dummy variables
12:23

Using the lm() function to estimate a model. Interpreting the coefficients, pvalues and t-values. F statistic

The lm() function: Part1
19:58

R2 and adjusted R2. ANOVA, selecting factors and looking into the ANOVA Fvalues

The lm function: Part2
13:44

Comparing nested and non-nested models using ANOVA and Akaike. Predicting new observations using the predict() function. Choosing between different models

Comparing models
19:56

Analyzing the residuals. Detecting structure and heterocedasticity. Plotting the leverage vs residuals, removing the influential observations

Normality, residuals and transformations
19:30

Re-estimating our house model using a log-log model. Difference between log-log, log-linear, linear-log models. Extracting elasticities. 

Log-log models
12:35

We introduce mixed and random effects models. They are used for modelling the covariance between observations sharing a subject variable, such as "zipcode", "person", "animal" etc. Every observation belonging to the same group (subject) will receive a random shock with mean 0 and std error sigmax, which will be estimated by the model.

Linear mixed effects models: Part1
17:34

We apply our previous methodology to our Kings County dataset. The objective is to include the zipcode as a random effect. Every house that belongs to the same zip code, will be assumed to be correlated (will receive a common random shock).

Linear mixed effects models: Part2
19:16

We explore two techniques for dealing with outliers in the context of linear regression. We show how to estimate conditional quantiles, and robust regression using the rlm function. Both approaches yield quite robust estimates that don't get affected dramatically by outliers

Robust regression
19:36

Robust regression
3 questions
+
Statistical modelling: GLM and Nonlinear regression
8 Lectures 02:16:29

Introduction to logistic regression. How to formulate a model using glm() and how to do the corresponding statistical tests

Preview 19:08

Analyzing the degrees of freedom. Looking at the coefficients, odds ratio, and ROC curve. Calculating the area under the curve. 

Logistic regression - Part 2
19:49

Using the "performance" package. Calculating the area under the curve. Doing ANOVA for GLMs

Logistic regression - Part 3
19:52

It is quite hard to interpret the coefficients of a logistic regression model, since it is a nonlinear model. Nevertheless, we can build curves profiling the predicted probabilities as one variable changes.

Logistic regression - Part 4
19:57

Poisson regression is one type of GLM (Generalized linear model) which is adequate for modelling count data (discrete and highly skewed data). The only conceptual problem is that we only estimate one lambda parameter which controls (at the same time) the variance and mean. As we will see this might be a problem. 

Poisson regression: Part1
19:54

We continue with our previous model. In this lecture we profile the coefficients as one variable changes. In particular, we study how many people are expected to be affected as the year changes for two situations.

Poisson regression: Part2
14:43

The fundamental assumption of the Poisson model is that the same parameter describes both the mean and variance (lambda). The problem is that sometimes the variability increases more than it should as lambda increases. One easy solution is to use the negative binomial model, which can control very easily this problem (think of the NB model as a generalization of the Poisson model)

Poisson regression: Part3
03:21

Nonlinear regression is usually used in the context of biology and pharmacokinetics. We show how to fit a model using nonlinear least squares. Specifically, we fit a Michaelis-Menten model for enzyme kinetics.

Nonlinear regression
19:45
+
XGBOOST: Gradient Boosting
6 Lectures 01:17:46

We explore the basic parameters in XGBoost

Preview 16:24

We continue analyzing the parameters in XGBoost

How does it work? Relevant parameters - Part2
08:10

We use XGBoost for modelling the house prices in Kings County. 

Using XGBoost for regression
14:37

XGBoost also comes with its own cross validation function that allows us to compute the cross validated score for each sample. The only problem is that we need to use a manual approach (this function does not tune the parameters for us)

Cross validation in XGBOOST: the xgb.cv function
12:36

We use the caret package to execute an XGBoost model via cross validation. This allows us to select some of the hyper-parameters via cross validation

Preview 17:10

We use XGBoost for classifying mushrooms. 

Preview 08:49
+
Principal components
2 Lectures 40:02

We work with a real dataset containing 65 indexes for 188 countries. The idea is to use Principal Components to extract a set of PCAs that explain a reasonable amount of variability from this dataset. The advantage is that we end up with a much smaller set of features. PCAs can then be used for modelling (since we are using few features, it's likely that we are mitigating overfitting)

Selecting PCA and projecting the data
20:01

We use the PCAs that we obtained in the previous lecture for regression. In particular, we predict the HDI (Human Development Index) in terms of a set of PCAs.

PCA regression
20:01
+
Machine learning - the CARET package - introduction
3 Lectures 49:27

We introduce the caret package in R. It is a fundamental package in R, incredibly important for doing machine learning. It can be thought as a layer standing between us, and the underlying R packages, which essentially simplify a lot the process of training, evaluating and predicting machine learning models.

Introduction
13:03

We explain how to leverage some of the excellent caret preprocessing capabilities. In particular we use it for scaling features, and constructing a matrix of dummies from a categorical variable.

Preprocessing data: Part1
16:46

We continue doing data preprocessing, in this case filtering the variables with no enough variability. We also show how to find and filter correlated variables.

Preprocessing data: Part2
19:38
+
Sound
1 Lecture 05:23

We use a specific R package for extracting several sound features. This is particularly relevant for classifying speech. In later stages of this course, we will run several machine learning models over this dataset.

Extracting meaningful sound features
05:23
6 More Sections
About the Instructor
Francisco Juretig
3.9 Average rating
129 Reviews
1,173 Students
8 Courses
Mr

I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.