24h Pro data science in R

101 students enrolled

Practice data science with 24hs of material using real examples

101 students enrolled

Current price: $12
Original price: $50
Discount:
76% off

30-Day Money-Back Guarantee

- 18.5 hours on-demand video
- 75 Supplemental Resources
- Full lifetime access
- Access on mobile and TV

- Certificate of Completion

Get your team access to Udemy's top 2,000 courses anytime, anywhere.

Try Udemy for Business
What Will I Learn?

- Do machine learning in R
- Process data for modelling

Requirements

- Some R programming experience is ideal, but not strictly necessary
- Some general knowledge on statistics is mandatory: What is a density function? What are random variables?

Description

This course explores several modern machine learning and data science techniques in R. As you probably know, R is one of the most used tools among data scientists. We showcase a wide array of statistical and machine learning techniques. In particular:

- Using R's statistical functions for drawing random numbers, calculating densities, histograms, etc.
- Supervised ML problems using the CARET package
- Data processing using sqldf, caret, etc.
- Unsupervised techniques such as PCA, DBSCAN, K-means
- Calling Deep Learning models in Keras(Python) from R
- Use the powerful XGBOOST method for both regression and classification
- Doing interesting plots, such as geo-heatmaps and interactive plots
- Train ML train hyperparameters for several ML methods using caret
- Do linear regression in R, build log-log models, and do ANOVA analysis
- Estimate mixed effects models to explicitly model the covariances between observations
- Train outlier robust models using robust regression and quantile regression
- Identify outliers and novel observations
- Estimate ARIMA (time series) models to predict temporal variables

Most of the examples presented in this course come from real datasets collected from the web such as Kaggle, the US Census Bureau, etc. All the lectures can be downloaded and come with the corresponding material. The teaching approach is to briefly introduce each technique, and focus on the computational aspect. The mathematical formulas are avoided as much as possible, so as to concentrate on the practical implementations.

This course covers most of what you would need to work as a data scientist, or compete in Kaggle competitions. It is assumed that you already have some exposure to data science / statistics.

Who is the target audience?

- Students aiming to do serious data science in R, with some knowledge about statistics

Compare to Other Data Science Courses

Curriculum For This Course

75 Lectures

18:29:50
+
–

Basics
2 Lectures
23:04

Brief introduction to this course

Preview
19:52

Installing R. How to install packages (and RStudio in case you want to do so)

Setting up R

03:12

+
–

General R programming
8 Lectures
01:42:53

Dataframes are a fundamental concept in R. They are internally stored as a list of vectors. They are very powerful for storing heterogeneous data. We explain how to work with them in R.

The data frame

13:15

We explore the different variable types that are available in R

Variables

13:03

Reading a csv file in R via the read.csv function.Specifying the column names, and interpreting the columns as factors or strings

Reading data

19:56

Creating customized classes for dates using astype()

Reading data with dates: Classes for customized dates

13:15

Working with text in R. Substrings, searching for letters, concatenating text

Text

08:48

Functions allow us to encapsulate similar functionality together. We explore how to use them in R

Functions

11:36

The family of apply() functions can be used to replicate a function to several elements at the same time. We explain how to use these functions in the context of lists, vectors, and data frames.

The apply family of functions

10:04

Histograms are particularly useful for visualizing the distribution of a random variable. We show how to use the hist() function, and how to analyze the plot produced by this function.

Histograms

12:56

+
–

Random numbers, probability and statistics
3 Lectures
42:11

Generating random deviates in R according to several distributions

Generating random numbers

14:59

Calculating the density and the cumulative distribution function

Density and cumulative distribution function

17:26

Comparing two distributions. Are they the same? We use the Kolmogorov-Smirnov test for doing this.

Comparing distributions

09:46

+
–

Advanced data processing using sqldf
2 Lectures
34:45

sqldf allows us to use sql syntax directly on R dataframes. Thus, we can easily build very complex queries, that would otherwise require a lot of cumbersome R notation. We discuss the basic sqldf operations, filtering, ordering, transposing. And how to execute queries on data frames

sqldf - Part1

16:50

We use sqldf for more realistic applications, such as merging data from different dataframes.

We review the *Full vs Inner vs Left vs Right join*. We end up using a customised approach to simulate a full join on sqldf, and can also be extended for finding observations in one table and not in the other one.

sqldf - Part2

17:55

+
–

Statistical modelling: Linear regression
9 Lectures
02:34:32

Categorical variables. How are these included in R using as.factor()

Dummy variables

12:23

Using the lm() function to estimate a model. Interpreting the coefficients, pvalues and t-values. F statistic

The lm() function: Part1

19:58

R2 and adjusted R2. ANOVA, selecting factors and looking into the ANOVA Fvalues

The lm function: Part2

13:44

Comparing nested and non-nested models using ANOVA and Akaike. Predicting new observations using the predict() function. Choosing between different models

Comparing models

19:56

Analyzing the residuals. Detecting structure and heterocedasticity. Plotting the leverage vs residuals, removing the influential observations

Normality, residuals and transformations

19:30

Re-estimating our house model using a log-log model. Difference between log-log, log-linear, linear-log models. Extracting elasticities.

Log-log models

12:35

We introduce mixed and random effects models. They are used for modelling the covariance between observations sharing a subject variable, such as "zipcode", "person", "animal" etc. Every observation belonging to the same group (subject) will receive a random shock with mean 0 and std error sigmax, which will be estimated by the model.

Linear mixed effects models: Part1

17:34

We apply our previous methodology to our Kings County dataset. The objective is to include the zipcode as a random effect. Every house that belongs to the same zip code, will be assumed to be correlated (will receive a common random shock).

Linear mixed effects models: Part2

19:16

We explore two techniques for dealing with outliers in the context of linear regression. We show how to estimate conditional quantiles, and robust regression using the rlm function. Both approaches yield quite robust estimates that don't get affected dramatically by outliers

Robust regression

19:36

Robust regression

3 questions

+
–

Statistical modelling: GLM and Nonlinear regression
8 Lectures
02:16:29

Introduction to logistic regression. How to formulate a model using glm() and how to do the corresponding statistical tests

Preview
19:08

Analyzing the degrees of freedom. Looking at the coefficients, odds ratio, and ROC curve. Calculating the area under the curve.

Logistic regression - Part 2

19:49

Using the "performance" package. Calculating the area under the curve. Doing ANOVA for GLMs

Logistic regression - Part 3

19:52

It is quite hard to interpret the coefficients of a logistic regression model, since it is a nonlinear model. Nevertheless, we can build curves profiling the predicted probabilities as one variable changes.

Logistic regression - Part 4

19:57

Poisson regression is one type of GLM (Generalized linear model) which is adequate for modelling count data (discrete and highly skewed data). The only conceptual problem is that we only estimate one lambda parameter which controls (at the same time) the variance and mean. As we will see this might be a problem.

Poisson regression: Part1

19:54

We continue with our previous model. In this lecture we profile the coefficients as one variable changes. In particular, we study how many people are expected to be affected as the year changes for two situations.

Poisson regression: Part2

14:43

The fundamental assumption of the Poisson model is that the same parameter describes both the mean and variance (lambda). The problem is that sometimes the variability increases more than it should as lambda increases. One easy solution is to use the negative binomial model, which can control very easily this problem (think of the NB model as a generalization of the Poisson model)

Poisson regression: Part3

03:21

Nonlinear regression is usually used in the context of biology and pharmacokinetics. We show how to fit a model using nonlinear least squares. Specifically, we fit a Michaelis-Menten model for enzyme kinetics.

Nonlinear regression

19:45

+
–

XGBOOST: Gradient Boosting
6 Lectures
01:17:46

We explore the basic parameters in XGBoost

Preview
16:24

We continue analyzing the parameters in XGBoost

How does it work? Relevant parameters - Part2

08:10

We use XGBoost for modelling the house prices in Kings County.

Using XGBoost for regression

14:37

XGBoost also comes with its own cross validation function that allows us to compute the cross validated score for each sample. The only problem is that we need to use a manual approach (this function does not tune the parameters for us)

Cross validation in XGBOOST: the xgb.cv function

12:36

We use the caret package to execute an XGBoost model via cross validation. This allows us to select some of the hyper-parameters via cross validation

Preview
17:10

We use XGBoost for classifying mushrooms.

Preview
08:49

+
–

Principal components
2 Lectures
40:02

We work with a real dataset containing 65 indexes for 188 countries. The idea is to use Principal Components to extract a set of PCAs that explain a reasonable amount of variability from this dataset. The advantage is that we end up with a much smaller set of features. PCAs can then be used for modelling (since we are using few features, it's likely that we are mitigating overfitting)

Selecting PCA and projecting the data

20:01

We use the PCAs that we obtained in the previous lecture for regression. In particular, we predict the HDI (Human Development Index) in terms of a set of PCAs.

PCA regression

20:01

+
–

Machine learning - the CARET package - introduction
3 Lectures
49:27

We introduce the caret package in R. It is a fundamental package in R, incredibly important for doing machine learning. It can be thought as a layer standing between us, and the underlying R packages, which essentially simplify a lot the process of training, evaluating and predicting machine learning models.

Introduction

13:03

We explain how to leverage some of the excellent caret preprocessing capabilities. In particular we use it for scaling features, and constructing a matrix of dummies from a categorical variable.

Preprocessing data: Part1

16:46

We continue doing data preprocessing, in this case filtering the variables with no enough variability. We also show how to find and filter correlated variables.

Preprocessing data: Part2

19:38

+
–

Sound
1 Lecture
05:23

We use a specific R package for extracting several sound features. This is particularly relevant for classifying speech. In later stages of this course, we will run several machine learning models over this dataset.

Extracting meaningful sound features

05:23

6 More Sections