Linear regression in R for Data Scientists

Learn the most important technique in Analytics with lots of business examples. From basic to advanced.
3.8 (3 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
66 students enrolled
70% off
Take This Course
  • Lectures 30
  • Length 7 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works


Find online courses made by experts from around the world.


Take your courses with you and learn anywhere, anytime.


Learn and practice real-world skills and achieve your goals.

About This Course

Published 11/2015 English

Course Description

Linear regression is the primary workhorse in statistics and data science. Its high degree of flexibility allows it to model very different problems. We will review the theory, and we will concentrate on the R applications using real world data (R is a free statistical software used heavily in the industry and academia). We will understand how to build a real model, how to interpret it, and the computational technical details behind it. The goal is to provide the student the computational knowledge necessary to work in the industry, and do applied research, using lineal modelling techniques. Some basic knowledge in statistics and R is recommended, but not necessary. The course complexity increases as it progresses: we review basic R and statistics concepts, we then transition into the linear model explaining the computational, mathematical and R methods available. We then move into much more advanced models: dealing with multilevel hierarchical models, and we finally concentrate on nonlinear regression. We also leverage several of the latest R packages, and latest research. We focus on typical business situations you will face as a data scientist/statistical analyst, and we provide many of the typical questions you will face interviewing for a job position. The course has lots of code examples, real datasets, quizzes, and video. The video duration is 4 hours, but the user is expected to take at least 5 extra hours working on the examples, data , and code provided. After completing this course, the user is expected to be fully proficient with these techniques in an industry/business context. All code and data available at Github.

What are the requirements?

  • Ideally some basic statistics and R, though neither is strictly necessary
  • Some previous experience manipulating Excel files

What am I going to get from this course?

  • Model basic and complex real world problem using linear regression
  • Understand when models are performing poorly and correct it
  • Design complex models for hierarchical data
  • How to properly prepare the data for linear regression
  • When linear regression is not sufficient
  • Understand how to interpret the results and translate them to actionable insights

Who is the target audience?

  • People pursuing a career in Data Science
  • Statisticians needing more practical/computational experience
  • Data modellers
  • People pursuing a career in practical Machine Learning

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.


Section 1: Introduction

Quick intro. Brief overview. What you will learn, and what you should learn before taking this course


Use the attached link resource for all the code/data used in this course.


A more complete overview of this course.


Advantages of R. Why it is the main statistical software nowadays? What are the advantages and disadvantages?


Basic concepts in R. Installing packages. Vectors. Matrices. Working with dataframes and dates. Basic mathematical operations


Working with read.csv(). How to load csv files. We will review the basic data-processing techniques we will use in this course

Section 2: Linear regression: Ordinary Least Squares

A quick overview of what are we doing when using OLS and the lm() function in R. Projection matrices. Residuals. Geometrical interpretation. Formulas for coefficients.


Running our first example in R. Using the lm() function. How to interpret the coefficients, pvalues, ANOVA. F statistics


The equivalence between doing ML and OLS. Why these estimates are equal? We review an example done via the optim() function in R. Minimizing the sum of squares numerically

Linear regression assumptions
6 questions

How to interpret pvalues in the context of linear regression.


When are our pvalues contaminated? How can we avoid this?


A much more complex example of OLS.


How to choose the best model? Why do we want models with few variables? How can we use the stepAIC() function and the AIC() function. What happens when we remove variables from our model?

10 questions

From the datasets folder. Open the CO2-Emissions.csv data. This data was obtained from the World Bank. The objective is to predict what are would be the CO2 Emissions in India. The data is from 1961

Choosing the best model
5 questions

We need our residuals to verify our OLS assumptions. How can we check: normality, homocedasticity, non-autocorrelation. How to read the qqplot() and some normality tests.


The relationship between leverage, outliers, influence. How to use CookD statistic? How can we read the last chart that lm() produces?


When do we need models with variables in logs()? Log-log models.

3 questions

Models with lots of variables might end up adjusting not to the true response, but to "noise". We use the DAAG package for cross-validation mean square error.


Using the predict() function in R. The difference between confidence intervals and prediction intervals. The difference between the variances of both predictions


The consequences of multicollinearity. How can we detect it, and what are the options to deal with it? VIFs, and condition indexes


The problem of non constant variance. How can we identify it using the R plots. Using robust sandwich matrices via the sandwich() package


Detecting auto-correlation. Using the robust HAC matrix from the sandwich package. The ACF() function

Linear Regression
8 questions

Monte Carlo in Excel. Monte Carlo simulation in R creating our own function. Creating synthetic datasets

Monte Carlo
3 questions
Section 3: Linear regression: Mixed Effects Regression

An introduction to mixed models. What are the conceptual differences between mixed models, and fixed effects models. Simulating datasets with random effects via Monte Carlo in R.


The possible definitions about random effects: A) the effects we don't care about B) the effects we can treat as becoming from an infinite population C) the effects not estimated by least squares D) the unobserved effects that change through time


Simulating random effects for the intercept and the independent variables. The different slopes per group, and what is the interpretation


We create our own function for maximizing the log-likelihood, and we compare this to lmer().

Mixed effects linear regression
7 questions

How to analyse the residuals from an lmer() object? Using the plots that R produces

Random vs Fixed effects
2 questions

Nested effects, crossed effects. The different operators we can use in lmer(). The different ways of defining the random effects


The problem of multiple comparisons. Using the lmertest() package. Checking for significative differences. Comparing different levels of our categorical variables.

Section 4: Robust linear regression

Why do outliers bring problems? What is an outlier? How can we detect them? The rlm() and lmrob() functions

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.

Ready to start learning?
Take This Course