When buying any of my courses, I also give you free coupons to the rest of my courses. Just send me a message after enrolling. Pay one course, get 5!!
Linear regression is the primary workhorse in statistics and data science. Its high degree of flexibility allows it to model very different problems. We will review the theory, and we will concentrate on the R applications using real world data (R is a free statistical software used heavily in the industry and academia). We will understand how to build a real model, how to interpret it, and the computational technical details behind it. The goal is to provide the student the computational knowledge necessary to work in the industry, and do applied research, using lineal modelling techniques. Some basic knowledge in statistics and R is recommended, but not necessary. The course complexity increases as it progresses: we review basic R and statistics concepts, we then transition into the linear model explaining the computational, mathematical and R methods available. We then move into much more advanced models: dealing with multilevel hierarchical models, and we finally concentrate on nonlinear regression. We also leverage several of the latest R packages, and latest research. We focus on typical business situations you will face as a data scientist/statistical analyst, and we provide many of the typical questions you will face interviewing for a job position. The course has lots of code examples, real datasets, quizzes, and video. The video duration is 4 hours, but the user is expected to take at least 5 extra hours working on the examples, data , and code provided. After completing this course, the user is expected to be fully proficient with these techniques in an industry/business context. All code and data available at Github.
Quick intro. Brief overview. What you will learn, and what you should learn before taking this course
Use the attached link resource for all the code/data used in this course.
A more complete overview of this course.
Advantages of R. Why it is the main statistical software nowadays? What are the advantages and disadvantages?
Basic concepts in R. Installing packages. Vectors. Matrices. Working with dataframes and dates. Basic mathematical operations
Working with read.csv(). How to load csv files. We will review the basic data-processing techniques we will use in this course
A quick overview of what are we doing when using OLS and the lm() function in R. Projection matrices. Residuals. Geometrical interpretation. Formulas for coefficients.
Running our first example in R. Using the lm() function. How to interpret the coefficients, pvalues, ANOVA. F statistics
The equivalence between doing ML and OLS. Why these estimates are equal? We review an example done via the optim() function in R. Minimizing the sum of squares numerically
How to interpret pvalues in the context of linear regression.
When are our pvalues contaminated? How can we avoid this?
A much more complex example of OLS.
How to choose the best model? Why do we want models with few variables? How can we use the stepAIC() function and the AIC() function. What happens when we remove variables from our model?
From the datasets folder. Open the CO2-Emissions.csv data. This data was obtained from the World Bank. The objective is to predict what are would be the CO2 Emissions in India. The data is from 1961
We need our residuals to verify our OLS assumptions. How can we check: normality, homocedasticity, non-autocorrelation. How to read the qqplot() and some normality tests.
The relationship between leverage, outliers, influence. How to use CookD statistic? How can we read the last chart that lm() produces?
When do we need models with variables in logs()? Log-log models.
Models with lots of variables might end up adjusting not to the true response, but to "noise". We use the DAAG package for cross-validation mean square error.
Using the predict() function in R. The difference between confidence intervals and prediction intervals. The difference between the variances of both predictions
The consequences of multicollinearity. How can we detect it, and what are the options to deal with it? VIFs, and condition indexes
The problem of non constant variance. How can we identify it using the R plots. Using robust sandwich matrices via the sandwich() package
Detecting auto-correlation. Using the robust HAC matrix from the sandwich package. The ACF() function
Monte Carlo in Excel. Monte Carlo simulation in R creating our own function. Creating synthetic datasets
An introduction to mixed models. What are the conceptual differences between mixed models, and fixed effects models. Simulating datasets with random effects via Monte Carlo in R.
The possible definitions about random effects: A) the effects we don't care about B) the effects we can treat as becoming from an infinite population C) the effects not estimated by least squares D) the unobserved effects that change through time
Simulating random effects for the intercept and the independent variables. The different slopes per group, and what is the interpretation
We create our own function for maximizing the log-likelihood, and we compare this to lmer().
How to analyse the residuals from an lmer() object? Using the plots that R produces
Nested effects, crossed effects. The different operators we can use in lmer(). The different ways of defining the random effects
The problem of multiple comparisons. Using the lmertest() package. Checking for significative differences. Comparing different levels of our categorical variables.
Why do outliers bring problems? What is an outlier? How can we detect them? The rlm() and lmrob() functions
I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.