Practical Data Science: Reducing High Dimensional Data in R
4.3 (141 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
1,761 students enrolled

Practical Data Science: Reducing High Dimensional Data in R

In this R course, we'll see how PCA can reduce a 5000+ variable data set into 10 variables and barely lose accuracy!
4.3 (141 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
1,761 students enrolled
Created by Manuel Amunategui
Last updated 4/2017
English
English [Auto]
Current price: $13.99 Original price: $19.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 2.5 hours on-demand video
  • 8 downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Understand various ways of reducing wide data sets
  • Understand Principal Component Analysis (PCA)
  • Control, tune and measure the effects of PCA
  • Use GBM modeling to measure the effectiveness of PCA
  • Reducing dimensionality with classic GBM & GLMNET Variable Selection
  • Use ensembling techniques to find the most stable variables
Course content
Expand all 11 lectures 02:24:59
+ High Dimensionality Data
2 lectures 07:38

Quick overview of what will be covered in this class

Preview 01:03

This is an optional video explaining where to find the binaries for R and RStudio needed to follow this course.

Preview 06:35
+ Principle Component Analysis (PCA)
2 lectures 18:53
Overview of different methods for reducing dimensionality and how PCA can be applied to drastically reduce the number of features while only losing some accuracy
Preview 07:52

Overview of the two PCA libraries (prcomp & princomp) from the {stats} library included in the base R install.

prcomp & princomp
11:01
+ Reducing Dimensionality With PCA and GBM modeling
2 lectures 36:24

Here we will:

  • Download a large data sets from the UC Irvine Machine Learning Repository
  • Use GBM (Generalized Boosted Models) to model the raw data and predict an outcome
  • Apply prcomp to the data and model it using different amount of PCA components and compare accuracy
prcomp and GBM
19:31
prcomp and GBM, part 2
16:53
+ Reducing Dimensionality With Variable Selection
3 lectures 48:21

Here we use GBM to reduce the number of variables while preserving feature names (reduction instead of compression). Note: Caret's code was updated to reflect latest object model in the attached PDF.

Variable selection using GBM (Generalized Boosted Models)
17:17
Variable selection using GBM, part 2
17:21

Same thing as before except using GLMNET (with a few twists). Note: Caret's code was updated to reflect latest object model in the attached PDF.

Variable selection using GLMNET
13:43
+ Reducing Dimensionality With Ensemble Modeling
2 lectures 33:44

Brief look at the Minimum Redundancy Maximum Relevance (mRMRe) package to reduce very wide data sets quickly. Note: Caret's code was updated to reflect latest object model in the attached PDF.

Variable selection using Minimum Redundancy Maximum Relevance (mRMRe)
17:46

This is noted as 'optional' as it may be difficult for some to install and it is fairly complex to troubleshoot thus I cannot help you with issues.

Optional: Variable selection using fscaret
15:58
Requirements
  • Some understanding and interest in the R programming language
Description

In this R course, we'll see how PCA can reduce a 5000+ variable data set down to 10 variables and barely lose accuracy! We'll look at different ways of measuring PCA's effectiveness and other ways of reducing wide data sets (those with lots of features/variables). We'll also look at the advantages and disadvantages with different ways of reducing data.

Who this course is for:
  • Some understanding and interest in the R programming language
  • Interest in reducing large data sets