Practical Data Science: Reducing High Dimensional Data in R

In this R course, we'll see how PCA can reduce a 5000+ variable data set into 10 variables and barely lose accuracy!
4.5 (38 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
1,084 students enrolled
$19
$25
24% off
Take This Course
  • Lectures 11
  • Length 2.5 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 9/2015 English

Course Description

In this R course, we'll see how PCA can reduce a 5000+ variable data set down to 10 variables and barely lose accuracy! We'll look at different ways of measuring PCA's effectiveness and other ways of reducing wide data sets (those with lots of features/variables). We'll also look at the advantages and disadvantages with different ways of reducing data.

What are the requirements?

  • Some understanding and interest in the R programming language

What am I going to get from this course?

  • Understand various ways of reducing wide data sets
  • Understand Principal Component Analysis (PCA)
  • Control, tune and measure the effects of PCA
  • Use GBM modeling to measure the effectiveness of PCA
  • Reducing dimensionality with classic GBM & GLMNET Variable Selection
  • Use ensembling techniques to find the most stable variables

What is the target audience?

  • Some understanding and interest in the R programming language
  • Interest in reducing large data sets

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: High Dimensionality Data
01:03

Quick overview of what will be covered in this class

06:35

This is an optional video explaining where to find the binaries for R and RStudio needed to follow this course.

Section 2: Principle Component Analysis (PCA)
07:52
Overview of different methods for reducing dimensionality and how PCA can be applied to drastically reduce the number of features while only losing some accuracy
11:01

Overview of the two PCA libraries (prcomp & princomp) from the {stats} library included in the base R install.

Section 3: Reducing Dimensionality With PCA and GBM modeling
19:31

Here we will:

  • Download a large data sets from the UC Irvine Machine Learning Repository
  • Use GBM (Generalized Boosted Models) to model the raw data and predict an outcome
  • Apply prcomp to the data and model it using different amount of PCA components and compare accuracy
prcomp and GBM, part 2
16:53
Section 4: Reducing Dimensionality With Variable Selection
17:17

Here we use GBM to reduce the number of variables while preserving feature names (reduction instead of compression). Note: Caret's code was updated to reflect latest object model in the attached PDF.

Variable selection using GBM, part 2
17:21
13:43

Same thing as before except using GLMNET (with a few twists). Note: Caret's code was updated to reflect latest object model in the attached PDF.

Section 5: Reducing Dimensionality With Ensemble Modeling
17:46

Brief look at the Minimum Redundancy Maximum Relevance (mRMRe) package to reduce very wide data sets quickly. Note: Caret's code was updated to reflect latest object model in the attached PDF.

15:58

This is noted as 'optional' as it may be difficult for some to install and it is fairly complex to troubleshoot thus I cannot help you with issues.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Manuel Amunategui, Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'

Ready to start learning?
Take This Course