Automating Data Exploration with R

Build the tools needed to quickly turn data into model-ready data sets
4.7 (31 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
217 students enrolled
$19
$25
24% off
Take This Course
  • Lectures 21
  • Length 4 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 4/2016 English

Course Description

As data scientists and analysts we face constant repetitive task when approaching new data sets. This class aims at automating a lot of these tasks in order to get to the actual analysis as quickly as possible. Of course, there will always be exceptions to the rule, some manual work and customization will be required. But overall a large swath of that work can be automated by building a smart pipeline. This is what we’ll do here. This is especially important in the era of big data where handling variables by hand isn’t always possible.

It is also a great learning strategy to think in terms of a processing pipeline and to understand, design and build each stage as separate and independent units.

What are the requirements?

  • Basic understanding of R programming
  • Some statistical and modeling knowledge

What am I going to get from this course?

  • Build a pipeline to automate the processing of raw data for discovery and modeling
  • Know the main steps to prepare data for modeling
  • Know how to handle the different data types in R
  • Understand data imputation
  • Treat categorical data properly with binarization (making dummy columns)
  • Apply feature engineering to dates, integers and real numbers
  • Apply variable selection, correlation and significance tests
  • Model and measure prepared data using both supervised and unsupervised modeling

What is the target audience?

  • Interest and need to process raw data for exploration and modeling in R

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: Introduction
02:12

As data scientists and analysts we face constant repetitive task when approaching new data sets. This class aims at automating a lot of these tasks in order to get to the actual analysis as quickly as possible. Of course, there will always be exceptions to the rule, some manual work and customization will be required. But overall a large swath of that work can be automated by building a smart pipeline. This is what we'll do here. This is especially important in the era of big data where handling variables by hand isn’t always possible.

It is also a great learning strategy to think in terms of a processing pipeline and to understand, design and build each stage as separate and independent units.

02:47

Let's briefly talk big picture so we're all on the same page.

02:57

Brief video on where to download R base and RStudio - Skip this if you have it up-and-running already.

Section 2: Reading Data
16:22

Let's take a look at popular data readers from the base package, readr, and data.table.

Section 3: Data Transformation - Data Scrubbing
19:46

Let's start by looking at dates and how to format them properly

17:55

We need to find clever ways of turning text data into numbers. You can choose to ignore any text and just model off the numerical variables but you would be leaving a lot of intelligence on the table

18:40

In very few cases you can turn text into factors directly and model it - let's do it the right way and see what it takes to do it properly 

12:02

Let's do a pipeline check and upgrade our Binarize_Features function

13:47

Let's look at imputing missing data with 0's or the mean value of the feature

Pipeline Check
08:24
06:28

Here is a look at a cool function from the caret package - nearZeroVar. It can tell which features have no or little variance (no pdf associated with this video).

Section 4: Feature Engineering
16:19

Let's see what additional numerical data we can pull out of date features

12:20

Just like we squeezed more intelligence out of dates, here we'll apply the same principals on integers and real numbers

Pipeline Check
06:19
Section 5: Basic Data Exploration
21:52

Let's look at pair-wise correlations and how to access results programmatically. 

04:49

A look at the findCorrelation function from the caret package (no pdf associated with this lecture).

10:31

Finding outliers in feature sets using the mean and standard deviation.

Section 6: Modeling
15:46

Let's see how our pipeline functions work on the Titanic data set and a random forest model.

13:55

Here we use the caret package, two of our pipeline functions and a GBM model to predict hospital readmissions.

GBM - 2
14:52
K-means, Unstructured Modeling
14:11

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Manuel Amunategui, Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'

Ready to start learning?
Take This Course