Automating Data Exploration with R
4.7 (50 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
407 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Automating Data Exploration with R to your Wishlist.

Add to Wishlist

Automating Data Exploration with R

Build the tools needed to quickly turn data into model-ready data sets
4.7 (50 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
407 students enrolled
Created by Manuel Amunategui
Last updated 11/2016
English
Current price: $19 Original price: $25 Discount: 24% off
30-Day Money-Back Guarantee
Includes:
  • 4 hours on-demand video
  • 15 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Build a pipeline to automate the processing of raw data for discovery and modeling
  • Know the main steps to prepare data for modeling
  • Know how to handle the different data types in R
  • Understand data imputation
  • Treat categorical data properly with binarization (making dummy columns)
  • Apply feature engineering to dates, integers and real numbers
  • Apply variable selection, correlation and significance tests
  • Model and measure prepared data using both supervised and unsupervised modeling
View Curriculum
Requirements
  • Basic understanding of R programming
  • Some statistical and modeling knowledge
Description

As data scientists and analysts we face constant repetitive task when approaching new data sets. This class aims at automating a lot of these tasks in order to get to the actual analysis as quickly as possible. Of course, there will always be exceptions to the rule, some manual work and customization will be required. But overall a large swath of that work can be automated by building a smart pipeline. This is what we’ll do here. This is especially important in the era of big data where handling variables by hand isn’t always possible.

It is also a great learning strategy to think in terms of a processing pipeline and to understand, design and build each stage as separate and independent units.

Who is the target audience?
  • Interest and need to process raw data for exploration and modeling in R
Curriculum For This Course
Expand All 21 Lectures Collapse All 21 Lectures 04:12:14
+
Introduction
3 Lectures 07:56

As data scientists and analysts we face constant repetitive task when approaching new data sets. This class aims at automating a lot of these tasks in order to get to the actual analysis as quickly as possible. Of course, there will always be exceptions to the rule, some manual work and customization will be required. But overall a large swath of that work can be automated by building a smart pipeline. This is what we'll do here. This is especially important in the era of big data where handling variables by hand isn’t always possible.

It is also a great learning strategy to think in terms of a processing pipeline and to understand, design and build each stage as separate and independent units.

Preview 02:12

Let's briefly talk big picture so we're all on the same page.

Big Picture - Data Scrubbing
02:47

Brief video on where to download R base and RStudio - Skip this if you have it up-and-running already.

Optional - Getting RStudio
02:57
+
Reading Data
1 Lecture 16:22

Let's take a look at popular data readers from the base package, readr, and data.table.

Preview 16:22
+
Data Transformation - Data Scrubbing
7 Lectures 01:37:02

Let's start by looking at dates and how to format them properly

Dates - Reading and Casting Dates
19:46

We need to find clever ways of turning text data into numbers. You can choose to ignore any text and just model off the numerical variables but you would be leaving a lot of intelligence on the table

Text Data - Ways to Quantify Free-Form Text
17:55

In very few cases you can turn text into factors directly and model it - let's do it the right way and see what it takes to do it properly 

Text Data - Categories
18:40

Let's do a pipeline check and upgrade our Binarize_Features function

Text Data - Categories 2 & Pipeline Check
12:02

Let's look at imputing missing data with 0's or the mean value of the feature

Imputing Data - Dealing with Missing Data
13:47

Pipeline Check
08:24

Here is a look at a cool function from the caret package - nearZeroVar. It can tell which features have no or little variance (no pdf associated with this video).

Caret Library - nearZeroVar
06:28
+
Feature Engineering
3 Lectures 34:58

Let's see what additional numerical data we can pull out of date features

Engineering Dates - Getting Additional Features out of Dates
16:19

Just like we squeezed more intelligence out of dates, here we'll apply the same principals on integers and real numbers

Numerical Engineering - Integers and Real Numbers
12:20

Pipeline Check
06:19
+
Basic Data Exploration
3 Lectures 37:12

Let's look at pair-wise correlations and how to access results programmatically. 

Correlations
21:52

A look at the findCorrelation function from the caret package (no pdf associated with this lecture).

Caret Library - findCorrelation
04:49

Finding outliers in feature sets using the mean and standard deviation.

Hunting Outliers
10:31
+
Modeling
4 Lectures 58:44

Let's see how our pipeline functions work on the Titanic data set and a random forest model.

Random Forest - Titanic Data Set
15:46

Here we use the caret package, two of our pipeline functions and a GBM model to predict hospital readmissions.

GBM (Generalized Boosted Models)/Caret - Diabetes Data Set - 1
13:55

GBM - 2
14:52

K-means, Unstructured Modeling
14:11
About the Instructor
Manuel Amunategui
4.5 Average rating
324 Reviews
2,638 Students
4 Courses
Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'