Supercharge R with SparkR - Apply your R chops to Big Data!

Extend R with Spark and SparkR - Create clusters on AWS, perform distributed modeling, and access HDFS and S3
4.6 (59 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
361 students enrolled
$19
$30
37% off
Take This Course
  • Lectures 22
  • Length 5.5 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 11/2015 English

Course Description

In this class you will learn:

  • how to use R in a distributed environment
  • create Spark clusters on Amazon's AWS
  • perform distributed modeling using GLM
  • measure distributed regression and classification predictions
  • access data from csv's, json, hdfs, and S3

All our examples will be performed on real clusters - no training wheels, single local clusters or third-party tools.

Note 1: you will need to know how to SSH to your Amazon AWS instance (I will show how I do it using the Mac but Windows or Linux isn't covered)

Note 2: There is a minimal cost involved when using Amazon's AWS instances. This biggest machine we will use is around 0.05 US cents/hour/machine.

What are the requirements?

  • Some understanding of the R programming language
  • Small cost to Amazon's AWS (biggest machine used @ 0.05 cents/hour/machine)

What am I going to get from this course?

  • Use Databricks Community Edition to learn SparkR commands
  • Create Spark clusters on Amazon's AWS (Spark 2.0 video added)
  • User SparkR in RStudio from an AWS instance
  • Perform distributed GLM modeling with SparkR
  • Measure Gaussian and Binomial models
  • Create SparkR data frames from various sources: web, S3, HDFS

What is the target audience?

  • Anybody with some R experience that wants to learn about big data solutions
  • Anybody interested in SparkR
  • Anybody interested in Spark and cluster computing

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: SparkR - What we'll cover in the class
Introduction
Preview
02:00
Section 2: Databricks - Community Edition - SparkR Commands
Databricks Introduction and Basic SparkR Data Exploration
11:23
Databricks Introduction and Basic SparkR Data Exploration Part 2
12:18
Databricks - GLM Modeling - Linear Regression
17:48
Databricks - GLM Modeling - Logistic Regression
19:28
19:54

This is the same video and content on my blog - for full source see (http://amunategui.github.io/databricks-spark-bayes/index.html)

Databricks - K-Means modeling
13:04
Section 3: Local SparkR - Run Spark on your local machine
Running a local sparkR session in RStudio
18:36
Section 4: Getting Started - Setting up AWS Instances and Spark
16:49

Setting up the Spark launching instance on AWS and controlling it via SSH - part 1

10:30

Here is an overview of using PuTTY to connect to your AWS instance. We also transfer the .pem file manually instead of using scp command as in the original lecture (or WinSCP for windows users).

15:53

We'll continue setting up our AWS EC2 launching instance and load the latest Spark binaries

18:41

We're now going to create 1 master and 2 dependent Spark clusters and do a very brief test run on RStudio server.

14:51

For those that want to try out Spark 2.0, here is a video on how to set it up as things have changed a bit from 1.6x. Though I wholeheartedly recommend going through the original course on Spark 1.6x first. In the next few months I will update this course with Spark 2.0 objects. 

Section 5: Basic Modeling with Spark GLM
19:13

We'll follow two simple examples from the SparkR documentation to give us a taste of some of the available SparkR functions

13:11

We'll look at using the SparkR GLM model and do some Gaussian modeling and extra a mean squared error on our results

13:19

Now we'll explore binomial modeling using GLM and how to measure the accuracy of the predictions

Section 6: Exploring Big Data, Verbs, Spark SQL & S3
12:16

Here we'll learn how to create an S3 bucket on Amazon Web Services and upload a data set.

07:40

S3cmd is a great tool to automatically breakup larger files before uploading them to S3 and reconstructing them into a single file afterwards.

17:58

He we'll look at the basic commands to interact with a SparkR data frame.

17:57

We continue to look at commands to query SparkR data frames along with a very brief look at the magrittr package and SparkSQL syntax.

Section 7: Fly-by HDFS & JSON
Brief look at HDFS and how to access it from RStudio
17:24
Section 8: Extra - PySpark
11:26

Quick look at starting the PySpark shell and running a simple word count query

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Manuel Amunategui, Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'

Ready to start learning?
Take This Course