Supercharge R with SparkR - Apply your R chops to Big Data!

Extend R with Spark and SparkR - Create clusters on AWS, perform distributed modeling, and access HDFS and S3
4.6 (86 ratings)
540 students enrolled
Last updated 11/2016
  • 5.5 hours on-demand video
  • 12 Supplemental Resources
What Will I Learn?
Use Databricks Community Edition to learn SparkR commands
Create Spark clusters on Amazon's AWS (Spark 2.0 video added)
User SparkR in RStudio from an AWS instance
Perform distributed GLM modeling with SparkR
Measure Gaussian and Binomial models
Create SparkR data frames from various sources: web, S3, HDFS
  • Some understanding of the R programming language
  • Small cost to Amazon's AWS (biggest machine used @ 0.05 cents/hour/machine)

In this class you will learn:

  • how to use R in a distributed environment
  • create Spark clusters on Amazon's AWS
  • perform distributed modeling using GLM
  • measure distributed regression and classification predictions
  • access data from csv's, json, hdfs, and S3

All our examples will be performed on real clusters - no training wheels, single local clusters or third-party tools.

Note 1: you will need to know how to SSH to your Amazon AWS instance (I will show how I do it using the Mac but Windows or Linux isn't covered)

Note 2: There is a minimal cost involved when using Amazon's AWS instances. This biggest machine we will use is around 0.05 US cents/hour/machine.

Who is the target audience?
  • Anybody with some R experience that wants to learn about big data solutions
  • Anybody interested in SparkR
  • Anybody interested in Spark and cluster computing
Curriculum For This Course
SparkR - What we'll cover in the class
Databricks - Community Edition - SparkR Commands
Databricks Introduction and Basic SparkR Data Exploration

Databricks Introduction and Basic SparkR Data Exploration Part 2

Databricks - GLM Modeling - Linear Regression

Databricks - GLM Modeling - Logistic Regression

This is the same video and content on my blog - for full source see (

Preview 19:54

Databricks - K-Means modeling
Local SparkR - Run Spark on your local machine
Running a local sparkR session in RStudio
Getting Started - Setting up AWS Instances and Spark
Setting up the Spark launching instance on AWS and controlling it via SSH - part 1

Preview 16:49

Here is an overview of using PuTTY to connect to your AWS instance. We also transfer the .pem file manually instead of using scp command as in the original lecture (or WinSCP for windows users).

Connecting to AWS using PuTTY (for Windows users)

We'll continue setting up our AWS EC2 launching instance and load the latest Spark binaries

Setting up - part 2

We're now going to create 1 master and 2 dependent Spark clusters and do a very brief test run on RStudio server.

Launching Spark clusters

For those that want to try out Spark 2.0, here is a video on how to set it up as things have changed a bit from 1.6x. Though I wholeheartedly recommend going through the original course on Spark 1.6x first. In the next few months I will update this course with Spark 2.0 objects. 

Spark 2.0 - How to start clusters in 2.0
Basic Modeling with Spark GLM
We'll follow two simple examples from the SparkR documentation to give us a taste of some of the available SparkR functions

Starting our clusters and looking at some built-in datasets

We'll look at using the SparkR GLM model and do some Gaussian modeling and extra a mean squared error on our results

GLM Gaussian modeling

Now we'll explore binomial modeling using GLM and how to measure the accuracy of the predictions

GLM Binomial modeling
Exploring Big Data, Verbs, Spark SQL & S3
Here we'll learn how to create an S3 bucket on Amazon Web Services and upload a data set.

Working with a bigger cluster and an S3 data store

S3cmd is a great tool to automatically breakup larger files before uploading them to S3 and reconstructing them into a single file afterwards.

Optional Lecture: S3cmd/S3Express - Uploading large files to S3

He we'll look at the basic commands to interact with a SparkR data frame.

Brief look at commands to query SparkR data frames

We continue to look at commands to query SparkR data frames along with a very brief look at the magrittr package and SparkSQL syntax.

Brief look at magrittr & SparkSQL
Fly-by HDFS & JSON
Brief look at HDFS and how to access it from RStudio
Extra - PySpark
Quick look at starting the PySpark shell and running a simple word count query

About the Instructor
4.5 Average rating
276 Reviews
2,301 Students
4 Courses
Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'

