Supercharge R with SparkR - Apply your R chops to Big Data!

Extend R with Spark and SparkR - Create clusters on AWS, perform distributed modeling, and access HDFS and S3
4.6 (73 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
471 students enrolled
50% off
Take This Course
  • Lectures 22
  • Length 5.5 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works


Find online courses made by experts from around the world.


Take your courses with you and learn anywhere, anytime.


Learn and practice real-world skills and achieve your goals.

About This Course

Published 11/2015 English

Course Description

In this class you will learn:

  • how to use R in a distributed environment
  • create Spark clusters on Amazon's AWS
  • perform distributed modeling using GLM
  • measure distributed regression and classification predictions
  • access data from csv's, json, hdfs, and S3

All our examples will be performed on real clusters - no training wheels, single local clusters or third-party tools.

Note 1: you will need to know how to SSH to your Amazon AWS instance (I will show how I do it using the Mac but Windows or Linux isn't covered)

Note 2: There is a minimal cost involved when using Amazon's AWS instances. This biggest machine we will use is around 0.05 US cents/hour/machine.

What are the requirements?

  • Some understanding of the R programming language
  • Small cost to Amazon's AWS (biggest machine used @ 0.05 cents/hour/machine)

What am I going to get from this course?

  • Use Databricks Community Edition to learn SparkR commands
  • Create Spark clusters on Amazon's AWS (Spark 2.0 video added)
  • User SparkR in RStudio from an AWS instance
  • Perform distributed GLM modeling with SparkR
  • Measure Gaussian and Binomial models
  • Create SparkR data frames from various sources: web, S3, HDFS

Who is the target audience?

  • Anybody with some R experience that wants to learn about big data solutions
  • Anybody interested in SparkR
  • Anybody interested in Spark and cluster computing

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.


Section 1: SparkR - What we'll cover in the class
Section 2: Databricks - Community Edition - SparkR Commands
Databricks Introduction and Basic SparkR Data Exploration
Databricks Introduction and Basic SparkR Data Exploration Part 2
Databricks - GLM Modeling - Linear Regression
Databricks - GLM Modeling - Logistic Regression

This is the same video and content on my blog - for full source see (

Databricks - K-Means modeling
Section 3: Local SparkR - Run Spark on your local machine
Running a local sparkR session in RStudio
Section 4: Getting Started - Setting up AWS Instances and Spark

Setting up the Spark launching instance on AWS and controlling it via SSH - part 1


Here is an overview of using PuTTY to connect to your AWS instance. We also transfer the .pem file manually instead of using scp command as in the original lecture (or WinSCP for windows users).


We'll continue setting up our AWS EC2 launching instance and load the latest Spark binaries


We're now going to create 1 master and 2 dependent Spark clusters and do a very brief test run on RStudio server.


For those that want to try out Spark 2.0, here is a video on how to set it up as things have changed a bit from 1.6x. Though I wholeheartedly recommend going through the original course on Spark 1.6x first. In the next few months I will update this course with Spark 2.0 objects. 

Section 5: Basic Modeling with Spark GLM

We'll follow two simple examples from the SparkR documentation to give us a taste of some of the available SparkR functions


We'll look at using the SparkR GLM model and do some Gaussian modeling and extra a mean squared error on our results


Now we'll explore binomial modeling using GLM and how to measure the accuracy of the predictions

Section 6: Exploring Big Data, Verbs, Spark SQL & S3

Here we'll learn how to create an S3 bucket on Amazon Web Services and upload a data set.


S3cmd is a great tool to automatically breakup larger files before uploading them to S3 and reconstructing them into a single file afterwards.


He we'll look at the basic commands to interact with a SparkR data frame.


We continue to look at commands to query SparkR data frames along with a very brief look at the magrittr package and SparkSQL syntax.

Section 7: Fly-by HDFS & JSON
Brief look at HDFS and how to access it from RStudio
Section 8: Extra - PySpark

Quick look at starting the PySpark shell and running a simple word count query

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Manuel Amunategui, Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'

Ready to start learning?
Take This Course