Supercharge R with SparkR - Apply your R chops to Big Data!
4.0 (143 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
958 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Supercharge R with SparkR - Apply your R chops to Big Data! to your Wishlist.

Add to Wishlist

Supercharge R with SparkR - Apply your R chops to Big Data!

Extend R with Spark and SparkR - Create clusters on AWS, perform distributed modeling, and access HDFS and S3
Best Seller
4.0 (143 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
958 students enrolled
Created by Manuel Amunategui
Last updated 11/2016
English
Current price: $10 Original price: $30 Discount: 67% off
5 hours left at this price!
30-Day Money-Back Guarantee
Includes:
  • 5.5 hours on-demand video
  • 12 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Use Databricks Community Edition to learn SparkR commands
  • Create Spark clusters on Amazon's AWS (Spark 2.0 video added)
  • User SparkR in RStudio from an AWS instance
  • Perform distributed GLM modeling with SparkR
  • Measure Gaussian and Binomial models
  • Create SparkR data frames from various sources: web, S3, HDFS
View Curriculum
Requirements
  • Some understanding of the R programming language
  • Small cost to Amazon's AWS (biggest machine used @ 0.05 cents/hour/machine)
Description

In this class you will learn:

  • how to use R in a distributed environment
  • create Spark clusters on Amazon's AWS
  • perform distributed modeling using GLM
  • measure distributed regression and classification predictions
  • access data from csv's, json, hdfs, and S3

All our examples will be performed on real clusters - no training wheels, single local clusters or third-party tools.

Note 1: you will need to know how to SSH to your Amazon AWS instance (I will show how I do it using the Mac but Windows or Linux isn't covered)

Note 2: There is a minimal cost involved when using Amazon's AWS instances. This biggest machine we will use is around 0.05 US cents/hour/machine.

Who is the target audience?
  • Anybody with some R experience that wants to learn about big data solutions
  • Anybody interested in SparkR
  • Anybody interested in Spark and cluster computing
Compare to Other R Courses
Curriculum For This Course
22 Lectures
05:21:39
+
SparkR - What we'll cover in the class
1 Lecture 02:00
+
Databricks - Community Edition - SparkR Commands
6 Lectures 01:33:55
Databricks Introduction and Basic SparkR Data Exploration
11:23

Databricks Introduction and Basic SparkR Data Exploration Part 2
12:18

Databricks - GLM Modeling - Linear Regression
17:48

Databricks - GLM Modeling - Logistic Regression
19:28

This is the same video and content on my blog - for full source see (http://amunategui.github.io/databricks-spark-bayes/index.html)

Preview 19:54

Databricks - K-Means modeling
13:04
+
Local SparkR - Run Spark on your local machine
1 Lecture 18:36
Running a local sparkR session in RStudio
18:36
+
Getting Started - Setting up AWS Instances and Spark
5 Lectures 01:16:44

Setting up the Spark launching instance on AWS and controlling it via SSH - part 1

Preview 16:49

Here is an overview of using PuTTY to connect to your AWS instance. We also transfer the .pem file manually instead of using scp command as in the original lecture (or WinSCP for windows users).

Connecting to AWS using PuTTY (for Windows users)
10:30

We'll continue setting up our AWS EC2 launching instance and load the latest Spark binaries

Setting up - part 2
15:53

We're now going to create 1 master and 2 dependent Spark clusters and do a very brief test run on RStudio server.

Launching Spark clusters
18:41

For those that want to try out Spark 2.0, here is a video on how to set it up as things have changed a bit from 1.6x. Though I wholeheartedly recommend going through the original course on Spark 1.6x first. In the next few months I will update this course with Spark 2.0 objects. 

Spark 2.0 - How to start clusters in 2.0
14:51
+
Basic Modeling with Spark GLM
3 Lectures 45:43

We'll follow two simple examples from the SparkR documentation to give us a taste of some of the available SparkR functions

Starting our clusters and looking at some built-in datasets
19:13

We'll look at using the SparkR GLM model and do some Gaussian modeling and extra a mean squared error on our results

GLM Gaussian modeling
13:11

Now we'll explore binomial modeling using GLM and how to measure the accuracy of the predictions

GLM Binomial modeling
13:19
+
Exploring Big Data, Verbs, Spark SQL & S3
4 Lectures 55:51

Here we'll learn how to create an S3 bucket on Amazon Web Services and upload a data set.

Working with a bigger cluster and an S3 data store
12:16

S3cmd is a great tool to automatically breakup larger files before uploading them to S3 and reconstructing them into a single file afterwards.

Optional Lecture: S3cmd/S3Express - Uploading large files to S3
07:40

He we'll look at the basic commands to interact with a SparkR data frame.

Brief look at commands to query SparkR data frames
17:58

We continue to look at commands to query SparkR data frames along with a very brief look at the magrittr package and SparkSQL syntax.

Brief look at magrittr & SparkSQL
17:57
+
Fly-by HDFS & JSON
1 Lecture 17:24
Brief look at HDFS and how to access it from RStudio
17:24
+
Extra - PySpark
1 Lecture 11:26

Quick look at starting the PySpark shell and running a simple word count query

PySpark
11:26
About the Instructor
Manuel Amunategui
4.3 Average rating
480 Reviews
3,615 Students
4 Courses
Data Scientist & Quantitative Developer

I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'