Taming Big Data with Spark Streaming and Scala - Hands On!
4.5 (934 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
6,141 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Taming Big Data with Spark Streaming and Scala - Hands On! to your Wishlist.

Add to Wishlist

Taming Big Data with Spark Streaming and Scala - Hands On!

Learn to process massive streams of data in real time on a cluster with Spark Streaming.
4.5 (934 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
6,141 students enrolled
Last updated 7/2017
English
English
Current price: $19 Original price: $100 Discount: 81% off
30-Day Money-Back Guarantee
Includes:
  • 6 hours on-demand video
  • 1 Supplemental Resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Process massive streams of real-time data using Spark Streaming
  • Create Spark applications using the Scala programming language
  • Integrate Spark Streaming with data sources, including Kafka, Flume, and Kinesis
  • Output transformed real-time data to Cassandra or file systems
  • Integrate Spark Streaming with Spark SQL to query streaming data in real time
  • Train machine learning models with streaming data, and use those models for real-time predictions
  • Ingest Apache access log data and transform streams of it
  • Receive real-time streams of Twitter feeds
  • Maintain stateful data across a continuous stream of input data
  • Query streaming data across sliding windows of time
View Curriculum
Requirements
  • To follow along with the examples, you'll need a personal computer. The course is filmed using Windows 10, but the tools we install are available for Linux and MacOS as well.
  • We'll walk through installing the required software in the first lecture: The Scala IDE, Spark, and a JDK.
  • My "Taming Big Data with Apache Spark - Hands On!" would be a helpful introduction to Spark in general, but it is not required for this course. A quick introduction to Spark is included.
  • The course includes a crash course in the Scala programming language if you're new to it; if you already know Scala, then great.
Description

"Big Data" analysis is a hot and highly valuable skill. Thing is, "big data" never stops flowing! Spark Streaming is a new and quickly developing technology for processing massive data sets as they are created - why wait for some nightly analysis to run when you can constantly update your analysis in real time, all the time? Whether it's clickstream data from a big website, sensor data from a massive "Internet of Things" deployment, financial data, or something else - Spark Streaming is a powerful technology for transforming and analyzing that data right when it is created, all the time.

You'll be learning from an ex-engineer and senior manager from Amazon and IMDb.

This course gets your hands on to some real live Twitter data, simulated streams of Apache access logs, and even data used to train machine learning models! You'll write and run real Spark Streaming jobs right at home on your own PC, and toward the end of the course, we'll show you how to take those jobs to a real Hadoop cluster and run them in a production environment too.

Across over 30 lectures and almost 6 hours of video content, you'll:

  • Get a crash course in the Scala programming language
  • Learn how Apache Spark operates on a cluster
  • Set up discretized streams with Spark Streaming and transform them as data is received
  • Analyze streaming data over sliding windows of time
  • Maintain stateful information across streams of data
  • Connect Spark Streaming with highly scalable sources of data, including Kafka, Flume, and Kinesis
  • Dump streams of data in real-time to NoSQL databases such as Cassandra
  • Run SQL queries on streamed data in real time
  • Train machine learning models in real time with streaming data, and use them to make predictions that keep getting better over time
  • Package, deploy, and run self-contained Spark Streaming code to a real Hadoop cluser using Amazon Elastic MapReduce.

This course is very hands-on, filled with achievable activities and exercises to reinforce your learning. By the end of this course, you'll be confidently creating Spark Streaming scripts in Scala, and be prepared to tackle massive streams of data in a whole new way. You'll be surprised at how easy Spark Streaming makes it!

Who is the target audience?
  • Students with some prior programming or scripting ability SHOULD take this course.
  • If you're working for a company with "big data" that is being generated continuously, or hope to work for one, this course is for you.
  • Students with no prior software engineering or programming experience should seek an introductory programming course first.
Curriculum For This Course
36 Lectures
05:58:31
+
Getting Started
2 Lectures 27:48

A brief introduction to the course, and then we'll get your development environment for Spark and Scala all set up on your desktop. A quick test application will confirm Spark is working on your system! Remember - be sure to install Spark 1.6.2 for this course.

Preview 15:20

Get set up with a Twitter developer account, and run your first Spark Streaming application to listen to and print out live Tweets as they happen!

Preview 12:28
+
A Crash Course in Scala
5 Lectures 53:50

We start our crash course in the Scala programming language by covering some basics of the language: types and variables, printing, and boolean comparisons.

[Activity] Scala Basics: Part 1
11:26

Part 2 of our introduction to the basics of Scala programming, and a simple exercise to get you writing your own Scala code.

[Exercise] Scala Basics: Part 2
09:41

Our Scala crash course continues, illustrating various means of flow control in Scala. For loops, do/while loops, while loops, etc.

[Exercise] Flow Control in Scala
07:18

Scala is a functional programming language, and so understanding how functions work and are treated in Scala is hugely important! This lecture covers the fundamentals, and lets you put it into practice.

[Exercise] Functions in Scala
08:47

We wrap up our Scala crash course with commonly used data structures using in Spark with Scala. Tuples, lists, and maps.

[Excercise] Data Structures in Scala
16:38
+
Spark Streaming Concepts
7 Lectures 48:50

Before you can learn about Spark Streaming, you need to understand how Spark itself works at a high level! This covers the why & how of Apache Spark, of which Spark Streaming is a component.

Introduction to Spark
07:06

The fundamental object of Spark programming is the Resilient Distributed Dataset (RDD), and this is used not just in Spark but also within Spark Streaming scripts. This lecture explains what they are, and what you can do with them.

The Resilient Distributed Dataset (RDD)
10:40

Let's walk through and actually run a simple Spark script that counts the number of occurrences of each word in a book.

[Activity] RDD's in action: simple word count application
08:17

We finally have all the pre-requisite knowledge to start talking about Spark Streaming itself in more detail! We'll cover how it works, what it's for, and its architecture.

Introduction to Spark Streaming
06:32

Now that we know more, let's go revisit that first Spark Streaming application we ran in lecture 2, and dive into how it really works.

[Activity] Revisiting the PrintTweets application
05:09

Windowing allows you to analyze streaming data over a sliding window of time, which lets you do much more than just transform streaming data and store it someplace else. We'll cover the concepts of the batch, window, and slide intervals, and how they work together to let you aggregate streaming data over some period of time.

Windowing: Aggregating data over longer time spans
05:00

How can Spark Streaming do so much work continuously in a reliable manner? We'll uncover some of its tricks for reliability, as well as tips for configuring Spark Streaming to be as reliable as possible.

Fault Tolerance in Spark Streaming
06:06
+
Spark Streaming Examples with Twitter
3 Lectures 36:35

We'll build on our "print tweets" example to actually store the incoming Tweets to disk, and illustrate how Spark Streaming can handle file output.

[Exercise] Saving Tweets to Disk
13:23

Compute the average length of a Tweet, using windowing in Spark Streaming.

[Exercise] Tracking the Average Tweet Length
08:22

This is a fun one! We'll track the most popular hashtags in Twitter over time, and watch how they change in real time!

Preview 14:50
+
Spark Streaming Examples with Clickstream / Apache Access Log Data
5 Lectures 55:32

We'll simulate an incoming stream of Apache access logs, and use Spark Streaming to keep track of the most-requested web pages in real time!

Preview 13:27

This example will listen to an Apache access log stream, and raise an alarm if too many errors are returned by the server in real time.

[Exercise] Alarming on Log Errors
11:56

We'll integrate Spark Streaming with Spark SQL, allowing us to run SQL queries on data as it is streamed in! Again we will use Apache logs as an example.

[Exercise] Integrating Spark Streaming with Spark SQL
10:18

Spark 2.0 introduced experimental support for Structured Streaming, a new DataSet-based API for Spark Streaming that is bound to become increasingly important. Learn how it works.

Intro to Structured Streaming in Spark 2
08:27

As an example, we'll stream Apache access logs in from a directory, and use Structured Streaming to count up status codes over a one-hour moving window.

[Activity] Analyzing Apache Log files with Structured Streaming
11:24
+
Integrating with Other Systems
5 Lectures 41:09

Apache Kafka is a popular and robust technology for publishing messages across a cluster on a large scale. We'll show how to get Spark Streaming to listen to Kafka topics, and process them in real time.

Preview 12:20

Flume is a popular technology for publishing log information at large scale, especially on a Hadoop cluster. We'll illustrate how to set up both push-based and pull-based Flume configurations with Spark Streaming, and discuss the tradeoffs of each.

Integrating with Apache Flume
08:51

Amazon's Kinesis Streaming service is basically Kafka on AWS. If you're working with an AWS/EC2 cluster, you'll want to know how to integrate Spark Streaming with Kinesis - and that's what this lecture covers.

Integrating with Amazon Kinesis
05:29

What if you need to integrate Spark Streaming with some proprietary system that does not have an existing connection library? Well, you can always write your own Receiver class. This example shows you how and actually lets you build and run one.

[Activity] Writing Custom Data Receivers
06:55

Cassandra is a popular "NoSQL" database that can be used to provide fast access to massive data sets to real-time applications. Dumping data transformed by Spark Streaming into a Cassandra database can expose that data you your larger, real-time services. We'll show you how and actually run a simple example.

Integrating with Cassandra
07:34
+
Advanced Spark Streaming Examples
3 Lectures 42:33

Spark has the ability to track arbitrary state across streams of data as they come in, such as web sessions, running totals, etc. This example shows you how it all works, and challenges you to track your own state using our example as a baseline.

[Exercise] Stateful Information in Spark Streams
15:07

Spark Streaming integrates with some of Spark's MLLib (Machine Learning Library) capabilities. This example creates a real-time K-Means clustering example; it does unsupervised machine learning that continually gets better as more training data feeds into it.

[Activity] Streaming K-Means Clustering
15:36

Spark Streaming can also feed data in real-time to linear regression models, that get better over time as more data is fed into them. This example shows linear regression in action with Spark Streaming.

[Activity] Streaming Linear Regression
11:50
+
Spark Streaming in Production
4 Lectures 47:24

Your production applications won't be run from within the Scala IDE; you'll need to run them from a command line, and potentially on a cluster. The spark-submit command is used for this. We'll show you how to package up your application and run it using spark-submit from a command prompt.

[Activity] Running with spark-submit
10:47

If your Spark Streaming application has external library dependencies that might not be already present on every machine in your cluster, the SBT tool can manage those dependencies for you, and package them into the JAR file you run with spark-submit. We'll show you how it works with a real example.

[Activity] Packaging your code with SBT
10:49

We'll run our simple word count example on a real cluster, using Amazon's Elastic MapReduce service! This just shows you what's involved in running a Spark Streaming job on a real cluster as opposed to your desktop; there are a few parameters to spark-submit you need to worry about, and getting your scripts and data in the right place is also something you need to deal with.

Running on a real Hadoop cluster with EMR
13:13

Spark jobs rarely run perfectly, if at all, on the first try - some tuning and debugging is usually required, and arriving at the right scale of your cluster is also necessary. We'll cover some performance tips, and how to troubleshoot what's going on with a Spark Streaming job running on a cluster.

Troubleshooting and Tuning Spark Jobs
12:35
+
You Made It!
2 Lectures 04:50

Want to learn more about Spark Streaming? Here are a few books and other resources I've found valuable.

Learning More
03:44

Let's stay in touch! Head to my website for discounts on my other courses, and to follow me on social media.

Bonus Lecture: Discounts on my other courses!
01:06
About the Instructor
Sundog Education by Frank Kane
4.5 Average rating
15,245 Reviews
73,374 Students
9 Courses
Training the World in Big Data and Machine Learning

Sundog Education's mission is to make highly valuable career skills in big data, data science, and machine learning accessible to everyone in the world. Our consortium of expert instructors shares our knowledge in these emerging fields with you, at prices anyone can afford. 

Sundog Education is led by Frank Kane and owned by Frank's company, Sundog Software LLC. Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.

Frank Kane
4.5 Average rating
14,853 Reviews
69,690 Students
7 Courses
Founder, Sundog Education

Frank spent 9 years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers, all the time. Frank holds 17 issued patents in the fields of distributed computingdata mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology, and teaching others about big data analysis.