- 6 hours on-demand video
- 2 articles
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to 4,000+ top Udemy courses anytime, anywhere.Try Udemy for Business
- Process massive streams of real-time data using Spark Streaming
- Integrate Spark Streaming with data sources, including Kafka, Flume, and Kinesis
- Use Spark 2's Structured Streaming API
- Create Spark applications using the Scala programming language
- Output transformed real-time data to Cassandra or file systems
- Integrate Spark Streaming with Spark SQL to query streaming data in real time
- Train machine learning models with streaming data, and use those models for real-time predictions
- Ingest Apache access log data and transform streams of it
- Receive real-time streams of Twitter feeds
- Maintain stateful data across a continuous stream of input data
- Query streaming data across sliding windows of time
A brief introduction to the course, and then we'll get your development environment for Spark and Scala all set up on your desktop. A quick test application will confirm Spark is working on your system! Remember - be sure to install Spark 2.4.4 for this course, and install Java 8, not Java 9, 10, or 11.
Get set up with a Twitter developer account, and run your first Spark Streaming application to listen to and print out live Tweets as they happen!
Windowing allows you to analyze streaming data over a sliding window of time, which lets you do much more than just transform streaming data and store it someplace else. We'll cover the concepts of the batch, window, and slide intervals, and how they work together to let you aggregate streaming data over some period of time.
We'll simulate an incoming stream of Apache access logs, and use Spark Streaming to keep track of the most-requested web pages in real time!
Apache Kafka is a popular and robust technology for publishing messages across a cluster on a large scale. We'll show how to get Spark Streaming to listen to Kafka topics, and process them in real time.
Cassandra is a popular "NoSQL" database that can be used to provide fast access to massive data sets to real-time applications. Dumping data transformed by Spark Streaming into a Cassandra database can expose that data you your larger, real-time services. We'll show you how and actually run a simple example.
Your production applications won't be run from within the Scala IDE; you'll need to run them from a command line, and potentially on a cluster. The spark-submit command is used for this. We'll show you how to package up your application and run it using spark-submit from a command prompt.
If your Spark Streaming application has external library dependencies that might not be already present on every machine in your cluster, the SBT tool can manage those dependencies for you, and package them into the JAR file you run with spark-submit. We'll show you how it works with a real example.
We'll run our simple word count example on a real cluster, using Amazon's Elastic MapReduce service! This just shows you what's involved in running a Spark Streaming job on a real cluster as opposed to your desktop; there are a few parameters to spark-submit you need to worry about, and getting your scripts and data in the right place is also something you need to deal with.
Spark jobs rarely run perfectly, if at all, on the first try - some tuning and debugging is usually required, and arriving at the right scale of your cluster is also necessary. We'll cover some performance tips, and how to troubleshoot what's going on with a Spark Streaming job running on a cluster.
- To follow along with the examples, you'll need a personal computer. The course is filmed using Windows 10, but the tools we install are available for Linux and MacOS as well.
- We'll walk through installing the required software in the first lecture: The Scala IDE, Spark, and a JDK.
- My "Taming Big Data with Apache Spark - Hands On!" would be a helpful introduction to Spark in general, but it is not required for this course. A quick introduction to Spark is included.
- The course includes a crash course in the Scala programming language if you're new to it; if you already know Scala, then great.
New! Updated for Spark 3.0.0!
"Big Data" analysis is a hot and highly valuable skill. Thing is, "big data" never stops flowing! Spark Streaming is a new and quickly developing technology for processing massive data sets as they are created - why wait for some nightly analysis to run when you can constantly update your analysis in real time, all the time? Whether it's clickstream data from a big website, sensor data from a massive "Internet of Things" deployment, financial data, or something else - Spark Streaming is a powerful technology for transforming and analyzing that data right when it is created, all the time.
You'll be learning from an ex-engineer and senior manager from Amazon and IMDb.
This course gets your hands on to some real live Twitter data, simulated streams of Apache access logs, and even data used to train machine learning models! You'll write and run real Spark Streaming jobs right at home on your own PC, and toward the end of the course, we'll show you how to take those jobs to a real Hadoop cluster and run them in a production environment too.
Across over 30 lectures and almost 6 hours of video content, you'll:
Get a crash course in the Scala programming language
Learn how Apache Spark operates on a cluster
Set up discretized streams with Spark Streaming and transform them as data is received
Use structured streaming to stream into dataframes in real-time
Analyze streaming data over sliding windows of time
Maintain stateful information across streams of data
Connect Spark Streaming with highly scalable sources of data, including Kafka, Flume, and Kinesis
Dump streams of data in real-time to NoSQL databases such as Cassandra
Run SQL queries on streamed data in real time
Train machine learning models in real time with streaming data, and use them to make predictions that keep getting better over time
Package, deploy, and run self-contained Spark Streaming code to a real Hadoop cluser using Amazon Elastic MapReduce.
This course is very hands-on, filled with achievable activities and exercises to reinforce your learning. By the end of this course, you'll be confidently creating Spark Streaming scripts in Scala, and be prepared to tackle massive streams of data in a whole new way. You'll be surprised at how easy Spark Streaming makes it!
- Students with some prior programming or scripting ability SHOULD take this course.
- If you're working for a company with "big data" that is being generated continuously, or hope to work for one, this course is for you.
- Students with no prior software engineering or programming experience should seek an introductory programming course first.