Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Apache Spark with Scala By Example

Advance your Spark skills and become more valuable, confident, and productive

Created byTodd McGrath

Last updated 6/2016

English

What you'll learn

Gain confidence and hands-on knowledge exploring, running and deploying Apache Spark
Access to numerous and wide variety of Spark with Scala, Spark SQL, Spark Streaming and Spark MLLib source code examples
Create hands-on Spark environments for experimenting with course examples
Participate in course discussion boards with instructor and other students
Know when and how Spark with Scala, Spark SQL, Spark Streaming and Spark MLLib may be an appropriate solution

Course content

9 sections • 45 lectures • 2h 47m total length

Course Overview2:04
Let's show and describe the structure of this Apache Spark with Scala course from a high level.

What Apache Spark topics will be covered?
Why is it structured this way?
What are the course activities and resources?

After watching this video, you'll know how each section in this course builds upon each other. So, as we progress through Spark Core and Spark SQL, we know these beginning sections will be relevant when learning Spark Streaming and Spark MLlib.
How to Succeed in this Course1:46
Download, review and run the source code. Customize the source code and re-run. The way to build confidence is through doing.

Participate in the course discussion boards. Through discussion and collaboration, you'll have the opportunity to teach others and ask questions. This will strengthen your Spark with Scala skills.

A note for Windows users.
Where and how to download the course source code.
Course Source Code0:03
Provides link to download all source code used in this Apache Spark with Scala course.

Fundamentals Overview6:05
Before we jump into Spark with Scala examples, let's presenting a high-level overview of the key concepts you need to know. These fundamentals will be used throughout the rest of this Spark with Scala course.

Key constructs: Resilient Distributed Datasets (RDDs), Transformations, Actions, Spark Driver programs, SparkContext and how applications deployed to a Spark cluster utilize the parallel nature of Spark.
Examples Introduction1:08
We're going to be running many examples in this next section. I don't expect you to follow every detail. Rather, I just want to experience loading external data and run some simple examples of Spark Transformations and Actions.
Let's run some Apache Spark code!6:21
To begin the course, let's run some Spark code with Scala from the shell.

I don't expect you to follow all the details of this code. I just want to get us motivated to continue our Spark learning adventure.

In this example, we'll get a glimpse into Spark core concepts such as Resilient Distributed Datasets, Transformations, Actions and Spark drivers from a Scala perspective. Again, I'll fill in all the details of this Scala code in later lectures.
[Milestone] Quiz - Spark Core Fundamentals

Setting Up Your Apache Spark Environment Introduction0:52
In this section of the Spark with Scala course, we'll set up and verify your Spark with Scala environment. With your own environment in place, you can choose to run the course examples and experiment with the Scala Spark API.
Download and Install Spark3:22
Walk through all steps required to setup Apache Spark on your machine.
[Milestone] Prepare Sample Data Source and Confirm Console3:03
We need sample data to run Scala examples in the Spark Console. This lecture will prepare the Apache Spark environment for loading data and confirm the Spark console.
Setup Resources0:12
Reference links used in this section of the Spark with Scala course

Transformations and Actions Introduction2:13
There are two kinds of Spark functions: Transformations and Actions. Transformations transform an existing RDD into a new, different one. Actions are functions used against RDDs to produce a value.

In this section of the Apache Spark with Scala course, we'll go over a variety of Spark Transformation and Action functions.

This should build your confidence and understanding of how you can apply these functions to your uses cases. It will also create more foundation for us to build upon in your journey of learning Apache Spark with Scala.
Transformations Part 17:48
What are Spark Transformations? Let's review common Spark Transformation functions through Scala code examples.

We're going to break Apache Spark transformations into groups. In this video, we'll cover some common spark transformations which produce RDDs. These include map, flatMap, filter, etc.

We're going to use a CSV dataset of baby names in New York. As we progress through transformations and actions in this Apache Spark with Scala course, we'll determine more and more results for this sample data set.

So, let's begin with some commonly used Spark transformations.
Transformations Part 21:49
In part 2 of Spark Transformations, we'll discover spark transformations used when we need to combine, compare and contrast elements in two RDDs. This is something we often have to do when working with datasets. Spark helps compare RDDs through transformation functions union, intersection, distinct, etc.
Transformations Part 36:30
In part 3 of our focus on Spark Transformation functions were going work with the "key" functions including groupByKey, reduceByKey, aggregateByKey, sortByKey

All these transformations work with key,value pair RDDs, so we will cover the creation of PairRDDs as well.

We'll continue to use the baby_names.csv file used in Part 1 and Part 2 of Spark Transformations
[Milestone] Transformation Quiz
Actions6:13
Run and review common Spark actions. You have already seen many Spark action examples before this lecture, so we will go quickly to review.

Spark Actions produce values back to the Spark Driver program. Also, recall that Action functions called against RDD cause a previously lazy RDD to be evaluated. So, in the real world when working with large datasets, we need to be careful when triggering RDDs to be evaluated through Spark actions.

This video shows commonly used Spark Actions.
[Milestone] Actions Quiz
Transformations and Actions Source Code and Programming Guides0:07
Links to conveniently download the Spark source code examples presented in this section of the course. Also, links to the latest programming guides for SparkTransformations and Actions is included.

Cluster Introduction3:37
Clusters allow Spark to processes huge volumes of data by distributing the workload across multiple nodes. This is also referred to as "running in parallel" or "horizontal scaling"

A cluster manager is required to Spark on a cluster. Spark supports 3 types of cluster managers including Apache YARN, Apache Mesos and an internal cluster manager distributed with Spark called Standalone.
Run Standalone Cluster4:28
Let's run a Spark Standalone cluster within your environment. We'll start a Spark Master and one Spark worker. We'll introduce the Spark UI web console.
[Milestone] Deploy a Scala Program to a Cluster6:34
Setup, compile and package a Scala Spark program using `sbt`. `sbt` is short for "simple build tool" and is most often used in Scala based projects.

This is easy example to ensure you're ready for more advanced build and cluster deploys later in this Apache Spark with Scala course.
Create an Amazon EC2 Based Cluster Part 15:54
Let's configure an Apache Spark cluster running on two instances of Amazon EC2.
Create an Amazon EC2 Based Cluster Part 22:54
Before the EC2 cluster is ready to use from local running shell, we need to open port 7077.
[Milestone] Cluster Section Recap2:52
Review key takeaways from this section on Spark running in a cluster and deploying a Scala based Spark program to the cluster.
Cluster Section Quiz
Cluster Section Resources0:10
Convenient link to download all source code used in this section

Spark SQL Introduction3:20
Spark SQL background, key concepts and high-level examples of CSV, JSON and mySQL (JDBC) data sources. This lecture lays the groundwork for next lectures in this course section. It provides overview examples and common patterns of Spark SQL from a Scala perspective.
Spark SQL with CSV source6:05
Spark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied with a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.

Methodology

We’re going to use the baby names dataset and the spark-csv package available from Spark Packages to make our lives easier. The spark-csv package is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames” This library is compatible with Spark 1.3 and above.
Spark SQL with JSON source8:56
Let's load a JSON input source to Spark SQL’s SQLContext. This Spark SQL JSON with Scala portion of the course has two parts. The first part shows examples of JSON input sources with a specific structure. The second part warns you of something you might not expect when using Spark SQL with JSON data source.

Methodology

We are going to use two JSON inputs. We’ll start with a simple, trivial example and then move to an analysis of more realistic JSON example.
Spark SQL with mySQL (JDBC) source5:38
Now that we have Spark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use Spark SQL with Scala and a mySQL database input data source.

Overview
We’re going to load data into a database. Then, we’re going to fire up spark-shell with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.
[Milestone] Spark SQL Deploying to a Spark Cluster7:43
Earlier in the course, we performed a simple deploy to an Apache Spark Cluster. Let's build upon the simple example and deploy our Spark SQL code examples.

Deploying the Spark SQL examples introduces a new challenge. How do we deploy when our application uses 3rd party libraries such as CSV parsing and JDBC drivers?
Spark SQL Section Resources1:00
Links to download Spark SQL code examples and videos on setting up mySQL

Spark Streaming Introduction3:08
Spark Streaming introduction, key concepts and our approach for learning Apache Spark Streaming through examples and building our own application streaming.
Spark Streaming Overview0:52
Present an overview of the lessons contained in this Spark Streaming section. For some of you, you may be able to skip the first two examples and move to a more complex Spark Streaming custom application.
Spark Streaming Example Part 12:53
To ensure your environment is ready for more complex Spark Streaming examples, let's run through a trivial example. This is a word count example which streams for the netcat utility found on Linux and Mac. For windows users, check https://nmap.org/ncat/ which may be used to run this example.
Spark Streaming Example Part 27:07
Let's continue to take one step at a time as we are learning Spark Streaming. In this example, we will build and deploy a spark streaming application to a Spark cluster.
Spark Streaming Application - Streaming from Slack5:44
This video demonstrates our custom Spark Streaming application and how you can configure Slack to stream your own channel content.
I think it's important to show you running example of Spark Streaming application a
Spark Streaming Custom Example Code Review10:11
Spark Streaming example code review. Answers the questions -- how do I write my own custom receiver and how did the Slack Spark Streaming example work?
[Advanced] Spark Streaming Deploy to Cluster Introduction1:24
Our Spark Streaming with Slack program contains 3rd party libraries. As we've seen previously in the course, we can use the sbt-assembly plugin to make "fat jars" for Spark Driver programs using 3rd party libraries.
But, what happens when things do not deploy according to plan?
In this video, we'll cover three advanced issues when deploying to a Spark Cluster and how to address.
1) What happens if your Spark Driver program is compiled to Scala 2.11, but you are deploying to Spark compiled to Scala 2.10?
2) What happens if your 3rd party library conflicts with your Spark Cluster?
3) What to do if your Spark Cluster uses a jar which is older and incompatible with a jar needed by your driver program?
[Milestone] Advanced Spark Deploy Troubleshooting and Tactics2:54
In this video, we'll cover three advanced issues when deploying to a Spark Cluster and how to address.

1) What happens if your Spark Driver program is compiled to Scala 2.11, but you are deploying to Spark compiled to Scala 2.10?

2) What happens if your 3rd party library conflicts with your Spark Cluster?

3) What to do if your Spark Cluster uses a jar which is older and incompatible with a jar needed by your driver program?
Spark Streaming Resources0:11
A list of resources used in this Spark Streaming section of the Apache Spark course tutorials

Spark Machine Learning (MLlib) Introduction5:05
Machine Learning is an exciting and growing topic of interest these days. Let's start this section on Spark MLlib with a background on Machine Learning.

Afterwards, we'll have a foundation of machine learning concepts when we run demos and review source code in later videos in this Spark MLlib section.
Machine Learning Demonstration - Running our Custom Machine Learning Code3:30
In this video, let's run a demo of a custom Spark MLlib based program so we have some context when reviewing the source code later in the course.

In this demo, we'll train our machine learning model. Then, we'll use the trained model to make predictions on an incoming data stream.

That's right, we're going to make machine learning based predictions on data arriving from a Spark Streaming source.

Should be fun :)
[Milestone] Source Code Review of Custom Spark MLlib Example Application11:03
Review of Spark MLlib based source code from the demo of the near real-time machine learning prediction model. The model used a Spark Streaming data source which will also be analyzed.

The code has tons of comments in it to help. Also, the source code is available for students to download from the course repository.
Spark MLlib Overview3:09
Up to now, we've seen a machine learning demo of near real-time prediction of stream data and we've reviewed the custom demo code.

So, now let's cover aspects of machine learning specific to Spark MLlib.
Spark Machine Learning (MLlib) Resources0:11
A suggested list of free resources for machine learning and Spark MLlib.

Conclusion v20:58
Conclusion of version 2 of the Apache Spark with Scala course. We review the content of version 2 of this course, suggested next steps and ask for ideas for version 3 of the Apache Spark with Scala course.

Version 1 major release: End of January 2015
- Spark Core and Clustering

Version 1.1, 1.2, 1.3 minor releases: February, March 2016
- Section introductions
- Add more resources to each section
- Spark SQL section

Version 2 major release: May 2016
- Spark Streaming
- Spark machine learning with Spark MLlib
Bonus Lecture: Free Resources, Coupons and More0:54
Bonus lecture with access to free Spark learning resources, course coupons, tutorials and free software development, data engineering and data science books.

Requirements

Prior programming or scripting experience in at least one programming language is preferred, but not required.
If you are training for a new career or looking to advance your career
You are curious how and when the Apache Spark ecosystem might be beneficial for your operations or product development efforts

Description

Understanding how to manipulate, deploy and leverage Apache Spark is quickly becoming essential for data engineers, architects, and data scientists. So, it's time for you to stay ahead of the crowd by learning Spark with Scala from an industry veteran and nice guy.

This course is designed to give you the core principles needed to understand Apache Spark and build your confidence through hands-on experiences.

In this course, you’ll be guided through a wide range of core Apache Spark concepts using Scala source code examples; all of which are designed to give you fundamental, working knowledge. Each section carefully builds upon previous sections, so your learning is reinforced along every step of the way.

All of the source code is conveniently available for download, so you can run and modify for yourself.

Here are just a few of concepts this course will teach you using more than 50 hands-on examples:

Learn the fundamentals and run examples of Spark's Resilient Distributed Datasets, Actions and Transformations through Scala
Run Spark on your local cluster and also Amazon EC2
Troubleshooting tricks when deploying Scala applications to Spark clusters
Explore Spark SQL with CSV, JSON and mySQL database (JDBC) data sources
Discover Spark Streaming through numerous examples and build a custom application which streams from Slack
Hands-on machine learning experiments with Spark MLlib
Reinforce your understanding through multiple quizzes and lecture recap

Check out the free preview videos below!

As an added bonus, this course will teach you about Scala and the Scala ecosystem such as SBT and SBT plugins to make packaging and deploying to Spark easier and more efficient.

As another added bonus, on top of all the extensive course content, the course offers a private message board so you can ask the instructor questions at anytime during your Spark learning journey.

This course will make you more knowledgeable about Apache Spark. It offers you the chance to build your confidence, productivity and value in your Spark adventures.

Who this course is for:

People looking to expand their working knowledge of Apache Spark and Scala
A desire to learn more about the Spark ecosystem such as Spark SQL, Spark Streaming and Spark MLlib
Software developers wanting to expand their skills and abilities for future career growth. Spark with Scala is an in-demand skill set.
Anyone who suspects an on-demand Spark course with access to both source code and questions/ answers with the instructor is probably more efficient than buying a Spark book or reading blog posts

Apache Spark with Scala By Example

What you'll learn

Explore related topics

Course content

Introduction3 lectures • 4min

Introducing the Apache Spark Fundamentals3 lectures • 14min

Preparing up your Spark environment4 lectures • 7min

Deeper Dive into Spark Actions and Transformations6 lectures • 25min

Utilizing Clusters with Apache Spark7 lectures • 26min

Spark SQL6 lectures • 33min

Spark Streaming9 lectures • 34min

Spark Machine Learning5 lectures • 23min

Conclusion and Suggested Next Steps2 lectures • 2min

Requirements

Description

Who this course is for: