Apache Spark with Scala By Example

Advance your Spark skills and become more valuable, confident, and productive
3.5 (70 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
1,109 students enrolled
73% off
Take This Course
  • Lectures 45
  • Length 3 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works


Find online courses made by experts from around the world.


Take your courses with you and learn anywhere, anytime.


Learn and practice real-world skills and achieve your goals.

About This Course

Published 12/2015 English

Course Description

Understanding how to manipulate, deploy and leverage Apache Spark is quickly becoming essential for data engineers, architects, and data scientists.  So, it's time for you to stay ahead of the crowd by learning Spark with Scala from an industry veteran and nice guy. 

This course is designed to give you the core principles needed to understand Apache Spark and build your confidence through hands-on experiences. 

In this course, you’ll be guided through a wide range of core Apache Spark concepts using Scala source code examples; all of which are designed to give you fundamental, working knowledge.  Each section carefully builds upon previous sections, so your learning is reinforced along every step of the way.  

All of the source code is conveniently available for download, so you can run and modify for yourself.  

Here are just a few of concepts this course will teach you using more than 50 hands-on examples: 

  • Learn the fundamentals and run examples of Spark's Resilient Distributed Datasets, Actions and Transformations through Scala
  • Run Spark on your local cluster and also Amazon EC2
  • Troubleshooting tricks when deploying Scala applications to Spark clusters
  • Explore Spark SQL with CSV, JSON and mySQL database (JDBC) data sources
  • Discover Spark Streaming through numerous examples and build a custom application which streams from Slack
  • Hands-on machine learning experiments with Spark MLlib
  • Reinforce your understanding through multiple quizzes and lecture recap

Check out the free preview videos below!

As an added bonus, this course will teach you about Scala and the Scala ecosystem such as SBT and SBT plugins to make packaging and deploying to Spark easier and more efficient.  

As another added bonus, on top of all the extensive course content, the course offers a private message board so you can ask the instructor questions at anytime during your Spark learning journey.

This course will make you more knowledgeable about Apache Spark.  It offers you the chance to build your confidence, productivity and value in your Spark adventures. 

What are the requirements?

  • Prior programming or scripting experience in at least one programming language is preferred, but not required.
  • If you are training for a new career or looking to advance your career
  • You are curious how and when the Apache Spark ecosystem might be beneficial for your operations or product development efforts

What am I going to get from this course?

  • Gain confidence and hands-on knowledge exploring, running and deploying Apache Spark
  • Access to numerous and wide variety of Spark with Scala, Spark SQL, Spark Streaming and Spark MLLib source code examples
  • Create hands-on Spark environments for experimenting with course examples
  • Participate in course discussion boards with instructor and other students
  • Know when and how Spark with Scala, Spark SQL, Spark Streaming and Spark MLLibr may be an appropriate solution

Who is the target audience?

  • People looking to expand their working knowledge of Apache Spark and Scala
  • A desire to learn more about the Spark ecosystem such as Spark SQL, Spark Streaming and Spark MLlib
  • Software developers wanting to expand their skills and abilities for future career growth. Spark with Scala is an in-demand skill set.
  • Anyone who suspects an on-demand Spark course with access to both source code and questions/ answers with the instructor is probably more efficient than buying a Spark book or reading blog posts

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.


Section 1: Introduction

Let's show and describe the structure of this Apache Spark with Scala course from a high level. 

  1. What Apache Spark topics will be covered? 
  2. Why is it structured this way? 
  3. What are the course activities and resources? 

After watching this video, you'll know how each section in this course builds upon each other.  So, as we progress through Spark Core and Spark SQL, we know these beginning sections will be relevant when learning Spark Streaming and Spark MLlib. 


Download, review and run the source code.  Customize the source code and re-run.  The way to build confidence is through doing.  

Participate in the course discussion boards.  Through discussion and collaboration, you'll have the opportunity to teach others and ask questions.  This will strengthen your Spark with Scala skills.

A note for Windows users.

Where and how to download the course source code.


Provides link to download all source code used in this Apache Spark with Scala course.

Section 2: Introducing the Apache Spark Fundamentals

Before we jump into Spark with Scala examples, let's presenting a high-level overview of the key concepts you need to know. These fundamentals will be used throughout the rest of this Spark with Scala course.

Key constructs: Resilient Distributed Datasets (RDDs), Transformations, Actions, Spark Driver programs, SparkContext and how applications deployed to a Spark cluster utilize the parallel nature of Spark.


We're going to be running many examples in this next section.  I don't expect you to follow every detail.  Rather, I just want to experience loading external data and run some simple examples of Spark Transformations and Actions. 


To begin the course, let's run some Spark code with Scala from the shell.

I don't expect you to follow all the details of this code. I just want to get us motivated to continue our Spark learning adventure.

In this example, we'll get a glimpse into Spark core concepts such as Resilient Distributed Datasets, Transformations, Actions and Spark drivers from a Scala perspective. Again, I'll fill in all the details of this Scala code in later lectures.

3 questions

Before moving to more advanced examples, we need to ensure the Apach Spark fundamentals are understood. This quiz will ensure the student is ready to proceed.

Section 3: Preparing up your Spark environment

In this section of the Spark with Scala course, we'll set up and verify your Spark with Scala environment. With your own environment in place, you can choose to run the course examples and experiment with the Scala Spark API.


Walk through all steps required to setup Apache Spark on your machine.


We need sample data to run Scala examples in the Spark Console. This lecture will prepare the Apache Spark environment for loading data and confirm the Spark console.


Reference links used in this section of the Spark with Scala course

Section 4: Deeper Dive into Spark Actions and Transformations

There are two kinds of Spark functions: Transformations and Actions. Transformations transform an existing RDD into a new, different one. Actions are functions used against RDDs to produce a value.

In this section of the Apache Spark with Scala course, we'll go over a variety of Spark Transformation and Action functions.

This should build your confidence and understanding of how you can apply these functions to your uses cases. It will also create more foundation for us to build upon in your journey of learning Apache Spark with Scala.


What are Spark Transformations? Let's review common Spark Transformation functions through Scala code examples.

We're going to break Apache Spark transformations into groups. In this video, we'll cover some common spark transformations which produce RDDs. These include map, flatMap, filter, etc.

We're going to use a CSV dataset of baby names in New York. As we progress through transformations and actions in this Apache Spark with Scala course, we'll determine more and more results for this sample data set.

So, let's begin with some commonly used Spark transformations.


In part 2 of Spark Transformations, we'll discover spark transformations used when we need to combine, compare and contrast elements in two RDDs. This is something we often have to do when working with datasets. Spark helps compare RDDs through transformation functions union, intersection, distinct, etc.


In part 3 of our focus on Spark Transformation functions were going work with the "key" functions including groupByKey, reduceByKey, aggregateByKey, sortByKey

All these transformations work with key,value pair RDDs, so we will cover the creation of PairRDDs as well.

We'll continue to use the baby_names.csv file used in Part 1 and Part 2 of Spark Transformations

3 questions

Test and confirm your knowledge of Spark Transformations.


Run and review common Spark actions. You have already seen many Spark action examples before this lecture, so we will go quickly to review.

Spark Actions produce values back to the Spark Driver program. Also, recall that Action functions called against RDD cause a previously lazy RDD to be evaluated. So, in the real world when working with large datasets, we need to be careful when triggering RDDs to be evaluated through Spark actions.

This video shows commonly used Spark Actions.

2 questions

Test and confirm knowledge of Spark Actions.


Links to conveniently download the Spark source code examples presented in this section of the course.  Also, links to the latest programming guides for SparkTransformations and Actions is included. 

Section 5: Utilizing Clusters with Apache Spark

Clusters allow Spark to processes huge volumes of data by distributing the workload across multiple nodes. This is also referred to as "running in parallel" or "horizontal scaling"

A cluster manager is required to Spark on a cluster. Spark supports 3 types of cluster managers including Apache YARN, Apache Mesos and an internal cluster manager distributed with Spark called Standalone.


Let's run a Spark Standalone cluster within your environment. We'll start a Spark Master and one Spark worker. We'll introduce the Spark UI web console.


Setup, compile and package a Scala Spark program using `sbt`.  `sbt` is short for "simple build tool" and is most often used in Scala based projects. 

This is easy example to ensure you're ready for more advanced build and cluster deploys later in this Apache Spark with Scala course.  


Let's configure an Apache Spark cluster running on two instances of Amazon EC2.

Before the EC2 cluster is ready to use from local running shell, we need to open port 7077.

Review key takeaways from this section on Spark running in a cluster and deploying a Scala based Spark program to the cluster.

4 questions

To reinforce the key takeaways from the Cluster section of the course


Convenient link to download all source code used in this section

Section 6: Spark SQL

Spark SQL background, key concepts and high-level examples of CSV, JSON and mySQL (JDBC) data sources. This lecture lays the groundwork for next lectures in this course section. It provides overview examples and common patterns of Spark SQL from a Scala perspective.


Spark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied with a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.


We’re going to use the baby names dataset and the spark-csv package available from Spark Packages to make our lives easier. The spark-csv package is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames” This library is compatible with Spark 1.3 and above.


Let's load a JSON input source to Spark SQL’s SQLContext. This Spark SQL JSON with Scala portion of the course has two parts. The first part shows examples of JSON input sources with a specific structure. The second part warns you of something you might not expect when using Spark SQL with JSON data source.


We are going to use two JSON inputs. We’ll start with a simple, trivial example and then move to an analysis of more realistic JSON example.


Now that we have Spark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use Spark SQL with Scala and a mySQL database input data source.


We’re going to load data into a database. Then, we’re going to fire up spark-shell with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.


Earlier in the course, we performed a simple deploy to an Apache Spark Cluster.  Let's build upon the simple example and deploy our Spark SQL code examples.    

Deploying the Spark SQL examples introduces a new challenge.  How do we deploy when our application uses 3rd party libraries such as CSV parsing and JDBC drivers?

1 page

Links to download Spark SQL code examples and videos on setting up mySQL

Section 7: Spark Streaming

Spark Streaming introduction, key concepts and our approach for learning Apache Spark Streaming through examples and building our own application streaming.


Present an overview of the lessons contained in this Spark Streaming section.  For some of you, you may be able to skip the first two examples and move to a more complex Spark Streaming custom application.  


To ensure your environment is ready for more complex Spark Streaming examples, let's run through a trivial example.  This is a word count example which streams for the netcat utility found on Linux and Mac.  For windows users, check https://nmap.org/ncat/ which may be used to run this example.


Let's continue to take one step at a time as we are learning Spark Streaming.  In this example, we will build and deploy a spark streaming application to a Spark cluster.


This video demonstrates our custom Spark Streaming application and how you can configure Slack to stream your own channel content.  

I think it's important to show you running example of Spark Streaming application a 


Spark Streaming example code review.  Answers the questions -- how do I write my own custom receiver and how did the Slack Spark Streaming example work?


Our Spark Streaming with Slack program contains 3rd party libraries.  As we've seen previously in the course, we can use the sbt-assembly plugin to make "fat jars" for Spark Driver programs using 3rd party libraries.

But, what happens when things do not deploy according to plan?  

In this video, we'll cover three advanced issues when deploying to a Spark Cluster and how to address.  

1) What happens if your Spark Driver program is compiled to Scala 2.11, but you are deploying to Spark compiled to Scala 2.10?

2) What happens if your 3rd party library conflicts with your Spark Cluster? 

3) What to do if your Spark Cluster uses a jar which is older and incompatible with a jar needed by your driver program?


In this video, we'll cover three advanced issues when deploying to a Spark Cluster and how to address.  

1) What happens if your Spark Driver program is compiled to Scala 2.11, but you are deploying to Spark compiled to Scala 2.10?

2) What happens if your 3rd party library conflicts with your Spark Cluster? 

3) What to do if your Spark Cluster uses a jar which is older and incompatible with a jar needed by your driver program?


A list of resources used in this Spark Streaming section of the Apache Spark course tutorials 

Section 8: Spark Machine Learning

Machine Learning is an exciting and growing topic of interest these days.  Let's start this section on Spark MLlib with a background on Machine Learning.  

Afterwards, we'll have a foundation of machine learning concepts when we run demos and review source code in later videos in this Spark MLlib section.


In this video, let's run a demo of a custom Spark MLlib based program so we have some context when reviewing the source code later in the course.

In this demo, we'll train our machine learning model.  Then, we'll use the trained model to make predictions on an incoming data stream.  

That's right, we're going to make machine learning based predictions on data arriving from a Spark Streaming source.  

Should be fun :)


Review of Spark MLlib based source code from the demo of the near real-time machine learning prediction model.  The model used a Spark Streaming data source which will also be analyzed.  

The code has tons of comments in it to help.  Also, the source code is available for students to download from the course repository.


Up to now, we've seen a machine learning demo of near real-time prediction of stream data and we've reviewed the custom demo code.   

So, now let's cover aspects of machine learning specific to Spark MLlib.


A suggested list of free resources for machine learning and Spark MLlib.

Section 9: Conclusion and Suggested Next Steps

Conclusion of version 2 of the Apache Spark with Scala course.  We review the content of version 2 of this course, suggested next steps and ask for ideas for version 3 of the Apache Spark with Scala course.

Version 1 major release:  End of January 2015

- Spark Core and Clustering

Version 1.1, 1.2, 1.3 minor releases: February, March 2016

- Section introductions

- Add more resources to each section

- Spark SQL section

Version 2 major release: May 2016

- Spark Streaming

- Spark machine learning with Spark MLlib


Bonus lecture with access to free Spark learning resources, course coupons, tutorials and free software development, data engineering and data science books.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Todd McGrath, Data Engineer, Software Developer, Mentor

Todd has an extensive and proven track record in software development leadership and building solutions for the world's largest brands and Silicon Valley startups.

His courses are taught using the same skills used in his consulting and mentoring projects.  Todd believes the only way to gain confidence and become productive is to be hands-on through examples.  Each new subject should build upon previous examples or presentation, so each step is also a way to reemphasis a prior topic.

To learn more about Todd, visit his LinkedIn profile. 

Ready to start learning?
Take This Course