Apache Spark with Python - Learn by Doing
4.1 (91 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
748 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Apache Spark with Python - Learn by Doing to your Wishlist.

Add to Wishlist

Apache Spark with Python - Learn by Doing

50 Python source code examples and multiple deployment scenarios
4.1 (91 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
748 students enrolled
Created by Todd McGrath
Last updated 2/2016
Current price: $10 Original price: $40 Discount: 75% off
5 hours left at this price!
30-Day Money-Back Guarantee
  • 2 hours on-demand video
  • 3 Articles
  • 1 Supplemental Resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Have confidence using Spark from Python
  • Understand Spark core concepts and processing options
  • Run Spark and Python on their own computer
  • Setup Spark on new Amazon EC2 cluster
  • Deploy Python Programs to to a Spark Cluster
  • Know what tools to use for Spark Adminstration
  • Certificate of completion
  • 30 money back guarantee
View Curriculum
  • Need a computer to run examples
  • Familiar with Python. Expertise not required, just basic understanding.

Would you like to advance your career and learning Apache Spark will help?

There's no doubt Apache Spark is an in-demand skillset with higher pay. This course will help you get there.

This course prepares you for job interviews and technical conversations. At the end of this course, you can update your resume or CV with a variety of Apache Spark experiences.

Or maybe you need to learn Apache Spark quickly for a current or upcoming project?

How can this course help?

You will become confident and productive with Apache Spark after taking this course. You need to be confident and productive in Apache Spark to be more valuable.

Now, I'm not going to pretend here. You are going to need to put in work. This course puts you in a position to focus on the work you will need to complete.

This course uses Python, which is a fun, dynamic programming language perfect for both beginners and industry veterans.

At the end of this course, you will have rock solid foundation to accelerate your career and growth in the exciting world of Apache Spark.

Why choose this course?

Let's be honest. You can find Apache Spark learning material online for free. Using these free resources is okay for people with extra time.

This course saves your time and effort. It is organized in a step-by-step approach that builds upon each previous lessons. PowerPoint presentations are minimal.

The intended audience of this course is people who need to learn Spark in a focused, organized fashion. If you want a more academic approach to learning Spark with over 4-5 hours of lectures covering the same material as found here, there are other courses on Udemy which may be better for you.

This Apache Spark with Python course emphasizes running source code examples.

All source code examples are available for download, so you can execute, experiment and customize for your environment after or during the course.

This Apache Spark with Python course covers over 50 hands-on examples. We run them locally first and then deploy them on cloud computing services such as Amazon EC2.

The following will be covered and more:

  • What makes Spark a power tool of Big Data and Data Science?
  • Learn the fundamentals of Spark including Resilient Distributed Datasets, Spark Actions and Transformations
  • Run Spark in a Cluster in your local environment and Amazon EC2
  • Deploy Python applications to a Spark Cluster
  • Explore Spark SQL with CSV, JSON and mySQL (JDBC) data sources
  • Convenient links to download all source code
  • Reinforce your understanding through multiple quizzes and lecture recap

Who is the target audience?
  • People looking for career growth and new opportunities
  • People curious if Spark with Python could be good solution for their technical challenges
  • People who do not want to evolve or learn new ways to do things should NOT take this course
Students Who Viewed This Course Also Viewed
Curriculum For This Course
31 Lectures
Course Overview
1 Lecture 03:52

Introducing the course objectives, benefits, instructor, and overall methodology for you to learn and become confident with Apache Spark with Python.

Preview 03:52
Apache Spark and Python Foundational Building Blocks
2 Lectures 14:56

I don't expect you to follow all the details of this video. I want to give you a big picture through source code examples of using Apache Spark and Python to analyze data.

We're going to start by running some code examples of Python against the Spark API through a Spark Driver program called PySpark.

I don't expect you to follow along with all these Python examples yet. I'll fill in the blanks later in the course.

We'll talk about what a "Spark Driver" program means later.

From PySpark, we're going to analyze some Uber data. Uber is a company which has disrupted the worldwide taxi industry. We're going to use the Uber data from New York City from the NYC's Taxi & Limousine Commission.

With this data, we can use Python and Spark to analyze. We can determine the total number of Uber trips, the most popular Uber bases in NYC, etc.

In this example, we'll give a glimpse into Spark core concepts such as Resilient Distributed Datasets, Transformations, Actions and Spark drivers. In addition, we'll see code examples of how to use Python with Spark.

Again, I don't expect you to follow all the details here, it's intended as a high level over to begin.

Preview 08:57

Now that we've seen Spark with Python examples, let's continue by considering the key Spark concepts you need to know. These concepts will be used throughout the rest of this Spark with Python data science course.

We need to describe Resilient Distributed Datasets, Transformations, Actions, Spark Drivers and applications deployed to clusters in more detail.

This builds the foundations for later sections of the Spark with Python Data Science Power Tools course.

Apache Spark Fundamentals - The Essentials You Need to Know

Let's confirm our understanding of the foundations of Apache Spark. Just three questions to confirm the goals of this section of the Apache Spark course.

[Milestone] Key Concepts Quiz
3 questions
Prepare Your Environment
8 Lectures 15:17

Now that we've seen an example of data analysis with Python using Spark, let's configure your environment. With your own environment, you'll be able to run code from this Spark with Python course as well as experiment on your own.

If you already have Spark downloaded and installed you can skip the next lecture. For Python setup, we're going to use a particular flavor or Python. So even if you don't end up using the version of Python used this course, the recommendation is to still view the Python videos.

Let me know in the course comments if you have any questions. It should be a straightforward process. It just takes a while to download.

Setting Up Your Environment

If you are preparing your Apache Spark and Python environment on a Windows machine, please watch this video. It highlights two areas you need to know before proceeding to the following lectures.

For Windows Operating System Users Only

This video shows how and where to download and install Apache Spark used in this course. You are free to watch this course without installing Spark, but if you want to experiment with your own environment, you should download and install Spark on your own machine.

Download and Install Spark

Walk through installing the Python version used in this Spark with Python course. We're going to use the Anaconda Python version which provides us convenient access to many 3rd party Python libraries used in data science such as charts and graphs, math, etc.

Download and Install Python

At this point, let's confirm your Spark environment is running and we're able to interact with the Spark Python API.

To accomplish this, start up the pyspark Spark driver program. This is just a short video to show how to confirm your Spark with Python environment.

[Milestone] Setup Checkpoint

ipython notebook is not a requirement for this course. But, it may help if you decide to copy-and-paste from the provided source code examples.

This video goes through ipython setup. Also, see the private course discussions on how people with a variety of setups have succeeded in configuring Apache Spark with ipython notebook.

Check ipython notebook Setup (optional)

We're going to use a few different files of sample data files for this Apache Spark with Python course. This video shows how and where to download.

Links for both files are also provided at the end of this section

Sample Data Access

Hyperlinks to download Spark, Python and command reference

[Milestone] Setup References and Download Links
1 page
Apache Spark Transformations and Actions
6 Lectures 36:58

In essence, there are two kinds of Spark functions: Transformations and Actions. Transformations transform an existing RDD into a new, different one. Actions are functions used against RDDs to produce a value.

In this section of the Apache Spark with Python Data Science course, we'll go over a wide variety of Spark Transformation and Action functions.

This should build your confidence and understanding of how you can apply these functions to your uses cases. It will also create more foundation for us to build upon in your journey of learning apache spark with python.

Spark Transformations and Actions Overview

We're going to break Apache Spark transformations into groups. In this video, we'll cover some common spark transformations which produce RDDs. These include map, flatMap, filter, etc.

We're going to use a CSV dataset of baby names in New York. As we progress through transformations and actions in this Apache Spark with Python course, we'll determine more and more results for this sample data set.

So, let's begin with some commonly used Spark transformations.

Spark Transformations Part 1

In part 2 of Spark Transformations, we'll discover spark transformations used when we need to combine, compare and contrast elements in two RDDs. This is something we often have to do when working with datasets. Spark helps compare RDDs through transformation functions union, intersection, distinct, etc.

Spark Transformations Part 2

In part 3 of our focus on Spark Transformation functions were going work with the "key" functions including groupByKey, reduceByKey, aggregateByKey, sortByKey

All these transformations work with key,value pair RDDs, so we will cover the creation of PairRDDs as well.

We'll continue to use the baby_names.csv file used in Part 1 and Part 2 of Spark Transformations

Spark Transformations Part 3

Let's confirm our understanding of Spark Transformations at this point.

[Milestone] Transformations Quiz
3 questions

Spark Actions produce values back to the Spark Driver program. Also, recall that Action functions called against RDD cause a previously lazy RDD to be evaluated. So, in the real world when working with large datasets, we need to be careful when triggering RDDs to be evaluated through Spark actions.

This video shows commonly used Spark Actions.

Spark Actions

Let's confirm our understanding of Spark Actions.

[Milestone] Spark Actions Quiz
2 questions

Provides links to download the source code (ipython notebook) used in this section of the course.

[Milestone] Download Resources and Source Code Access
Apache Spark Clusters
7 Lectures 22:16

Clusters allow Spark to processes huge volumes of data by distributing the workload across multiple nodes. This is also referred to as "running in parallel" or "horizontal scaling"

A cluster manager is required to Spark on a cluster. Spark supports 3 types of cluster managers including Apache YARN, Apache Mesos and an internal cluster manager distributed with Spark called Standalone.

Let's cover the key concepts of this Spark Clustering section of the course.

Spark on a Cluster Introduction

Let's run a Spark Standalone cluster within your environment. We'll start a Spark Master and one Spark worker. We'll quickly go over the Spark UI web console. We'll return to the Spark UI console in a later lecture after we deploy a couple of Python programs to it.

Run Standalone Cluster

Now that we have a Spark cluster running, how do we use it? In this lecture of the Spark with Python course, we'll deploy a couple of Python programs. We'll start with a simple example and then progress to more complicated examples which include utilizing spark-packages and Spark SQL.

Deploy commands include:

bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 examples/src/main/python/pi.py

bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 examples/src/main/python/wordcount.py baby_names.csv

Deploy Python Programs to the Cluster

Let's review a Python program which utilizes examples we've already seen in this Spark with Python course. It's a program which analyzes New York City Uber data using Spark SQL. The video will show the program in the Sublime Text editor, but you can use any editor you wish.

When deploying our driver program, we need to do things differently than we have while working with pyspark. For example, we need to obtain a SparkContext and SQLContext. We need to specific Python imports.

bin/spark-submit --master spark://todd-mcgraths-macbook-pro.local:7077 --packages com.databricks:spark-csv_2.10:1.3.0 uberstats.py Uber-Jan-Feb-FOIL.csv

[Milestone] Write and Deploy Python Program to the Spark Cluster

The Spark UI was briefly introduced in a previous lecture. Let's return to it now we have an available worker in the cluster and we have deployed some Python programs.

The Spark UI is the tool for Spark Cluster diagnostics, so we'll review the key attributes of the tool.

Spark Cluster Administrative Diagnostics - The Spark UI

Apache Spark can be run on a cluster of two or more instances of Amazon EC2. In this video, let's go over how to create a Spark cluster on EC2. We'll cover the setup from both Spark as well as how to configure Amazon EC2 authentication and authorization using the Amazon Web Console (AWS).

Create an Amazon EC2 Based Cluster Part 1

We'll continue the setup Spark cluster on EC2 with special attention to how we can use ipython notebook against our Spark cluster running in EC2.

Before the EC2 cluster is ready to use from ipython notebook, we need to open port 7077.

[Milestone] Create an Amazon EC2 Based Cluster Part 2

Let's confirm our understanding of Spark Clustering

[Milestone] Spark Cluster Quiz
3 questions
Spark SQL
5 Lectures 20:47

Spark SQL is perfect for those coming from a SQL background. It allows us to use SQL against a variety of datasets including CSV, JSON and JDBC databases. The Spark code for working with these datasets looks the same!

In this section of the Spark with Python course, we're going discuss a certain kind of RDD used with Spark SQL. Then, we're going to cover Spark SQL through input data source examples such as CSV, JSON and a mySQL database.

Preview 03:13

Spark SQL uses a type of Resilient Distributed Dataset called DataFrames which are composed of Row objects accompanied with a schema. The schema describes the data types of each column. A DataFrame may be considered similar to a table in a traditional relational database.


We’re going to use the Uber dataset, ipython notebook and the spark-csv package available from Spark Packages to make our lives easier. The spark-csv package is described as a “library for parsing and querying CSV data with Apache Spark, for Spark SQL and DataFrames” This library is compatible with Spark 1.3 and above.

Spark SQL with New York City Uber Trips CSV Source

Let's load a JSON input source to Spark SQL’s SQLContext. This Spark SQL JSON with Python portion of the course has two parts. The first part shows examples of JSON input sources with a specific structure. The second part warns you of something you might not expect when using Spark SQL with JSON data source.


We are going to use two JSON inputs. We’ll start with a simple, trivial example and then move to an analysis of historical World Cup player data.

The World Cup Player source data may be downloaded from the github repo for this course or https://raw.githubusercontent.com/jokecamp/FootballData/master/World%20Cups/all-world-cup-players.json

Spark SQL with Historical World Cup Player Statistics JSON Source

Now that we have Spark SQL experience with CSV and JSON, connecting and using a mySQL database will be easy. So, let’s cover how to use Spark SQL with Python and a mySQL database input data source.


We’re going to load some NYC Uber data into a database. Then, we’re going to fire up pyspark with a command line argument to specifiy the JDBC driver needed to connect to the JDBC data source. We’ll make sure we can authenticate and then start running some queries.

Spark SQL with mySQL (JDBC) source

All the source code used in the Spark SQL section of the Spark with Python course is available from the course github repo.

[Milestone] Spark SQL Resources and Download Source Code
Conclusion and Free Bonus Lecture
2 Lectures 01:10

Thanks for taking the course! I hope you enjoyed the course and you are feeling comfortable and confident using Python with Apache Spark. If you have any questions or suggestions on how to improve this course, just let me know in the course discussions forum.

Apache Spark with Python Course Conclusion and Looking Ahead

Be sure to visit http://www.supergloo.com for discount coupons to my other Udemy courses and links to my Spark related books. Also, you'll have a chance to sign up for my mailing list where I send announcements of new courses, books, Spark tutorials, etc.

Come check it out!

Bonus Lecture: Access to Free Books and Course Discounts
About the Instructor
Todd McGrath
4.1 Average rating
181 Reviews
1,777 Students
2 Courses
Data Engineer, Software Developer, Mentor

Todd has an extensive and proven track record in software development leadership and building solutions for the world's largest brands and Silicon Valley startups.

His courses are taught using the same skills used in his consulting and mentoring projects.  Todd believes the only way to gain confidence and become productive is to be hands-on through examples.  Each new subject should build upon previous examples or presentation, so each step is also a way to reemphasis a prior topic.

To learn more about Todd, visit his LinkedIn profile.