Learn Apache Spark in Python
0.0 (0 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
4 students enrolled

Learn Apache Spark in Python

Processing a million word text corpus using Pyspark and Window SQL
New
0.0 (0 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
4 students enrolled
Last updated 6/2020
English
English [Auto]
Current price: $20.99 Original price: $29.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 4 hours on-demand video
  • 9 downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • In this course you'll learn the physical components of a Spark cluster and the Spark computing framework.
  • You will build your own local standalone cluster.
  • You will write Spark code.
  • You will learn how to run Spark jobs.
  • You will create Spark tables and query them using SQL.
  • You will learn a process for creating successful Spark applications.
  • You will profile a Spark application.
  • You will tune a Spark application.
  • You will learn 30+ Spark commands.
  • You will use Spark SQL window functions.
Requirements
  • Be familiar with the Python programming language.
  • Be comfortable at the Unix command line.
  • Have a data orientation and enjoy combing through huge piles of data.
  • Have a laptop and internet connection.
Description

In this course you'll learn the physical components of a Spark cluster, and the Spark computing framework. You’ll build your own local standalone install of Pyspark. You’ll write Spark code. You’ll learn how to run Spark jobs in a variety of ways. You’ll create Spark tables and query them using SQL. You will learn details of Spark internals. You will learn a process for creating successful Spark applications.


What am I going to get from this course?

  • Build Spark applications.

  • Tune Spark applications.

  • Profile Spark applications.

  • Tackle 3 coding projects.

  • Learn 30+ Spark commands.

  • Master Spark SQL window functions.

  • Step through 900 lines of Spark code.

  • Apply pro tips and best practices tested in production.


Why should you take this course?


Because your time is valuable and you are data driven. Lost opportunity is a major contributor to future regret. Apply these core underlying principles to your own life and build your own dreams.

Technology moves fast. Keeping up using only free materials from scattered sources is penny wise and pound foolish. Using only your own judgment to sift through the abundance of materials before you have learned what is relevant, and what is not, is backwards. You will be better able to judge the relevance of new content to your mission once you have a good foundation. You will be able to more rapidly consume -- and more importantly : synthesize -- new materials. Learning in a way that allows you to synthesize what you have learned is more effective than simply devouring all material indiscriminately.

Have you ever learned a new spoken language? Language instructors will tell you : it is better to learn an aspect of the language deeply and be able to apply it immediately. This will get your conversational skills up to speed the most rapidly. Then, your ability to utilize what you have learned, and with confidence, starts to have a compounding effect. Learning a programming language is analogous. There is a lot to learn and too little time to exhaustively cover the abundance of material that is available.

There are core principles that when leveraged effectively accelerate you rapidly through the ramp up phase.

The best learning regimen provides new skills and insights, but also solidifies what your present knowledge base. Your ability to leverage your current level of knowledge to build on it by learning is akin to a building your principal capital, namely, your knowledge base and skill set.

With the proper approach, learning has a compounding effect on knowledge. Once you have mastery of what you have learned in both breadth and depth you are positioned to put your knowledge to work for your mission, and convert it into a currency you can spend to improve your life and the lives of those around you.

New learners are entering this field every day. Learners that make effective use of their time quickly surpass learners who do not.

Be mindful of the most important resource you wield : your attention. Spend it wisely. Learners who know the same things as the all the others don't stand out from the crowd. They may do well enough in a rising tide that lifts all boats. The doers who have an edge will win a slot on the fastest ship in the finest fleet and do more than just ride the tide. They will accomplish what was previously unimagined. Investing in yourself is wise; being overly frugal about how you spend your precious time is foolish. Your attention is limited. Most importantly, it is not free. With every day that you put off your future you miss out on the compounding effect that future growth has on present capability. Meanwhile the gap between you and other learners that understand these lessons widens.

A small edge today blossoms into a bigger one next week, and a gaping lead by next month.


Use your time and attention wisely.


Don't scrimp on your biggest investment :


Yourself.





Cut through the clutter

If you want to ramp up quickly on Apache Spark, then these courses are for you.

There is an abundance of training material available for learning Spark.

Then why should you use this course?

Because your time is valuable. Wading through the tens of thousands of resources that are appropriate to your level of analysis while ensuring that you filter out ones that are outdated takes time and judgment. Time spent ramping up on the fundamentals has an opportunity cost that is far higher than the price of this course.

I've collected a comprehensive yet succinct lesson plan containing material gleaned from hundreds of sources, including pro tips from Spark contributors, insights from Spark consultants, many conference presentations, and first-hand experience applying this powerful tool to big data applications in production serving millions of users. If you want to cut through the clutter and ramp up quickly, then this course will help you accomplish that.



What you will learn in this course

We cover a range of topics, ranging from concepts, to architecture, to managing your development environment, as well as several use cases.

This course is not only for developers, but for managers and technical leads who don’t necessarily write code, but want to be able to perform code reviews or analyze the performance of a running application.



Prerequisites

All you will need is a standard development-grade laptop and an internet connection.

You should be comfortable at the unix command line.

Although this course is taught using the Python programming language, it is also suitable for users who plan to primarily utilize R and SQL.



What you will do in this course

  • Step through over 900 lines of code in 9 original exercises utilizing over 30 Spark commands

  • Learn best practices that are tested in production application

  • Utilize over 30 Spark commands, including :

    • read, select, count, join, where, groupBy, show, drop_duplicates, distinct, limit, subtract, withColumn, sort, split, explode, join, lower, length, withColumn, alias, UDFs, explain, sql, cache, persist, unpersist, catalog, listTables, isCached, createOrReplaceTempView, monotonically_increasing_id, cacheTable, clearCache, and others.



Applications you can tackle with Spark

  • Discover statistically improbable phrases in text

  • Perform anomaly detection on log data

  • Apply topic modeling to text

  • Build a recommender

  • Do trend analysis



What you can do with Spark

  • Develop on your laptop and migrate to a cluster later without changing your code.

  • Spin up a cluster to use for a short period of time and spin it down when finished.

  • Run applications in a shared long-running cluster environment that autoscales up and down in size as workloads demand.

  • Develop applications intended to run on a single machine that can later be migrated to a cluster without modifying the code.

  • Work within a dynamically typed or a statically typed language, including Python, Scala, Java, R, and SQL.

  • Access data stores including AWS S3, HDFS, Hive, HBase, Cassandra, or any Hadoop data source, as well as Kafka, Redshift, among others.

  • Develop and deploy using the same language and framework.

  • Develop and test directly on a cluster.

  • Deploy applications programmatically.

  • Manage clusters programmatically.



What you'll be able to do upon completing this course

Tackle nontrivial applications, including

  • Approximate K Nearest Neighbors

  • Alternating Least Squares

  • K-means clustering

  • Streaming

You'll also have a solid foundation for data engineering applications.


Given the flexibility of Spark you are only limited by the data available to you and your imagination.



Who this course is for:
  • Data Scientists
  • Data Engineers
  • Quantitative Analysts
  • Engineering Managers
  • Data Analysts
  • Business Intelligence Dashboard Developers
  • Machine Learning Developers
  • SQL Developers
Course content
Expand all 9 lectures 04:06:12
+ Course Overview and objectives
1 lecture 09:19

Lecture Overview

This lecture provides an overview of the course.  It introduces Apache Spark. It explains how Spark is used and what kinds of problems is Spark good for.  It provides background on the instructor.


All code used in the video lectures is provided for download.  It is attached to this section.  Another set of code that is updated to use Spark version 2.2 is provided in Section 9. The code provided in Section 9 should also be compatible with the latest version of Spark.



Course Overview

In this course you'll learn the physical components of a Spark cluster, and the Spark computing framework. You’ll build your own local standalone cluster. You’ll write Spark code. You’ll learn how to run Spark jobs in a variety of ways. You’ll create Spark tables and query them using SQL. You will learn a process for creating successful Spark applications.



What am I going to get from this course?

  • Install and configure Spark.

  • Run Spark in several ways.

  • Build Spark applications.

  • Profile Spark applications.

  • Tune Spark applications.

  • Learn 30+ Spark commands.

  • Use Spark SQL window functions.

  • Step through 900 lines of Spark code.

  • Apply pro tips and best practices tested in production.



Prerequisites and Target Audience

What will students need to know or do before starting this course?

  • Have familiarity with Python.

  • Be comfortable at the Unix command line.

  • Have a data orientation and enjoy combing through huge piles of data.

  • Have a laptop, software development environment, and internet connection.



Who should take this course?

This course is for all data professionals, including:

  • Data Scientists

  • Data Engineers

  • Quantitative Analysts

  • Engineering Managers

  • Data Analysts

  • Dashboard Developers

  • Machine Learning Developers

  • R and SQL Developers


About the Instructor

  • Instructor background and experience

Preview 09:19
+ The Spark Computing Framework
1 lecture 29:18

This lecture explains the following aspects of the Spark Computing Framework in detail:


  • Components of Spark Physical Cluster

  • Components of Spark Software Architecture

  • Execution Modes



This lecture also demonstrates the following ways of running Spark:


  • Running the pyspark shell

  • Running Spark in the python shell

  • Running Spark in an ipython shell

  • Using the Spark session object and the Spark context

  • Running "Hello World" in Spark

  • Creating an RDD and inspecting its contents.



This lecture also demonstrates how to install Spark.


NOTE: To install pyspark on any unix system first try the following:


  • $ pip install pyspark


The pip install approach uses the pip package management system.  It is the recommended installation and works for most configurations. Once you install Spark in this way, you may skip ahead, unless you are interested in how to download Spark and build from source.

Students who may want to build from source may include Data Engineers, System Administrators, Dev Ops, and the few students for whom the package management system install does not work. Downloading the source is also helpful for accessing application code examples that come with the Spark package.


Lecture Timestamps

  • [0:00 - 6:40] : Spark Computing Framework, Execution Modes

  • [6:40 - 20:25] : Installing Spark.  Building Spark from source.

  • [20:205 - 29:19] : Running Spark


The Spark Computing Framework
29:18
+ The Spark UI
1 lecture 28:45

This lecture covers the following topics:

  • Running Spark from the command-line

  • Debugging within an IDE

  • Running in a notebook

  • Using Spark UI to inspect a running application

  • Understanding lazy execution

  • How code creates Spark objects, such as driver, executor, job, stage, task

  • How to profile memory and data movement


This lecture also demonstrates how to use the Spark UI to observe the internals of an application while it is running using the Jobs, Stages, and Executors tabs.



Perform every step that is demonstrated in the video for yourself!  Learning by actively engaging with the code and the tools is the best way to internalize these concepts!


Lecture timestamps:

  • [0:00 - 4:31] : Execution Modes,

  • [4:31 - 6:11] : Running Spark from the command-line

  • [6:11 - 9:41] : Debugging within an IDE

  • [9:41 - 17:30] : Running ALS in Spark and inspecting in the Spark UI

  • [17:30 - 21:36] : Creating and running a simple Spark application in various ways

  • [21:36 - 28:45] : Creating a dataframe and inspecting using the Spark UI

The Spark UI
28:45
+ Running Spark
1 lecture 23:00

This lecture covers the following topics:

  • Spark Cluster Components

  • Execution Modes

  • Driver Program

  • Job, Stage, Task

  • Transformation vs Action

  • Wide Transformation vs Narrow Transformation

  • Relationship between Shuffle Operations and Stage Boundary

  • Execution Plan

  • Relationship between Dataset, RDD, and Dataframe

  • Shared Variables


We also review Spark Cluster Components and the Spark Standalone Cluster Architecture.


We also see how to use the Spark UI to do the following:

  • inspect resource usage

  • observe the programmatic creation of jobs, stages, and tasks

  • profile memory and data movement


Lecture Timestamps:

  • [0:00 - 2:14] : Spark Cluster Components

  • [2:14 - 3:22] : Spark Execution Modes

  • [3:22 - 8:26] : Driver Program, Spark Job

  • [8:26 - 9:26] : Spark Task, Stage, Shuffle

  • [9:26 - 13:11] : Parallelized Collection

  • [13:11 - 18:36] : Partitions, Transformations, Actions

  • [18:36 - 20:54] : Shuffle, Stage Boundary, Execution Plan

  • [20:54 - 21:37] : Spark UI Thread Dump

  • [21:37 - 23:00] : Broadcast Variables, Accumulators



Running Spark
23:00
+ Let's practice using RDDs !
1 lecture 28:13

Exercise 1 : RDDs


In this lecture you will:

  • Download a dataset

  • Run an example application on a dataset

  • Extract statistics from the run performed in the previous step

  • Modify the application

  • Perform data cleansing

  • Extract additional statistics


Lecture Timestamps:

[0:00 -

[ - 28:14] :


Preview 28:13
+ Using Dataframes
1 lecture 40:08

This lecture covers the following:

  • A walkthrough a solution to Exercise 2 introduced in the previous lecture

  • How to achieve the following results in a dataframe:

    • select certain fields

    • filter data

    • group data

    • pretty print a dataframe

    • get the number of distinct values

    • sort by a specified column

    • limit the number of rows in a result

  • Joining two dataframes

  • Subtracting a dataframe from another dataframe

  • Adding a column based on an existing column

  • Creating a UDF

  • Adding a column using a UDF

  • Creating a UDF that operates on two columns

  • Using the following dataframe operations:

    • select, join, where, groupBy, show, drop_duplicates, limit, subtract, withColumn

  • Introduces Exercise 3:

    • Split a text into chapters

    • Compare the word frequencies of each chapter




What you will learn :

  • Recap of the RDD-based approach

  • Loading a dataframe from text file

  • Using the select, alias, explode, lower, col, and length operations

  • Counting uniques using drop_duplicates and distinct

  • Aggregations using the groupBy operation

  • Introducing the GroupedData object

  • Set operations - Joins - Set intersection - Set subtraction

  • Filtering using where

  • Inspecting a sample of a result set using the show action

  • Transforming column using a UDF

  • Transforming a column using a UDF within a select operation

  • Adding a new column using withColumn

  • Adding a column containing a fixed literal value using a literal argument to a UDF

  • How to create a UDF that operates on two columns



Lecture timestamps:

  • [0:00 - 3:48] : Demonstrating the solution at the command line

  • [3:48 - 8:17] : Demonstrating the solution in an IDE

  • [8:17 - 25:49] : Detailed code review

  • [25:49 - 29:19] : UDFs

  • [29:19 - 38:56] : Exercise 2 : Bonus Round

  • [38:56 - 40:08] : Introduction to Exercise 3



Exercise 2 : Using Dataframes
40:08
+ Caching and Memory Storage Levels
1 lecture 44:38


The following topics are covered:

  • Caching and Logging

    • caching vs persist

    • removing objects from cache using unpersist

    • command line demonstration of caching

    • demonstrating an important quirk of logging when using the DEBUG log level

  • How to size a dataset in the Spark UI

    • Spark UI Storage tab

    • creating the object, caching it, pausing the application

    • inspecting the object size in the Spark UI

  • Storage levels, serialization, and cache eviction policy

    • memory storage levels

    • serialization

    • cache eviction policy

  • Tuning Cache and Best Practices for Caching and Logging

    • a systematic way to tune cache

    • best practices for caching

    • when to cache

    • when not to cache

    • implications of running Spark actions within a debug log statement



Lecture timestamps:

  • [00:00 - 8:52] : Caching and Logging

  • [8:52 - 15:14] : How to size a dataset in the Spark UI

  • [15:14 - 20:45] : Storage levels, serialization, and cache eviction policy

  • [20:45 - 44:38] Tuning Cache and Best Practices for Caching and Logging


Caching and Memory Storage Levels
44:38
+ Spark SQL and Window Functions
1 lecture 42:00

Moving Window N-Tuple Analysis using Window SQL Functions

  • Demonstrates 3-tuple, 4-tuple, 5-tuple, and 6-tuple analyses

  • Analyzes a 6 MiB text of 1 million words using moving n-tuple windows via Window SQL functions


Covers the following topics :

  • Introduction to Spark SQL

    • Examples of traditional SQL queries

    • Examples of window function SQL queries

    • Code demonstrated at the command line and in IDE

    • Spark Tables

  • Spark Catalog and Execution Plans

    • Registering a dataframe as a Spark table

    • Caching a Spark table

    • Inspecting the Spark catalog

    • Querying a Spark table using SQL

    • Examining the execution plan of a dataframe

    • Examining the execution plan of a query

    • A defensive programming technique for debugging lazy evaluations

  • Window Function SQL

    • Using dot notation vs sql query for dataframe operations

    • Example of an operation that is easier using dot notation

    • Window functions in action

    • Identifying word sequences of a specified length

    • Creating a moving window feature set

    • Finding most frequent word sequences of a specified length

    • Observations gleaned from our dataset using windowed sequence analysis

  • Pro Tips

    • Window functions

    • UDFs

    • Debugging

    • Tuning

  • Project ideas


Lecture timestamps:

  • [00:00 - 13:56] : Introduction to Spark SQL

  • [13:56 - 21:33] : Spark Catalog, Execution Plans

  • [21:33 - 31:00] : Window Function SQL

  • [31:00 - 42:00] : Pro Tips, Project ideas

Spark SQL and Window Functions
42:00
+ Additional Source Code Downloads
1 lecture 00:51

Code for Spark Version 2.2



This code should also run as-is with Spark 2.6.



This course was originally designed using Spark version 2.1.2. However, code presented here works in later versions, including version 2.2, 2.3, and 2.4.


This code should also work as-is with Spark 2.6. The concepts taught were already fairly mature at the time of version 2.2, and the new features introduced in Spark version 2.6 should not affect code presented here.


For your convenience, versions of the code examples are provided using Spark version 2.2.1 and Python 3. The key changes include the following:


  1. A code module spark_2_2_1.py is provided in the Downloads for running Spark 2.2.1.

  2. See https://github.com/minrk/findspark for how findspark resolves some cases where Pyspark isn't on sys.path by default.

  3. Note also the use of a os environment setting to explicitly instruct the code to use python3. This resolves some issues in environments where more than one version of python is installed.


Steps 2 and 3 above were included only for the convenience of those students who were using uncommon configurations. These steps should not be necessary for most students. When using pip install to install the latest version, steps 2 and 3 should not be necessary.


Code for Spark Version 2.2 and compatible with later versions
00:51