Apache Spark 3 for Data Engineering & Analytics with Python

Name: Apache Spark 3 for Data Engineering & Analytics with Python
Rating: 4.6 (2138 reviews)

Learn how to use Python and PySpark 3.0.1 for Data Engineering / Analytics (Databricks) - Beginner to Ninja

Highest Rated

Created byDavid Charles Academy

Last updated 5/2022

English

What you'll learn

Learn the Spark Architecture
Learn Spark Execution Concepts
Learn Spark Transformations and Actions using the Structured API
Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API
Learn how to set up your own local PySpark Environment
Learn how to interpret the Spark Web UI
Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution
Learn the RDD (Resilient Distributed Datasets) API (Crash Course)
Learn the Spark DataFrame API (Structured APIs)
Learn Spark SQL
Learn Spark on Databricks
Learn to Visualize (Graphs and Dashboards) Data on Databricks

Course content

5 sections • 89 lectures • 8h 39m total length

Introduction4:43
The Spark Architecture3:39
The Spark Unified Stack3:36
Explore the Apache Spark unified stack built on Spark Core, delivering batch and real-time streaming, SQL, machine learning, and graph processing with data frames.
Windows - Download Java2:29
Windows - Install Java1:39
Windows - Set up Java environment variables4:48
Windows - Download Python Installer1:11
Windows - Install Python2:21
Windows - Set up PATH variable for Python5:10
Windows - Install Spark for Python3:38
Windows - PySpark Test Program6:05
Hadoop Installation5:26
Install Microsoft Buid Tools2:35
Mac OS - Java Installation3:45
Mac OS - Python Installation4:16
Mac OS - PySpark Installation7:15
Install pyspark on Mac OS using pip and verify the installation. Set environment variables in the zshrc to reference Python3 and pyspark drivers, then source the profile to apply changes.
Mac OS - Testing the Spark Installation5:06
Install Jupyter Notebooks9:18
The Spark Web UI11:18
Learn to use the Spark web UI to monitor a local Spark job, exploring stages and executables, and applying map transformations to square numbers in a sample program.
Section Summary2:23
install a spark environment with java, python 3.9, and jupyter notebook, resolve c++ build tools to run notebook, and use the spok ui to monitor spark jobs and rdd libraries.

Section Introduction1:24
Learn spark foundations and build a blog app to compute the most orders per region and country, while exploring spark transformations, actions, and directed acyclic graph in spark web ui.
Spark Application and Session7:54
Spark Transformations and Actions Part 19:10
Spark Transformations and Actions Part 23:32
DAG Visualisation5:43
Explore the Spark web UI and DAG visualisation to see how stages and jobs execute in a lazy transformation-to-action workflow, with Python and Java APIs and data frames.

Introduction to RDDs4:51
Data Preparation6:38
Distince and Filter Transformations8:16
Remove duplicates with the distinct transformation on an RDD, verify the reduced count, and then filter words with a lambda to keep those that start with s.
Map and Flat Map Transformations7:15
Explore map and flatMap transformations in Spark, applying complex operations while preserving record counts; create squared number pairs and flatten words into letters.
SortByKey Transformations6:19
Learn how to sort by key using Spark's sort by key transformation on an RDD of country-ranking tuples, including descending sort with map and collect.
RDD Actions8:21
Explore the reduce action in RDDs by aggregating values with a lambda to a single result, summing lists, finding the longest word, and deriving max and min values.
Challenge - Convert Fahrenheit to Centigrade9:04
Challenge - XYZ Research2:18
Use spark 3 to tackle the XYZ research challenge with the attached data, applying union and subtract transformations to count initiated projects and year-one and year-two completions.
XYZ Research0:47
Challenge - XYZ Research Part 16:08
Challenge XYZ Research Part 24:30

Structured APIs Introduction6:23
Preparing the Project Folder5:18
PySpark DataFrame, Schema and DataTypes8:49
DataFrame Reader and Writer9:26
Challenge Part 1 - Brief2:41
Challenge Part 10:22
Challenge Part 1 - Data Preparation8:52
Import libraries, create a spark session, and define a sales schema; load csv data from data folder with a header, then print the schema and show the first 10 records.
Working with Structured Operations3:07
Managing Performance Errors4:50
Manage performance errors in Spark by stopping and restarting sessions, cleaning up temporary space, and restarting notebooks to recover from block manager errors and file not found exceptions.
Reading a JSON File10:16
Columns and Expressions8:35
Filter and Where Conditions6:33
Learn to filter data with filter and where functions to select salary 3000 or less, combine conditions with and, and use contains to find Land of the Lost.
Distinct Drop Duplicates Order By7:13
Rows and Union7:41
Learn how to create individual row items, assemble them into a data frame, and merge data frames using Spark's union transformation for tasks in data engineering and analytics with Python.
Adding, Renaming and Dropping Columns6:22
Working with Missing or Bad Data8:06
Working with User Defined Functions8:12
Challenge Part 2 - Brief5:10
Learn to clean sales data with Spark, remove bad records, extract city and state from address, convert data types, add year and month, write output partitioned by report and month.
Challenge Part 21:28
Challenge Part 2 - Remove Null Row and Bad Records8:02
Challenge Part 2 - Get the City and State8:20
Extract the city and state from the purchase address using Spark's split function to create new city and state columns in the sales dataframe.
Challenge Part 2 - Rearrange the Schema9:14
Challenge Part 2 - Write Partitioned DataFrame to Parquet5:59
Aggregations2:28
Aggregations - Setting up Flight Summary Data6:11
Load the flight summary data into a Spark data frame, infer the schema, and count route usage (origin to destination airports) to set up aggregations.
Aggregations - Count and Count Distinct6:02
Aggregations - Min Max Sum SumDistinct AVG6:38
Aggregations with Grouping7:50
Group by origin airport, apply count and max aggregations, and order by results to reveal airport counts; extend to group by state and city with California filters.
Challenge Part 3 - Brief1:59
Challenge Part 30:23
Challenge Part 3 - Prepare 2019 Data6:03
Challenge Part 3 - Q1 Get the Best Sales Month10:39
Challenge Part 3 - Q2 Get the City that sold the most products4:57
Challenge Part 3 - Q3 When to advertise10:23
Challenge Part 3 - Q4 Products Bought Together9:29

Introduction to DataBricks4:27
Spark SQL Introduction3:48
Register Account on Databricks3:05
Discover how to register for the Databricks community edition, complete sign-up, verify your email, and land on the data platform's front page for future Spark data engineering lessons.
Create a Databricks Cluster3:53
Learn to create a Databricks cluster to run Spark code, starting with a single-node community edition, Python 3 support, and understanding idle termination and availability zone settings.
Creating our First 2 Databricks Notebooks5:27
Reading CSV Files into DataFrame8:56
Load sales data files into a Spark dataframe by defining a schema and a data path, then read csv files with headers. Attach the cluster and verify the initial records.
Creating a Database and Table7:48
Inserting Records into a Table9:26
Insert records from a sales raw data frame into the sales table using Spark SQL and a temporary view. Verify compatibility with describe and understand the insert process with DML.
Exposing Bad Records5:38
Figuring out how to remove bad records4:29
Extract the City and State8:43
Inserting Records to Final Sales Table14:56
Insert transformed sales records with a spark Eskil statement, convert order date to timestamp, and extract year and month for final analytics and POCKY storage.
What was the best month in sales?9:14
Get the City that sold the most products2:55
Get the right time to advertise4:44
Explore the best ad timing by extracting the hour from order dates, measuring purchasing power per unique order id, and visualizing peak moments with a line chart.
Get the most products sold together9:42
Identify the most common product pairs sold together in New York using Apache Spark 3's data frame API, grouping by order IDs and visualizing results with a pie chart.
Create a Dashboard3:22
Summary2:21
Dive into core data fundamentals with DDL-based database and table creation, SQL-like select statements, and DML operations, then explore clusters, notebooks, and dashboards.

Requirements

A basic laptop PC running Windows or Mac OS with at least 6 - 8GB of RAM
Basic programming knowledge

Description

The key objectives of this course are as follows;

Learn the Spark Architecture
Learn Spark Execution Concepts
Learn Spark Transformations and Actions using the Structured API
Learn Spark Transformations and Actions using the RDD (Resilient Distributed Datasets) API
Learn how to set up your own local PySpark Environment
Learn how to interpret the Spark Web UI
Learn how to interpret DAG (Directed Acyclic Graph) for Spark Execution
Learn the RDD (Resilient Distributed Datasets) API (Crash Course)
- RDD Transformations
- RDD Actions
Learn the Spark DataFrame API (Structured APIs)
- Create Schemas and Assign DataTypes
- Read and Write Data using the DataFrame Reader and Writer
- Read Semi-Structured Data such as JSON
- Create and New Data Columns to the DataFrame using Expressions
- Filter the DataFrame using the "Filter" and "Where" Transformations
- Ensure that the DataFrame has unique rows
- Detect and Drop Duplicates
- Augment the DataFrame by Adding New Rows
- Combine 2 or More DataFrames
- Order the DataFrame by Specific Columns
- Renaming and Drop Columns from the DataFrame
- Clean the DataFrame by detecting and Removing Missing or Bad Data
- Create User-Defined Spark Functions
- Read and Write to/from Parquet File
- Partition the DataFrame and Write to Parquet File
- Aggregate the DataFrame using Spark SQL functions (count, countDistinct, Max, Min, Sum, SumDistinct, AVG)
- Perform Aggregations with Grouping
Learn Spark SQL and Databricks
- Create a Databricks Account
- Create a Databricks Cluster
- Create Databricks SQL and Python Notebooks
- Learn Databricks shortcuts
- Create Databases and Tables using Spark SQL
- Use DML, DQL, and DDL with Spark SQL
- Use Spark SQL Functions
- Learn the differences between Managed and Unmanaged Tables
- Read CSV Files from the Databricks File System
- Learn to write Complex SQL
- Use Spark SQL Functions
- Create Visualisations with Databricks
- Create a Databricks Dashboard

The Python Spark project that we are going to do together;

Sales Data

Create a Spark Session
Read a CSV file into a Spark Dataframe
Learn to Infer a Schema
Select data from the Spark Dataframe
Produce analytics that shows the topmost sales orders per Region and Country

Convert Fahrenheit to Degrees Centigrade

Create a Spark Session
Read and Parallelize data using the Spark Context into an RDD
Create a Function to Convert Fahrenheit to Degrees Centigrade
Use the Map Function to convert data contained within an RDD
Filter temperatures greater than or equal to 13 degrees celsius

XYZ Research

Create a set of RDDs that hold Research Data
Use the union transformation to combine RDDs
Learn to use the subtract transformation to minus values from an RDD
Use the RDD API to answer the following questions
- How many research projects were initiated in the first three years?
- How many projects were completed in the first year?
- How many projects were completed in the first two years?

Sales Analytics

Create the Sales Analytics DataFrame to a set of CSV Files
Prepare the DataFrame by applying a Structure
Remove bad records from the DataFrame (Cleaning)
Generate New Columns from the DataFrame
Write a Partitioned DataFrame to a Parquet Directory
Answer the following questions and create visualizations using Seaborn and Matplotlib
- What was the best month in sales?
- What city sold the most products?
- What time should the business display advertisements to maximize the likelihood of customers buying products?
- What products are often sold together in the state "NY"?

Technology Spec

Python
Jupyter Notebook
Jupyter Lab
PySpark (Spark with Python)
Pandas
Matplotlib
Seaborne
Databricks
SQL

Who this course is for:

Python Developers who wish to learn how to use the language for Data Engineering and Analytics with PySpark
Aspiring Data Engineering and Analytics Professionals
Data Scientists / Analysts who wish to learn an analytical processing strategy that can be deployed over a big data cluster
Data Managers who want to gain a deeper understanding of managing data over a cluster

Apache Spark 3 for Data Engineering & Analytics with Python

What you'll learn

Explore related topics

Course content

Introduction to Spark and Installation20 lectures • 1hr 31min

Spark Execution Concepts5 lectures • 28min

RDD Crash Course11 lectures • 1hr 4min

Structured API - Spark DataFrame35 lectures • 3hr 44min

Introduction to Spark SQL and Databricks18 lectures • 1hr 53min

Requirements

Description

Who this course is for: