Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Master Apache Spark using Spark SQL and PySpark 3

Name: Master Apache Spark using Spark SQL and PySpark 3
Rating: 4.1 (2636 reviews)

Master Apache Spark using Spark SQL as well as PySpark with Python3 with complementary lab access

Created byDurga Viswanatha Raju Gadiraju

Last updated 5/2024

English

What you'll learn

Setup the Single Node Hadoop and Spark using Docker locally or on AWS Cloud9
Review ITVersity Labs (exclusively for ITVersity Lab Customers)
All the HDFS Commands that are relevant to validate files and folders in HDFS.
Quick recap of Python which is relevant to learn Spark
Ability to use Spark SQL to solve the problems using SQL style syntax.
Pyspark Dataframe APIs to solve the problems using Dataframe style APIs.
Relevance of Spark Metastore to convert Dataframs into Temporary Views so that one can process data in Dataframes using Spark SQL.
Apache Spark Application Development Life Cycle
Apache Spark Application Execution Life Cycle and Spark UI
Setup SSH Proxy to access Spark Application logs
Deployment Modes of Spark Applications (Cluster and Client)
Passing Application Properties Files and External Dependencies while running Spark Applications

Course content

25 sections • 346 lectures • 32h 11m total length

Introduction to Spark SQL and PySpark 3 using Python 31:35
Curriculum for Spark SQL and Pyspark 3 using Python 32:02
Explore Spark SQL and PySpark with Python 3, use SQL style syntax and dataframe APIs, and learn to convert dataframes to temporary views, deploy and troubleshoot Spark applications.
Purchasing the Spark SQL and PySpark using Python 3 Course2:09
Introduction to Udemy Course Landing Page2:00
Overview of Udemy Course or Video Player7:16
Adding Notes to Course Lectures5:03
Using Course Sidebar to move between lectures2:54
Overview of Support to ITVersity courses on Udemy4:04
Best Practices to get ITVersity Support using Udemy9:54
Resources for Spark SQL and Pyspark 3 using Python 33:57
Material for Spark SQL and PySpark 3 using Python 32:30
Become Part of ITVersity Data Engineering Community2:04
Rate and Leave Feedback - Spark SQL and PySpark 3 using Python 33:51
Udemy for Business Customers - Important Information for about labs for practice2:20

Setup Development Environment using VS Code Remote Development Extension Pack4:41
Review Data Sets Provided as part of Gateway Nodes of Hadoop and Spark Cluster4:15
Validate HDFS on Multi Node Hadoop and Spark Cluster from Gateway Node7:08
Validate Hive on Hadoop and Spark Multinode Cluster9:29
Review Hadoop HDFS and YARN Property Files on Hadoop and Spark Cluster5:35
Review Hadoop HDFS and YARN Property Files using Visual Studio Code Editor3:02
Review Hive Property Files on Multinode Hadoop and Spark Cluster2:03
Review Spark 2 Property Files and Important Properties4:28
Validate Spark Shell CLI using Spark 25:51
Launch and validate Spark shell with Scala on Spark 2, read JSON data from HDFS into a dataframe, print the schema, preview 20 records, and count 68,083 rows.
Validate Pyspark CLI using Spark 23:37
Validate Spark SQL CLI using Spark 211:00
Review Spark 3 Property Files and Important Properties3:19
Validate Spark Shell CLI using Spark 33:49
Validate spark shell with spark 3 by running spark sql queries, creating a database, and setting spark.sql.warehouse.dir to fix permissions.
Validate Pyspark CLI using Spark 37:41
Validate Spark SQL CLI using Spark 36:07

Prerequisites for Single Node Hadoop and Spark Cluster on Windows3:40
Overview of Windows System Configuration2:54
Setup Ubuntu on Windows 11 using wsl5:21
Setup and Validate Ubuntu VM on Windows using wsl3:21
Install Docker Desktop on Windows 11 using wsl26:06
Overview of Docker Desktop on Windows 112:41
Validate Docker Commands using Windows Powershell as well as wsl Ubuntu2:24
Setup Visual Studio Code IDE on Windows4:10
Install Visual Studio Code Extension for Remote Development3:49
Clone GitHub Repository for Pyspark Course using Visual Studio Code4:33
Launching Terminal using Visual Studio Code and WSL3:33
Review Docker Compose File to setup Hadoop and Spark Lab4:28
Start Hadoop and Spark Lab along with Jupyter Lab on Windows 117:17
Review the resource utilization of Windows for Hadoop and Spark Lab4:51
Review Docker Desktop for Hadoop and Spark Lab using Docker5:06
Overview of Docker Compose Commands to manage Hadoop and Spark Lab6:45
Validate Hadoop and Spark setup using Docker on Windows8:32

Getting Started with AWS Cloud92:50
Creating AWS Cloud9 Environment4:36
Warming up with AWS Cloud9 IDE3:19
Review Operating System Details on AWS Cloud92:31
Overview of EC2 Instance related to AWS Cloud91:31
Opening ports for AWS Cloud9 Instance3:12
Associating Elastic IPs to AWS Cloud9 Instance4:13
Increase EBS Volume Size of AWS Cloud9 Instance3:15
Setup Docker Compose on AWS Cloud9 Instance2:26
Clone GitHub Repository on AWS Cloud9 for the Course Material2:39
Review Docker Compose File to setup Hadoop and Spark Lab4:28
Start Hadoop and Spark Lab along with Jupyter Lab on Windows 117:17
Overview of Docker Compose Commands to manage Hadoop and Spark Lab6:45
Validate Hadoop and Spark setup using Docker8:32

Introduction2:16
Review of Setup Steps for Spark Environment8:39
Using ITVersity labs6:32
Join ITVersity labs to access a hosted multi-node big data cluster with spark, hive, and kafka, and learn to run spark 1.6.3 or spark 2.x on yarn.
Apache Spark Official Documentation (Very Important)7:20
Quick Review of Spark APIs12:30
Spark Modules5:01
Spark Data Structures - RDDs and Data Frames14:49
Develop Simple Application14:26
Apache Spark - Framework22:19
Create Data Frames from Text Files16:18
Create Data Frames from Hive Tables5:49

Getting Started - Overview2:01
Overview of Spark Documentation2:29
Launching and using Spark SQL CLI4:08
Overview of Spark SQL Properties8:51
Running OS Commands using Spark SQL3:19
Understanding Spark Metastore Warehouse Directory4:12
Managing Spark Metastore Databases using Spark SQL10:01
Managing Spark Metastore Tables using Spark SQL3:21
Retrieve Metadata of Spark Metastore Tables using Spark SQL Describe Command2:19
Role of Spark Metastore or Hive Metastore5:01
Exercise - Getting Started with Spark SQL8:57

Basic Transformations using Spark SQL - Introduction3:19
Spark SQL - Overview6:41
Define Problem Statement3:19
Prepare Spark Metastore Tables for Basic Transformations using Spark SQL5:06
Projecting Data using Spark SQL Select Clause4:00
Filtering Data using Spark SQL Where Clause10:20
Joining Tables using Spark SQL - Inner7:29
Joining Tables using Spark SQL - Outer7:22
Aggregating Data using Group By in Spark SQL11:06
Sorting Data using Order By in Spark SQL4:49
Conclusion - Final Solution for the problem statement using Spark SQL4:21

Introduction to Basic DDL and DML in Spark SQL2:47
Create Spark Metastore Tables using Spark SQL Create Statement10:33
Overview of Data Types used in Spark Metastore Tables9:51
Adding Comments to Spark Metastore Tables using Spark SQL2:02
Loading Data from Local File System Into Tables using Spark SQL Load Statement4:17
Loading Data from HDFS Folders Into Tables using Spark SQL Load Statement6:09
Difference between Load with Append and Overwrite using Spark SQL Load Statement2:40
Creating External Spark Metastore Tables using Spark SQL3:06
Difference between Managed and External Spark Metastore Tables4:39
Overview of File Formats used in Spark Metastore Tables8:01
Drop Spark Metastore Tables and Databases using Spark SQL4:17
Truncating Spark Metastore Tables2:17
Exercise - Managed Spark Metastore Tables7:10

Requirements

Basic programming skills using any programming language
Self support lab (Instructions provided) or ITVersity lab at additional cost for appropriate environment.
Minimum memory required based on the environment you are using with 64 bit operating system
4 GB RAM with access to proper clusters or 16 GB RAM to setup environment using Docker

Description

DISCLAIMER

This course requires you to download the following softwares
Docker
Visual Studio Code
If you are a Udemy Business user, please check with your employer before downloading software

As part of this course, you will learn all the key skills to build Data Engineering Pipelines using Spark SQL and Spark Data Frame APIs using Python as a Programming language. This course used to be a CCA 175 Spark and Hadoop Developer course for the preparation for the Certification Exam. As of 10/31/2021, the exam is sunset and we have renamed it to Apache Spark 2 and Apache Spark 3 using Python 3 as it covers industry-relevant topics beyond the scope of certification.

About Data Engineering

Data Engineering is nothing but processing the data depending upon our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc. Apache Spark is evolved as a leading technology to take care of Data Engineering at scale.

I have prepared this course for anyone who would like to transition into a Data Engineer role using Pyspark (Python + Spark). I myself am a proven Data Engineering Solution Architect with proven experience in designing solutions using Apache Spark.

Let us go through the details about what you will be learning in this course. Keep in mind that the course is created with a lot of hands-on tasks which will give you enough practice using the right tools. Also, there are tons of tasks and exercises to evaluate yourself. We will provide details about Resources or Environments to learn Spark SQL and PySpark 3 using Python 3 as well as Reference Material on GitHub to practice Spark SQL and PySpark 3 using Python 3. Keep in mind that you can either use the cluster at your workplace or set up the environment using provided instructions or use ITVersity Lab to take this course.

Setup of Single Node Big Data Cluster

Many of you would like to transition to Big Data from Conventional Technologies such as Mainframes, Oracle PL/SQL, etc and you might not have access to Big Data Clusters. It is very important for you set up the environment in the right manner. Don't worry if you do not have the cluster handy, we will guide you through support via Udemy Q&A.

Setup Ubuntu-based AWS Cloud9 Instance with the right configuration
Ensure Docker is setup
Setup Jupyter Lab and other key components
Setup and Validate Hadoop, Hive, YARN, and Spark

Are you feeling a bit overwhelmed about setting up the environment? Don't worry!!! We will provide complementary lab access for up to 2 months. Here are the details.

Training using an interactive environment. You will get 2 weeks of lab access, to begin with. If you like the environment, and acknowledge it by providing a 5* rating and feedback, the lab access will be extended to additional 6 weeks (2 months). Feel free to send an email to support@itversity.com to get complementary lab access. Also, if your employer provides a multi-node environment, we will help you set up the material for the practice as part of the live session. On top of Q&A Support, we also provide required support via live sessions.

A quick recap of Python

This course requires a decent knowledge of Python. To make sure you understand Spark from a Data Engineering perspective, we added a module to quickly warm up with Python. If you are not familiar with Python, then we suggest you go through our other course Data Engineering Essentials - Python, SQL, and Spark.

Master required Hadoop Skills to build Data Engineering Applications

As part of this section, you will primarily focus on HDFS commands so that we can copy files into HDFS. The data copied into HDFS will be used as part of building data engineering pipelines using Spark and Hadoop with Python as the Programming Language.

Overview of HDFS Commands
Copy Files into HDFS using the put or copyFromLocal command using appropriate HDFS Commands
Review whether the files are copied properly or not to HDFS using HDFS Commands.
Get the size of the files using HDFS commands such as du, df, etc.
Some fundamental concepts related to HDFS such as block size, replication factor, etc.

Data Engineering using Spark SQL

Let us, deep-dive into Spark SQL to understand how it can be used to build Data Engineering Pipelines. Spark with SQL will provide us the ability to leverage distributed computing capabilities of Spark coupled with easy-to-use developer-friendly SQL-style syntax.

Getting Started with Spark SQL
Basic Transformations using Spark SQL
Managing Tables - Basic DDL and DML in Spark SQL
Managing Tables - DML and Create Partitioned Tables using Spark SQL
Overview of Spark SQL Functions to manipulate strings, dates, null values, etc
Windowing Functions using Spark SQL for ranking, advanced aggregations, etc.

Data Engineering using Spark Data Frame APIs

Spark Data Frame APIs are an alternative way of building Data Engineering applications at scale leveraging distributed computing capabilities of Apache Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications.

Data Processing Overview using Spark or Pyspark Data Frame APIs.
Projecting or Selecting data from Spark Data Frames, renaming columns, providing aliases, dropping columns from Data Frames, etc using Pyspark Data Frame APIs.
Processing Column Data using Spark or Pyspark Data Frame APIs - You will be learning functions to manipulate strings, dates, null values, etc.
Basic Transformations on Spark Data Frames using Pyspark Data Frame APIs such as Filtering, Aggregations, and Sorting using functions such as filter/where, groupBy with agg, sort or orderBy, etc.
Joining Data Sets on Spark Data Frames using Pyspark Data Frame APIs such as join. You will learn inner joins, outer joins, etc using the right examples.
Windowing Functions on Spark Data Frames using Pyspark Data Frame APIs to perform advanced Aggregations, Ranking, and Analytic Functions
Spark Metastore Databases and Tables and integration between Spark SQL and Data Frame APIs

Apache Spark Application Development and Deployment Life Cycle

Once you go through the content related to Spark using a Jupyter-based environment, we will also walk you through the details about how the Spark applications are typically developed using Python, deployed as well as reviewed.

Setup Python Virtual Environment and Project for Spark Application Development using Pycharm
Understand complete Spark Application Development Lifecycle using Pycharm and Python
Build zip file for the Spark Application, copy to the environment where it is supposed to run and run.
Understand how to review the Spark Application Execution Life Cycle.

All the demos are given on our state-of-the-art Big Data cluster. You can avail of one-month complimentary lab access by reaching out to support@itversity.com with a Udemy receipt.

Who this course is for:

Any IT aspirant/professional willing to learn Data Engineering using Apache Spark
Python Developers who want to learn Spark to add the key skill to be a Data Engineer
Scala based Data Engineers who would like to learn Spark using Python as Programming Language

Master Apache Spark using Spark SQL and PySpark 3

What you'll learn

Explore related topics

Course content

Introduction about Spark SQL and PySpark 3 using Python 314 lectures • 52min

Using ITVersity Labs for hands-on practice (for ITVersity Lab Customers only)15 lectures • 1hr 22min

Setup Hadoop and Spark Single Node Cluster on Windows 11 using Docker17 lectures • 1hr 20min

Setup Hadoop and Spark Single Node Cluster on AWS Cloud9 using Docker14 lectures • 58min

Python Fundamentals7 lectures • 1hr 27min

Overview of Hadoop HDFS Commands13 lectures • 1hr 22min

Apache Spark 2.x - Data processing - Getting Started11 lectures • 1hr 56min

Apache Spark using SQL - Getting Started11 lectures • 55min

Apache Spark using SQL - Basic Transformations using Spark SQL11 lectures • 1hr 8min

Apache Spark using SQL - Basic DDL and DML13 lectures • 1hr 8min

Requirements

Description

Who this course is for: