Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
PySpark - Zero to Superhero
Rating: 4.4 out of 5(6 ratings)
30 students

PySpark - Zero to Superhero

PySpark and Spark SQL
Created byGanesh Kudale
Last updated 2/2026
English

What you'll learn

  • Basics of PySpark
  • Reading PySpark Data Frame and various methods of creating PySpark Data Frames
  • Processing using PySpark Data Frames and Spark SQL - Deep Dive
  • Write Transformed Results from Data Frame to Expected Location

Course content

3 sections63 lectures16h 26m total length
  • Creating the raw data frame13:40

    What is data frame?

    Data Frame is one of higher-level APIs.


    Apache Spark -

    1. Spark core APIs(RDDs)

    2. Higher Level APIs(Data frames and Spark SQL)

    3. GraphX

    4. MLlib


    Data Frames are not persistent.


    1. Create Data frame using Python List


    # Data List

    customer_data = [

        ("Ganesh",30,"Data Engineering"),

        ("Akshay",27,"Data Engineering"),

        ("Snehal",35,"Scrum Master"),

        ("Rahul",43,"Cricketer"),

        ("Rohit",32,"Cricketer"),

        ("Priyesh",31,"IT Manager")

    ]


    df = spark.createDataFrame(data,schema)

    df = spark.createDataFrame(data).toDF("col1","col2","col3")

  • Defining the Schema12:25

    2 ways to define the schema -

    1. Schema DDL

    2. StructType -


    Schema DDL - "col1 col1_datatype,col2 col2_datatype"


    StructType


    struct_schema = ([

    StructField("col1",col1_datatype()),

    StructField("col2",col2_datatype())

    ])


    # StructType


    from pyspark.sql.types import *


    struct_schema = StructType([

        StructField("cust_name",StringType()),

        StructField("cust_age",IntegerType()),

        StructField("cust_prof",StringType())

    ])

  • Assignment 1.1
  • Reading the data frame form file stored at storage location15:34

    /FileStore/emp_data.csv


    1. Infer the Schema

    emp_df = spark.read\

    .format("csv")\

    .option("inferSchema",True)\

    .option("header",True)\

    .load("/FileStore/emp_data.csv")


    2. Infer the Schema - read sample of data

    emp_df1 = spark.read\

    .format("csv")\

    .option("inferSchema",True)\

    .option("samplingRatio",0.1)\

    .option("header",True)\

    .load("/FileStore/emp_data.csv")


    3. Enforce the schema - StructType


    emp_schema = StructType([

    StructField("emp_id",IntegerType()),

    StructField("emp_name",StringType()),

    StructField("emp_salary",LongType())

    ])


    emp_df3 = spark.read\

    .format("csv")\

    .schema(emp_schema)\

    .option("header",True)\

    .load("/FileStore/emp_data.csv")

  • Quiz 1.1
  • Different ways of creating the data frame12:01

    1. When we have python list

    df = spark.createDataFrame(data,schema)


    2. When file is stored at storage location

    df = spark.read\

    .format("file_format")\

    .schema(dataframe_schema)\

    .option("header",True)\

    .load("file_path")


    3. Creating the dataframe from table

    df3 = spark.read.table("employee")


    4. Creating the dataframe from table

    df4 = spark.table("employee")


    5. Creating the dataframe from table

    df5 = spark.sql("SELECT * FROM employee")


    6. Using range function

    df6 = spark.range(6)

    df7 = spark.range(12,35)

    df8 = spark.range(2,56,4)

  • Transformations and Action in Apache Spark15:46

    Actions and Tranformations in PySpark -

    Transformations are Lazy.

    Actions are not Lazy.


    Transformations in PySpark

    Filter

    withColumn

    withColumnRenamed

    select

    selectExpr


    Actions in PySpark

    display

    show

    count

    take

    write



    Higher Level APIs are interconvertible -


    Data Frame to Spark Table -

    createOrReplaceTempView - It will create spark table which can be

    accessed within one spark session. If we try to create the same view

    again, it will not throw and error, it will replace the existing table.

    createTempView - It will create spark table which can be

    accessed within one spark session. If we try to create the same view

    again, it will throw and error.

    createOrReplaceGlobalTempView - It will create spark table which can be

    accessed across multiple spark sessions. If we try to create the same

    view again, it will not throw and error, it will replace the existing

    table.

    createGlobalTempView - It will create spark table which can be

    accessed across multiple spark sessions. If we try to create the same

    view again, it will throw and error.


    How to convert Spark table to data frame -

    emp_df = spark.read.table("employee1")

    emp_df = spark.table("employee1")

    emp_df2 = spark.sql("SELECT * FROM employee1")

  • Data Frame Read Modes13:00

    /FileStore/emp_data-1.csv


    1. Permissive - If it encounteres data type mismatch,

    it will mark that cell as NULL. It is default read mode.

    2. Failfast - If it encounteres data type mismatch,

    it will throw an error.

    3. DropMalformed - If it encounteres data type mismatch,

    it will drop that row.


    emp_permissive_df = spark.read\

    .format("csv")\

    .schema(emp_schema)\

    .option("header",True)\

    .option("mode","permissive")\

    .load("/FileStore/emp_data-1.csv")


    emp_permissive1_df = spark.read\

    .format("csv")\

    .schema(emp_schema)\

    .option("header",True)\

    .load("/FileStore/emp_data-1.csv")


    emp_failfast_df = spark.read\

    .format("csv")\

    .schema(emp_schema)\

    .option("header",True)\

    .option("mode","failfast")\

    .load("/FileStore/emp_data-1.csv")


    emp_dropmalformed_df = spark.read\

    .format("csv")\

    .schema(emp_schema)\

    .option("header",True)\

    .option("mode","dropmalformed")\

    .load("/FileStore/emp_data-1.csv")

  • Reading Single Line JSON file as PySpark Dataframe11:18

    from pyspark.sql.types import *

    from pyspark.sql import functions as F


    bikes_schema = StructType([

        StructField("model",StringType()),

        StructField("mpg",DoubleType()),

        StructField("cyl",DoubleType()),

        StructField("disp",DoubleType()),

        StructField("hp",DoubleType()),

        StructField("drat",DoubleType()),

        StructField("wt",DoubleType()),

        StructField("qsec",DoubleType()),

        StructField("vs",DoubleType()),

        StructField("am",DoubleType()),

        StructField("gear",DoubleType()),

        StructField("carb",DoubleType())

    ])


    bikes_raw_df = spark.read.format("json").schema(bikes_schema).load("/Volumes/demo/default/landing/singleline_json.json")


    bikes_raw_df.display()


    bikes_schema1 = StructType([

        StructField("model",StringType()),

        StructField("mpg",DoubleType()),

        StructField("cyl",DoubleType()),

        StructField("disp",DoubleType()),

        StructField("horse_power",DoubleType()),

        StructField("drat",DoubleType()),

        StructField("wt",DoubleType()),

        StructField("qsec",DoubleType()),

        StructField("vs",DoubleType()),

        StructField("am",DoubleType()),

        StructField("gear",DoubleType()),

        StructField("carb",DoubleType())

    ])


    bikes_raw_df1 = spark.read.format("json").schema(bikes_schema1).load("/Volumes/demo/default/landing/singleline_json.json")


    bikes_raw_df1.display()


    bikes_raw_df2 = spark.read.json(path="/Volumes/demo/default/landing/singleline_json.json",schema=bikes_schema)


    bikes_raw_df2.display()


  • Reading Multi Line JSON file as PySpark Dataframe8:31

    from pyspark.sql.types import *

    from pyspark.sql import functions as F


    bikes_schema = StructType([

        StructField("model",StringType()),

        StructField("mpg",DoubleType()),

        StructField("cyl",DoubleType()),

        StructField("disp",DoubleType()),

        StructField("hp",DoubleType()),

        StructField("drat",DoubleType()),

        StructField("wt",DoubleType()),

        StructField("qsec",DoubleType()),

        StructField("vs",DoubleType()),

        StructField("am",DoubleType()),

        StructField("gear",DoubleType()),

        StructField("carb",DoubleType())

    ])


    muliline_json_df = spark.read.format("json").option("multiline",True).schema(bikes_schema).load("/Volumes/demo/default/landing/multiline_json.json")


    muliline_json_df.display()


    multiline_json_df1 = spark.read.json(path="/Volumes/demo/default/landing/multiline_json.json",schema=bikes_schema,multiLine=True)


    multiline_json_df1.display()


  • Reading parquet file as pyspark dataframe9:03

    Parquet is column based file format.

    200 columns and you want to read 30 columns.


    parquet_df = spark.read.format("parquet").load("/Volumes/demo/default/landing/titanic.parquet")


    parquet_df1 = spark.read.parquet("/Volumes/demo/default/landing/titanic.parquet")


  • Section 1 - Knowledge Check

Requirements

  • Basics of Python Programming and Basics of SQL

Description

Course Description:

This hands-on course is designed for aspiring and experienced data engineers who want to master PySpark—the powerful distributed computing framework built on Apache Spark. Led by Ganesh Kudale, a seasoned data engineer, the series walks learners through real-world scenarios, from foundational concepts to advanced transformations, with a strong focus on production-grade pipeline development.


What You'll Learn:

PySpark Essentials: RDDs, Data Frames, and Spark SQL

Data Ingestion & ETL: Reading from CSV, JSON, Parquet

Transformations & Actions: Filtering, joins, aggregations, and window functions


Who Should Enroll:

  • Data engineers working with big data platforms

  • Developers transitioning from SQL to PySpark

  • Professionals building scalable pipelines in Big Data

  • Anyone preparing for Spark-related interviews

Curriculum -

Session 1 - Creating the raw data frame

Session 2 - Defining the Schema in PySpark

Session 3 - Reading the data frame from file stored at storage location

Session 4 - Different ways of creating the data frame

Session 5 - Transformations and Action in Apache Spark

Session 6 - Data Frame Read Modes

Session 7 - PySpark withColumn Transformation

Session 8 - PySpark datatype conversions

Session 9 - withColumn in PySpark VS spark SQL

Session 10 - PySpark select transformation

Session 11 - PySpark selectExpr Transformation

Session 12 - Performance difference between select, selectExpr and withColumn transformations

Session 13 - Renaming the column in PySpark data frame and using Spark SQL

Session 14 - Performance efficient approach for renaming columns in PySpark data frame

Session 15 - Filtering data in PySpark

Session 16 - Efficient ways to filter the data in PySpark

Session 17 - Sorting in PySpark Single Column

Session 18 - Sorting in PySpark - Multiple Columns

Session 19 - Sorting in Spark SQL

Session 20 - Performance difference between sort and orderBy in PySpark

Session 21 - Aggregations in PySpark

Session 22 - Simple Aggregations in PySpark - Count, Average, Max, Min

Session 23 - Introduction to Grouping aggregations in PySpark

Session 24 - Grouping aggregations in PySpark - Continuation

Session 25 - Grouping aggregations in PySpark - Continuation 1

Session 26 - Grouping Aggregations on Multiple Columns in PySpark

Session 27 - Grouping Aggregations on Multiple Columns in PySpark Continuation

Session 28 - Running multiple grouping aggregations together

Session 29 - Windowing Aggregations in PySpark - Row_Number

Session 30 - Windowing Aggregations in PySpark - Rank

Session 31 - Windowing Aggregations in PySpark - Dense Rank

Session 32 - Remove duplicates using PySpark window functions

Session 33 - Top scorer students in each subject using PySpark window functions

Session 34 - PySpark Window Function Lead Data Frame

Session 35 - PySpark Window Function Lead Spark SQL

Session 36 - PySpark Window Function - LAG

Session 37 - CASE WHEN in PySpark - One when Condition

Session 38 - CASE WHEN in PySpark - Multiple when Conditions and Multiple Conditions within when

Session 39 - WHEN Otherwise in PySpark - One when Condition

Session 40 - WHEN Otherwise in PySpark - Multiple when Conditions

Session 41 - Working With dates in PySpark - Python List

Session 42 - Working With dates in PySpark - Storage Location

Session 43 - Adding created timestamp and created date to the newly added data in PySpark

Session 44 - Joins in PySpark - Theory

Session 45 - Inner Join in PySpark - Joining over one Column

Session 46 - Inner Join in PySpark - Joining over one Column - NULL values in joining Columns

Session 47 - Inner Join in PySpark - Joining over multiple Columns

Session 48 - Left Outer Join in PySpark - Joining over one Column

Session 49 - Left Outer Join in PySpark - Joining over one Column - NULL values in joining Columns

Session 50 - Left Outer Join in PySpark - Joining over multiple Columns

Session 51 - Right Outer Join in PySpark - Joining over one Column

Session 52 - Right Outer Join in PySpark - Joining over one Column - NULL values in joining Columns

Session 53 - Right Outer Join in PySpark - Joining over multiple Columns

Session 54 - Full Outer Join in PySpark - Joining over one Column

Session 55 - Full Outer Join in PySpark - Joining over one Column - NULL values in joining Columns

Session 56 - Full Outer Join in PySpark - Joining over multiple Columns

Session 57 - Left Semi Join in PySpark

Session 58 - Left Anti Join in PySpark

Session 59 - Reading Single Line JSON file as PySpark Data frame

Session 60 - Reading Multi Line JSON file as PySpark Data frame

Session 61 - Reading parquet file as PySpark data frame

Session 62 - Data Frame writer API and data frame writer Modes

Who this course is for:

  • Beginner and Experienced PySpark and Spark SQL Developers