Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

PySpark & AWS: Master Big Data With PySpark and AWS

Name: PySpark & AWS: Master Big Data With PySpark and AWS
Rating: 4.4 (3187 reviews)

Mastering AWS & PySpark: Spark, PySpark, AWS, Spark Ecosystem, Hadoop, and Spark Applications [AWS, Hadoop, Pyspark]

Created byAI Sciences, AI Sciences Team

Last updated 12/2025

English

What you'll learn

● The introduction and importance of Big Data.
● Practical explanation and live coding with PySpark.
● Spark applications
● Spark EcoSystem
● Spark Architecture
● Hadoop EcoSystem
● Hadoop Architecture
● PySpark RDDs
● PySpark RDD transformations
● PySpark RDD actions
● PySpark DataFrames
● PySpark DataFrames transformations
● PySpark DataFrames actions
● Collaborative filtering in PySpark
● Spark Streaming
● ETL Pipeline
● CDC and Replication on Going

Course content

9 sections • 190 lectures • 18h 40m total length

Why Big Data3:23
Explore why big data matters and learn to analyze it with PySpark on AWS, gaining hands-on skills for batch and real-time analytics and in-demand careers.
Applications of PySpark3:03
Explore PySpark applications across streaming data, machine learning with Mllib, batch analysis, and ETL, plus full load and replication, to drive real-time insights and scalable data pipelines.
Introduction to Instructor0:36
Meet your instructor, Muhammad Ahmad, a cloud and big data engineer with experience in Python, PySpark, AWS cloud, data mining, data orchestration, and teaching.
Introduction to Course1:39
Master data analysis with PySpark from basics to advanced, covering RDDs, DataFrames, Spark SQL, transformations and actions, plus Databricks and AWS integration for a CDC pipeline.
Projects Overview3:16
Explore practical big data analytics with PySpark and AWS through hands-on projects: student and employee data analysis, collaborative filtering with MLlib, spark streaming, and ETL and replication workflows.
Request for Your Honest Review1:18
Explore the udemy review system, preview upcoming topics and concepts you'll work on in the real world, and rate honestly if this course meets five-star standards or request updates.
Links for the Course's Materials and Codes0:09
Practice Test # 01

Links for the Course's Materials and Codes0:09
Why Spark3:53
Explore spark’s speed and distributed processing, enabling real-time analytics, caching, fault tolerance, and libraries like Spark SQL, machine learning, and graphics across Python, Scala, Java, and R.
Hadoop EcoSystem4:40
Explore the Hadoop ecosystem, including HDFS for distributed storage, YARN as the operating system, and MapReduce; see how Spark builds on this foundation for faster processing.
Spark Architecture and EcoSystem7:58
Discover spark architecture and ecosystem, including driver node, cluster manager, and workers, and explore core APIs with libraries like spark SQL, spark streaming, MLlib, and spark graphics.
DataBricks SignUp4:10
Sign up for Databricks online or offline, then verify your email and sign in. Explore notebooks, data import, machine learning, and transformations within a refreshed workspace interface.
Create DataBricks Notebook5:28
Learn how to spin up a Databricks cluster, create a notebook, and attach the notebook to the cluster to run your first hello world and test the setup.
Download Spark and Dependencies4:22
Set up PySpark offline on Windows or Mac by downloading Java, Python, Spark, and Winutils. Follow platform-specific steps and download latest versions, including Hadoop version 3.x package.
Java Setup on Window3:19
Install and configure Java on Windows by downloading the packages, granting permissions, and setting java_home and path variables to complete the setup.
Windows Setup Python Spark Hadoop5:48
Configure a Windows-based PySpark environment by installing Python, extracting Spark, and setting Hadoop paths and environment variables (SPARK_HOME and PATH) for offline use.
Runing Spark on Window3:15
Install and verify spark on windows by launching spark-shell, confirming spark 3.5.1 works, and testing a PySpark session that prints Hello, after installing Hadoop, Java, and Python.
Java Download on MAC1:43
Download and install the Java JDK on Mac, using Java 11 for Spark compatibility, then download the macOS installer and complete Oracle sign-in if prompted.
Installing JDK on MAC0:53
Open the downloaded JDK package to launch the installation wizard, follow prompts to continue and install, enter your Mac password if prompted, then confirm JDK is installed and proceed.
Setting Java Home on MAC2:53
Set up java home on mac by locating the JDK path, editing or creating bash profile, exporting JAVA_HOME for version 11, and reloading with source to verify.
Java check on MAC1:09
Verify Java installation on Mac by running java -version and javac -version in a new terminal. Confirm Java 11 is installed, troubleshoot if needed, then proceed to spark installation.
Installing Python on MAC1:04
Install Python on macOS to complete the PySpark setup, download Python 3.9.6 from the macOS link, run the installer, and prepare the PySpark path in the next video.
Setup Spark on MAC4:09
Download and extract Apache Spark, place the extracted folder under documents/dev, then set SPARK_HOME, update bash profile, source it, and verify PySpark with a simple Python test.
Which of the following statement is True
Which of the following is not a part of spark ecosystem?
Practice Test # 02

Links for the Course's Materials and Codes0:09
Spark RDDs8:28
Learn how Spark RDDs, the immutable distributed data set, power parallel processing by transforming data and triggering actions, with lazy evaluation and distribution across nodes.
Creating Spark RDD9:39
Learn to read a text file with PySpark, configure Spark config and context, create an RDD, and trigger collect to realize lazy evaluation on Databricks.
Running Spark Code Locally10:06
Export PySpark code from Databricks and run it on your local machine using spark-submit, adjusting file paths and environment variables to resolve Python version issues.
RDD stands for:
RDD is created by using:
RDD Map (Lambda)10:59
Apply the spark rdd map function (lambda) to transform each element into an rdd, such as splitting lines into tokens or concatenating strings. Trigger with collect on the text rdd.
RDD Map (Simple Function)9:28
Learn how to replace lambda with a concrete function in an RDD map to split strings, convert to integers, and apply custom transformations in Spark using Python.
Quiz (Map)1:14
Read a text file from Databricks storage, then use a PySpark map to compute and return the length of each word as a list.
Solution 1 (Map)6:28
Explore solving a quiz with PySpark on Databricks by reading a text file into an RDD, mapping to word lengths, and returning a new RDD of lengths.
Solution 2 (Map)3:52
Learn to replicate the prior quiz output inside a lambda using map in PySpark, splitting strings, applying list comprehension, and collecting results, with notes on readability and lambda use.
RDD FlatMap10:03
Explore flatMap as an extension of map in PySpark, acting as a mapper to flatten an RDD's nested outputs into a single list and compare with map.
RDD Filter7:52
Explore how RDD filter removes elements to produce a new RDD using lambda expressions or defined functions, with true/false conditions and practical examples.
Quiz (Filter)1:28
Execute a quiz that reads a file from dbfs or Databricks storage, filters out words starting with a or c in an RDD, and flattens to a single list.
Solution (Filter)16:09
Demonstrates filtering an rdd in pyspark by reading a file, using flatMap to split words, and applying a filter with a lambda to remove words starting with a or c.
RDD Distinct6:14
Apply the distinct transformation to an RDD to drop duplicates and produce a new RDD with unique elements. Explore how distinct works after flatMap and before collect in PySpark.
RDD GroupByKey16:52
Learn how to apply the group by key transformation on RDDs, create key-value pairs with map and flatMap, and aggregate values into grouped lists.
RDD ReduceByKey13:36
Explore reduce by key in Spark RDDs, aggregating values by key with a lambda to produce a single value per key, contrasting with group by key that yields value lists.
Quiz (Word Count)0:53
Learn to perform a word count in PySpark by reading a file into an RDD and returning each word as a key and its count as a value.
Solution (Word Count)14:58
Learn how to compute word counts with PySpark RDDs by reading a file, creating an RDD, and applying flatMap, map, and reduceByKey, with filtering for empties and noting multiple approaches.
RDD (Count and CountByValue)7:02
Explore spark rdd actions and transformations, focusing on count and count by value, which trigger computation and return element counts and per-value frequencies.
RDD (saveAsTextFile)15:20
Learn how Spark writes an RDD to text files using save as text file, with default two partitions, actions triggering processing, and per-partition output files in a folder.
RDD (Partition)17:56
Master how repartition and coalesce adjust rdd partitions, creating new rdds and optimizing parallelism; learn when repartition increases partitions and coalesce decreases them, plus dbfs file read‑write behavior.
Finding Average-114:54
Compute the average movie rating by reading a CSV file into Spark, mapping to key value pairs, and applying reduce by key to aggregate totals and averages.
Finding Average-27:00
Compute movie ratings by building RDD transformations with map and reduceByKey, accumulate total rating and count, and derive and verify the average rating across the dataset.
Quiz (Average)1:19
Read a file containing months, city codes, and monthly ratings, then write code to calculate the average score for each month.
Solution (Average)11:15
Create an rdd from the input csv, map each record to month and (rating, 1), then reduce by key to compute monthly sums and counts and derive the average.
Finding Min and Max10:08
Learn to compute per-movie minimum and maximum ratings in PySpark by converting lines to (movie, rating) pairs, casting to int, and applying reduceByKey.
Quiz (Min and Max)0:48
Apply PySpark to read an input file into an RDD and implement MapReduce-style transformations to compute the minimum and maximum ratings for each city.
Solution (Min and Max)6:04
Learn to compute min and max ratings by city with PySpark using an RDD workflow: read the csv, map to city and rating, then reduce by key.
Project Overview2:16
Implement a PySpark mini project that reads a student CSV into an RDD and computes counts, gender-based marks, 50+ pass/fail, per-course enrollment, marks, averages, and min/max by course and gender.
Total Students3:30
Upload the student data csv, set up Spark configurations and context, read the csv, strip the header using first and a filter, then count the records to show 1000 students.
Total Marks by Male and Female Student6:42
Learn to compute total marks by gender in PySpark using RDDs: map to create gender and marks pairs, cast to int, and reduceByKey to sum by male and female.
Total Passed and Failed Students4:39
Learn to compute total passed and failed students in PySpark using filters and counts on an RDD, based on a 50 threshold.
Total Enrollments per Course4:56
Learn how to compute total enrollments per course using map-reduce in PySpark, creating key-value pairs, applying reduce by key, and summing counts.
Total Marks per Course3:03
Compute total marks per course using a PySpark RDD workflow, mapping course names to marks and applying reduceByKey to sum totals per course.
Average marks per Course12:35
Learn to compute the average marks per course by aggregating total scores and enrollments with reduce by key, then map values to extract sums and divide to get the average.
Finding Minimum and Maximum marks3:41
Compute minimum and maximum marks per course by mapping course names to marks and reducing by key with max or min. Collect results to reveal per-course statistics with PySpark.
Average Age of Male and Female Students5:38
Compute the average age by gender using Spark RDD transformations, map and reduce by key, then map values to show female and male averages.

Links for the Course's Materials and Codes0:09
Introduction to Spark DFs8:08
Explore how Spark data frames wrap RDDs, provide schemas and named columns, enable parallel, SQL-like analysis from structured, unstructured, external data sources, and existing RDDs.
Creating Spark DFs10:25
Create a spark session with spark session builder and getOrCreate, read a csv into a spark data frame, and set header to true.
DF stands for:
DF is created by using:
Spark Infer Schema7:38
Learn how Spark infers a dataframe schema from a file, revealing data types and header use. Explore options to configure schema inference and delimiters for CSV and TSV files.
Spark Provide Schema8:19
Create and apply a custom Spark schema with StructType and StructField, mapping csv columns to explicit data types to control data frame reading.
Create DF from Rdd8:11
Learn to build data frames from rdd in pyspark by extracting headers, mapping records, and applying explicit schemas or inferring schemas for scalable big data workflows on aws.
Rectifying the Error5:07
Select DF Colums11:40
Learn to select multiple columns from a spark data frame using methods such as df.select, df.col, and column, then create new data frames by filtering columns.
Spark DF withColumn19:36
Learn to use withColumn in a spark data frame to manipulate column values, cast types, create new columns, and add lit literals, with examples like updating marks and adding country.
Spark DF withColumnRenamed and Alias6:03
Learn to rename spark df columns with withColumnRenamed and alias, noting lazy evaluation requires assigning the transformed df back, and aliasing during select to rename outputs.
Spark DF Filter rows15:55
Learn to filter Spark dataframes by rows using df.filter and df.col, apply single and multiple conditions, and use is in, starts with, ends with, contains, and like for precise data.
Quiz (select, withColumn, filter)1:17
Read student data from csv into a dataframe, add total marks column set to 120, compute average, filter oop above 80% and cloud above 60%, print names and marks.
Solution (select, withColumn, filter)10:09
Read student data from a CSV, add total marks column of 120 with withColumn, compute average, filter for OOP above 80% and cloud above 60%, then select names and marks.
Spark DF (Count, Distinct, Duplicate)10:46
Master counting rows, identifying distinct rows, and dropping duplicates in a Spark data frame, with examples on counting after filters and using drop duplicates by selected columns.
Quiz (Distinct, Duplicate)0:35
Solve a quick quiz by reading Student Data.csv into a data frame and displaying unique rows for age, gender, and course to distinguish between drop duplicates and distinct.
Solution (Distinct, Duplicate)5:09
Learn to read data from the student data.csv file into a Spark data frame and extract unique age, gender, and course combinations using distinct or drop duplicates in PySpark.
Spark DF (sort, orderBy)6:15
Learn to sort a Spark data frame with sort or orderBy, arranging rows by single or multiple columns in ascending or descending order, using df notation and integer data.
Quiz (sort, orderBy)1:45
Read the office data dot csv into a dataframe, sort by bonus ascending, then sort by age and salary in descending and ascending orders, and preview the results.
Solution (sort, orderBy)9:05
Explore sorting dataframes in PySpark with orderBy and sort, applying ascending and descending orders on bonus, age, and salary, while creating and displaying new transformed dataframes.
Spark DF (Group By)12:21
Group by in a Spark DataFrame creates groups based on a selected column, then apply aggregations such as sum, count, max, min, and mean to each group.
Spark DF (Group By - Multiple Columns and Aggregations)10:28
Explore multi-column grouping in PySpark DataFrames with groupBy and agg, applying sum, mean, max, min, and count by course and gender for deeper analytics.
Spark DF (Group By -Visualization)13:16
Learn how Spark df group by operates under the hood by visualizing department and state aggregations, including counts, sums of salary, and multi-column groupings.
Spark DF (Group By - Filtering)10:58
Learn how to filter with group by in Spark dataframes, before and after grouping, using where as a filter, aliases, and column notation to refine aggregates.
Quiz (Group By)0:42
Read csv into PySpark data frame and perform group-by analyses to compute enrollments per course, gender counts, and marks by gender, plus min, max, and average marks by age group.
Solution (Group By)7:41
Use PySpark group by course, gender, and age to derive enrollment counts and statistics like sum, min, max, and average marks.
Quiz (Word Count)0:44
Read the word data dot txt into a data frame, then calculate and display the count of each word as a simple word count quiz, noting there is no header.
Solution (Word Count)4:29
Read data from word data dot txt into a PySpark data frame, group by the text column, and compute word counts to display the frequency of each word.
Spark DF (UDFs)8:24
Create and apply user defined functions (UDFs) in Spark DataFrames to compute total salary by summing salary and bonus, with proper return types and column mappings.
Quiz (UDFs)1:20
Complete a udf-based quiz that reads office data dot csv into a dataframe and adds an increment column using NY and CA salary and bonus rules.
Solution (UDFs)7:59
Learn to build and apply a PySpark UDF to compute state-based salary increments (NY 10%, CA 12%) plus bonuses, using withColumn on data frames.
Solution (Cache and Presist)7:20
Cache data and persist it in memory to speed Spark data frame workflows by caching after transformations and reusing results across actions.
Spark DF (DF to RDD)7:14
Learn how to convert data frames to RDDs and access the underlying RDDs, then decide when RDDs or data frames are best for grouping and aggregations.
Spark DF (Spark SQL)6:07
Learn how Spark SQL lets you create data frames, register them as temporary views or tables, and run SQL queries for filtering, aggregations, and data exploration.
Spark DF (Write DF)10:35
Learn how to write a Spark DataFrame to CSV, control headers and schemas, and select write modes like overwrite, append, ignore, or error, while understanding partitions.
Project Overview2:01
Read the office data csv into a dataframe in PySpark workflow, analyze employees by department and state, compute salaries and bonuses, apply age-based raises, and save the results.
Project (Count and Select)4:01
Learn to load a csv into a spark data frame, count employees, and derive unique department counts and names using group by and drop duplicates in PySpark.
Project (Group By)4:16
Group data by department to count employees in each department using PySpark. Extend to group by state, then by department within each state to show state-department employee counts.
Project (Group By, Aggregations and Order By)4:54
Group employees by department, calculate minimum and maximum salaries with aggregations, and sort results by these values in ascending order using spark data frames.
Project (Filtering)8:10
Filter a PySpark data frame to show New York finance employees whose bonuses exceed the NY state average bonus, using group by, average bonus, and conditional filtering.
Project (UDF and WithColumn)6:02
Create and register a user-defined function to increment salary by 500 for employees whose age is greater than 45, then apply it with withColumn to update the salary column.
Project (Write)3:07
filter employees aged over 45 in a data frame and write the result to csv in the output_45 folder, creating multiple files based on partitions and validating via dbfs.

Links for the Course's Materials and Codes0:09
Collaborative filtering2:21
Explore collaborative filtering with Spark using dataframes and RDDs, and see how recommender systems predict user preferences to suggest products, shows, and content.
Utility Matrix3:54
Learn how a utility matrix drives recommender systems by filling missing ratings for users and movies, using averages and user similarities to predict top picks.
Explicit and Implicit Ratings4:06
Explore explicit and implicit ratings in recommender systems, detailing how explicit ratings are provided and why implicit signals—time spent on episodes and genre exploration—are harder to translate.
Expected Results3:00
Explore collaborative filtering to generate a user–movie rating matrix and personalized recommendations, inferring unseen ratings from existing data and showing how user tastes guide future movie suggestions.
Dataset6:28
Upload the movie and ratings dataset to DBFS, read them with Spark, inspect schemas, and prepare for collaborative filtering in a Databricks notebook.
Joining Dataframes6:33
Join rating and movies dataframes on movie ID to enrich ratings with titles and genres. The video shows reading CSVs and performing a left join to produce the combined data.
Train and Test Data6:17
Split the ratings data frame into train and test sets using random split with 0.8 for training and 0.2 for testing to train a recommender system and evaluate its performance.
ALS model5:47
Learn to build an alternating least squares (ALS) model in PySpark by specifying the user and item columns, rating, non-negative true, implicit prefs false, and a drop cold-start strategy.
Hyperparameter tuning and cross validation8:14
Build a param grid with 16 ALS model configurations in PySpark. Evaluate with a regression evaluator using RMSE and choose the best through five-fold cross-validation.
Best model and evaluate predictions4:03
Train the model with the cross validator on the training data, select the best model from the 16 options, then test predictions on the test data and compute rmse.
Recommendations10:33
Learn to build a PySpark ALS-based recommender on Databricks, generate top five movie recommendations per user, and flatten results with explode for per-user insights.

Links for the Course's Materials and Codes0:09
Introduction to Spark Streaming4:36
Compare batch analysis and streaming analysis, using spark streaming to read streaming input, transform with dataframes or RDDs, and output to any format or storage.
Spark Streaming with RDD4:15
Create a streaming context with SparkConf and SparkContext for RDDs. Read data from a streaming input directory at the time interval and process new files as they arrive.
Spark streaming is used to:
Spark Streaming Context5:00
Discover how to read data in spark streaming by specifying an input directory, creating an rdd from text file stream, and starting the streaming context.
Spark Streaming Reading Data5:09
Learn to read data with Spark streaming by configuring a streaming context, ingesting file-based input from a directory, printing streamed words, and managing the context lifecycle with getOrCreate.
Spark Streaming Cluster Restart3:50
Restart the spark streaming cluster to resolve dag manipulation errors after starting streaming in databricks. Replicate the local setup and run the same code to stream data reliably.
Spark Streaming RDD Transformations7:31
Explore spark streaming with rdd transformations by building a word count using map and reduceByKey on streaming data, and note rdd limitations that dataframes later address.
Which statement is true about SparkContext and StreamingContext:
Spark Streaming DF8:13
Explore spark streaming with data frames by building a read stream from a directory, processing and printing results to the console in complete mode.
Spark Streaming Display5:05
Explore visualizing streaming data with spark streaming and dataframes in databricks. See how starting, stopping, and file ingestion affect word count and aggregated results across files.
Spark Streaming DF Aggregations5:26
Master spark streaming by performing dataframe aggregations with group by to count words from streaming data, and observe counts update as new files land in dbfs.

Links for the Course's Materials and Codes0:09
Introduction to ETL4:49
Learn how to build an ETL pipeline with PySpark, extracting data from diverse sources, applying optional transformations, and loading into multiple destinations using Spark's flexible APIs.
We can perform ETL using PySpark:
ETL stands for:
ETL pipeline Flow2:10
Explore a simple etl pipeline using PySpark on Databricks to read a csv from dbfs, apply transformations, and load data into a Postgres database on aws rds.
Data set2:25
Prepare the etl pipeline by uploading a text data set word data dot txt to dbfs and use data frames to perform a word count and load into a database.
Extracting Data3:11
Extract data from a text file in dbfs using a Spark session and spark.read, create a data frame, and preview it with show as the first ETL step.
Transforming Data14:06
Transform data in an etl pipeline by converting lines into words with split and explode, then count word occurrences using group by and count on a spark data frame.
Loading data (Creating RDS-I)8:58
Finish the ETL pipeline by loading transformed data into an AWS RDS Postgres database, covering creating a free-tier instance, root credentials, VPC settings, and public access configuration.
Load data (Creating RDS-II)2:40
Configure an AWS RDS Postgres database by specifying a database name, selecting a db parameter group, enabling monitoring and performance insights, and reviewing maintenance options before creating the database.
RDS Networking5:21
Learn to configure and secure an AWS RDS Postgres instance by examining endpoints, VPC, subnets, security groups, and inbound rules for port 5432 and connectivity.
Downloading Postgres1:06
Install postgres on your local machine, install pgadmin to connect to AWS RDS, and follow Windows 64-bit installer steps to download and install version 13.3 or the latest.
Installing Postgres1:44
Install Postgres on Windows by following the setup wizard, configuring pgAdmin, and setting the admin password. Complete the installation by confirming the default port and launching pgAdmin.
Connect to RDS thorugh PgAdmin2:26
Connect to an AWS RDS Postgres instance using Pgadmin by configuring a new server with the RDS endpoint and port, then run SQL queries on the Postgres database.
Loading Data15:31
Create a schema and load a word count data frame into an AWS RDS Postgres database using PySpark's JDBC options, then validate the data with queries.

Links for the Course's Materials and Codes0:09
Introduction to Project1:48
Implement CDC to replicate all changes from the database inside DDS into HDFs storage, building a data lake and a pipeline that captures inserts, deletes, and updates.
Project Architecture15:33
Design a full load and change data capture pipeline using RDS MySQL, S3, DMS, Lambda, and Glue PySpark to reflect changes into the final destination.
In this project we are going to implement:
The cloud service DMS will be used to:
Creating RDS MySql instance9:17
Instantiate an AWS RDS MySQL under the free tier, configure credentials, storage, and backups, and create a custom MySQL 8.0 parameter group with binlog_format set to row for DMS.
Creating S3 Bucket3:23
Create an S3 bucket to serve as the destination for DMS in a PySpark CDC pipeline, using a unique bucket name and public access settings suitable for a POC.
Creating DMS Source Endpoint5:28
Create a DMS source endpoint for an RDS MySQL instance, configure connection details, test the endpoint, and adjust the endpoint name for clarity.
Creating DMS Destination Endpoint5:26
Create a DMS destination endpoint for an S3 bucket, configure an IAM role with Amazon S3 full access, and test the endpoint labeled CDC S3 PySpark.
Creating DMS Instance2:32
Create an AWS DMS replication instance to move data from MySQL to an S3 bucket, using a micro instance in the default VPC with the default subnet and security settings.
MySql WorkBench1:06
download MySQL Workbench from the MySQL community downloads page on Windows, download the latest version, no login required, and simply click next to complete the installation.
Connecting with RDS and Dumping Data5:53
Connect to the RDS instance using MySQL Workbench, test the connection, then run a SQL dump that creates a schema, a table with a primary key, and inserts for CDC.
Quering RDS1:48
DMS Full Load8:20
Create a DMS task to migrate existing MySQL data to S3, then replicate ongoing changes using the specified endpoints and replication instance.
DMS Replication Ongoing5:54
Explore how AWS DMS executes a full load from MySQL to S3 and manages ongoing replication, capturing inserts, updates, and deletions before triggering a PySpark pipeline.
Stoping Instances1:36
Stop replication and DF instances to avoid costs, run the PySpark job to fetch data from S3 and store it in another bucket, then stop the RDS without a snapshot.
Glue Job (Full Load)8:19
Define and execute an aws glue pyspark job to process a full load and cdc updates, reading csv data, applying headers, and writing final output.
Glue Job (Change Capture)3:40
Learn how to build a Glue CDC pipeline by performing a full load, processing change data with an updated data frame and UDF, and writing results to S3.
Glue Job (CDC)15:17
Master change data capture with a glue job using PySpark on AWS. Learn to apply insertions, updates, and deletions to a final data frame and load it back.
Creating Lambda Function and Adding Trigger6:36
Create a Python lambda function triggered by S3 object creation to pass the file name to a PySpark Glue job for full load or change-data processing and write results back.
Checking Trigger5:12
Test the lambda trigger by deploying code, running a test, and validating S3 uploads activate the function with CloudWatch logs.
Getting S3 file name in Lambda4:18
Learn to extract the triggering S3 bucket and file name from a lambda event, print the event data, deploy updates, and verify in CloudWatch.
Creating Glue Job5:24
Create a Glue job to run PySpark on AWS, configure an IAM role with S3 and CloudWatch access, and supply your own script.
Adding Invoke for Glue Job4:39
Invoke a PySpark job on AWS Glue from a lambda function using boto3, starting the job with S3 bucket and file name arguments and handling the response.
Testing Invoke4:49
Upload a dump file to the S3 bucket, deploy the Lambda, and invoke the Glue PySpark job with the bucket and file name. Monitor CloudWatch and Glue logs to verify.
Writing Glue Shell Job5:42
Transfer spark code from Databricks into a Glue shell job, create a Spark session, and read from and write to S3 buckets while handling full load and incremental updates.
Full Load Pipeline6:32
Spin up the RDS and DMS, perform the full load from MySQL into S3, then trigger Lambda and AWS Glue to process and replicate changes via CDC.
Change Data Capture Pipeline7:02
Execute a change data capture pipeline with PySpark and AWS, coordinating MySQL updates and deletions via Glue and Lambda, merging into an S3 final output for CDC validation.

Links for the Course's Materials and Codes0:09
Fundamentals of AWS for Chatbots: Lex Bot Overview4:23
Discover the Amazon Lex bot architecture and how to define intents, utterances, request data, slots, and fulfillment to build and deploy chatbots with voice and text.
Fundamentals of AWS for Chatbots: Benifits of Amazon Lex2:53
Explore the benefits of using Amazon Lex for chatbots on AWS cloud, including seamless deployment and scaling, built-in AWS integrations, and cost-effective, simple bot development with versatile input technologies.
Fundamentals of AWS for Chatbots: Framework of Lex4:03
Explore Amazon Lex fundamentals in a bank chatbot, where intents are understood, prompts guide replies, and AWS Lambda retrieves data to provide a balance.
Chatbot Development with AWS Lex and AWS Lambda: Module Overview6:37
Learn hands-on chatbot development with Amazon Lex, including intent classification, Lambda integration, Twilio SMS, website integration, and response cards, through theory and hands-on tutorials.
Chatbot Development with AWS Lex and AWS Lambda: Chatbot Steps5:11
Explore how to develop a chatbot with Amazon Lex, wiring it to AWS Lambda, Twilio, and Dynamo database storage, and follow steps from bot configuration to build and test.
Chatbot Development with AWS Lex and AWS Lambda: AWS Lambda Steps2:55
Explore the five-step process to connect Amazon Lambda with AWS Lex, including creating, updating, and testing a basic Lambda function to enable fulfillment and validation for chatbot development.
Chatbot Development with AWS Lex and AWS Lambda: Twilio and Website4:54
Explore how to integrate AWS Lex chatbots with Lambda, Twilio, and websites, covering account setup, chatbot integration, and live demos.
Chatbot Development with AWS Lex and AWS Lambda: Response Cards1:23
Set up response cards for an Amazon Lex chatbot and upgrade slots. Build the chatbot and finish with a live demo.
Chatbot Development with AWS Lex and AWS Lambda: Start Developing Chatbot8:50
Develop an Amazon Lex chatbot by creating a bot (blank, example, or transcript), naming it, configuring IAM permissions, language, and idle session timeout, then cover intents and utterances.
Chatbot Development with AWS Lex and AWS Lambda: Intent Utterance and Slot4:52
Explore how intents, utterances, and slots drive a pizza-ordering chatbot with Amazon Lex, using natural language processing to identify user needs and collect slot values like size and crust.
Chatbot Development with AWS Lex and AWS Lambda: Making Utterances3:54
Set up the book hotel intent in AWS Lex with utterances like book a hotel, and enable slots for variables such as two nights and New York.
Chatbot Development with AWS Lex and AWS Lambda: Generic Utterance with Slots5:55
Learn to design a hotel booking chatbot with Amazon Lex by defining slots for nights and location, creating utterances, marking slots as required, and adding prompts to collect missing details.
Chatbot Development with AWS Lex and AWS Lambda: Adding Custom Slots4:43
Add custom slots to a hotel booking bot with AWS Lex and Lambda, creating a room type slot type and prompts for the Book Hotel intent.
Chatbot Development with AWS Lex and AWS Lambda: Build and Test7:32
Build and test a hotel booking bot using AWS Lex and Lambda, configure utterances and slots, add initial, confirmation, and fulfillment messages, and test interactions with the built-in test tool.
Chatbot Development with AWS Lex and AWS Lambda: Visual Builder4:31
Explore building a hotel booking chatbot with AWS Lex Visual Builder, detailing conversation flow, slots, intents, and fallback handling, and integrating Lambda.
Chatbot Development with AWS Lex and AWS Lambda: Lambda Introduction2:51
Connect your chat bot to a backend with AWS Lambda, a serverless platform that runs code on demand and supports code as zip or container image.
Chatbot Development with AWS Lex and AWS Lambda: Interconnection9:23
Learn to build a hotel chatbot by creating a Python Lambda function, connecting it to an AWS Lex bot, and testing the end-to-end integration.
Chatbot Development with AWS Lex and AWS Lambda: Starting Lambda Code6:53
Build a Lex-based chatbot with AWS Lambda by implementing slot validation for location and other slots, handling session state and intents, and processing invocation in Python.
Chatbot Development with AWS Lex and AWS Lambda: Session state Dialog Hook and Dialog Action7:24
Explore session state and dialogue actions in Amazon Lex v2, including dialogue and fulfillment code hooks, intents, slots, validation results, and elicit slot behavior.
Chatbot Development with AWS Lex and AWS Lambda: Completing Lambda Function1:33
Complete a Lex chatbot lambda function using a fulfillment code hook, closing the session after fulfilling slots and returning a plain text confirmation like 'Thanks. I have placed your reservation.'
Chatbot Development with AWS Lex and AWS Lambda: Testing our Chatbot4:48
Test and validate a hotel booking chatbot using AWS Lex and AWS Lambda, guiding city and date inputs, slot filling and fulfillment, with live logs and deployment steps.
Chatbot Development with AWS Lex and AWS Lambda: Chatbot Deployment on Whatsapp with Twilio13:41
Configure a hotel chatbot for WhatsApp using AWS Lex and Twilio. Build and test via sandbox, buy numbers, create versions and aliases, and enable two-way messaging through channel integration.
Chatbot Development with AWS Lex and AWS Lambda: Integration with Boto8:20
Deploy a hotel chatbot using boto3 in python on Google Colab with AWS Lex v2 runtime. Configure region, credentials, bot ID and alias, language, and session ID for recognize_text responses.
Chatbot Development with AWS Lex and AWS Lambda: Responses with Boto6:02
Build a hotel booking chatbot with AWS Lex and Lambda, managing session state, intents, and elicit slots using boto3 to guide users from city to reservation.
Chatbot Development with AWS Lex and AWS Lambda: Chatbot on Website8:45
Deploy a website chatbot with AWS Lex via communicate.io, configuring Lex v2, region, and alias, then test, monitor conversations, and embed the HTML snippet.
Chatbot Development with AWS Lex and AWS Lambda: Response Cards for User Experience3:03
Improve bot user experience by using AWS Lex response cards with Lambda to present selectable options for hotels or flowers, reducing wrong entries and guiding user flows.
Chatbot Development with AWS Lex and AWS Lambda: Complete Chatbot with Response Cards10:10
Explore building a complete hotel booking chatbot with AWS Lex and Lambda, using slot types and prompts, response cards, and interactive previews to elicit city, room type, and dates.

Requirements

● Prior knowledge of Python.
● An elementary understanding of programming.
● A willingness to learn and practice.

Description

Comprehensive Course Description:

The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.

Right through the course, you’ll be using PySpark for performing data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and dataframes. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment for running the Spark scripts and explore it as well.

Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, computations, and how Spark can communicate with different AWS services and get its required data.

How Is This Course Different?

In this Learning by Doing course, every theoretical explanation is followed by practical implementation.

The course ‘PySpark & AWS: Master Big Data With PySpark and AWS’ is crafted to reflect the most in-demand workplace skills. This course will help you understand all the essential concepts and methodologies with regards to PySpark. The course is:

• Easy to understand.

• Expressive.

• Exhaustive.

• Practical with live coding.

• Rich with the state of the art and latest knowledge of this field.

As this course is a detailed compilation of all the basics, it will motivate you to make quick progress and experience much more than what you have learned. At the end of each concept, you will be assigned Homework/tasks/activities/quizzes along with solutions. This is to evaluate and promote your learning based on the previous concepts and methods you have learned. Most of these activities will be coding-based, as the aim is to get you up and running with implementations.

High-quality video content, in-depth course material, evaluating questions, detailed course notes, and informative handouts are some of the perks of this course. You can approach our friendly team in case of any course-related queries, and we assure you of a fast response.

The course tutorials are divided into 140+ brief videos. You’ll learn the concepts and methodologies of PySpark and AWS along with a lot of practical implementation. The total runtime of the HD videos is around 16 hours.

Why Should You Learn PySpark and AWS?

PySpark is the Python library that makes the magic happen.

PySpark is worth learning because of the huge demand for Spark professionals and the high salaries they command. The usage of PySpark in Big Data processing is increasing at a rapid pace compared to other Big Data tools.

AWS, launched in 2006, is the fastest-growing public cloud. The right time to cash in on cloud computing skills—AWS skills, to be precise—is now.

Course Content:

The all-inclusive course consists of the following topics:

1. Introduction:

a. Why Big Data?

b. Applications of PySpark

c. Introduction to the Instructor

d. Introduction to the Course

e. Projects Overview

2. Introduction to Hadoop, Spark EcoSystems, and Architectures:

a. Hadoop EcoSystem

b. Spark EcoSystem

c. Hadoop Architecture

d. Spark Architecture

e. PySpark Databricks setup

f. PySpark local setup

3. Spark RDDs:

a. Introduction to PySpark RDDs

b. Understanding underlying Partitions

c. RDD transformations

d. RDD actions

e. Creating Spark RDD

f. Running Spark Code Locally

g. RDD Map (Lambda)

h. RDD Map (Simple Function)

i. RDD FlatMap

j. RDD Filter

k. RDD Distinct

l. RDD GroupByKey

m. RDD ReduceByKey

n. RDD (Count and CountByValue)

o. RDD (saveAsTextFile)

p. RDD (Partition)

q. Finding Average

r. Finding Min and Max

s. Mini project on student data set analysis

t. Total Marks by Male and Female Student

u. Total Passed and Failed Students

v. Total Enrollments per Course

w. Total Marks per Course

x. Average marks per Course

y. Finding Minimum and Maximum marks

z. Average Age of Male and Female Students

4. Spark DFs:

a. Introduction to PySpark DFs

b. Understanding underlying RDDs

c. DFs transformations

d. DFs actions

e. Creating Spark DFs

f. Spark Infer Schema

g. Spark Provide Schema

h. Create DF from RDD

i. Select DF Columns

j. Spark DF with Column

k. Spark DF with Column Renamed and Alias

l. Spark DF Filter rows

m. Spark DF (Count, Distinct, Duplicate)

n. Spark DF (sort, order By)

o. Spark DF (Group By)

p. Spark DF (UDFs)

q. Spark DF (DF to RDD)

r. Spark DF (Spark SQL)

s. Spark DF (Write DF)

t. Mini project on Employees data set analysis

u. Project Overview

v. Project (Count and Select)

w. Project (Group By)

x. Project (Group By, Aggregations, and Order By)

y. Project (Filtering)

z. Project (UDF and With Column)

aa. Project (Write)

5. Collaborative filtering:

a. Understanding collaborative filtering

b. Developing recommendation system using ALS model

c. Utility Matrix

d. Explicit and Implicit Ratings

e. Expected Results

f. Dataset

g. Joining Dataframes

h. Train and Test Data

i. ALS model

j. Hyperparameter tuning and cross-validation

k. Best model and evaluate predictions

l. Recommendations

6. Spark Streaming:

a. Understanding the difference between batch and streaming analysis.

b. Hands-on with spark streaming through word count example

c. Spark Streaming with RDD

d. Spark Streaming Context

e. Spark Streaming Reading Data

f. Spark Streaming Cluster Restart

g. Spark Streaming RDD Transformations

h. Spark Streaming DF

i. Spark Streaming Display

j. Spark Streaming DF Aggregations

7. ETL Pipeline

a. Understanding the ETL

b. ETL pipeline Flow

c. Data set

d. Extracting Data

e. Transforming Data

f. Loading data (Creating RDS)

g. Load data (Creating RDS)

h. RDS Networking

i. Downloading Postgres

j. Installing Postgres

k. Connect to RDS through PgAdmin

l. Loading Data

8. Project – Change Data Capture / Replication On Going

a. Introduction to Project

b. Project Architecture

c. Creating RDS MySql Instance

d. Creating S3 Bucket

e. Creating DMS Source Endpoint

f. Creating DMS Destination Endpoint

g. Creating DMS Instance

h. MySql WorkBench

i. Connecting with RDS and Dumping Data

j. Querying RDS

k. DMS Full Load

l. DMS Replication Ongoing

m. Stoping Instances

n. Glue Job (Full Load)

o. Glue Job (Change Capture)

p. Glue Job (CDC)

q. Creating Lambda Function and Adding Trigger

r. Checking Trigger

s. Getting S3 file name in Lambda

t. Creating Glue Job

u. Adding Invoke for Glue Job

v. Testing Invoke

w. Writing Glue Shell Job

x. Full Load Pipeline

y. Change Data Capture Pipeline

After the successful completion of this course, you will be able to:

● Relate the concepts and practicals of Spark and AWS with real-world problems

● Implement any project that requires PySpark knowledge from scratch

● Know the theory and practical aspects of PySpark and AWS

Who this course is for:

● People who are beginners and know absolutely nothing about PySpark and AWS

● People who want to develop intelligent solutions

● People who want to learn PySpark and AWS

● People who love to learn the theoretical concepts first before implementing them using Python

● People who want to learn PySpark along with its implementation in realistic projects

● Big Data Scientists

● Big Data Engineers

Enroll in this comprehensive PySpark and AWS course now to master the essential skills in Big Data analytics, data processing, and cloud computing.

Whether you're a beginner or looking to expand your knowledge, this course offers a hands-on learning experience with practical projects. Don't miss this opportunity to advance your career and tackle real-world challenges in the world of data analytics and cloud computing. Join us today and start your journey towards becoming a Big Data expert with PySpark and AWS!

List of keywords:

Big Data analytics
Data analysis
Data cleaning
Machine learning (ML)
Spark RDDs
Dataframes
Spark SQL queries
Spark ecosystem
Hadoop
Databricks
AWS cloud
Spark scripts
AWS services
PySpark and AWS collaboration
PySpark tutorial
PySpark hands-on
PySpark projects
Spark architecture
Hadoop ecosystem
PySpark Databricks setup
Spark local setup
Spark RDD transformations
Spark RDD actions
Spark DF transformations
Spark DF actions
Spark Infer Schema
Spark Provide Schema
Spark DF Filter rows
Spark DF (Count, Distinct, Duplicate)
Spark DF (sort, order By)
Spark DF (Group By)
Spark DF (UDFs)
Spark DF (Spark SQL)
Collaborative filtering
Recommendation system
ALS model
Spark Streaming
ETL pipeline
Change Data Capture (CDC)
Replication
AWS Glue Job
Lambda Function
RDS
S3 Bucket
MySql Instance
Data Migration Service (DMS)
PgAdmin
Spark Shell Job
Full Load Pipeline
Change Data Capture Pipeline

Who this course is for:

● People who are beginners and know absolutely nothing about PySpark and AWS.
● People who want to develop intelligent solutions.
● People who want to learn PySpark and AWS.
● People who love to learn the theoretical concepts first before implementing them using Python.
● People who want to learn PySpark along with its implementation in realistic projects.
● Big Data Scientists.
● Big Data Engineers.

PySpark & AWS: Master Big Data With PySpark and AWS

What you'll learn

Explore related topics

Course content

Introduction7 lectures • 13min

01-Introduction to Hadoop, Spark EcoSystems and Architectures16 lectures • 55min

Spark RDDs37 lectures • 4hr 47min

Spark DFs41 lectures • 4hr 48min

Collaborative filtering12 lectures • 1hr 1min

Spark Streaming10 lectures • 49min

ETL Pipeline13 lectures • 1hr 5min

Project - Change Data Capture / Replication On Going26 lectures • 2hr 26min

Chatbots Development with Amazon Lex28 lectures • 2hr 36min

Requirements

Description

Who this course is for: