
Explore why big data matters and learn to analyze it with PySpark on AWS, gaining hands-on skills for batch and real-time analytics and in-demand careers.
Explore PySpark applications across streaming data, machine learning with Mllib, batch analysis, and ETL, plus full load and replication, to drive real-time insights and scalable data pipelines.
Meet your instructor, Muhammad Ahmad, a cloud and big data engineer with experience in Python, PySpark, AWS cloud, data mining, data orchestration, and teaching.
Master data analysis with PySpark from basics to advanced, covering RDDs, DataFrames, Spark SQL, transformations and actions, plus Databricks and AWS integration for a CDC pipeline.
Explore practical big data analytics with PySpark and AWS through hands-on projects: student and employee data analysis, collaborative filtering with MLlib, spark streaming, and ETL and replication workflows.
Explore the udemy review system, preview upcoming topics and concepts you'll work on in the real world, and rate honestly if this course meets five-star standards or request updates.
Explore spark’s speed and distributed processing, enabling real-time analytics, caching, fault tolerance, and libraries like Spark SQL, machine learning, and graphics across Python, Scala, Java, and R.
Explore the Hadoop ecosystem, including HDFS for distributed storage, YARN as the operating system, and MapReduce; see how Spark builds on this foundation for faster processing.
Discover spark architecture and ecosystem, including driver node, cluster manager, and workers, and explore core APIs with libraries like spark SQL, spark streaming, MLlib, and spark graphics.
Sign up for Databricks online or offline, then verify your email and sign in. Explore notebooks, data import, machine learning, and transformations within a refreshed workspace interface.
Learn how to spin up a Databricks cluster, create a notebook, and attach the notebook to the cluster to run your first hello world and test the setup.
Set up PySpark offline on Windows or Mac by downloading Java, Python, Spark, and Winutils. Follow platform-specific steps and download latest versions, including Hadoop version 3.x package.
Install and configure Java on Windows by downloading the packages, granting permissions, and setting java_home and path variables to complete the setup.
Configure a Windows-based PySpark environment by installing Python, extracting Spark, and setting Hadoop paths and environment variables (SPARK_HOME and PATH) for offline use.
Install and verify spark on windows by launching spark-shell, confirming spark 3.5.1 works, and testing a PySpark session that prints Hello, after installing Hadoop, Java, and Python.
Download and install the Java JDK on Mac, using Java 11 for Spark compatibility, then download the macOS installer and complete Oracle sign-in if prompted.
Open the downloaded JDK package to launch the installation wizard, follow prompts to continue and install, enter your Mac password if prompted, then confirm JDK is installed and proceed.
Set up java home on mac by locating the JDK path, editing or creating bash profile, exporting JAVA_HOME for version 11, and reloading with source to verify.
Verify Java installation on Mac by running java -version and javac -version in a new terminal. Confirm Java 11 is installed, troubleshoot if needed, then proceed to spark installation.
Install Python on macOS to complete the PySpark setup, download Python 3.9.6 from the macOS link, run the installer, and prepare the PySpark path in the next video.
Download and extract Apache Spark, place the extracted folder under documents/dev, then set SPARK_HOME, update bash profile, source it, and verify PySpark with a simple Python test.
Learn how Spark RDDs, the immutable distributed data set, power parallel processing by transforming data and triggering actions, with lazy evaluation and distribution across nodes.
Learn to read a text file with PySpark, configure Spark config and context, create an RDD, and trigger collect to realize lazy evaluation on Databricks.
Export PySpark code from Databricks and run it on your local machine using spark-submit, adjusting file paths and environment variables to resolve Python version issues.
Apply the spark rdd map function (lambda) to transform each element into an rdd, such as splitting lines into tokens or concatenating strings. Trigger with collect on the text rdd.
Learn how to replace lambda with a concrete function in an RDD map to split strings, convert to integers, and apply custom transformations in Spark using Python.
Read a text file from Databricks storage, then use a PySpark map to compute and return the length of each word as a list.
Explore solving a quiz with PySpark on Databricks by reading a text file into an RDD, mapping to word lengths, and returning a new RDD of lengths.
Learn to replicate the prior quiz output inside a lambda using map in PySpark, splitting strings, applying list comprehension, and collecting results, with notes on readability and lambda use.
Explore flatMap as an extension of map in PySpark, acting as a mapper to flatten an RDD's nested outputs into a single list and compare with map.
Explore how RDD filter removes elements to produce a new RDD using lambda expressions or defined functions, with true/false conditions and practical examples.
Execute a quiz that reads a file from dbfs or Databricks storage, filters out words starting with a or c in an RDD, and flattens to a single list.
Demonstrates filtering an rdd in pyspark by reading a file, using flatMap to split words, and applying a filter with a lambda to remove words starting with a or c.
Apply the distinct transformation to an RDD to drop duplicates and produce a new RDD with unique elements. Explore how distinct works after flatMap and before collect in PySpark.
Learn how to apply the group by key transformation on RDDs, create key-value pairs with map and flatMap, and aggregate values into grouped lists.
Explore reduce by key in Spark RDDs, aggregating values by key with a lambda to produce a single value per key, contrasting with group by key that yields value lists.
Learn to perform a word count in PySpark by reading a file into an RDD and returning each word as a key and its count as a value.
Learn how to compute word counts with PySpark RDDs by reading a file, creating an RDD, and applying flatMap, map, and reduceByKey, with filtering for empties and noting multiple approaches.
Explore spark rdd actions and transformations, focusing on count and count by value, which trigger computation and return element counts and per-value frequencies.
Learn how Spark writes an RDD to text files using save as text file, with default two partitions, actions triggering processing, and per-partition output files in a folder.
Master how repartition and coalesce adjust rdd partitions, creating new rdds and optimizing parallelism; learn when repartition increases partitions and coalesce decreases them, plus dbfs file read‑write behavior.
Compute the average movie rating by reading a CSV file into Spark, mapping to key value pairs, and applying reduce by key to aggregate totals and averages.
Compute movie ratings by building RDD transformations with map and reduceByKey, accumulate total rating and count, and derive and verify the average rating across the dataset.
Read a file containing months, city codes, and monthly ratings, then write code to calculate the average score for each month.
Create an rdd from the input csv, map each record to month and (rating, 1), then reduce by key to compute monthly sums and counts and derive the average.
Learn to compute per-movie minimum and maximum ratings in PySpark by converting lines to (movie, rating) pairs, casting to int, and applying reduceByKey.
Apply PySpark to read an input file into an RDD and implement MapReduce-style transformations to compute the minimum and maximum ratings for each city.
Learn to compute min and max ratings by city with PySpark using an RDD workflow: read the csv, map to city and rating, then reduce by key.
Implement a PySpark mini project that reads a student CSV into an RDD and computes counts, gender-based marks, 50+ pass/fail, per-course enrollment, marks, averages, and min/max by course and gender.
Upload the student data csv, set up Spark configurations and context, read the csv, strip the header using first and a filter, then count the records to show 1000 students.
Learn to compute total marks by gender in PySpark using RDDs: map to create gender and marks pairs, cast to int, and reduceByKey to sum by male and female.
Learn to compute total passed and failed students in PySpark using filters and counts on an RDD, based on a 50 threshold.
Learn how to compute total enrollments per course using map-reduce in PySpark, creating key-value pairs, applying reduce by key, and summing counts.
Compute total marks per course using a PySpark RDD workflow, mapping course names to marks and applying reduceByKey to sum totals per course.
Learn to compute the average marks per course by aggregating total scores and enrollments with reduce by key, then map values to extract sums and divide to get the average.
Compute minimum and maximum marks per course by mapping course names to marks and reducing by key with max or min. Collect results to reveal per-course statistics with PySpark.
Compute the average age by gender using Spark RDD transformations, map and reduce by key, then map values to show female and male averages.
Explore how Spark data frames wrap RDDs, provide schemas and named columns, enable parallel, SQL-like analysis from structured, unstructured, external data sources, and existing RDDs.
Create a spark session with spark session builder and getOrCreate, read a csv into a spark data frame, and set header to true.
Learn how Spark infers a dataframe schema from a file, revealing data types and header use. Explore options to configure schema inference and delimiters for CSV and TSV files.
Create and apply a custom Spark schema with StructType and StructField, mapping csv columns to explicit data types to control data frame reading.
Learn to build data frames from rdd in pyspark by extracting headers, mapping records, and applying explicit schemas or inferring schemas for scalable big data workflows on aws.
Learn to select multiple columns from a spark data frame using methods such as df.select, df.col, and column, then create new data frames by filtering columns.
Learn to use withColumn in a spark data frame to manipulate column values, cast types, create new columns, and add lit literals, with examples like updating marks and adding country.
Learn to rename spark df columns with withColumnRenamed and alias, noting lazy evaluation requires assigning the transformed df back, and aliasing during select to rename outputs.
Learn to filter Spark dataframes by rows using df.filter and df.col, apply single and multiple conditions, and use is in, starts with, ends with, contains, and like for precise data.
Read student data from csv into a dataframe, add total marks column set to 120, compute average, filter oop above 80% and cloud above 60%, print names and marks.
Read student data from a CSV, add total marks column of 120 with withColumn, compute average, filter for OOP above 80% and cloud above 60%, then select names and marks.
Master counting rows, identifying distinct rows, and dropping duplicates in a Spark data frame, with examples on counting after filters and using drop duplicates by selected columns.
Solve a quick quiz by reading Student Data.csv into a data frame and displaying unique rows for age, gender, and course to distinguish between drop duplicates and distinct.
Learn to read data from the student data.csv file into a Spark data frame and extract unique age, gender, and course combinations using distinct or drop duplicates in PySpark.
Learn to sort a Spark data frame with sort or orderBy, arranging rows by single or multiple columns in ascending or descending order, using df notation and integer data.
Read the office data dot csv into a dataframe, sort by bonus ascending, then sort by age and salary in descending and ascending orders, and preview the results.
Explore sorting dataframes in PySpark with orderBy and sort, applying ascending and descending orders on bonus, age, and salary, while creating and displaying new transformed dataframes.
Group by in a Spark DataFrame creates groups based on a selected column, then apply aggregations such as sum, count, max, min, and mean to each group.
Explore multi-column grouping in PySpark DataFrames with groupBy and agg, applying sum, mean, max, min, and count by course and gender for deeper analytics.
Learn how Spark df group by operates under the hood by visualizing department and state aggregations, including counts, sums of salary, and multi-column groupings.
Learn how to filter with group by in Spark dataframes, before and after grouping, using where as a filter, aliases, and column notation to refine aggregates.
Read csv into PySpark data frame and perform group-by analyses to compute enrollments per course, gender counts, and marks by gender, plus min, max, and average marks by age group.
Use PySpark group by course, gender, and age to derive enrollment counts and statistics like sum, min, max, and average marks.
Read the word data dot txt into a data frame, then calculate and display the count of each word as a simple word count quiz, noting there is no header.
Read data from word data dot txt into a PySpark data frame, group by the text column, and compute word counts to display the frequency of each word.
Create and apply user defined functions (UDFs) in Spark DataFrames to compute total salary by summing salary and bonus, with proper return types and column mappings.
Complete a udf-based quiz that reads office data dot csv into a dataframe and adds an increment column using NY and CA salary and bonus rules.
Learn to build and apply a PySpark UDF to compute state-based salary increments (NY 10%, CA 12%) plus bonuses, using withColumn on data frames.
Cache data and persist it in memory to speed Spark data frame workflows by caching after transformations and reusing results across actions.
Learn how to convert data frames to RDDs and access the underlying RDDs, then decide when RDDs or data frames are best for grouping and aggregations.
Learn how Spark SQL lets you create data frames, register them as temporary views or tables, and run SQL queries for filtering, aggregations, and data exploration.
Learn how to write a Spark DataFrame to CSV, control headers and schemas, and select write modes like overwrite, append, ignore, or error, while understanding partitions.
Read the office data csv into a dataframe in PySpark workflow, analyze employees by department and state, compute salaries and bonuses, apply age-based raises, and save the results.
Learn to load a csv into a spark data frame, count employees, and derive unique department counts and names using group by and drop duplicates in PySpark.
Group data by department to count employees in each department using PySpark. Extend to group by state, then by department within each state to show state-department employee counts.
Group employees by department, calculate minimum and maximum salaries with aggregations, and sort results by these values in ascending order using spark data frames.
Filter a PySpark data frame to show New York finance employees whose bonuses exceed the NY state average bonus, using group by, average bonus, and conditional filtering.
Create and register a user-defined function to increment salary by 500 for employees whose age is greater than 45, then apply it with withColumn to update the salary column.
filter employees aged over 45 in a data frame and write the result to csv in the output_45 folder, creating multiple files based on partitions and validating via dbfs.
Explore collaborative filtering with Spark using dataframes and RDDs, and see how recommender systems predict user preferences to suggest products, shows, and content.
Learn how a utility matrix drives recommender systems by filling missing ratings for users and movies, using averages and user similarities to predict top picks.
Explore explicit and implicit ratings in recommender systems, detailing how explicit ratings are provided and why implicit signals—time spent on episodes and genre exploration—are harder to translate.
Explore collaborative filtering to generate a user–movie rating matrix and personalized recommendations, inferring unseen ratings from existing data and showing how user tastes guide future movie suggestions.
Upload the movie and ratings dataset to DBFS, read them with Spark, inspect schemas, and prepare for collaborative filtering in a Databricks notebook.
Join rating and movies dataframes on movie ID to enrich ratings with titles and genres. The video shows reading CSVs and performing a left join to produce the combined data.
Split the ratings data frame into train and test sets using random split with 0.8 for training and 0.2 for testing to train a recommender system and evaluate its performance.
Learn to build an alternating least squares (ALS) model in PySpark by specifying the user and item columns, rating, non-negative true, implicit prefs false, and a drop cold-start strategy.
Build a param grid with 16 ALS model configurations in PySpark. Evaluate with a regression evaluator using RMSE and choose the best through five-fold cross-validation.
Train the model with the cross validator on the training data, select the best model from the 16 options, then test predictions on the test data and compute rmse.
Learn to build a PySpark ALS-based recommender on Databricks, generate top five movie recommendations per user, and flatten results with explode for per-user insights.
Compare batch analysis and streaming analysis, using spark streaming to read streaming input, transform with dataframes or RDDs, and output to any format or storage.
Create a streaming context with SparkConf and SparkContext for RDDs. Read data from a streaming input directory at the time interval and process new files as they arrive.
Discover how to read data in spark streaming by specifying an input directory, creating an rdd from text file stream, and starting the streaming context.
Learn to read data with Spark streaming by configuring a streaming context, ingesting file-based input from a directory, printing streamed words, and managing the context lifecycle with getOrCreate.
Restart the spark streaming cluster to resolve dag manipulation errors after starting streaming in databricks. Replicate the local setup and run the same code to stream data reliably.
Explore spark streaming with rdd transformations by building a word count using map and reduceByKey on streaming data, and note rdd limitations that dataframes later address.
Explore spark streaming with data frames by building a read stream from a directory, processing and printing results to the console in complete mode.
Explore visualizing streaming data with spark streaming and dataframes in databricks. See how starting, stopping, and file ingestion affect word count and aggregated results across files.
Master spark streaming by performing dataframe aggregations with group by to count words from streaming data, and observe counts update as new files land in dbfs.
Learn how to build an ETL pipeline with PySpark, extracting data from diverse sources, applying optional transformations, and loading into multiple destinations using Spark's flexible APIs.
Explore a simple etl pipeline using PySpark on Databricks to read a csv from dbfs, apply transformations, and load data into a Postgres database on aws rds.
Prepare the etl pipeline by uploading a text data set word data dot txt to dbfs and use data frames to perform a word count and load into a database.
Extract data from a text file in dbfs using a Spark session and spark.read, create a data frame, and preview it with show as the first ETL step.
Transform data in an etl pipeline by converting lines into words with split and explode, then count word occurrences using group by and count on a spark data frame.
Finish the ETL pipeline by loading transformed data into an AWS RDS Postgres database, covering creating a free-tier instance, root credentials, VPC settings, and public access configuration.
Configure an AWS RDS Postgres database by specifying a database name, selecting a db parameter group, enabling monitoring and performance insights, and reviewing maintenance options before creating the database.
Learn to configure and secure an AWS RDS Postgres instance by examining endpoints, VPC, subnets, security groups, and inbound rules for port 5432 and connectivity.
Install postgres on your local machine, install pgadmin to connect to AWS RDS, and follow Windows 64-bit installer steps to download and install version 13.3 or the latest.
Install Postgres on Windows by following the setup wizard, configuring pgAdmin, and setting the admin password. Complete the installation by confirming the default port and launching pgAdmin.
Connect to an AWS RDS Postgres instance using Pgadmin by configuring a new server with the RDS endpoint and port, then run SQL queries on the Postgres database.
Create a schema and load a word count data frame into an AWS RDS Postgres database using PySpark's JDBC options, then validate the data with queries.
Implement CDC to replicate all changes from the database inside DDS into HDFs storage, building a data lake and a pipeline that captures inserts, deletes, and updates.
Design a full load and change data capture pipeline using RDS MySQL, S3, DMS, Lambda, and Glue PySpark to reflect changes into the final destination.
Instantiate an AWS RDS MySQL under the free tier, configure credentials, storage, and backups, and create a custom MySQL 8.0 parameter group with binlog_format set to row for DMS.
Create an S3 bucket to serve as the destination for DMS in a PySpark CDC pipeline, using a unique bucket name and public access settings suitable for a POC.
Create a DMS source endpoint for an RDS MySQL instance, configure connection details, test the endpoint, and adjust the endpoint name for clarity.
Create a DMS destination endpoint for an S3 bucket, configure an IAM role with Amazon S3 full access, and test the endpoint labeled CDC S3 PySpark.
Create an AWS DMS replication instance to move data from MySQL to an S3 bucket, using a micro instance in the default VPC with the default subnet and security settings.
download MySQL Workbench from the MySQL community downloads page on Windows, download the latest version, no login required, and simply click next to complete the installation.
Connect to the RDS instance using MySQL Workbench, test the connection, then run a SQL dump that creates a schema, a table with a primary key, and inserts for CDC.
Create a DMS task to migrate existing MySQL data to S3, then replicate ongoing changes using the specified endpoints and replication instance.
Explore how AWS DMS executes a full load from MySQL to S3 and manages ongoing replication, capturing inserts, updates, and deletions before triggering a PySpark pipeline.
Stop replication and DF instances to avoid costs, run the PySpark job to fetch data from S3 and store it in another bucket, then stop the RDS without a snapshot.
Define and execute an aws glue pyspark job to process a full load and cdc updates, reading csv data, applying headers, and writing final output.
Learn how to build a Glue CDC pipeline by performing a full load, processing change data with an updated data frame and UDF, and writing results to S3.
Master change data capture with a glue job using PySpark on AWS. Learn to apply insertions, updates, and deletions to a final data frame and load it back.
Create a Python lambda function triggered by S3 object creation to pass the file name to a PySpark Glue job for full load or change-data processing and write results back.
Test the lambda trigger by deploying code, running a test, and validating S3 uploads activate the function with CloudWatch logs.
Learn to extract the triggering S3 bucket and file name from a lambda event, print the event data, deploy updates, and verify in CloudWatch.
Create a Glue job to run PySpark on AWS, configure an IAM role with S3 and CloudWatch access, and supply your own script.
Invoke a PySpark job on AWS Glue from a lambda function using boto3, starting the job with S3 bucket and file name arguments and handling the response.
Upload a dump file to the S3 bucket, deploy the Lambda, and invoke the Glue PySpark job with the bucket and file name. Monitor CloudWatch and Glue logs to verify.
Transfer spark code from Databricks into a Glue shell job, create a Spark session, and read from and write to S3 buckets while handling full load and incremental updates.
Spin up the RDS and DMS, perform the full load from MySQL into S3, then trigger Lambda and AWS Glue to process and replicate changes via CDC.
Execute a change data capture pipeline with PySpark and AWS, coordinating MySQL updates and deletions via Glue and Lambda, merging into an S3 final output for CDC validation.
Discover the Amazon Lex bot architecture and how to define intents, utterances, request data, slots, and fulfillment to build and deploy chatbots with voice and text.
Explore the benefits of using Amazon Lex for chatbots on AWS cloud, including seamless deployment and scaling, built-in AWS integrations, and cost-effective, simple bot development with versatile input technologies.
Explore Amazon Lex fundamentals in a bank chatbot, where intents are understood, prompts guide replies, and AWS Lambda retrieves data to provide a balance.
Learn hands-on chatbot development with Amazon Lex, including intent classification, Lambda integration, Twilio SMS, website integration, and response cards, through theory and hands-on tutorials.
Explore how to develop a chatbot with Amazon Lex, wiring it to AWS Lambda, Twilio, and Dynamo database storage, and follow steps from bot configuration to build and test.
Explore the five-step process to connect Amazon Lambda with AWS Lex, including creating, updating, and testing a basic Lambda function to enable fulfillment and validation for chatbot development.
Explore how to integrate AWS Lex chatbots with Lambda, Twilio, and websites, covering account setup, chatbot integration, and live demos.
Set up response cards for an Amazon Lex chatbot and upgrade slots. Build the chatbot and finish with a live demo.
Develop an Amazon Lex chatbot by creating a bot (blank, example, or transcript), naming it, configuring IAM permissions, language, and idle session timeout, then cover intents and utterances.
Explore how intents, utterances, and slots drive a pizza-ordering chatbot with Amazon Lex, using natural language processing to identify user needs and collect slot values like size and crust.
Set up the book hotel intent in AWS Lex with utterances like book a hotel, and enable slots for variables such as two nights and New York.
Learn to design a hotel booking chatbot with Amazon Lex by defining slots for nights and location, creating utterances, marking slots as required, and adding prompts to collect missing details.
Add custom slots to a hotel booking bot with AWS Lex and Lambda, creating a room type slot type and prompts for the Book Hotel intent.
Build and test a hotel booking bot using AWS Lex and Lambda, configure utterances and slots, add initial, confirmation, and fulfillment messages, and test interactions with the built-in test tool.
Explore building a hotel booking chatbot with AWS Lex Visual Builder, detailing conversation flow, slots, intents, and fallback handling, and integrating Lambda.
Connect your chat bot to a backend with AWS Lambda, a serverless platform that runs code on demand and supports code as zip or container image.
Learn to build a hotel chatbot by creating a Python Lambda function, connecting it to an AWS Lex bot, and testing the end-to-end integration.
Build a Lex-based chatbot with AWS Lambda by implementing slot validation for location and other slots, handling session state and intents, and processing invocation in Python.
Explore session state and dialogue actions in Amazon Lex v2, including dialogue and fulfillment code hooks, intents, slots, validation results, and elicit slot behavior.
Complete a Lex chatbot lambda function using a fulfillment code hook, closing the session after fulfilling slots and returning a plain text confirmation like 'Thanks. I have placed your reservation.'
Test and validate a hotel booking chatbot using AWS Lex and AWS Lambda, guiding city and date inputs, slot filling and fulfillment, with live logs and deployment steps.
Configure a hotel chatbot for WhatsApp using AWS Lex and Twilio. Build and test via sandbox, buy numbers, create versions and aliases, and enable two-way messaging through channel integration.
Deploy a hotel chatbot using boto3 in python on Google Colab with AWS Lex v2 runtime. Configure region, credentials, bot ID and alias, language, and session ID for recognize_text responses.
Build a hotel booking chatbot with AWS Lex and Lambda, managing session state, intents, and elicit slots using boto3 to guide users from city to reservation.
Deploy a website chatbot with AWS Lex via communicate.io, configuring Lex v2, region, and alias, then test, monitor conversations, and embed the HTML snippet.
Improve bot user experience by using AWS Lex response cards with Lambda to present selectable options for hotels or flowers, reducing wrong entries and guiding user flows.
Explore building a complete hotel booking chatbot with AWS Lex and Lambda, using slot types and prompts, response cards, and interactive previews to elicit city, room type, and dates.
Comprehensive Course Description:
The hottest buzzwords in the Big Data analytics industry are Python and Apache Spark. PySpark supports the collaboration of Python and Apache Spark. In this course, you’ll start right from the basics and proceed to the advanced levels of data analysis. From cleaning data to building features and implementing machine learning (ML) models, you’ll learn how to execute end-to-end workflows using PySpark.
Right through the course, you’ll be using PySpark for performing data analysis. You’ll explore Spark RDDs, Dataframes, and a bit of Spark SQL queries. Also, you’ll explore the transformations and actions that can be performed on the data using Spark RDDs and dataframes. You’ll also explore the ecosystem of Spark and Hadoop and their underlying architecture. You’ll use the Databricks environment for running the Spark scripts and explore it as well.
Finally, you’ll have a taste of Spark with AWS cloud. You’ll see how we can leverage AWS storages, databases, computations, and how Spark can communicate with different AWS services and get its required data.
How Is This Course Different?
In this Learning by Doing course, every theoretical explanation is followed by practical implementation.
The course ‘PySpark & AWS: Master Big Data With PySpark and AWS’ is crafted to reflect the most in-demand workplace skills. This course will help you understand all the essential concepts and methodologies with regards to PySpark. The course is:
• Easy to understand.
• Expressive.
• Exhaustive.
• Practical with live coding.
• Rich with the state of the art and latest knowledge of this field.
As this course is a detailed compilation of all the basics, it will motivate you to make quick progress and experience much more than what you have learned. At the end of each concept, you will be assigned Homework/tasks/activities/quizzes along with solutions. This is to evaluate and promote your learning based on the previous concepts and methods you have learned. Most of these activities will be coding-based, as the aim is to get you up and running with implementations.
High-quality video content, in-depth course material, evaluating questions, detailed course notes, and informative handouts are some of the perks of this course. You can approach our friendly team in case of any course-related queries, and we assure you of a fast response.
The course tutorials are divided into 140+ brief videos. You’ll learn the concepts and methodologies of PySpark and AWS along with a lot of practical implementation. The total runtime of the HD videos is around 16 hours.
Why Should You Learn PySpark and AWS?
PySpark is the Python library that makes the magic happen.
PySpark is worth learning because of the huge demand for Spark professionals and the high salaries they command. The usage of PySpark in Big Data processing is increasing at a rapid pace compared to other Big Data tools.
AWS, launched in 2006, is the fastest-growing public cloud. The right time to cash in on cloud computing skills—AWS skills, to be precise—is now.
Course Content:
The all-inclusive course consists of the following topics:
1. Introduction:
a. Why Big Data?
b. Applications of PySpark
c. Introduction to the Instructor
d. Introduction to the Course
e. Projects Overview
2. Introduction to Hadoop, Spark EcoSystems, and Architectures:
a. Hadoop EcoSystem
b. Spark EcoSystem
c. Hadoop Architecture
d. Spark Architecture
e. PySpark Databricks setup
f. PySpark local setup
3. Spark RDDs:
a. Introduction to PySpark RDDs
b. Understanding underlying Partitions
c. RDD transformations
d. RDD actions
e. Creating Spark RDD
f. Running Spark Code Locally
g. RDD Map (Lambda)
h. RDD Map (Simple Function)
i. RDD FlatMap
j. RDD Filter
k. RDD Distinct
l. RDD GroupByKey
m. RDD ReduceByKey
n. RDD (Count and CountByValue)
o. RDD (saveAsTextFile)
p. RDD (Partition)
q. Finding Average
r. Finding Min and Max
s. Mini project on student data set analysis
t. Total Marks by Male and Female Student
u. Total Passed and Failed Students
v. Total Enrollments per Course
w. Total Marks per Course
x. Average marks per Course
y. Finding Minimum and Maximum marks
z. Average Age of Male and Female Students
4. Spark DFs:
a. Introduction to PySpark DFs
b. Understanding underlying RDDs
c. DFs transformations
d. DFs actions
e. Creating Spark DFs
f. Spark Infer Schema
g. Spark Provide Schema
h. Create DF from RDD
i. Select DF Columns
j. Spark DF with Column
k. Spark DF with Column Renamed and Alias
l. Spark DF Filter rows
m. Spark DF (Count, Distinct, Duplicate)
n. Spark DF (sort, order By)
o. Spark DF (Group By)
p. Spark DF (UDFs)
q. Spark DF (DF to RDD)
r. Spark DF (Spark SQL)
s. Spark DF (Write DF)
t. Mini project on Employees data set analysis
u. Project Overview
v. Project (Count and Select)
w. Project (Group By)
x. Project (Group By, Aggregations, and Order By)
y. Project (Filtering)
z. Project (UDF and With Column)
aa. Project (Write)
5. Collaborative filtering:
a. Understanding collaborative filtering
b. Developing recommendation system using ALS model
c. Utility Matrix
d. Explicit and Implicit Ratings
e. Expected Results
f. Dataset
g. Joining Dataframes
h. Train and Test Data
i. ALS model
j. Hyperparameter tuning and cross-validation
k. Best model and evaluate predictions
l. Recommendations
6. Spark Streaming:
a. Understanding the difference between batch and streaming analysis.
b. Hands-on with spark streaming through word count example
c. Spark Streaming with RDD
d. Spark Streaming Context
e. Spark Streaming Reading Data
f. Spark Streaming Cluster Restart
g. Spark Streaming RDD Transformations
h. Spark Streaming DF
i. Spark Streaming Display
j. Spark Streaming DF Aggregations
7. ETL Pipeline
a. Understanding the ETL
b. ETL pipeline Flow
c. Data set
d. Extracting Data
e. Transforming Data
f. Loading data (Creating RDS)
g. Load data (Creating RDS)
h. RDS Networking
i. Downloading Postgres
j. Installing Postgres
k. Connect to RDS through PgAdmin
l. Loading Data
8. Project – Change Data Capture / Replication On Going
a. Introduction to Project
b. Project Architecture
c. Creating RDS MySql Instance
d. Creating S3 Bucket
e. Creating DMS Source Endpoint
f. Creating DMS Destination Endpoint
g. Creating DMS Instance
h. MySql WorkBench
i. Connecting with RDS and Dumping Data
j. Querying RDS
k. DMS Full Load
l. DMS Replication Ongoing
m. Stoping Instances
n. Glue Job (Full Load)
o. Glue Job (Change Capture)
p. Glue Job (CDC)
q. Creating Lambda Function and Adding Trigger
r. Checking Trigger
s. Getting S3 file name in Lambda
t. Creating Glue Job
u. Adding Invoke for Glue Job
v. Testing Invoke
w. Writing Glue Shell Job
x. Full Load Pipeline
y. Change Data Capture Pipeline
After the successful completion of this course, you will be able to:
● Relate the concepts and practicals of Spark and AWS with real-world problems
● Implement any project that requires PySpark knowledge from scratch
● Know the theory and practical aspects of PySpark and AWS
Who this course is for:
● People who are beginners and know absolutely nothing about PySpark and AWS
● People who want to develop intelligent solutions
● People who want to learn PySpark and AWS
● People who love to learn the theoretical concepts first before implementing them using Python
● People who want to learn PySpark along with its implementation in realistic projects
● Big Data Scientists
● Big Data Engineers
Enroll in this comprehensive PySpark and AWS course now to master the essential skills in Big Data analytics, data processing, and cloud computing.
Whether you're a beginner or looking to expand your knowledge, this course offers a hands-on learning experience with practical projects. Don't miss this opportunity to advance your career and tackle real-world challenges in the world of data analytics and cloud computing. Join us today and start your journey towards becoming a Big Data expert with PySpark and AWS!
List of keywords:
Big Data analytics
Data analysis
Data cleaning
Machine learning (ML)
Spark RDDs
Dataframes
Spark SQL queries
Spark ecosystem
Hadoop
Databricks
AWS cloud
Spark scripts
AWS services
PySpark and AWS collaboration
PySpark tutorial
PySpark hands-on
PySpark projects
Spark architecture
Hadoop ecosystem
PySpark Databricks setup
Spark local setup
Spark RDD transformations
Spark RDD actions
Spark DF transformations
Spark DF actions
Spark Infer Schema
Spark Provide Schema
Spark DF Filter rows
Spark DF (Count, Distinct, Duplicate)
Spark DF (sort, order By)
Spark DF (Group By)
Spark DF (UDFs)
Spark DF (Spark SQL)
Collaborative filtering
Recommendation system
ALS model
Spark Streaming
ETL pipeline
Change Data Capture (CDC)
Replication
AWS Glue Job
Lambda Function
RDS
S3 Bucket
MySql Instance
Data Migration Service (DMS)
PgAdmin
Spark Shell Job
Full Load Pipeline
Change Data Capture Pipeline