AWS & Apache Spark: The Complete Big Data Analytics Course

Name: AWS & Apache Spark: The Complete Big Data Analytics Course
Rating: 4.2 (293 reviews)

Master the cloud. Build scalable data pipelines and train Machine Learning models using Spark, EMR, and SparkSQL.

Created bySkillbox, LLC

Last updated 1/2026

English

What you'll learn

Start a project using Apache Spark
Understand how Spark SQL lets you work with structured data
Install and run Apache Spark on a desktop computer or on a cluster
Gain hands-on experience setting up Spark clusters on AWS cloud services platform
Understand how to control a cloud instance on AWS using SSH or PuTTY
Understand how to access data from the CSV, Json, HDFS, and S3 formats

Course content

6 sections • 21 lectures • 2h 20m total length

Introduction4:12
Explore how Apache Spark on AWS accelerates big data processing with in-memory speed. Build clusters from scratch on AWS, outpace Hadoop MapReduce, and support Java, Scala, Python, SQL, and R.
Mastering Big Data: The Advantages of Integrating Apache Spark and AWS2:14

Creating an AWS Instance9:39
Set up a Spark-ready AWS environment by creating a VPC and launching a basic EC2. Select a t2 micro, enable a public IP, and stop the instance to avoid charges.
Connecting to AWS Instance with SSH6:18
Sign in to the AWS console, start the EC2 instance, and connect via ssh using the pem file after setting its permissions, noting DNS name changes after restart.
Connecting to AWS Instance with PuTTY8:37
Connect to your AWS instance using PuTTY on Windows, or SSH on Linux and Mac, generate a PPK with PuTTYgen, and configure a saved Spark session.
Spark Clusters9:01
Log in to the AWS console and launch an Amazon Linux EC2 instance with a public IP, then install Spark 2.0 and clone the Amplab Spark EC2 branch 2.0.
Spark Clusters in depth9:55
Launch aws spark clusters using spark ec2 scripts to create a master and a slave. SSH into the master to access the spark master page and start a spark session.
Learn How to Terminate Your Clusters0:58
Terminate your clusters at chapter end to avoid surprises and costs; restart clusters when you resume, waiting about fifteen minutes for them to fire up.

Data Basics8:33
Set up Spark clusters on AWS, launch and connect to the master, and run SparkR with RStudio to create and manipulate Spark dataframes using the faithful dataset.
Modeling with Gaussian Generalized Linear Models11:19
Explore gaussian generalized linear models in spark using iris and diabetes data, fitting linear and binomial models, and evaluating with rmse and model summaries.
Binomial Generalized Linear Models9:33
Apply binomial generalized linear models to the titanic dataset, convert text to dummy features, split data, and evaluate survival predictions with a 0.5 cutoff achieving about 82.07% accuracy in Spark.
Naive Bayes and K-Means Modeling9:14
Conclude the chapter by introducing naive Bayes and K-means in Spark ML, illustrating supervised and unsupervised modeling with Titanic and iris data, and show cluster destruction on EC2.

Bigger Data and AWS S37:27
Explore big data workflows with S3, Spark, SQL, and Dplyr by uploading large datasets to S3 and running scalable analytics on a Spark cluster.
Accessing S3 Spark Dataframes4:57
Explore handling large data frames with a dplyr-like syntax, read from s3 into a spark dataframe, and ingest csv data directly for scalable analytics.
SparkR Dataframe Operations11:01
Explore SparkR dataframes with dplyr-like commands such as select, filter, and summarize, and learn group by, aggregate, mean, and piping with the Margaret R package.
Intro to SparkSQL5:16
Master spark sql to query spark dataframes directly via a temporary view like diabetes_spark_table. Learn to use select, distinct, and count with proper quoting to analyze data across a cluster.

Intro to HDFS10:59
Explore hdfs, the Hadoop distributed file system, which stores data in blocks across commodity machines for scalable, redundant, high-bandwidth access and easy integration with Spark.
Databricks Community Edition8:19
Explore Databricks community edition to manage Spark clusters with interactive notebooks, six gigabytes of free space, and multi-language support for Python, Scala, SQL, and R.

Requirements

A PC or Mac
Basic understanding and functional knowledge of Apache Spark and big data

Description

Data is the new oil. But it’s useless if you can't refine it.

Processing gigabytes on your laptop is easy. But what happens when you need to process terabytes or petabytes? You need the Cloud, and you need a distributed computing engine. You need AWS and Apache Spark.

Welcome to the comprehensive guide to modern Big Data. This course is designed to bridge the gap between Data Engineering (setting up clusters, managing storage) and Data Science (analyzing data, training models).

Why this course? Most courses teach Spark in isolation on a local machine. We take you to the real world. You will learn how to provision legitimate clusters on the AWS Cloud, effectively becoming a "Cloud Data Specialist."

What will you master?

The AWS Ecosystem: We start from zero. You will learn to navigate the AWS console, understand IAM security, and master S3 (Simple Storage Service) for storing massive datasets.
Cluster Management: Stop struggling with local installations. Learn to spin up EC2 instances and fully managed EMR (Elastic MapReduce) clusters to handle heavy workloads.
SparkSQL & DataFrames: Move beyond old-school RDDs. Master the modern DataFrame API to query structured data just like you would with SQL—but faster and at scale.
Machine Learning at Scale: This is where the magic happens. We dive into Spark MLlib to build predictive models on big data. You will implement:
- Classification: Naive Bayes and Binomial GLMs.
- Regression: Gaussian Generalized Linear Models (GLM).
- Clustering: K-Means to group similar data points automatically.
SparkR & Analytics: Leverage the power of R syntax within the Spark engine for advanced statistical analysis.

Who is this course for? This course is perfect for Data Scientists who want to scale their models from their laptop to the cloud, and Software Engineers who want to break into the lucrative field of Big Data.

Don't let the data overwhelm you. Master the tools to control it. Enroll today and start building scalable Big Data solutions on the world's leading cloud platform.

Who this course is for:

Software Engineer
Application developers
Data scientists
Big data architects

AWS & Apache Spark: The Complete Big Data Analytics Course

What you'll learn

Explore related topics

Course content

Welcome2 lectures • 6min

Creating Clusters6 lectures • 44min

Data and Modeling Basics4 lectures • 39min

Data Sources and Data Manipulation4 lectures • 29min

Various2 lectures • 19min

Course Summary3 lectures • 3min

Requirements

Description

Who this course is for: