Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Master Apache Spark - Hands On!

Name: Master Apache Spark - Hands On!
Rating: 4.6 (1343 reviews)

Learn how to slice and dice data using the next generation big data platform - Apache Spark!

Created byJob Ready Programmer

Last updated 9/2025

English

English [Auto],Korean [Auto],

What you'll learn

Utilize the most powerful big data batch and stream processing engine to solve big data problems
Master the new Spark Java Datasets API to slice and dice big data in an efficient manner
Build, deploy and run Spark jobs on the cloud and bench mark performance on various hardware configurations
Optimize spark clusters to work on big data efficiently and understand performance tuning
Transform structured and semi-structured data using Spark SQL, Dataframes and Datasets
Implement popular Machine Learning algorithms in Spark such as Linear Regression, Logistic Regression, and K-Means Clustering

Course content

7 sections • 33 lectures • 6h 58m total length

Why Spark7:58
Explore Apache Spark, a general purpose distributed big data processing platform, and see how a word count scales across a cluster, faster than Hadoop due to in memory processing.
Spark High Level Components4:04
Explore Spark's five components: core, sql, streaming, mllib, and graph library—and learn how a consistent data frame and data set API enables fast, scalable distributed processing across diverse data sources.
Creating a Spark Maven Project9:16
Set up Java 8 and the Eclipse IDE, create a Spark Maven project with a pom.xml including Spark core, SQL, and mllib, and refresh dependencies from the GitHub code.
Dedicated TA Support1:39
Engage with dedicated ta support and a responsive q&a, leveraging Stackoverflow.com for fast, crowd-sourced answers to accelerate learning and strengthen knowledge in Master Apache Spark — hands on.
Join our Online Community (Discord)0:48
Import Source Code into Eclipse5:51
Import the Spark with Java source from the GitHub repository or zip file into Eclipse as a Maven project, then compile and run.
First Spark Application21:24
Build a Spark application that reads a CSV with headers into a data frame, applies transformations (full name via concat and lit, filter, order), and writes to PostgreSQL via JDBC.
Spark Standalone Cluster Architecture11:46
Launch a spark cluster with a master and workers; submit your jar to the master, orchestrate via the driver, and run tasks on partitions for ingest, transform, and load data.
Apache Spark Introduction

Ingesting CSV and JSON Files21:42
Ingest CSV and JSON files with Spark by configuring a Spark session, parsing semicolon-delimited multi-line CSVs with quotes, and parsing JSON lines and multi-line JSON.
How to reduce logging in the console0:27
Real World Dataframes Example21:13
Combine real-world parks data from json and csv into a single Spark data frame, extracting nested fields, building a new schema, and performing a union of heterogeneous datasets.
Union Dataframes and Other Set Transformations29:15
Discover building and cleaning data frames from json and csv, filtering for parks, renaming and adding columns, and unioning by name to combine Philadelphia and Durham datasets.
Converting Between Datasets and Dataframes14:31
Explore difference between a data set and a data frame, and learn to use encoders with pojos to convert between data sets and frames for Spark SQL and tungsten optimizations.

Map and Reduce Transformation Functions16:35
Learn how map and reduce transformations work in Spark, transforming each item and reducing many inputs to one, with serializable mappers and pojos.
Using Datasets with User Defined POJOs19:14
Learn how to map CSV data into a POJO, create a dataset with a mapper, and convert back to a data frame to leverage Spark tungsten optimizations.
Using Datasets with Unstructured Textual Data18:16
Explore processing unstructured text with Spark by ingesting Shakespeare text into a dataframe. Use flatMap to split lines into words and then group by and count to reveal word frequencies.
Joining Dataframes and Using Various Filter Transformations23:29
Master the data frames API with join and filter transformations. Learn to load CSV files into data frames, join on GPA, and use select and where clauses to shape output.
Aggregation Transformations + Join Assignment14:05
Join the customers, purchases, and products data frames in Spark to form a unified dataset. Explore aggregation techniques with group by and agg to compute counts, sums, and max values.
More on Transformations, Actions and the DAG17:17
Explore transformations and actions in Spark, learn how a directed acyclic graph (DAG) builds lazy execution plans for data frames, and see how the Catalyst optimizer streamlines joins and aggregations.

Using Spark to Analyze Reddit Comments26:39
Use Apache Spark to analyze reddit comments by extracting words from json data and counting word frequencies, with stopword filtering, on aws emr with s3.
Running the Reddit Spark Application on an EMR Cluster20:11
Set up AWS EMR with spark, upload data to S3, deploy the Reddit spark application jar, run it via spark-submit on a master-slave cluster, and compare performance across configurations.
Instructions for Configuring a Spark Stand-alone Cluster1:37

Streaming Network Socket Example21:40
Learn how spark streaming uses the data frame API for incremental micro-batch processing of streaming data with a near 100 ms latency, using socket sources and various output modes.
Stock Market Files Streaming Example6:23
Watch a directory for incoming stock data files (csv, json, parquet), define a date and price schema, stream and aggregate by date, and print complete results to the console.
Using Kafka with Spark Streaming14:19
Connect Spark Streaming to a Kafka cluster, subscribe to topics, and consume messages; build a local end-to-end example that counts words and prints results.

Machine Learning Resources0:48
Overview of Linear Regression6:28
Learn how simple linear regression models the relationship between advertising spend and sales, using residuals and least-squares to predict outcomes, with practical Spark MLlib code.
Spark Java Linear Regression Example23:04
Learn to build a spark java linear regression model by preparing a label and features data frame, using a vector assembler, fitting the model, and evaluating with predictions.
Overview of Logistic Regression2:19
Spark Java Logistic Regression (Classification Algorithm)16:03
Demonstrates logistic regression on a binary cryotherapy dataset using Spark ML, converting sex with a string indexer, assembling features, and a 70/30 train-test split.
Overview of K-Means Clustering7:46
Learn how k-means clustering partitions observations into k clusters with centroids moved to the cluster average, iterating to minimize within sum of squared errors in unsupervised learning.
Spark Java K-Means Clustering Example10:52
Master Apache Spark with hands-on k-means clustering on wholesale customers data using Spark MLlib, exploring unsupervised learning, vector assembler, and cluster evaluation.

Requirements

Some basic Java programming experience is required. A crash course on Java 8 lambdas is included
You will need a personal computer with an internet connection.
The software needed for this course is completely freely and I'll walk you through the steps on how to get it installed on your computer

Description

Welcome to Apache Spark Mastery – Hands-On Big Data Processing!

Are you a Java developer or data engineer eager to harness the power of big data?

Do you want to design scalable data processing pipelines using one of today’s most powerful platforms?

Have you been challenged by real-time data streams or the complexities of performance tuning in distributed systems?

If you answered yes, then you’re in the right place.

What Makes This Course Stand Out?

Hands-On Experience: Build over 15 real-world Spark applications that tackle actual data challenges.
Comprehensive Curriculum: Dive deep into Spark’s Java Datasets API, Spark SQL, Dataframes, and Streaming to transform and analyze data efficiently.
Cloud Deployment & Performance Tuning: Learn how to deploy Spark jobs on the cloud, benchmark performance, and optimize clusters for maximum efficiency.
Industry-Relevant Projects: Work with diverse data sources—from text and CSV to JSON—and analyze large-scale datasets like millions of Reddit comments.

Why This Course Is Essential:

Apache Spark is the next generation batch and stream processing engine. It's been proven to be almost 100 times faster than Hadoop and much much easier to develop distributed big data applications with. It's demand has sky rocketed in recent years and having this technology on your resume is truly a game changer. Over 3000 companies are using Spark in production right now and the list is growing very quickly! Some of the big names include: Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Amazon as well as most of the big world banks and financial institutions!

You'll be developing over 15 practical Spark Java applications crunching through real world data and slicing and dicing it in various ways using several data transformation techniques. This course is especially important for people who would like to be hired as a java developer or data engineer because Spark is a hugely sought after skill. We'll even go over how to setup a live cluster and configure Spark Jobs to run on the cloud. You'll also learn about the practical implications of performance tuning and scaling out a cluster to work with big data so you'll definitely be learning a ton in this course.

Topics Covered in the Apache Spark Course

In this course, you'll learn everything you need to know about using Apache Spark in your organization while using their latest and greatest Java Datasets API. Below are some of the things you'll learn:

How to develop Spark Java Applications using Spark SQL Dataframes
Understand how the Spark Standalone cluster works behind the scenes
How to use various transformations to slice and dice your data in Spark Java
How to marshall/unmarshall Java domain objects (pojos) while working with Spark Datasets
Master joins, filters, aggregations and ingest data of various sizes and file formats (txt, csv, Json etc.)
Analyze over 18 million real-world comments on Reddit to find the most trending words used
Develop programs using Spark Streaming for streaming stock market index files
Stream network sockets and messages queued on a Kafka cluster
Learn how to develop the most popular machine learning algorithms using Spark MLlib
Covers the most popular algorithms: Linear Regression, Logistic Regression and K-Means Clustering

KEY BENEFITS OF APACHE SPARK MASTERY

Mastering Apache Spark positions you at the forefront of big data technology. With this expertise, you’ll be able to design efficient, scalable data processing pipelines that are in high demand across industries. Spark’s widespread adoption by over 3000 companies—including Oracle, Cisco, and Amazon—underscores its value in today's competitive tech landscape. This course will not only boost your technical skills but also enhance your resume, opening doors to exciting career opportunities in data engineering and data science.

KEY TAKEAWAY

By the end of this course, you’ll have the practical skills and in-depth knowledge to harness Apache Spark for building high-performance, scalable data solutions. Whether you’re looking to boost your career or transform how your organization handles big data, Apache Spark Mastery is your gateway to success.

This course has a 30 day money back guarantee. You will have access to all of the code used in this course.

Ready to transform your big data capabilities? Enroll now and start mastering Apache Spark today!

Who this course is for:

Anyone who is a Java developer and want's to add this seriously marketable technology on their resume
Anyone who wants to get into the data science field
Anyone who is interested in into the world of big data
Anyone who wants to implement machine learning algorithms in spark

Master Apache Spark - Hands On!

What you'll learn

Explore related topics

Course content

Introduction8 lectures • 1hr 3min

Spark Java Dataset API Basics5 lectures • 1hr 27min

Diving Deeper with Datasets, Dataframes, Transformations and the DAG6 lectures • 1hr 49min

Running Spark Jobs on the Cloud3 lectures • 48min

Spark Streaming Applications3 lectures • 42min

Machine Learning with Spark MLlib7 lectures • 1hr 7min

Course Extras!1 lecture • 2min

Requirements

Description

Who this course is for: