Master Apache Spark (Scala) for Data Engineers

Name: Master Apache Spark (Scala) for Data Engineers
Rating: 4.5 (122 reviews)

Intense course to learn Apache Spark with lots of hands to excel in Data Engineering

Created byNavdeep Kaur

Last updated 5/2024

English

What you'll learn

Students will learn Spark Architecture, Internals, Working with RDDs, Working with Dataframes, Using IDE, Running Spark on EMR Cluster.

Course content

11 sections • 61 lectures • 4h 44m total length

Big Data Introduction5:21
Understanding Big Data Ecosystem10:27

Intellij Setup2:24
Project Setup3:43
Set up a Scala Spark project from IntelliJ, choose sbt, align Scala and Spark versions, and add Spark dependencies (core, sql, hive) to build a runnable Spark application.
Writing first Spark program on IDE6:06
Understanding spark configuration7:00
Adding Actions/Transformations7:55
Read a multiline json file into a Spark dataframe, filter out adventure movies, uppercase titles, group by release year with collect_list, and write results back to json.
Understanding Execution Plan7:43

Map Transformation3:59
Environment update1:02
Cloudera vm Setup1:30
FlatMap1:54
Filter/Intersection4:00
Union/Distinct2:23
GroupByKey3:31
GroupByKey in Spark groups data by a key, triggering a shuffle and a new stage. Map lines to key-value pairs, then apply groupByKey to aggregate by month.
ReduceByKey6:43
SortByKey5:34
Sort data in Spark using sortByKey and sort by transformations, demonstrating ascending and descending orders. Convert numeric fields to integers for correct order and sort by specific values like scores.
map_partition5:48
Master how Spark partitions a distributed data set to enable parallel processing, and use mapPartitions and mapPartitionsWithIndex to run per-partition logic, such as creating a database connection per partition.
Colease/Repartition3:34
Learn to adjust Spark partitions with coalesce and repartition, decreasing partitions with coalesce to avoid reshuffling, and increasing partitions with repartition, which may reshuffle data.
Joins3:00
Explore the join transformation in Apache Spark, aligning records by a key such as the name, and learn how inner, left outer, and right outer joins work.
Spark Actions5:42

Requirements

Basic knowledge of scala language.

Description

This course is designed in such a manner to cover basics to advanced concept to learn Apache Spark 3.x in most efficient and concise manner. This course will be beneficial for beginners as well as for those who already know Apache Spark. It covers in-depth details about spark internals, datasets, execution plan, Intellij IDE, EMR cluster with lots of hands on.

This course is designed for Data Engineers and Architects who are willing to design and develop a Bigdata Engineering Projects using Apache Spark. It does not require any prior knowledge of Apache Spark or Hadoop. Spark Architecture and fundamental concepts are explained in details to help you grasp the content of this course. This course uses the Scala programming language which is the best language to work with Apache Spark.

This course covers:

Intro to Big data ecosystem
Spark Internals in details
Understanding Spark Drivers, executors.
Understanding Execution plan in details
Setting up environment on Local/Google cloud
Working with Spark Dataframes
Working with Intellij IDE
Running Spark on EMR cluster (AWS Cloud)
Advanced Dataframe examples
Working with RDD
RDD examples

By the end of this course, you'll be able to answer any spark interview question and will be able to run code that analyzes gigabytes worth of information in Apache Spark in a matter of minutes.

Who this course is for:

Software Engineers who want to learn Apache Spark

Master Apache Spark (Scala) for Data Engineers

What you'll learn

Explore related topics

Course content

Introduction to Big Data (Optional)2 lectures • 16min

Spark with Yarn & HDFS5 lectures • 21min

Local Setup1 lecture • 7min

Spark Internals8 lectures • 32min

Google Cloud Dataproc Cluster4 lectures • 14min

Working with Dataframes11 lectures • 50min

Using Intellij IDE6 lectures • 35min

Advanced Dataframe2 lectures • 7min

Running Spark on EMR (AWS Cloud)5 lectures • 27min

Working with RDDs13 lectures • 49min

Requirements

Description

Who this course is for: