Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Spark SQL and Spark 3 using Scala Hands-On with Labs

Name: Spark SQL and Spark 3 using Scala Hands-On with Labs
Rating: 4.4 (3124 reviews)

A comprehensive course on Spark SQL as well as Data Frame APIs using Scala with complementary lab access

Created byDurga Viswanatha Raju Gadiraju

Last updated 2/2023

English

What you'll learn

All the HDFS Commands that are relevant to validate files and folders in HDFS.
Enough Scala to work Data Engineering Projects using Scala as Programming Language
Spark Dataframe APIs to solve the problems using Dataframe style APIs.
Basic Transformations such as Projection, Filtering, Total as well as Aggregations by Keys using Spark Dataframe APIs
Inner as well as outer joins using Spark Data Frame APIs
Ability to use Spark SQL to solve the problems using SQL style syntax.
Basic Transformations such as Projection, Filtering, Total as well as Aggregations by Keys using Spark SQL
Inner as well as outer joins using Spark SQL
Basic DDL to create and manage tables using Spark SQL
Basic DML or CRUD Operations using Spark SQL
Create and Manage Partitioned Tables using Spark SQL
Manipulating Data using Spark SQL Functions
Advanced Analytical or Windowing Functions to perform aggregations and ranking using Spark SQL

Course content

19 sections • 232 lectures • 24h 12m total length

CCA 175 Spark and Hadoop Developer - Curriculum8:01

Getting Started with Cloud92:50
Creating Cloud9 Environment4:36
Set up a cloud nine environment via the cloud nine console, selecting ubuntu server 18.4, a t3.medium instance, and postgres with jupyter to practice data engineering.
Warming up with Cloud9 IDE3:19
Overview of EC2 related to Cloud91:31
Opening ports for Cloud9 Instance3:12
Learn how to open ports on a Cloud9 instance to expose a web app, using DNS alias, IP, port 80, security group inbound rules, and elastic IP.
Associating Elastic IPs to Cloud9 Instance4:13
Learn to allocate and associate an elastic IP address with a Cloud9 instance to provide a stable DNS alias, handle changing IP addresses, and maintain access after reboots.
Increase EBS Volume Size of Cloud9 Instance3:15
Increase Cloud9 storage from 10 gb to 32 gb by modifying the EBS volume in the EC2 console, reboot the instance, and verify with df -h.
Setup Jupyter Lab on Cloud97:06
[Commands] Setup Jupyter Lab on Cloud90:15

Introduction to Single Node Hadoop Cluster2:45
Setup Prerequisties4:55
[Commands] - Setup Prerequisites0:11
Setup Password less login3:36
Set up passwordless login on a single-node Hadoop cluster by generating SSH keys, copying the public key to authorized_keys, and validating passwordless access to start Hadoop components.
[Commands] - Setup Password less login0:27
Download and Install Hadoop4:14
[Commands] - Download and Install Hadoop0:33
Configure Hadoop HDFS6:50
[Commands] - Configure Hadoop HDFS0:44
Start and Validate HDFS5:59
Start and validate HDFS on a single-node cluster by configuring the path and passwordless login, launching all DFS components, and validating with DFS commands and file operations.
[Commands] - Start and Validate HDFS0:27
Configure Hadoop YARN1:19
[Commands] - Configure Hadoop YARN0:18
Start and Validate YARN2:03
[Commands] - Start and Validate YARN0:15
Managing Single Node Hadoop3:49
[Commands] - Managing Single Node Hadoop0:44

Setup Data Sets for Practice5:02
Set up a single-node Hadoop cluster, start DFS and YARN, and validate the services. Clone the retail_db repo from GitHub, copy the data to HDFS, and verify readiness.
[Commands] - Setup Data Sets for Practice0:28
Download and Install Hive2:17
[Commands] - Download and Install Hive0:47
Setup Database for Hive Metastore7:13
[Commands] - Setup Database for Hive Metastore1:19
Configure and Setup Hive Metastore7:10
[Commands] - Configure and Setup Hive Metastore0:38
Launch and Validate Hive4:57
[Commands] - Launch and Validate Hive0:18
Scripts to Manage Single Node Cluster4:40
[Commands] - Scripts to Manage Single Node Cluster0:32
Download and Install Spark 23:28
Download and install Spark 2.4.7 on a single-node Hadoop 2.7 cluster, extract the tarball, create a spark2 soft link, and run Spark via Python or Scala SQL for learning.
[Commands] - Download and Install Spark 20:43
Configure Spark 29:02
[Commands] - Configure Spark 20:54
Validate Spark 2 using CLIs8:23
[Commands] - Validate Spark 2 using CLIs0:42
Validate Jupyter Lab Setup9:45
[Commands] - Validate Jupyter Lab Setup0:36
Intergrate Spark 2 with Jupyter Lab6:08
Learn to integrate Spark 2 with Jupyter Lab by creating a new kernel embedded with SPARC, configuring Yarn, and validating Spark and Hive databases on a single-node cluster.
[Commands] - Intergrate Spark 2 with Jupyter Lab0:35
Download and Install Spark 31:55
Download Spark 3.1.1 for a single-node Hadoop and Spark cluster, unzip, move the folder, and create a symlink. Validate the installation by configuring and running basic scripts.
[Commands] - Download and Install Spark 30:43
Configure Spark 36:46
[Commands] - Configure Spark 30:56
Validate Spark 3 using CLIs7:54
Validate spark 3 using clis by launching a single-node cluster, running Scala, Python, and SQL interfaces, and verifying access to retail_db.orders via Hive metastore and Spark queries.
[Commands] - Validate Spark 3 using CLIs0:42
Intergrate Spark 3 with Jupyter Lab4:19
[Commands] - Intergrate Spark 3 with Jupyter Lab0:36

Introduction and Setting up of Scala10:51
Setup Scala on Windows7:23
Basic Programming Constructs18:53
Functions18:35
Master Scala functions in Spark contexts by defining with def, using return types, and creating higher-order and anonymous functions to sum ranges, squares, cubes, and multiples.
Object Oriented Concepts - Classes17:42
Object Oriented Concepts - Objects13:02
Object Oriented Concepts - Case Classes11:14
Discover case classes, a boilerplate-reducing, immutable default construct that auto generates toString, equals, hashCode, copy, and product related utilities via a companion object.
Collections - Seq, Set and Map8:56
Explore Scala collections: sequence, set, and map, with hands-on focus on list, set, and map APIs, iteration via foreach, and common traits like Traversable and Iterable for Spark integration.
Basic Map Reduce Operations14:08
Learn basic MapReduce in Spark with Scala, using map, filter, and reduce to sum squares of even numbers, and compare filter-first versus square-first strategies.
Setting up Data Sets for Basic I/O Operations4:23
Set up datasets for basic I/O by cloning or downloading from GitHub, locate the data in the lab or local PC, then read files and perform MapReduce operations.
Basic I/O Operations and using Scala Collections APIs16:23
Read data from files using a source API, convert to in-memory collections, and apply map, filter, and reduce to compute order subtotals and total revenue.
Tuples4:56
Development Cycle - Create Program File7:24
Development Cycle - Compile source code to jar using SBT9:32
Development Cycle - Setup SBT on Windows2:48
Install sbt on Windows, download the appropriate version for Scala and Spark projects, and finish setup via command prompt. Launch sbt to verify first-time downloads complete.
Development Cycle - Compile changes and run jar with arguments4:21
Development Cycle - Setup IntelliJ with Scala12:07
Development Cycle - Develop Scala application using SBT in IntelliJ10:50

Getting help or usage of HDFS Commands3:51
Listing HDFS Files3:46
Learn to list HDFS files with the dfs -ls command and options like -h for human-readable sizes, -R for recursive listing, and sort by name, time, or size.
Managing HDFS Directories12:47
Learn to create hdfs directories, set user space under /user, assign ownership to the login user, and adjust group ownership with dfs mkdir, -chown, and -chgrp, including recursive options.
Copying files from local to HDFS12:16
Copy files from local to hdfs using copy from local (put); learn to create folders, preserve metadata, handle existing files with -f, and understand name node and block distribution.
Copying files from HDFS to local5:18
Getting File Metadata7:12
Explore how to obtain file metadata in the Hadoop distributed file system using DFS commands, revealing files, blocks, and locations and explaining replication factor, block IDs, and data node mappings.
Previewing Data in HDFS File5:40
HDFS Block Size4:50
HDFS Replication Factor7:06
Getting HDFS Storage Usage2:11
Using HDFS Stat Commands1:31
HDFS File Permissions8:53
Overriding Properties6:17
Learn how to read and override HDFS properties at runtime by inspecting core-site.xml and hdfs-site.xml. Use -D or --conf to set replication factor and block size for copied files.

Introduction for the module2:18
Starting Spark Context using spark-shell10:14
Overview of Spark read APIs18:16
Spark read APIs explain reading data into dataframes using csv, json, and text formats, with format, load, options, and schema to control headers, delimiters, and inferred schema.
Previewing Schema and Data using Spark APIs4:31
Preview schema and data in Spark dataframes using print schema, show, and describe, with practical hands-on labs on Spark SQL and Spark 3 using Scala.
Overview of Spark Data Frame APIs7:41
Discover Spark data frame APIs to read data into dataframes, apply standard and low-level transformations, filter, aggregate by group, sort, and project fields with select, drop, and withColumn.
Overview of Functions to Manipulate Data in Spark Data Frames18:15
Overview of Spark Write APIs16:43
Explore how to write Spark dataframes to multiple formats such as csv, json, and parquet using write APIs, options, compression, mode, and validation on a multi-node cluster.

Introduction to Pre-defined Functions5:51
Creating Spark Session Object in Notebook1:55
Launch a Spark session in a notebook, create a Spark object with the Spark session builder, import implicits for shorthand column references to build dummy frame and explore SQL functions.
Create Dummy Data Frames for Practice8:06
Categories of Functions on Spark DAta Frame Columns2:16
Using Spark Special Functions - col13:50
Using Spark Special Functions - lit4:44
Manipulating String Columns using Spark Functions - Case Conversion and Length6:44
Manipulating String Columns using Spark Functions - substring13:16
Manipulating String Columns using Spark Functions - split9:04
Manipulating String Columns using Spark Functions - Concatenating Strings3:37
Manipulating String Columns using Spark Functions - Padding Strings11:10
Manipulating String Columns using Spark Functions - Trimming unwanted characters5:23
Date and Time Functions in Spark - Overview4:14
Date and Time Functions in Spark - Date Arithmetic9:53
Master date and time arithmetic in spark using date_add, date_sub, add_months, and months_between. Learn to use current_date and current_timestamp with dataframe examples and understand end-of-month behavior.
Date and Time Functions in Spark - Using trunc and date_trunc7:34
Explore how to use Spark date_trunc and trunc to generate week-to-date, month-to-date, and year-to-date reports, and derive beginnings of date or time from timestamps with practical examples.
Date and Time Functions in Spark - Using date_format and other functions15:33
Date and Time Functions in Spark - dealing with unix timestamp8:13
Pre-defined Functions in Spark - Conclusion4:23

Introduction to Basic Transformations using Data Frame APIs2:51
Explore basic transformations, including filtering, aggregation, and sorting, using data frame APIs on airlines data to compute daily totals of departure and arrival delays for one month.
Starting Spark Context3:13
Launch a Spark context by configuring a Spark session or Spark shell with local or yarn master settings, then explore filtering, aggregations, and sorting.
Overview of Filtering using Spark Data Frame APIs5:24
Explore filtering with spark data frame APIs on the airlines data, using sql style and dataframe style conditions with operators such as in, between, and like.
Filtering Data from Spark Data Frames - Reading Data and Understanding Schema2:30
Filtering Data from Spark Data Frames - Task 1 - Equal Operator8:19
Filtering Data from Spark Data Frames - Task 2 - Comparison Operators3:41
Filter Spark data frames to count flights with departure delay over 60 minutes, using sql-style and api-style methods, and preview results in the airlines dataframe (count = 40,104).
Filtering Data from Spark Data Frames - Task 3 - Boolean AND5:22
Filtering Data from Spark Data Frames - Task 4 - IN Operator5:43
Filtering Data from Spark Data Frames - Task 5 - Between and Like9:09
Filtering Data from Spark Data Frames - Task 6 - Using functions in Filter9:48
Count flights departing late on Sundays from the 2008 January data using Spark data frames in scala. Use date_format and to_date to extract day of week and filter late departures.
Overview of Aggregations using Spark Data Frame APIs8:41
Overview of Sorting using Spark Data Frame APIs2:52
Solution - Get Delayed Counts using Spark Data Frame APIs - Part 16:47
Solution - Get Delayed Counts using Spark Data Frame APIs - Part 25:22
Solution - Getting Delayed Counts By Date using Spark Data Frame APIs16:28

Requirements

Basic programming skills
Self support lab (Instructions provided) or ITVersity lab at additional cost for appropriate environment.
Minimum memory required based on the environment you are using with 64 bit operating system
4 GB RAM with access to proper clusters or 16 GB RAM with virtual machines such as Cloudera QuickStart VM

Description

As part of this course, you will learn all the key skills to build Data Engineering Pipelines using Spark SQL and Spark Data Frame APIs using Scala as a Programming language. This course used to be a CCA 175 Spark and Hadoop Developer course for the preparation of the Certification Exam. As of 10/31/2021, the exam is sunset and we have renamed it to Spark SQL and Spark 3 using Scala as it covers industry-relevant topics beyond the scope of certification.

About Data Engineering

Data Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc. Apache Spark is evolved as a leading technology to take care of Data Engineering at scale.

I have prepared this course for anyone who would like to transition into a Data Engineer role using Spark (Scala). I myself am a proven Data Engineering Solution Architect with proven experience in designing solutions using Apache Spark.

Let us go through the details about what you will be learning in this course. Keep in mind that the course is created with a lot of hands-on tasks which will give you enough practice using the right tools. Also, there are tons of tasks and exercises to evaluate yourself.

Setup of Single Node Big Data Cluster

Many of you would like to transition to Big Data from Conventional Technologies such as Mainframes, Oracle PL/SQL, etc and you might not have access to Big Data Clusters. It is very important for you set up the environment in the right manner. Don't worry if you do not have the cluster handy, we will guide you through support via Udemy Q&A.

Setup Ubuntu-based AWS Cloud9 Instance with the right configuration
Ensure Docker is setup
Setup Jupyter Lab and other key components
Setup and Validate Hadoop, Hive, YARN, and Spark

Are you feeling a bit overwhelmed about setting up the environment? Don't worry!!! We will provide complementary lab access for up to 2 months. Here are the details.

Training using an interactive environment. You will get 2 weeks of lab access, to begin with. If you like the environment, and acknowledge it by providing a 5* rating and feedback, the lab access will be extended to additional 6 weeks (2 months). Feel free to send an email to support@itversity.com to get complementary lab access. Also, if your employer provides a multi-node environment, we will help you set up the material for the practice as part of the live session. On top of Q&A Support, we also provide required support via live sessions.

A quick recap of Scala

This course requires a decent knowledge of Scala. To make sure you understand Spark from a Data Engineering perspective, we added a module to quickly warm up with Scala. If you are not familiar with Scala, then we suggest you go through relevant courses on Scala as Programming Language.

Data Engineering using Spark SQL

Let us, deep-dive into Spark SQL to understand how it can be used to build Data Engineering Pipelines. Spark with SQL will provide us the ability to leverage distributed computing capabilities of Spark coupled with easy-to-use developer-friendly SQL-style syntax.

Getting Started with Spark SQL
Basic Transformations using Spark SQL
Managing Spark Metastore Tables - Basic DDL and DML
Managing Spark Metastore Tables Tables - DML and Partitioning
Overview of Spark SQL Functions
Windowing Functions using Spark SQL

Data Engineering using Spark Data Frame APIs

Spark Data Frame APIs are an alternative way of building Data Engineering applications at scale leveraging distributed computing capabilities of Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications.

Data Processing Overview using Spark Data Frame APIs leveraging Scala as Programming Language
Processing Column Data using Spark Data Frame APIs leveraging Scala as Programming Language
Basic Transformations using Spark Data Frame APIs leveraging Scala as Programming Language - Filtering, Aggregations, and Sorting
Joining Data Sets using Spark Data Frame APIs leveraging Scala as Programming Language

All the demos are given on our state-of-the-art Big Data cluster. You can avail of one-month complimentary lab access by reaching out to support@itversity.com with a Udemy receipt.

Who this course is for:

Any IT aspirant/professional willing to learn Data Engineering using Apache Spark
Python Developers who want to learn Spark using Scala to add additional skill to be a Data Engineer
Java or Scala Developers to learn Spark using Scala to add Data Engineering Skills to their profile

Spark SQL and Spark 3 using Scala Hands-On with Labs

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 8min

Setting up Environment using AWS Cloud99 lectures • 30min

Setting up Environment - Overview of GCP and Provision Ubuntu VM8 lectures • 42min

Setup Hadoop on Single Node Cluster17 lectures • 39min

Setup Hive and Spark on Single Node Cluster30 lectures • 1hr 39min

Scala Fundamentals18 lectures • 3hr 13min

Overview of Hadoop HDFS Commands13 lectures • 1hr 22min

Apache Spark 2 using Scala - Data Processing - Overview7 lectures • 1hr 18min

Apache Spark 2 using Scala - Processing Column Data using Pre-defined Functions18 lectures • 2hr 16min

Apache Spark 2 using Scala - Basic Transformations using Data Frames15 lectures • 1hr 36min

Requirements

Description

Who this course is for: