Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

[LEGACY–SUPPORT END] Spark SQL & Hadoop (For Data Science)

Name: [LEGACY–SUPPORT END] Spark SQL & Hadoop (For Data Science)
Rating: 4.4 (96 reviews)

Learn HDFS commands, Hadoop, Spark SQL, SQL Queries, ETL & Data Analysis| Spark Hadoop Cluster VM | Fully Solved Qs

Created byMatthew Barr

Last updated 9/2024

English

English [Auto],

What you'll learn

Students will get hands-on experience working in a Spark Hadoop environment that’s free and downloadable as part of this course.
Students will have opportunities solve Data Engineering and Data Analysis Problems using Spark on a Hadoop cluster in the sandbox environment that comes as part
Issuing HDFS commands.
Converting a set of data values in a given format stored in HDFS into new data values or a new data format and writing them into HDFS.
Loading data from HDFS for use in Spark applications & writing the results back into HDFS using Spark.
Reading and writing files in a variety of file formats.
Performing standard extract, transform, load (ETL) processes on data using the Spark API.
Using metastore tables as an input source or an output sink for Spark applications.
Applying the understanding of the fundamentals of querying datasets in Spark.
Filtering data using Spark.
Writing queries that calculate aggregate statistics.
Joining disparate datasets using Spark.
Producing ranked or sorted data.

Course content

13 sections • 85 lectures • 5h 41m total length

Introduction3:13
Learn to solve big data problems using Apache Spark and Hadoop in a fully configured cluster, with hands-on data ingestion, data cleansing, and production-ready applications.
The Udemy Environment2:23
Explore the Udemy environment and its user interface to optimize learning, including reviews, video speed and quality, and downloading course materials.

Section Introduction1:07
Explore the high level theory behind big data and the need for distributed storage and processing. Introduce Hadoop, its components, the Hadoop distributed file system, and YARN.
Big Data7:17
Understand big data through the data explosion, velocity and volume, from internet usage and IoT to social media, and learn how unstructured, structured, and semi-structured data differ.
Distributed Storage & Processing6:33
Explore distributed systems for big data storage and processing via computer clusters and distributed file systems, comparing vertical and horizontal scaling with commodity hardware and tools like Hadoop and Spark.
Introduction to Hadoop5:06
Explore how Hadoop enables reliable, scalable distributed computing with open source software. Analyze features like scalability, resilience, data locality, and the distributed file system, cluster resource manager, and MapReduce.
Introduction to Spark4:40
Apache Spark is a unified analytics engine for large-scale data processing, delivering speed with built-in libraries Spark SQL, Spark Streaming, MLlib, and GraphX.
Spark Applications3:38
Define spark applications as programs using spark APIs with a driver and executors. Explain the master process and spark session for cluster connection and dataframe APIs.
Spark's Interactive Shell4:07
Explore Spark's interactive shell, including the Spark shell interfaces for Scala and Python, to write and test code in an interactive Spark session with Hive support.
Distributed Processing on a Hadoop Cluster using Spark1:51
Deploy Spark on a Hadoop cluster and use HFS as a data source. Show how Yarn acts as resource scheduler and how Spark executors run in containers on worker nodes.

Section Introduction0:48
Set up the blue VM with a Spark Hadoop cluster, install Oracle VM VirtualBox, boot the VM, then access Spark Shell or Zeppelin notebook for practice questions.
Install Oracle VM VirtualBox1:58
Install Oracle VM VirtualBox to load the blue VM, enabling SPARC on a Hadoop cluster from your desktop, with Hive MetaStable server and an Apache Zeppelin notebook.
The Verulam Blue VM - Zipped Files for Downloading0:10
Loading the Verulam Blue VM2:52
Load the Verulam Blue VM in Oracle VM VirtualBox by decompressing zip files, importing the VM image, and starting the VM; download files one at a time to avoid corruption.
Booting up the VM1:23
Boot the Oracle VM VirtualBox and load the Farallon Blue VM, then work with Spark, Hadoop, Hive, and the Zeppelin notebook, and learn to spin up a cluster.
Spin Up Cluster1:13
Spin up the Hadoop cluster by clicking the desktop spin up icon on red and blue VM, then verify seven processes, including the name node and resource manager, are running.
spark-shell1:34
Launch spark-shell to start a spark session, use the spark variable to run code, view databases, and exit by typing colon quit.
Run Zeppelin Notebook1:34
Launch the Apache Zeppelin notebook, open a new notebook, and run code. Use show databases to list databases, then refresh until the readiness indicator appears, and shut down the process.
Problems & practice test questions1:30
Access end-of-section problems and the practice test by launching Spark Show from the desktop, then use the spark-shell to review questions once the Vernon Blue VM cluster is up.

Interacting with HDFS4:05
Learn how to interact with the Hadoop distributed file system (HDFS) using the Java API, NFC gateway, web UI, and the file system shell to browse, upload, and manage data.
The File System Shell (FS Shell)2:40
Learn to use the file system shell to interact with HDFS through the command line, perform basic filesystem operations, and script tasks for efficient data management in a Hadoop cluster.
Commands and operations -help1:34
Learn how to use the dfs command with the -help option to enumerate available options and read detailed information about a specific option, such as put, along with its arguments.
Commands and operations -ls5:04
Learn to list directory contents with the dfs -ls option, view root and user directories, and understand the permissions, ownership, and basic file attributes shown.
Commands and operations -find3:36
Use the hdfs dfs -find command to locate paths of files and subdirectories. Employ -name with wildcards to search by partial names and view help with -help find.
Commands and operations -mkdir3:02
Discover how to create directories in hdfs using the mkdir command, including single and multiple directories, and nested directories with the -p option, and verify with ls.
Commands and operations -put4:38
Copy files and directories from the local file system to HDFS using the put command, overwrite with -f, copy directories, and copy only contents with wildcards, with verification.
Commands and operations -cp -mv4:59
Learn to copy and move files between HFS locations using the cp and mv options in the official commands, showing source and destination paths, and how to rename via move.
Commands and operations -cat -tail -text5:29
Discover how to view Hadoop file contents using hdfs dfs: cat for small files, head and tail for partial views, and text for compressed files.
Commands and operations -rmdir -rm5:40
Learn to delete files and directories using rm and rmdir, including removing empty directories and recursively deleting non-empty folders, and using wildcards to target multiple files.
Commands and operations -get2:48
Learn how to copy files and directories from hdfs to the local file system using the hdfs dfs -get command, including both directories and single files with practical examples.
Health warning0:58
Exercise caution when running Hadoop file system shell commands to avoid accidental data deletion, respect cluster permissions, and recover by downloading raw VM files and reloading the overlay.
HDFS Basic File Management - Problems & Solutions0:03

Section Introduction0:45
Explore the three key data structures: spark data frames, high tables, and temporary views, and note that data frames sit on IDs and are a special case of data sets.
DataFrames6:43
Explore Spark data frames, distributed structures with named columns and defined data types for large-scale data across clusters. Understand schema, partitions, immutability, and lazy transformations and actions.
Tables4:15
Explore database concepts, tables, and data warehouses with Apache Hive on Hadoop. Learn Hive metastore, table metadata, manage vs external tables, and schema on read for large datasets.
Temp Views2:11
Understand temporary views in Spark as virtual structures not stored in Hive metastore, disappearing when the session ends. Create them from data frames or Hive tables to query with SQL.

Section Introduction0:33
Explore creating three key data structures—data frames, archive tables, and temporary views—and use Spark SQL to query them.
Querying Data Structures using SQL via Spark SQL7:13
Explore querying structured data with Spark SQL, writing SQL queries over Spark dataframes and tables, and using the Spark session to register temp views or save data as tables.
Creating DataFrames with Spark SQL4:02
Create data frames with spark sql by reading tables or files via the data frame reader interface and table method. Load from various sources in formats like csv and json.
Creating Databases & Tables with Spark SQL7:08
Learn to create databases and tables with Spark SQL, including create database if not exists and saving data frames as tables in the Hive metastore.
Creating Temporary Views with Spark SQL1:57
Create temporary views in Spark SQL from data frames or tables, register them for SQL queries, and drop them without affecting the underlying data.

Section Introduction0:26
Explore basic spark sql operations on data frames and tables, including column and row manipulations and basic sql queries.
Operations on DataFrame columns13:14
Learn how to manipulate data frame columns with spark sql and dataframe apis, covering referencing columns in multiple ways, selecting subsets, renaming, adding, and dropping columns, plus handling reserved names.
Operations on DataFrame rows5:51
Explore basic data frame operations with the scholar api in a Zeppelin notebook, including limit, show three, counts, union, distinct, drop duplicates, and sample.
Basic SQL queries for Tables9:23
Explore essential sql queries for tables, including describe table and function, select, joins, rename with as, cast, backticks, and limits, distinct, and drop table, illustrated in a Zeppelin notebook.

Section Introduction1:15
Explore data engineering with Spark SQL and Hadoop by performing ETL tasks, loading data from HFS, reading files in various formats, transforming data, and writing results back to DFS.
The ETL Process3:38
Explore the etl process: extract, transform, and load data from data lakes and warehouses, clean and transform it in a Spark data frame, and load for analysis.
The Extract Phase of an ETL process5:14
Explore the extraction phase of a Spark ETL process by loading files into a data frame with the Read API. Identify formats like text, csv, json, parquet, and avro.
The Extract Phase - Loading CSV and Text files7:56
Learn how to load csv and text files into spark data frames, using schema inference or explicit schema, header and separator options, and text versus csv readers.
The Extract Phase - Loading JSON and Parquet files4:01
Load JSON and Parquet data into Spark data frames; schemas auto-infer, view schema, and reorder columns with select expressions and multi-line or merge schema options.
The Extract Phase - Loading Avro and ORC files1:12
Learn to load avro and ORC files into a data frame with the spark session's data frame reader. Compare standard and alternative loading methods and note limited data source options.
The Transform Phase of an ETL process2:29
Navigate the transform phase of ETL, covering deduplication, formats revision, splitting, encoding, deriving new values, null handling, and using alias, string, numeric, and datetime functions on HFS data.
The Transform Phase - String Transformations9:08
Learn string transformations in Spark SQL for data science, including replace with regular expressions, label encoding, splitting, substring, length, trim, capitalization, and joining column values.
The Transform Phase - Numerical Transformations4:50
Explore numerical transformations in Spark SQL using five math functions—round, absolute value, power, and square root—applied to salary data within a Spark session workflow.
The Transform Phase - Date & Time Transformations7:27
Learn how to perform date and time transformations in Spark, including converting strings to dates, reformatting dates, extracting day, month, and year, obtaining current date, and computing differences between dates.
The Transform Phase - Data Type Transformations5:40
Explore Spark data types and data type transformations, examine and cast data frame columns, and understand integers, floats, decimals, strings, booleans, and timestamps using print schema and cast.
The Transform Phase - Transformations of Nulls4:39
Transform nulls in Spark dataframes by dropping rows with null values or filling them with replacements, using drop and fill methods with configurable how, thresh, and subset parameters.
The Load Phase of an ETL process1:37
Explore the load phase of ETL, writing results back to storage as persistent tables or exporting data frames to file formats like CSV, text, JSON, Parquet, Avro, and ORC.
The Load Phase - Saving DataFrame data to Files I8:22
Write data frame data to files using Spark session and data frame writer; choose formats CSV, JSON, or Avro, and configure options such as schema inference, header, and compression.
The Load Phase - Saving DataFrame data to Files II7:34
Learn how to save dataframe data with Spark's partition by, coalesced, and partition control to manage output files and directories, enabling reading only relevant partitions.
The Load Phase - Saving DataFrame data to Tables2:46
Save a DataFrame as a persistent table using Spark SQL, choosing formats, setting a custom path, and applying save modes while understanding table drop behavior and hive metastore.
Data Engineering - Solutions to Problems0:03

Section Introduction1:11
Master data analysis with spark by querying loaded data to generate simple reports, filter data, and compute aggregate statistics with group by and joins across datasets.
Metastore Tables as Input Sources or Output Sinks3:30
Use metastore tables as input sources or output sinks in Spark applications, reading with the data frame reader and table method, or with SQL, and save results to Hive tables.
Querying datasets in Spark6:32
Learn the fundamentals of querying with Sequel in Spark, building select and from clauses, applying where, group by, having, and order by, with results returned as data frames.
Math Functions in SQL3:04
Explore sql math functions in Spark SQL, including round, abs, power, and square root, to transform column values such as calculating monthly salaries and naming result columns.
Filtering10:24
Explore how to filter data with where clauses and boolean expressions in Spark SQL. Apply limit, distinct, between, and like to retrieve targeted records in data frames.
Sorting & Ranking7:32
Learn to sort and rank results in Spark SQL using order by, ascending and descending, and ranking functions rank and dense rank, with partition by for regional top salaries.
Aggregation6:30
Explore aggregation in Spark SQL and Hadoop data science by applying common aggregate functions, count, sum, average, min, max, and distinct, to compute insights across table columns and groups.
Grouping7:33
Group data using the group by clause, apply where filters, and use having to refine groups. Learn aggregation and dataframe approaches in Spark SQL.
Multi Table Queries2:38
Master multi-table queries by applying the join clause for side-by-side results and the union operator to stack results from multiple selects, using common keys and compatible columns.
Multi Table Queries - Joins6:36
Learn how to perform multi-table and data frame joins using on conditions and join clauses, with aliases, and handle duplicate column names by dropping or renaming.
Multi Table Queries - Types of Joins8:53
Master multi-table queries with spark dataframes by applying inner, left outer, right outer, full outer, left semi, and left anti joins on left and right tables, including select statements.
Multi Table Queries - Unions5:28
Master multi-table queries with the union operator to combine rows from top and bottom tables in the school DB, including union all and union distinct, aliasing, and casting for compatibility.
Data Analysis - Solutions to Problems0:03

Requirements

This course has been designed for individuals that are new to Hadoop and Spark, so the course does not assume any prior knowledge of Hadoop or Spark theory.
A basic knowledge of SQL queries is helpful. But students with no prior knowledge of SQL are provided with a good enough introduction to SQL queries to ensure that they hit the ground running.
The Verulam Blue VM, that comes as part of this course, has a Spark Hadoop environment and requires a pc or a laptop with a minimum of 8 GB RAM and 20 GB of free space (instructions on how to download and run the VM are provided).

Description

*Important Notice*

This course has been retired and is no longer receiving support. Originally designed to help students pass the now-retired Cloudera Certification exams, the material remains useful for those wanting to practice their skills on Spark and Hadoop clusters. However, its primary focus was certification preparation, which many students successfully completed.

Apache Spark is currently one of the most popular systems for processing big data.

Apache Hadoop continues to be used by many organizations that look to store data locally on premises. Hadoop allows these organisations to efficiently store big datasets ranging in size from gigabytes to petabytes.

As the number of vacancies for data science, big data analysis and data engineering roles continue to grow, so too will the demand for individuals that possess knowledge of Spark and Hadoop technologies to fill these vacancies.

This course has been designed specifically for data scientists, big data analysts and data engineers looking to leverage the power of Hadoop and Apache Spark to make sense of big data.

This course will help those individuals that are looking to interactively analyse big data or to begin writing production applications to prepare data for further analysis using Spark SQL in a Hadoop environment.

The course is also well suited for university students and recent graduates that are keen to gain exposure to Spark & Hadoop or anyone who simply wants to apply their SQL skills in a big data environment using Spark-SQL.

This course has been designed to be concise and to provide students with a necessary and sufficient amount of theory, enough for them to be able to use Hadoop & Spark without getting bogged down in too much theory about older low-level APIs such as RDDs.

On solving the questions contained in this course students will begin to develop those skills & the confidence needed to handle real world scenarios that come their way in a production environment.

(a) There are just under 30 problems in this course. These cover hdfs commands, basic data engineering tasks and data analysis.

(b) Fully worked out solutions to all the problems.

(c) Also included is the Verulam Blue virtual machine which is an environment that has a spark Hadoop cluster already installed so that you can practice working on the problems.

The VM contains a Spark Hadoop environment which allows students to read and write data to & from the Hadoop file system as well as to store metastore tables on the Hive metastore.
All the datasets students will need for the problems are already loaded onto HDFS, so there is no need for students to do any extra work.
The VM also has Apache Zeppelin installed. This is a notebook specific to Spark and is similar to Python’s Jupyter notebook.

This course will allow students to get hands-on experience working in a Spark Hadoop environment as they practice:

Converting a set of data values in a given format stored in HDFS into new data values or a new data format and writing them into HDFS.
Loading data from HDFS for use in Spark applications & writing the results back into HDFS using Spark.
Reading and writing files in a variety of file formats.
Performing standard extract, transform, load (ETL) processes on data using the Spark API.
Using metastore tables as an input source or an output sink for Spark applications.
Applying the understanding of the fundamentals of querying datasets in Spark.
Filtering data using Spark.
Writing queries that calculate aggregate statistics.
Joining disparate datasets using Spark.
Producing ranked or sorted data.

Who this course is for:

This course has been designed specifically for data scientists, big data analysts and data engineers looking to leverage the power of Hadoop and Apache Spark to make sense of big data.
This course is also well suited for university students and recent graduates that are keen to land a job with a company that’s looking to fill a big data-related positions or anyone who simply wants to apply their SQL skills in a big data environment using Spark-SQL.
Software engineers & developers who are looking to break into the Data Engineering field will also find this course helpful.

[LEGACY–SUPPORT END] Spark SQL & Hadoop (For Data Science)

What you'll learn

Explore related topics

Course content

Introduction2 lectures • 6min

Introduction to Hadoop & Spark8 lectures • 34min

Our Working Environment9 lectures • 13min

HDFS Basic File Management13 lectures • 45min

Data Structures4 lectures • 14min

Spark SQL & Creating Data Structures5 lectures • 21min

Basic Operations on Data Structures4 lectures • 29min

Data Engineering17 lectures • 1hr 18min

Data Analysis13 lectures • 1hr 10min

End of Course Test Solutions1 lecture • 1min

Requirements

Description

Who this course is for: