Learn Big Data: The Hadoop Ecosystem Masterclass

Master the Hadoop ecosystem using HDFS, MapReduce, Yarn, Pig, Hive, Kafka, HBase, Spark, Knox, Ranger, Ambari, Zookeeper

Created byEdward Viaene

Last updated 6/2025

English

What you'll learn

Process Big Data using batch
Process Big Data using realtime data
Be familiar with the technologies in the Hadoop Stack
Be able to install and configure the Hortonworks Data Platform (HDP)

Course content

17 sections • 99 lectures • 6h 25m total length

Course Introduction3:01
Course introduction, lecture overview, course objectives
Course Guide1:15
This document provides a guide to do the demos in this course

What is Big Data2:16
The 3 (or 4) V's of Big Data explained
Examples of Big Data3:29
What is Big Data? Some examples of companies using Big Data, like Spotify, Amazon, Google, and Tesla
What is Data Science2:19
What can we do with Big Data? Data Science explained.
What is Hadoop4:13
How to build a Big Data System? What is Hadoop?
Hadoop Distributions3:17
Hadoop Distributions: a comparison between Apache Hadoop, Hortonworks Data Platform, Cloudera, and MapR
What is Big Data Quiz

Hadoop Installation - Ambari 3 (2025 Update)25:50
In 2025, Ambari 3 was released, making it easier to provision a Hadoop cluster. This demo shows how to provision a Hadoop cluster using docker and Ambari 3, replacing the HDP Vagrant/Sandbox install.
Hadoop Installation4:40
How to install Hadoop? You can install Hadoop using vagrant with Virtualbox / VMWare, or on the Cloud using AWS. Hortonworks also provides a Sandbox.
Demo: Hortonworks Sandbox (not available anymore for download)4:21
This is a demo of how to install and use the Hortonworks Sandbox. An alternative to the full installation using Ambari if you have a machine that doesn't have a lot of memory available. You can also use both in conjunction.
Demo: Hadoop Installation (Hortonworks - old ambari version) - Part 14:58
A walkthrough of how to install the Hortonworks Data Platform (HDP) on your Laptop or Desktop
Demo: Hadoop Installation (Hortonworks - old ambari version) - Part 26:38
A walkthrough of how to install the Hortonworks Data Platform (HDP) on your Laptop or Desktop (Part II)
Introduction to HDFS3:28
An introduction to HDFS, The Hadoop Distributed Filesystem
DataNode Communications1:15
Communications between the DataNode and the NameNode explained
Demo: HDFS - Part 15:45
An introduction to HDFS using hadoop fs put. I'm also showing how a files gets divided in blocks and where those blocks are stored.
Demo: HDFS - Part 2 - Using Ambari4:59
An introduction to downloading, uploading and listing files. This time I'm using the Ambari HDFS Viewer and the NameNode UI. I also show what configuration changes are necessary to make this work.
MapReduce WordCount Example4:17
MapReduce WordCount, step by step explained
Demo: MapReduce WordCount7:05
A demo of MapReduce WordCount on our HDP cluster
Lines that span blocks2:29
In HDFS, files are divided in blocks and stored on the DataNodes. In this lecture we're going to see what happens when we're reading lines from files that potentially span over multiple blocks.
Introduction to Yarn4:20
Introducing Yarn, and concepts like the ResourceManager, the scheduler, the applicationsManager, the NodeManager, and the Application Master. I explain how an application is executed and the consequences when a node crashes.
Demo: Yarn and ResourceManager UI5:45
A demo of an application executed using yarn jar. I provide an overview of Ambari Yarn metrics and the ResourceManager UI
Ambari API and Blueprints3:35
Ambari also exposes a REST API. Commands can be executed directly to this API. Ambari also lets you do unattended install using Ambari Blueprints
Demo: Ambari API and Blueprints8:38
A demo showing you the Ambari API and how to work with blueprints
ETL Processing in Hadoop1:50
An introduction to ETL processing in Hadoop. MapReduce, Pig, and Spark are suitable to do batch processing. Hive is more suitable for data exploration.
Introduction Quiz

Introduction to Pig2:36
An introduction to Pig and Pig Latin.
Demo: Part 1 - Pig Installation2:08
This demo shows how to install pig and tez using Ambari on the Hortonworks Data Platform
Demo: Part 2 - Pig Commands6:21
In this demo I will show you basic pig commands to load, dump and store data. I'll also show you an example how to filter data.
Demo: Part 3 - More Pig Commands4:02
More Pig commands in this final part of the pig demo. I'll go over commands like GROUP BY, FOREACH ... GENERATE and COUNT()

Introduction to Apache Spark3:42
An introduction to Apache Spark. This lecture explains the differences between the spark-submit using local mode, yarn-cluster and yarn-client.
Spark WordCount2:36
An introduction to WordCount in Spark using Python (pyspark)
Demo: Spark installation and WordCount4:36
Spark installation using Ambari and a demo of the Spark Wordcount using the pyspark shell.
RDDs3:52
This lectures gives an introduction to Resilient Distributed Datasets (RDDs). This abstraction allows you to do transformations and actions in Spark. I give an example using filter RDDs, and explain how shuffle RDDs impact disk and network IO
Demo: RDD Transformations and Actions6:02
A demo of RDD transformations and actions in Spark
Overview of RDD Transformations and Actions3:36
An overview of the most common RDD actions and transformations
Spark MLLib1:58
An overview of what Spark MLLib (Machine Learning Library) can do. I explain a Recommendation Engine example, and a Clustering Example (K-Means / DBScan)

Introduction to Hive2:47
An introduction to SQL on Hadoop using Hive, enabling data warehouse capabilities. This lecture provides an architecture overview and an overview of the hive CLI and beeline using JDBC.
Hive Queries4:29
An overview of Hive Queries: creating tables, creating databases, inserting data, and selecting data. This lecture also shows where the hive data is stored in HDFS.
Demo: Hive Installation and Hive Queries7:33
A demo that shows the installation of Hiveserver2 and the clients. Afterwards I show you a few example queries using a JDBC beeline connection.
Hive Partitioning, Buckets, UDFs, and SerDes4:32
Optimizing hive can't be done using indexes. This lecture explains how queries in hive should be optimized, using partitions and buckets. This lecture also handles User Defined Functions (UDFs) and Serialization / Deserialization
The Stinger Initiative2:42
The Stinger initiative brings optimizations to Spark. Query time has lowered significantly over the years. This lecture explains you the details.
Hive in Spark1:43
You can also use Hive in Spark using the Spark SQLContext.

Introduction to Kafka1:42
An introduction to Kafka and its terminology like Producers, Consumers, Topics and Partitions.
Kafka Topics4:10
An explanation of Kafka Topics covering Leader partitions, Follower partitions, and how writes are sent to the partitions. Also covers the Consumer groups to show the difference between publish-subscribe (pubsub) mechanism and queuing
Kafka Messages and Log Compaction4:04
Kafka guarantees at-least-once message delivery, but can also be configured for at-most-once. Log Compaction is a technique that Kafka provides to have a full dataset maintained in the commit log. This lecture shows an example of a customer dataset fully kept in Kafka and explains Log Tail, Cleaner Point and Log Head and how it impacts consumers.
Kafka Use Cases and Usage2:47
A few example use cases of Kafka
Demo: Kafka Installation and Usage6:31
The installation of Kafka on the Hortonworks Data Platform and a demo of a producer - consumer example.

Introduction to Storm2:49
This lecture provides an introduction to Storm, a realtime computing system. The architecture overview explains components like Nimbus, Zookeeper, and the Supervisor
A Storm Topology4:14
This lecture explains what Storm topologies are. I talk about streams, tuples, spouts, and bolts.
Demo: Storm installation and Example Topology9:33
A demo of a Storm Topology ingesting data from Kafka and doing computation on the data.
Storm Message Processing and Reliability4:00
Message Delivery explained:

At most once delivery

At least once delivery

Exactly once delivery

This lecture also explains the Storm's reliability API (Anchoring and Acking) and the performance impact of acking.
Trident2:42
An introduction to the Trident API, an alternative interface for Storm that supports exactly-once processing of messages.

Introduction to Spark Streaming1:57
Spark streaming is an alternative to Storm that gained a lot of popularity in the last few years. It allows you to reuse the code you wrote in batch and use it for stream processing.
Spark Streaming Architecture1:32
Spark Streaming generates DStreams, micro-batches of RDDs. This lecture explains the Spark Streaming Architecture
Spark Receivers and WordCount Streaming Example3:28
This lecture explains possible receivers, like Kafka. It also shows a WordCount streaming example, where data is ingested from Kafka and processed using WordCount in Spark Streaming
Demo: Spark Streaming with Kafka3:57
This demo shows the Kafka-spark-streaming example.
Spark Streaming State and Checkpointing2:09
In the previous lecture we did a WordCount using Spark Streaming, but our example was stateless. In this lecture I'm adding state, using UpdateStateByKey to keep state and checkpointing to save the data to HDFS.
Demo: Stateful Spark Streaming3:24
A demo of a stateful spark streaming application. Performs a global WordCount from a topic from Kafka. Does checkpointing in HDFS.
More Spark Streaming Features1:08
More Spark Streaming Features, like Windowing and streaming algorithms

Requirements

You will need to have a background in IT. The course is aimed at Software Engineers, System Administrators, DBAs who want to learn about Big Data
Knowing any programming language will enhance your course experience
The course contains demos you can try out on your own machine. To run the Hadoop cluster on your own machine, you will need to run a virtual server. 8 GB or more RAM is recommended.

Description

Important update: As of March 2025, Ambari 3 was released, allowing easy installs again using public Hadoop repositories. The installation demo in this course has been updated to Ambari 3. The install video is free to watch as a preview. To install old HDP (Hortonworks Data Platform) releases, you need to have a subscription. The Ambari 3 demo is a great alternative to having an HDP subscription.

In this course you will learn Big Data using the Hadoop Ecosystem. Why Hadoop? It is one of the most sought after skills in the IT industry. The average salary in the US is $112,000 per year, up to an average of $160,000 in San Fransisco (source: Indeed).

The course is aimed at Software Engineers, Database Administrators, and System Administrators that want to learn about Big Data. Other IT professionals can also take this course, but might have to do some extra research to understand some of the concepts.

You will learn how to use the most popular software in the Big Data industry at moment, using batch processing as well as realtime processing. This course will give you enough background to be able to talk about real problems and solutions with experts in the industry. Updating your LinkedIn profile with these technologies will make recruiters want you to get interviews at the most prestigious companies in the world.

The course is very practical, with more than 6 hours of lectures. You want to try out everything yourself, adding multiple hours of learning. If you get stuck with the technology while trying, there is support available. I will answer your messages on the message boards and we have a Facebook group where you can post questions.

Who this course is for:

This course is for anyone that wants to know how Big Data works, and what technologies are involved
The main focus is on the Hadoop ecosystem. We don't cover any technologies not on the Hortonworks Data Platform Stack
The course compares MapR, Cloudera, and Hortonworks, but we only use the Hortonworks Data Platform (HDP) in the demos

Learn Big Data: The Hadoop Ecosystem Masterclass

What you'll learn

Explore related topics

Course content

Introduction2 lectures • 4min

What is Big Data and Hadoop5 lectures • 16min

Introduction to Hadoop17 lectures • 1hr 40min

Pig4 lectures • 15min

Apache Spark7 lectures • 26min

Hive6 lectures • 24min

Real Time Processing1 lecture • 3min

Kafka5 lectures • 19min

Storm5 lectures • 23min

Spark Streaming7 lectures • 18min

Requirements

Description

Who this course is for: