The Ultimate Hands-On Hadoop - Tame your Big Data!
4.5 (20,625 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
110,389 students enrolled

The Ultimate Hands-On Hadoop - Tame your Big Data!

Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies.
4.5 (20,616 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
110,389 students enrolled
Last updated 5/2020
English, Portuguese [Auto], 1 more
  • Spanish [Auto]
Current price: $132.99 Original price: $189.99 Discount: 30% off
11 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 14.5 hours on-demand video
  • 5 articles
  • 2 downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Design distributed systems that manage "big data" using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
  • You will need access to a PC running 64-bit Windows, MacOS, or Linux with an Internet connection and at least 8GB of *free* (not total) RAM, if you want to participate in the hands-on activities and exercises. If your PC does not meet these requirements, you can still follow along in the course without doing hands-on activities.
  • Some activities will require some prior programming experience, preferably in Python or Scala.
  • A basic familiarity with the Linux command line will be very helpful.

The world of Hadoop and "Big Data" can be intimidating - hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this Hadoop tutorial, you'll not only understand what those systems are and how they fit together - but you'll go hands-on and learn how to use them to solve real business problems!

Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We'll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.

  • Install and work with a real Hadoop installation right on your desktop with Hortonworks (now part of Cloudera) and the Ambari UI

  • Manage big data on a cluster with HDFS and MapReduce

  • Write programs to analyze data on Hadoop with Pig and Spark

  • Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto

  • Design real-world systems using the Hadoop ecosystem

  • Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue

  • Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm

Understanding Hadoop is a highly valuable skill for anyone working at companies with large amounts of data.

Almost every large company you might want to work at uses Hadoop in some way, including Amazon, Ebay, Facebook, Google, LinkedIn, IBM,  Spotify, Twitter, and Yahoo! And it's not just technology companies that need Hadoop; even the New York Times uses Hadoop for processing images.

This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It's filled with hands-on activities and exercises, so you get some real experience in using Hadoop - it's not just theory.

You'll find a range of activities in this course for people at every level. If you're a project manager who just wants to learn the buzzwords, there are web UI's for many of the activities in the course that require no programming knowledge. If you're comfortable with command lines, we'll show you how to work with them too. And if you're a programmer, I'll challenge you with writing real scripts on a Hadoop system using Scala, Pig Latin, and Python.

You'll walk away from this course with a real, deep understanding of Hadoop and its associated distributed systems, and you can apply Hadoop to real-world problems. Plus a valuable completion certificate is waiting for you at the end! 

Please note the focus on this course is on application development, not Hadoop administration. Although you will pick up some administration skills along the way.

Knowing how to wrangle "big data" is an incredibly valuable skill for today's top tech employers. Don't be left behind - enroll now!

  • "The Ultimate Hands-On Hadoop... was a crucial discovery for me. I supplemented your course with a bunch of literature and conferences until I managed to land an interview. I can proudly say that I landed a job as a Big Data Engineer around a year after I started your course. Thanks so much for all the great content you have generated and the crystal clear explanations. " - Aldo Serrano

  • "I honestly wouldn’t be where I am now without this course. Frank makes the complex simple by helping you through the process every step of the way. Highly recommended and worth your time especially the Spark environment.   This course helped me achieve a far greater understanding of the environment and its capabilities.  Frank makes the complex simple by helping you through the process every step of the way. Highly recommended and worth your time especially the Spark environment." - Tyler Buck

Who this course is for:
  • Software engineers and programmers who want to understand the larger Hadoop ecosystem, and use it to store, analyze, and vend "big data" at scale.
  • Project, program, or product managers who want to understand the lingo and high-level architecture of Hadoop.
  • Data analysts and database administrators who are curious about Hadoop and how it relates to their work.
  • System architects who need to understand the components available in the Hadoop ecosystem, and how they fit together.
Course content
Expand all 101 lectures 14:38:04
+ Learn all the buzzwords! And install the Hortonworks Data Platform Sandbox.
7 lectures 50:24

How to ask questions, tune the video playback, enable captions, and leave reviews.

Preview 02:10
Tips for Using This Course
If you have trouble downloading Hortonworks Data Platform...

After a quick intro, we'll dive right in and install Hortonworks Sandbox in a virtual machine right on your own PC. This is the quickest way to get up and running with Hadoop so you can start learning and experimenting with it. We'll then download some real movie ratings data, and use Hive to analyze it!

Preview 19:03

The activities in this course use the Hortonworks Data Platform (HDP.) But Hortonworks merged with Cloudera, and they're working on a new thing called CDP. Don't worry... here's why.

Preview 03:01

What's Hadoop for? What problems does it solve? Where did it come from? We'll learn Hadoop's backstory in this lecture.

Preview 07:44

We'll take a quick tour of all the technologies we'll cover in this course, and how they all fit together. You'll come out of this lecture knowing all the buzzwords!

Overview of the Hadoop Ecosystem
+ Using Hadoop's Core: HDFS and MapReduce
11 lectures 01:34:29

Learn how Hadoop's Distributed Filesystem allows you store massive data sets across a cluster of commodity computers, in a reliable and scalable manner.

HDFS: What it is, and how it works

Before we can analyze movie ratings data from GroupLens using Hadoop, we need to load it into HDFS. You don't need to mess with command lines or programming to use HDFS. We'll start by importing some real movie ratings data into HDFS just using a web-based UI provided by Ambari.

Preview 06:20

Developers might be more comfortable interacting with HDFS via the command line interface. We'll import the same data, this time from a terminal prompt.

[Activity] Install the MovieLens dataset into HDFS using the command line

Learn how mappers and reducers provide a clever way to analyze massive distributed datasets quickly and reliably.

MapReduce: What it is, and how it works

Learn what makes MapReduce so powerful, by horizontally scaling across a cluster of computers.

How MapReduce distributes processing

Let's look at a very simple example of MapReduce - counting how many of each rating type exists in our movie ratings data.

MapReduce example: Break down movie ratings by rating score

Some versions of the Hortonworks sandbox will have trouble installing the pip and mrjob packages in the next lecture. Here's what to do.

Troubleshooting tips: installing pip and mrjob

The quickest and easiest way to get started with MapReduce is by using Python's MRJob package, which lets you use MapReduce's streaming feature to write MapReduce code in Python instead of Java. Let's get set up.

[Activity] Installing Python, MRJob, and nano

We'll study our code for building a breakdown of movie ratings, and actually run it on your system!

Preview 07:36

As a challenge, see if you can write your own MapReduce script that sorts movies by how many ratings they received. I'll give you some hints, set you off, and then review my solution to the problem.

[Exercise] Rank movies by their popularity

Let's see how I solved the challenge from the previous lecture - we'll change our script to count movies instead of ratings, and then review and run my solution for sorting by rating count.

[Activity] Check your results against mine!
+ Programming Hadoop with Pig
7 lectures 56:08

Ambari is Hortonworks' web-based UI (similar to Hue used by Cloudera.) We can use it as an easy way to experiment with Pig, so let's take a closer look at it before moving ahead.

Introducing Ambari

An overview of what Pig is used for, who it's for, and how it works.

Introducing Pig

We'll use Pig to script a chain of queries on MovieLens to solve a more complex problem.

Example: Find the oldest movie with a 5-star rating using Pig

Let's actually run our example from the previous lecture on your Hadoop sandbox, and find some good, old movies!

Preview 09:40

We covered most of the basics of Pig in our example, but let's look at what else Pig Latin can do.

More Pig Latin

I'll give you some pointers, and challenge you to write your own Pig script that finds the most popular really bad movie!

[Exercise] Find the most-rated one-star movie

Let's look at my code for finding the most popular bad movies, and you can compare my results to yours.

Pig Challenge: Compare Your Results to Mine!
+ Programming Hadoop with Spark
8 lectures 01:14:07

What's so special about Spark? Learn how its efficiency and versatility make Apache Spark one of the hottest Hadoop-related technologies right now, and how it achieves this under the hood.

Why Spark?

The core building block of Spark is the RDD; learn how they are used and the functions available on them.

The Resilient Distributed Dataset (RDD)

As an example, let's write a Spark script to find the movie with the lowest average rating. We'll start by doing it just with RDD's.

[Activity] Find the movie with the lowest average rating - with RDD's

Spark 2.0 placed a new emphasis on Datasets and SparkSQL. Learn how Datasets can make your Spark scripts even faster and easier to write.

Preview 06:28

Let's revisit the previous problem of finding the lowest-rated movies, but this time using DataFrames.

[Activity] Find the movie with the lowest average rating - with DataFrames

As an example of the more complicated things Spark is capable of, we'll use Spark's machine learning library to produce movie recommendations using the ALS algorithm.

Preview 12:16

As a very simple exercise, we'll build upon our earlier activity to filter the results by movies with a given number of ratings.

[Exercise] Filter the lowest-rated movies by number of ratings

We'll review my solution to the previous exercise, and run the resulting scripts.

[Activity] Check your results against mine!
+ Using relational data stores with Hadoop
9 lectures 01:03:03

An introduction to Apache Hive and how it enables relational queries on HDFS-hosted data.

What is Hive?

We'll import the MovieLens data set into Hive using the Ambari UI, and run a simple query to find the most popular movies.

[Activity] Use Hive to find the most popular movie

Learn how Hive works under the hood of your Hadoop cluster, to efficiently query your data across a cluster using SQL commands. Well, technically it's HiveQL, but it will definitely seem familiar.

Preview 09:10

As a challenge, use this same Hive database to find the best-rated movie.

[Exercise] Use Hive to find the movie with the highest average rating

Compare your solution to mine for the exercise of finding the highest-rated movies using Hive.

Compare your solution to mine.

A quick overview of MySQL and how it might fit into your Hadoop-based work.

Integrating MySQL with Hadoop

Let import the MovieLens data set into MySQL, and run a query to view the most popular movies just to see that's it's working.

[Activity] Install MySQL and import our movie data

Learn how Sqoop works as a way to transfer data from an existing RDBMS like MySQL into Hadoop.

[Activity] Use Sqoop to import data from MySQL to HFDS/Hive

Sqoop can also work the other way - let's build a new table with Hive and export it back into MySQL.

[Activity] Use Sqoop to export data from Hadoop to MySQL
+ Using non-relational data stores with Hadoop
13 lectures 02:28:32

Learn why "NoSQL" databases are important for efficiently and scalably vending your data.

Why NoSQL?

HBase is a NoSQL columnar data store that sits on top of Hadoop. Learn what it's for and how it works.

What is HBase

We'll import our movie ratings into HBase through a RESTful service interface, using a Python script running our desktop to both populate and query the table.

[Activity] Import movie ratings into HBase

We'll see how HBase can integrate with Pig to store big data into HBase in a distributed manner.

[Activity] Use HBase with Pig to import data at scale.

Cassandra is a popular NoSQL database, that is appropriate for vending data at massive scale outside of Hadoop.

Cassandra overview

In the next lecture, we'll install Cassandra into your sandbox. It's a complicated process, and a lot can go wrong. Really, if you're not pretty comfortable with Linux, you might want to just watch the exercises that involve Cassandra instead of running them yourself.

One common issue is ending up in a state where your RPM database (which keeps track of what packages you have installed on your system) becomes corrupt. You'll experience this as seeing an error message like this:

rpmts_HdrFromFdno – error: rpmdbNextIterator – Header V3 RSA/SHA1 Signature, key ID BAD

If you encounter this, "yum" will no longer work at all. But, there is a way to fix it.

Just enter the following commands (you can paste them into PuTTY by right-clicking in the PuTTY terminal window after copying them; and be sure you've already run "su root" to run the following as the root user:)

cd ~


rpm2cpio | cpio -idmv

cp ./lib64/libfreeblpriv3.* /lib64

Now, yum should work again. Note that if you do a big "yum update" and the ssl library is updated, you may lose your connection via PuTTY. If you're disconnected, wait a couple of minutes to allow yum to finish what it's doing, issue an ACPI Shutdown command to the virtual machine (via the Machine menu,) restart the sandbox, and connect again.

If you have trouble installing Cassandra...

Cassandra isn't a part of Hortonworks, so we'll need to install it ourselves.

[Activity] Installing Cassandra

We'll modify our HBase example to write results into a Cassandra database instead, and look at the results.

[Activity] Write Spark output into Cassandra

MongoDB is a popular alternative to Cassandra. Learn what's different about it.

MongoDB overview

We'll install MongoDB on our virtual machine using Ambari. Then, we'll study and run a script to load up a Spark DataFrame of user data, store it into MongoDB, and query MongoDB to get users under 20 years old.

[Activity] Install MongoDB, and integrate Spark with MongoDB

We'll query our movie user data using MongoDB's command line interface, and set up an index on it.

[Activity] Using the MongoDB shell

With so many options for choosing a database, how do you decide? We'll look at the requirements of given problems, such as consistency, latency, and scalability, and how that can inform your decision.

Preview 15:59

In the previous lecture, I challenged you to choose a database for a stock trading application. Let's talk about my own thought process in this decision, and see if we reached the same conclusion.

[Exercise] Choose a database for a given problem
+ Querying your Data Interactively
9 lectures 01:21:54

What is Drill and what problems does it solve?

Overview of Drill

We'll install Drill so we can play with it, after installing a Hive and MongoDB database to work with.

[Activity] Setting up Drill

We'll use Drill to execute a query that spans data on MongoDB and Hive at the same time!

Preview 07:07

What is Phoenix for? How does it work?

Overview of Phoenix

We'll get our hands dirty with Phoenix and use it to query our HBase database.

[Activity] Install Phoenix and query HBase with it

We'll use Phoenix with Pig to store and load MovieLens users data, and accelerate queries on it.

[Activity] Integrate Phoenix with Pig

What is Presto, and how does it differ from Drill and Phoenix?

Overview of Presto

We'll install Presto, and issue some queries on Hive through it.

[Activity] Install Presto, and query Hive with it.

We'll configure Presto to also talk to our Cassandra database that we set up earlier, and do a JOIN query that spans both data in Cassandra and Hive!

Preview 09:01
+ Managing your Cluster
14 lectures 01:59:27

Learn how YARN works in more depth as it controls and allocates the resources of your Hadoop cluster.

Preview 10:01

Like Spark, Tez also uses Directed Acyclic Graphs to optimize tasks on your cluster. Learn how it works, and how it's different.

Tez explained

As an example of the power of Tez, we'll execute a Hive query with and without it.

[Activity] Use Hive on Tez and measure the performance benefit

Mesos is an alternative cluster manager to Hadoop YARN. Learn how it differs, who uses Mesos, and why.

Mesos explained

Zookeeper is a deceptively simple service for maintaining states across your cluster, like which servers are in service, in a highly reliable manner. Learn how it works, and what systems depend on Zookeeper for reliable operation.

ZooKeeper explained

Let's use ZooKeeper's command line interface to explore how it works.

[Activity] Simulating a failing master with ZooKeeper

Oozie allows you to set up complex workflows on your cluster using multiple technologies, and schedule them. Let's look at some examples of how it works.

Oozie explained
Import setup step for Oozie on HDP 2.6.5!

As a hands-on example, we'll use Oozie to import movie data into HDFS from MySQL using Sqoop, then analyze that data using Hive.

[Activity] Set up a simple Oozie workflow

Apache Zeppelin provides a notebook-based environment for importing, transforming, and analyzing your data.

Zeppelin overview

We'll set up a Zeppelin notebook to load movie ratings and titles into Spark dataframes, and interactively query and visualize them.

[Activity] Use Zeppelin to analyze movie ratings, part 1

We'll set up a Zeppelin notebook to load movie ratings and titles into Spark dataframes, and interactively query and visualize them.

[Activity] Use Zeppelin to analyze movie ratings, part 2

Apache Hue is a popular alternative to Ambari views, especially on Cloudera platforms. Let's see what it offers and how it's different.

Hue overview

Let's talk about Chukwa and Ganglia, just so you know what they are.

Other technologies worth mentioning
+ Feeding Data to your Cluster
6 lectures 54:47

Learn how Kafka provides a scalable, reliable means for collecting data across a cluster of computers and broadcasting it for further processing.

Kafka explained

We'll get Kafka running, and set it up to publish and consume some data from a new topic.

[Activity] Setting up Kafka, and publishing some data.

We'll simulate a web server by monitoring an Apache log files using a Kafka connector, and watch Kafka pick up new lines in it.

[Activity] Publishing web logs with Kafka

Flume is another way to publish logs from a cluster. Learn about sinks and Flume's architecture, and how it differs from Kafka.

Flume explained

As a simple way to get started with Flume, we'll connect a source listening to a telnet connection to a sink that just logs information received.

[Activity] Set up Flume and publish logs with it.

As something closer to a real-world example, we'll configure Flume to monitor a directory on our local filesystem for new files, and publish their data into HDFS, organized by the time the data was received.

Preview 09:12
+ Analyzing Streams of Data
8 lectures 01:16:28

Spark streaming allows you to write "continuous applications" that process micro-batches of information in real time. Learn how it works, about DStreams, windowing, and the new Structured Streaming API.

Spark Streaming: Introduction

We'll write and run a Spark Streaming application that analyzes web logs as they are streamed in from Flume.

[Activity] Analyze web logs published with Flume using Spark Streaming

As a challenge, extend the previous activity to look for status codes in the web log and aggregate how often different status codes appear. Also, let's fiddle with the slide interval.

[Exercise] Monitor Flume-published logs for errors in real time

Let's review my solution to the previous exercise, and run it.

Exercise solution: Aggregating HTTP access codes with Spark Streaming

Storm is an alternative to Spark Streaming. Learn how it differs and is a true streaming solution.

Apache Storm: Introduction

We'll walk through, and run, the word count topology sample included with Storm.

[Activity] Count words with Storm

Apache Flink is an up-and-coming alternative to Storm that offers a higher-level API. Let's talk about what sets it apart.

Preview 06:53

Let's install Flink and run a simple example with it.

[Activity] Counting words with Flink