- 14.5 hours on-demand video
- 5 articles
- 2 downloadable resources
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to 4,000+ top Udemy courses anytime, anywhere.Try Udemy for Business
- Design distributed systems that manage "big data" using Hadoop and related technologies.
- Use HDFS and MapReduce for storing and analyzing data at scale.
- Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
- Analyze relational data using Hive and MySQL
- Analyze non-relational data using HBase, Cassandra, and MongoDB
- Query data interactively with Drill, Phoenix, and Presto
- Choose an appropriate data storage technology for your application
- Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
- Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
- Consume streaming data using Spark Streaming, Flink, and Storm
- You will need access to a PC running 64-bit Windows, MacOS, or Linux with an Internet connection and at least 8GB of *free* (not total) RAM, if you want to participate in the hands-on activities and exercises. If your PC does not meet these requirements, you can still follow along in the course without doing hands-on activities.
- Some activities will require some prior programming experience, preferably in Python or Scala.
- A basic familiarity with the Linux command line will be very helpful.
The world of Hadoop and "Big Data" can be intimidating - hundreds of different technologies with cryptic names form the Hadoop ecosystem. With this Hadoop tutorial, you'll not only understand what those systems are and how they fit together - but you'll go hands-on and learn how to use them to solve real business problems!
Learn and master the most popular big data technologies in this comprehensive course, taught by a former engineer and senior manager from Amazon and IMDb. We'll go way beyond Hadoop itself, and dive into all sorts of distributed systems you may need to integrate with.
Install and work with a real Hadoop installation right on your desktop with Hortonworks (now part of Cloudera) and the Ambari UI
Manage big data on a cluster with HDFS and MapReduce
Write programs to analyze data on Hadoop with Pig and Spark
Store and query your data with Sqoop, Hive, MySQL, HBase, Cassandra, MongoDB, Drill, Phoenix, and Presto
Design real-world systems using the Hadoop ecosystem
Learn how your cluster is managed with YARN, Mesos, Zookeeper, Oozie, Zeppelin, and Hue
Handle streaming data in real time with Kafka, Flume, Spark Streaming, Flink, and Storm
Understanding Hadoop is a highly valuable skill for anyone working at companies with large amounts of data.
Almost every large company you might want to work at uses Hadoop in some way, including Amazon, Ebay, Facebook, Google, LinkedIn, IBM, Spotify, Twitter, and Yahoo! And it's not just technology companies that need Hadoop; even the New York Times uses Hadoop for processing images.
This course is comprehensive, covering over 25 different technologies in over 14 hours of video lectures. It's filled with hands-on activities and exercises, so you get some real experience in using Hadoop - it's not just theory.
You'll find a range of activities in this course for people at every level. If you're a project manager who just wants to learn the buzzwords, there are web UI's for many of the activities in the course that require no programming knowledge. If you're comfortable with command lines, we'll show you how to work with them too. And if you're a programmer, I'll challenge you with writing real scripts on a Hadoop system using Scala, Pig Latin, and Python.
You'll walk away from this course with a real, deep understanding of Hadoop and its associated distributed systems, and you can apply Hadoop to real-world problems. Plus a valuable completion certificate is waiting for you at the end!
Please note the focus on this course is on application development, not Hadoop administration. Although you will pick up some administration skills along the way.
Knowing how to wrangle "big data" is an incredibly valuable skill for today's top tech employers. Don't be left behind - enroll now!
"The Ultimate Hands-On Hadoop... was a crucial discovery for me. I supplemented your course with a bunch of literature and conferences until I managed to land an interview. I can proudly say that I landed a job as a Big Data Engineer around a year after I started your course. Thanks so much for all the great content you have generated and the crystal clear explanations. " - Aldo Serrano
"I honestly wouldn’t be where I am now without this course. Frank makes the complex simple by helping you through the process every step of the way. Highly recommended and worth your time especially the Spark environment. This course helped me achieve a far greater understanding of the environment and its capabilities. Frank makes the complex simple by helping you through the process every step of the way. Highly recommended and worth your time especially the Spark environment." - Tyler Buck
- Software engineers and programmers who want to understand the larger Hadoop ecosystem, and use it to store, analyze, and vend "big data" at scale.
- Project, program, or product managers who want to understand the lingo and high-level architecture of Hadoop.
- Data analysts and database administrators who are curious about Hadoop and how it relates to their work.
- System architects who need to understand the components available in the Hadoop ecosystem, and how they fit together.
How to ask questions, tune the video playback, enable captions, and leave reviews.
After a quick intro, we'll dive right in and install Hortonworks Sandbox in a virtual machine right on your own PC. This is the quickest way to get up and running with Hadoop so you can start learning and experimenting with it. We'll then download some real movie ratings data, and use Hive to analyze it!
The activities in this course use the Hortonworks Data Platform (HDP.) But Hortonworks merged with Cloudera, and they're working on a new thing called CDP. Don't worry... here's why.
What's Hadoop for? What problems does it solve? Where did it come from? We'll learn Hadoop's backstory in this lecture.
Before we can analyze movie ratings data from GroupLens using Hadoop, we need to load it into HDFS. You don't need to mess with command lines or programming to use HDFS. We'll start by importing some real movie ratings data into HDFS just using a web-based UI provided by Ambari.
We'll study our code for building a breakdown of movie ratings, and actually run it on your system!
Let's actually run our example from the previous lecture on your Hadoop sandbox, and find some good, old movies!
Spark 2.0 placed a new emphasis on Datasets and SparkSQL. Learn how Datasets can make your Spark scripts even faster and easier to write.
As an example of the more complicated things Spark is capable of, we'll use Spark's machine learning library to produce movie recommendations using the ALS algorithm.
Learn how Hive works under the hood of your Hadoop cluster, to efficiently query your data across a cluster using SQL commands. Well, technically it's HiveQL, but it will definitely seem familiar.
In the next lecture, we'll install Cassandra into your sandbox. It's a complicated process, and a lot can go wrong. Really, if you're not pretty comfortable with Linux, you might want to just watch the exercises that involve Cassandra instead of running them yourself.
One common issue is ending up in a state where your RPM database (which keeps track of what packages you have installed on your system) becomes corrupt. You'll experience this as seeing an error message like this:
rpmts_HdrFromFdno – error: rpmdbNextIterator – Header V3 RSA/SHA1 Signature, key ID BAD
If you encounter this, "yum" will no longer work at all. But, there is a way to fix it.
Just enter the following commands (you can paste them into PuTTY by right-clicking in the PuTTY terminal window after copying them; and be sure you've already run "su root" to run the following as the root user:)
rpm2cpio http://mirror.centos.org/centos/6/os/x86_64/Packages/nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64.rpm | cpio -idmv
cp ./lib64/libfreeblpriv3.* /lib64
Now, yum should work again. Note that if you do a big "yum update" and the ssl library is updated, you may lose your connection via PuTTY. If you're disconnected, wait a couple of minutes to allow yum to finish what it's doing, issue an ACPI Shutdown command to the virtual machine (via the Machine menu,) restart the sandbox, and connect again.
With so many options for choosing a database, how do you decide? We'll look at the requirements of given problems, such as consistency, latency, and scalability, and how that can inform your decision.
We'll use Drill to execute a query that spans data on MongoDB and Hive at the same time!
We'll configure Presto to also talk to our Cassandra database that we set up earlier, and do a JOIN query that spans both data in Cassandra and Hive!
Learn how YARN works in more depth as it controls and allocates the resources of your Hadoop cluster.
As something closer to a real-world example, we'll configure Flume to monitor a directory on our local filesystem for new files, and publish their data into HDFS, organized by the time the data was received.
Apache Flink is an up-and-coming alternative to Storm that offers a higher-level API. Let's talk about what sets it apart.