Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Real World Hadoop - Hands on Enterprise Distributed Storage.

Name: Real World Hadoop - Hands on Enterprise Distributed Storage.
Rating: 4.5 (16 reviews)

Master the art of manipulating files within a distributed storage enterprise platform with hands on Hadoop HDFS.

Created byToyin Akin

Last updated 12/2016

English

What you'll learn

Learn how to navigate the HDFS file system
If you want to build a HDFS stack, Simply run a single command on your desktop, go for a coffee, and come back with a running distributed environment for cluster deployment
Quickly build an environment where Cloudera and HDFS software can be installed.
Ability to automate the installation of software across multiple Virtual Machines

Course content

6 sections • 18 lectures • 2h 34m total length

Justification8:07
Simple introduction to the course.
Suggested course curriculum to follow ...
Suggested course curriculum to follow ...

Walking through the topology and benefits of HDFS17:47
Here we whiteboard by stepping through the topology and benefits of HDFS
Here we break down how HDFS can be installed10:52
HDFS needs to be installed and for our Cluster, we are using Cloudera Manager. The HDFS binaries are contained within Cloudera's Parcel and thus we break down how the binaries are distributed to all the Hadoop nodes.
We step through the installation of HDFS using Cloudera Manager13:46
Here, we step through the installation of HDFS using Cloudera Manager

Comparing "hdfs dfs" commands to our regular "bash" commands5:46
Here, we start to compare the "hdfs dfs" commands to our regular "bash" commands. They look pretty similar!!
Creating a userspace within hdfs for users to read/write files7:17
Here we use the hdfs "superuser" account in order to create a userspace for regular users to access the distributed file system in order to read/write files.
Here we upload a file into HDFS and view some details8:45
Here we upload a file into HDFS and view some details
It sounds naughty, but here we look at - hdfs fsck5:53
It sounds naughty, but here we look at - hdfs fsck
Here we look at hdfs - ls, rm and expunge6:41
Here we look at hdfs - ls, rm and expunge

We take a closer look at deleting files along with the skipTrash option6:40
We take a closer look at deleting files along with the skipTrash option
Here we look at the hdfs commands - mkdir, appendToFile, cat and tail10:44
Here we look at the hdfs commands - mkdir, appendToFile, cat and tail
Here we learn to search for files within hdfs6:05
Here we learn to search for files within hdfs
Here we look at the hdfs "get" and "getmerge" commands9:53
Here we look at the hdfs get and getmerge commands

Here we look at how we can count files and directories within hdfs6:23
Here we look at how we can count files and directories within hdfs
Here we look at how we can copy and move files within hdfs8:16
Here we look at how we can copy and moved files within hdfs
Here we combine touchz and appendToFile to simulate increasing DataSet size7:32
Here we combine touchz and appendToFile to simulate increasing DataSet size

Requirements

Basic programming or scripting experience is required.
You will need a desktop PC and an Internet connection. The course is created with Windows in mind.
The software needed for this course is freely available
You will require a computer with a Virtualization chipset support - VT-x. Most computers purchased over the last five years should be good enough
Optional : Some exposure to Linux and/or Bash shell environment
64-bit Windows operating system required (Would recommend Windows 7 or above)
This course is not recommened if you have no desire to work with/in distributed computing
Optional : This course is built on top of - "Real World Vagrant - Automate a Cloudera Manager Build"

Description

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems.

We will be manipulating the HDFS File System, however why are Enterprises interested in HDFS to begin with?

However, the differences from other distributed file systems are significant.

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

HDFS is part of the Apache Hadoop Core project.

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that
it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

Here I present a curriculum as to the current state of my Cloudera courses.

My Hadoop courses are based on Vagrant so that you can practice and destroy your virtual environment before applying the installation onto real servers/VMs.

For those with little or no knowledge of the Hadoop eco system Udemy course : Big Data Intro for IT Administrators, Devs and Consultants

I would first practice with Vagrant so that you can carve out a virtual environment on your local desktop. You don't want to corrupt your physical servers if you do not understand the steps or make a mistake. Udemy course : Real World Vagrant For Distributed Computing

I would then, on the virtual servers, deploy Cloudera Manager plus agents. Agents are the guys that will sit on all the slave nodes ready to deploy your Hadoop services Udemy course : Real World Vagrant - Automate a Cloudera Manager Build

Then deploy the Hadoop services across your cluster (via the installed Cloudera Manager in the previous step). We look at the logic regarding the placement of master and slave services. Udemy course : Real World Hadoop - Deploying Hadoop with Cloudera Manager

If you want to play around with HDFS commands (Hands on distributed file manipulation). Udemy course : Real World Hadoop - Hands on Enterprise Distributed Storage.

You can also automate the deployment of the Hadoop services via Python (using the Cloudera Manager Python API). But this is an advanced step and thus I would make sure that you understand how to manually deploy the Hadoop services first. Udemy course : Real World Hadoop - Automating Hadoop install with Python!

There is also the upgrade step. Once you have a running cluster, how do you upgrade to a newer hadoop cluster (Both for Cloudera Manager and the Hadoop Services). Udemy course : Real World Hadoop - Upgrade Cloudera and Hadoop hands on

Who this course is for:

Software engineers who want to expand their skills into the world of distributed computing
System Engineers that want to expand their skillsets beyond the single server
Developers who want to write/test their code against a valid distributed enviroment

Real World Hadoop - Hands on Enterprise Distributed Storage.

What you'll learn

Explore related topics

Course content

Navigating the HDFS File System2 lectures • 8min

HDFS Theory and Installation3 lectures • 42min

Navigating the Distributed Storage using - hdfs dfs5 lectures • 34min

OK, so how can one Add or Remove Files within a Distributed System.4 lectures • 33min

So how easy (or hard!) is it to Manage a Large Distributed Cluster3 lectures • 22min

Conclusion1 lecture • 14min

Requirements

Description

Who this course is for: