Hands-On with Hadoop 2: 3-in-1

Name: Hands-On with Hadoop 2: 3-in-1
Rating: 4.2 (6 reviews)

Run your own Hadoop clusters on your own machine or in the cloud

Created byPackt Publishing

Last updated 11/2020

English

What you'll learn

Understand the Hadoop 2.x Architecture
Create Map-reduce jobs
Plan, install and configure core Hadoop services on a Cluster
Validate the Cluster using HDFS, Map Reduce and Spark
Understand Cluster Life-Cycle and Performance tuning of a Hadoop Cluster
Hands-on solutions to your perplexing, real-world big data problems

Course content

3 sections • 96 lectures • 10h 56m total length

The Course Overview3:43
This video gives an overview of the entire course.
Installing Hadoop in Local22:50
In this video we’ll learn how to install Hadoop on our local system.
Prerequisites to the Hadoop installation
Hadoop Installation
Testing our Hadoop installation
Bring Process to Data4:43
The important part of selecting the Hadoop framework for your own solution is to understand why it is a good fit for your application.
Understand the Hadoop way of execution
Fit your own application and see if it will benefit from it
NameNode Versus DataNode4:15
Understand the difference between the two nodes in HDFS; Datanode and Namenode
Understand data resiliency
Distributed data source
Map and Reduce Operations7:59
The new term Map-Reduce… what does it mean and how does it solve a problem?
Understand what map-reduce operation is
Dig deep with different parts of map-reduce in the Hadoop world
Devise a way to implement your own problem with a map-reduce operation
Order of Execution and Parallel Thinking4:39
When jumping to parallel programming from serial programming, it is always hard to plan the computation.
Figure out the pitfalls of parallel data
See how the Hadoop parallel process works
Start parallel thinking
Formatting a HDFS6:38
Prepare your HDD for with HDFS
Have the HDFS
Fit your own application and see if it will benefit with it
Formatting a HDFS4:34
Copy data to/from HDFS.
Copy data to HDFS
Copy data from HDFS
Some Helpful Commands to Communicate with the HDFS3:35
Using the HDFS commands in the shell.
Find the basic difference between generic shell command and HDFS command
Get used to the different ways of writing the commands
HDFS Protocol and Using It in Applications11:11
How do we access the HDFS files from a java program
Connect to the file system HDFS protocol
Fetch data from HDFS
Put data in HDFS
Hadoop Jobs Versus Tasks4:47
What are Hadoop jobs and tasks?
Understand how tasks are communicated
Understand how jobs are run
The Hadoop UI for Task Progress4:06
How to see the process flow and progress of a Hadoop job.
Open the Hadoop UI
Assess the memory and process
Running a Couple of Example Jobs10:08
Run Hadoop jobs.
Run Hadoop jobs directly
Run Hadoop jobs on yarn
Analyze the Work Flow/Data Flow/Process Flow7:26
In this video, we are going to look at how the map and reduce gets executed
Get to know about the data flow
Discover how map gets executed
Discover how reduce gets executed
Introduction to the Movie Dataset4:04
Data Transformation and Storing to HDFS17:55
Prepare the data to be fit for our algorithm
Split the data to be transformed
Transform the data using Hadoop
Merge the data using a basic java application
Devise a Simple Algorithm for Recommendation4:06
Devise a simple algorithm for recommendation
Create an algorithm to prepare data for recommendation by genre
Understand the data format for that output
Implement the Algorithm in Hadoop Map-Reduce Way and Analyze Performance10:39
Implement the map-reduce for the transformation of the movie -> genre context
Create a map-reduce job for this problem
Take different splits for different performance assessment

The Course Overview3:26
This video will give you an overview about the course.
Navigation of GitBash10:35
In this video we will understand the use of GitBash and how it helps in stream lining the code.
Learn to manage code or configuration consistently across the cluster
Have a centralized configuration tool
Get to know about code Sanity and accountability.
Navigation of Vagrant9:10
The aim of this video is to learn how to manage virtual machines especially when you have a large cluster
Learn to control virtual machines in an efficient manner
Study how to interface to virtual machines
Get to know how easy it is to maintain virtual machines
Navigation of VirtualBox10:58
The aim of this video is to understand the virtualization tool you use in your environment as you cannot have physical nodes to practice a Hadoop cluster.
Learn to use virtualization tools to ease out the node deployments and simulate the production setup
Run Linux machines efficiently
Get to know how it is easy to use and control virtual machines
Planning a Single Node Setup14:49
The aim of this video is to build up on our understanding to scale large clusters.
Learn to set up single nodes cluster
Follow steps and streamline the machines for future use
Get to know how the single node cluster is functional and in use.
Install Apache Hadoop14:43
The aim of this video is to install Hadoop, understand its components and the role they play in the Hadoop ecosystem
Baseline the requirements and stack needed
Install software and other prerequisites
Verify by running the basic Hadoop commands
Apache Hadoop Overview4:31
The aim of this videos is to explore the Hadoop ecosystem and look at various tools or frameworks
Understand Hadoop ecosystem
Understand role of each components
Should be able to corelate with traditional technologies
Hadoop Distributed File System (HDFS)11:06
The aim of this video is to learn how Hadoop stores data and its differences from the traditional file system
Get to know about HDFS filesystem layout
Study the various differences from other filesystems
Learn to identify the master and slave components
YARN Overview11:22
The aim of this video is to understand the YARN framework and the problems it solves
Learn how Hadoop Yarn addresses issues related to resource management
Understand it better as a scheduler
Study how to conceptualize the generic scheduler
MapReduce9:55
The aim of this video is to know what is MapReduce, its evolution and its simplicity to address large scale data
Understand the working and use cases of Hadoop MapReduce
Get to know how the splits work
Learn to run the example programs
Planning Hadoop Services Placement9:36
The aim of this video is to study planning the layout of various Hadoop services to improve availability and performance. To have a balanced distribution of compute and memory across the cluster.
Distribute the services across nodes
Get to know how the failure of a node should not cause service disruption
Decide what all components you need
Planning ZooKeeper Placement11:02
The aim of this video is to demonstrate how in distributed system, coordination, locking and dynamic configuration is critical. Also, learn how to address the situation of split brain scenarios.
Learn to identify the nodes, where ZooKeeper must be installed.
Get to know how to account the best practices for IO improvements
Planning HDFS Service Placement10:47
The aim of this video is to show how important it is to address failures and make sure that the services are running even when few components or hardware fails
Identify the master and slave nodes
Work on placement of Datanodes for data locality
Verify by coping files to HDFS
Planning YARN4:42
The aim of this videos is to address failures and make sure that the services are running even when few components or hardware fails
Identify the master and slave nodes
Placement of Nodemangers with Datanodes
Verify by running an example job using yarn
Planning Spark Services8:56
The aim of this video is to understand the use cases for Spark and how to execute it across the cluster. It is important to know that the Spark history server is the only daemon in Yarn mode
How to stage the jars and libraries across the cluster
Localizing the jars efficiently
Verify by running a Spark job
HDFS Concepts13:37
The aim of this video is to show how HDFS as a filesystem is used to store data by splitting it into blocks of specific size. The blocks are replicated for redundancy and performance
How the files are split?
How the replication works?
Verify the concepts by coping data to the cluster.
HDFS Data Movement7:37
The aim of this video is to show how important use for HDFS as a filesystem is its ability to copy data from local to HDFS filesystem or visa versa.
How to copy data across the filesystems
Understand the direction of data move
Verify by moving data across the filesystems
HDFS Admin Commands7:41
The aim of this video is to understand what an administrator must do in his day to day activities and maintain the health of the cluster and keep the users happy
Where are the blocks for a file?
What is the health of the cluster?
Execute commands to validate our learnings
MapReduce Jobs7:17
The aim of this video is to show how does a MapReduce work and the stages it executes? When does a Reducer run or how we size them?
How many Mappers are launched?
How the reducers write output?
Validate by running jobs
Spark Jobs12:11
The aim of this video is to show how does Spark jobs work? Where are the libraries pulled from?
Which mode it runs in, cluster or client?
How it does execution
Validate by running jobs
Start/Stop Services7:09
The aim of this video is to show how we can start/stop an individual service in a cluster or restart all services across the cluster.
Get to know why in large clusters, it is cumbersome to manage services independently
Learn to separate out service management of each component
Get to know how to control the services in the cluster to get a feel of things
Manage Cluster Using Ambari4:53
The aim of this video to is to learn to manage services using Ambari web UI.
Learn how to Control services at the node level
Practically study how to control services at the cluster level
Explore the Ambari console to get familiar
Hadoop Upgrade10:11
The aim of this video it to show how to maintain stability and roll the latest patches as it is important to keep the versions updated to latest stable releases.
Understand the criticality of data and impact of its loss
Plan the release cycles and upgrades
Execute the best practices to have a smooth transition to new version
Scaling Cluster – Part 19:27
The aim of this video is to show how as the cluster grows, it is important to scale it according to the needs. Also, how we gracefully remove bad hardware or replace nodes – Apache Hadoop.
Learn to identify the bad nodes and decommission them
Add more nodes to increase capacity
Verify that there are no missing or corrupt blocks
Scaling Cluster – Part 26:52
The aim of this video is to show how as the cluster grows, it is important to scale it according to the needs. Also, how we gracefully remove bad hardware or replace nodes – using HDP.
Learn to identify the bad nodes and decommission them
Add more nodes to increase capacity
Verify that there are no missing or corrupt blocks
HDFS Masters6:14
The aim of this video is to understand the role of HDFS masters and its type. What if they fail? Can we recover them or failover?
Types of masters and its roles
Why we have multiple masters?
Make sure you understand each masters role
HA Configuration18:48
The aim of this video is to Setup Namenode HA using the QJM. It is important to understand its role, usage and the steps to be performed
What all we need to setup HA? Does the cluster meet the prerequisites for HA?
Verify that zookeeper is running, and journal node are up. Understand the Namenode’s metadata criticality
Verify by starting all services and doing a failover.
YARN Masters6:19
The aim of this video is to setup HA for YARN using HDP and understand the ease as compared to setting up things manually, as we did in HDFS HA
What is the RM’s identifier in the ZooKeeper?
Identify the nodes for HA
Verify that failover by manually stopping a service on Primary
Linux ACLs10:27
The aim of this video is to check the additional permissions and controls a user can have in terms of permissions. Can users in same group have different access rights on Linux native filesystem
What are the default permissions and ACLs
Do we have the necessary modules and filesystem support for ACLs?
Verify by executing the examples
HDFS ACLs Security – Part 13:22
HDFS ACLs Security – Part 24:36
The aim of this video is to know how a User is identified and is Hadoop secure.
How can we impersonate a User
Understand the various security option we have
Verify by executing the examples
Hadoop Users and Groups3:46
The aim of this video is to study about how could many users in an organization access a Hadoop cluster. Do we add them manually or using a centralized user management system
Understand the role of Users and Groups in a cluster
Have we considered Openldap or AD for user management?
Verify by creating a user and writing data to the cluster
NameNode UI5:21
The aim of this video is to learn about how we know the state of the cluster? Is it healthy or there are some issues? What is the total capacity of the cluster and its number of nodes we have
How we quickly look at the cluster state and the Datanodes
Do we have any free space in the cluster?
Use Namenode UI to see the state of the cluster
Apache Hadoop Auditing5:29
The aim of this video is to study that how in a multi-tenancy cluster with many users, who accessed a file or data. Were they authorized to execute or read a file?
How we enable auditing to track the user behavior
What commands a user executed and which service he used
Explore the audit logs and see what all is captured
Hadoop Metrics6:44
The aim of this video is to get to know if my cluster is optimally used or do we need to add more resources? How are my jobs performing, do they need optimizations?
What metrics are emitted by each daemon/service
How to read the metrics for each service and understand it
Verify that you understand the configured resources using metrics
Hadoop Logs and Monitoring6:45
The aim of this video is to know if it is a good habit to log and monitor for proactive resolution. Each service has its logging mechanism and verbosity and give information about its state.
Is the service up and running? What does the logs say?
Are we capturing the logs and monitoring it for issues?
Get yourself familiar with logs of a service and what stages a service goes though
Hadoop Troubleshooting – Part 17:10
The aim of this video is to know how despite having all the checks and best practices in place, things go wrong. How quickly we can identify the problem and resolve it is a key factor.
What are the common problems for HDFS
Is the service up and running, if not why?
Understand each of the scenarios discussed to have a better hold
Hadoop Troubleshooting – Part 29:15
The aim of this video is to study how the cluster is up and running, all services are healthy, and yet the jobs are failing.
What kinds of jobs are failing?
Can we isolate it quickly or is the cluster wide
Understand each of the scenarios discussed to have a better hold

The Course Overview2:52
This video gives an overview of the entire course.
Hadoop Distributed File System (HDFS)6:59
In this video, we will see what a HDFS is.
What a Hadoop is
What the Hadoop Distributed File System is
Explain HDFS architecture
Distributed Compute Capability YARN4:46
In this video, we will learn about YARN.
what the YARN is
How it is used with Spark
Apache Hive for ETL and SQL Like7:23
In this video, we will see what the Hive is.
When to use Hive
How Hive is using HDFS
What is a Metastore
Message Queuing and Data Ingestion Kafka3:50
In this video, we will see what a pub-sub is.
What the topic is
How Kafka topics scale
What is a topic offset
NoSQL Datastores – Hadoop HBase, Accumulo5:32
In this video, we will see some column-oriented database concepts.
Explain HBASE architecture
Explain HBASE data structure
Explain Accumulo architecture
Machine Learning – Spark and Spark MLlib6:41
In this video, we will see Spark architecture.
Explain RDD
Explain partitioning
Explain Spark MLlib
Stream Processing – Spark Streaming4:41
In this video, we will explain Spark Streaming architecture.
Explain some micro batches
See the difference between latency versus throughput
Explain failure recovery and Check pointing
Processing Payment Data from an Event Stream4:50
In this video, we will process payment data.
Create DStream provider
Explain stream of payment
Advanced Aggregations Using Streaming API – PaymentAnalyzer4:28
In this video, we will implement real-time logic on stream of events.
Implement PaymentAnalyzer
Save results to sink
Test the final result
Storing Time Series Data in HBase6:58
In this video, we will save data to HBase.
Implement HBase connector
Save data into HBase
Detecting BOT Traffic Using Spark Streaming6:08
In this video, we will implement bots filtering streaming jobs.
Write DStream provider for PageView
Filter bots
Test logic
Make Web Log Data Queryable – Hive Sink6:48
In this video, we will implement HDFS sink that saves data into HDFS.
Create Avro structure for data.
Generate classes
Save results to HDFC using Avro schema
Investigating Customers Data in Hive4:19
In this video, we will investigate the data of customers in Hive
Create table in Hive
Fill table with data
Query the customer’s data
Trending Supply Chain – Finding Top Seller Item in a Streaming Way8:01
In this video, we will use the streaming way to find the top seller item.
Create streaming job than analyze transaction
Write tests
Find top sellers
Enriching Top Sellers with Additional Information5:17
In this video, we will enrich transactions with additional information.
Add product information per item_id
Analyzing Customer Churn (Quantitative) Using DataFrame Queries5:36
In this video, we will perform quantitative analyze on the customer churn.
Find out what a churn analysis is
Explain quantitative analysis
Write and test in Spark
Analyzing Customer Churn (Amounts) Using DataFrame Queries4:56
In this video, we will analyze the amounts of customer churn based on transactional amounts.
Calculate churn based on the transactions amounts.
Take into consideration customers that are actually buying products and spending money
Compare to previous approach it gives an information about actual revenue (increase or decrease)
Storing Low Granularity Structured Sensor Data in HBase8:41
In this video, we will take a look at Streaming processing of sensor data.
Create HBase connector
Stream job that loads sensor data
Save sensor data to HBase
Consuming Sensor Data Stored in HBase – Scan and Count3:51
In this video, we will insert data to HBase from Spark Streaming job.
Use HBase shell
Count inserted sensor data
Scan inserted sensor data
Building Summaries on Data Streaming from Devices6:35
In this video, we will calculate statistics from sensors.
Fetch data from HBase to Apache Spark (batch)
Calculate statistics
Store statistics about sensors in HBase
Introducing Spark GraphX – How to Represent a Graph?2:13
In this video, we will see how to represent a graph.
What a graph is
What an edge is
What a vertex is
Perform Graph Operations Using GraphX3:56
In this video, we will perform operations in graph using GraphX.
Use GraphX API to experiment with graph
Explain operations on edges
Explain operations on vertices
Counting Degree of Vertices3:20
In this video, we will count degrees of vertices.
Count degrees of vertices
Count in-degree of vertex
Count out-degree of Vertex
Neighborhood Aggregations – Collecting Neighbors3:45
In this video, we will calculate average of neighborhood.
Use neighbourhood aggregations from GraphX API
Structural Operators – Connected Components2:09
In this video, we will see what connected components are.
Implement in Spark GraphX
Page Rank Using Spark GraphX4:59
In this video, we will see find page rank using Spark GraphX.
what a page rank is
Look at the input data
Calculate page rank in SparkX
Anomaly Detection2:16
In this video, we will see what an anomaly is and how to detect it.
Explain fraud detection
Explain clustering
Analyzing Web Logs for Suspicious Activity and Loading into Spark2:11
The aim of this video is to analyse web logs for suspicious activity and load data into Spark.
Use the input data set for finding anomalies
Load data to Spark
Parse data
Implementing Clustering – Choosing Number of Clusters3:59
In this video, we will implement clustering in Spark.
Choose number of clusters
Tweak clustering model
Detecting Anomalies in Network Traffic4:11
In this video, we will detect anomalies in network traffic.
Use clustered network traffic to find anomaly
Find an outlier comparing to distance to any of the clusters
Analyzing Post for an Author3:23
In this video, we will analyse post for an author.
Create a project using Spark MLlib
Analyze input data
Prepare input data to be make it ready for input to ML models
Extracting Information from Unstructured Text1:01
In this video, we will extract information from unstructured text.
Find out how to show text as a vector
Transform text into vector of numbers
Extracting Information Via Spark DataFrame3:36
In this video, we will get to know the algorithms for transforming text into vector of numbers.
Explain Bag-of-Words
Explain Word2Vect
Explain Skip-Gram
Sentiment Analysis of Posts Using Logistic Regression3:36
In this video, we will see what a supervised and unsupervised ML is.
What the logistic regression is
Explain logistic regression simple example
Implement logistic regression model in Apache Spark
Finding an Author of a Post2:23
In this video, we will find an author of a post.
Find out what cross validation is
How to split training and test data in a proper way
Implement cross validation in Apache Spark
Downloading and Setting Cloudera Sandbox3:49
In this video, we will download and setup the Cloudera Sandbox.
Download and setup Cloudera Sandbox
Start VirtualBox with Cloudera
Look at the tools that are available
Finding What Products Users Wants to Buy Using Cloudera Sandbox Toolkit11:52
In this video, we will find out what products the users want to buy.
Importing data to Hadoop
Using Hive as a SQL-like interface
Query data using Hue and using Impala as execution engine
Using Movies History to Suggest Interesting Content2:33
In this video, we will use movies to suggest interesting content to the viewer.
Look at the movie data source that will be used to train model
Build collaborative filtering in Apache Spark
Use the Alternating Least Squares (ALS) algorithm
Testing and Experimenting with Recommendation Engine7:59
In this video, we will test and experiment with the recommendation engine.
Get recommended movies for given user
Test the recommendation model
Validate

Requirements

Good knowledge of Java

Description

Hadoop is the most popular, reliable and scalable distributed computing and storage for Big Data solutions. It comprises of components designed to enable tasks on a distributed scale, across multiple servers and thousands of machines.

This comprehensive 3-in-1 training course gives you a strong foundation by exploring Hadoop ecosystem with real-world examples. You’ll discover the process to set up an HDFS cluster along with formatting and data transfer in between your local storage and the Hadoop filesystem. Also get a hands-on solution to 10 real-world use-cases using Hadoop.

Contents and Overview This training program includes 3 complete courses, carefully chosen to give you the most comprehensive training possible.

The first course, Getting Started with Hadoop 2.x, opens with an introduction to the world of Hadoop, where you will learn Nodes, Data Sets, and operations such as map and reduce. The second section deals HDFS, Hadoop's file-system used to store data. Further on, you’ll discover the differences between jobs and tasks, and get to know about the Hadoop UI. After this, we turn our attention to storing data in HDFS and Data Transformations. Lastly, we will learn how to implement an algorithm in Hadoop map-reduce way and analyze the overall performance.

The second course, Hadoop Administration and Cluster Management, starts by installing the Apache Hadoop for cluster installation and configuring the required services. Learn various cluster operations like validations, and expanding and shrinking Hadoop services. You will then move onto gain a better understanding of administrative tasks like planning your cluster, monitoring, logging, security, troubleshooting and best practices. Techniques to keep your Hadoop clusters highly available and reliant are also covered in this course.

The third course, Solving 10 Hadoop'able Problems, covers the core parts of the Hadoop ecosystem, helping to give a broad understanding and get you up-and-running fast. Next, it describes a number of common problems as case-study projects Hadoop is able to solve. These sections are broken down into sections by different projects, each serving as a specific use case for solving big data problems.

By the end of this Learning Path, you’ll be able to plan, deploy, manage and monitor and performance-tune your Hadoop Cluster with Apache Hadoop.

About the Author

A K M Zahiduzzaman is a software engineer with NewsCred Dhaka. He is a software developer and technology enthusiast. He was a Ruby on Rails developer, but now working on NodeJS and angularJS and python. He is also working with a much wider vision as a technology company. The next goal is introducing SOA within the current applications to scale development via microservices. Zahiduzzaman has a lot of experience with Spark and is passionate about it. He is also a guitarist and has a band too. He was also a speaker for an international event in Dhaka. He is very enthusiastic and love to share his knowledge.

Gurmukh Singh is a technology professional with 14+ years of industry experience in infrastructure design, distributed systems, performance optimization, and networks. He has worked in big data domain for the last 5 years and provides consultancy and training on various technologies. He has worked with companies such as HP, JP Morgan, and Yahoo and has authored the book Monitoring Hadoop.

Tomasz Lelek is a Software Engineer and Co-Founder of InitLearn. He mostly does programming in Java and Scala. He dedicates his time and efforts to get better at everything. He is currently delving into big data technologies. Tomasz is very passionate about everything associated with software development. He has been a speaker at a few conferences in Poland-Confitura and JDD, and at the Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He was also a speaker at an international event in Dhaka. He is very enthusiastic and loves to share his knowledge.

Who this course is for:

This course is perfect for budding data scientists and data analysts with a firm understanding of Java and wants to get started with Hadoop

Hands-On with Hadoop 2: 3-in-1

What you'll learn

Explore related topics

Course content

Getting Started with Hadoop 2.x18 lectures • 2hr 17min

Hadoop Administration and Cluster Management38 lectures • 5hr 27min

Solving 10 Hadoop'able Problems40 lectures • 3hr 12min

Requirements

Description

Who this course is for: