CCA131 Cloudera CDH 5 & 6 Hadoop Administrator Master Course
- 15 hours on-demand video
- 3 articles
- 61 downloadable resources
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to 4,000+ top Udemy courses anytime, anywhere.Try Udemy for Business
- Successful Hadoop Cloudera Administrator
- Start working in Hadoop Cloudera Production Environment
- Install, Configure, Manage, Secure, Test and Troubleshoot Hadoop Cloudera Cluster
- Manage and secure production grade Hadoop Cloudera Cluster
- Linux, Cloud Basics, System Administration will be added advantage
- AWS account registration. This course will guide through setup of production grade Hadoop Cloudera Cluster in AWS Cloud.
- Knowing any system setup experience will be added advantage will make the learning experience more enjoyable.
- Basic understanding of IT administration or development activities
This course is designed for professionals from zero experience to already skilled professionals to enhance their learning. Hands on session covers on end to end setup of Cloudera Cluster. We will be using AWS EC2 instances to deploy the cluster.
COURSE UPDATED PERIODICALLY SINCE LAUNCH (Cloudera 6)
What students are saying:
5 stars, "Very clear and adept in delivering the content. Learnt a lot. He covers the material 360 degrees and keeps the students in minds." - Sasidhar Thiriveedhi
5 stars, "This course is an absolute paradigm shift for me. This is really an amazing course, and you shouldn't miss if you are a novice/intermediate level in Cloudera Administration." - Santhosh G
5 stars, "Great work by the instructor... highly recommended..." - Phani Raj
5 stars, "It is really excellent course. A lot of learning materials." - Shaiukh Noor
5 stars, "This course is help me a lot for my certification preparation. thank you!" - Muhammad Faridh Ronianto
The course is targeted at Software Engineers, System Analysts, Database Administrators, Devops engineer and System Administrators who want to learn about Big Data Ecosystem with Cloudera. Other IT professionals can also take this course, but might have to do some extra work to understand some of the advanced concepts.
Cloudera being the market leader in Big data space, Hadoop Cloudera administration brings in huge job opportunities in Cloudera and Big data domain. Covers all the required skills as follows for CCA131 Certification
Install - Demonstrating and Installation of Cloudera Manager, Cloudera Data Hadoop (CDH) and Hadoop Ecosystem components
Configure - Basic to advanced configurations to setup Cloudera manager, Namenode High Availability (HA), Resource manager High Availability(HA)
Manage - Create and maintain day-to-day activities and operations in Cloudera Cluster like Cluster balancing, Alert setup, Rack topology management, Commissioning, Decommissioning hosts, YARN resource management with FIFO, Fair, Capacity Schedulers, Dynamic Resource Manager Configurations
Secure - Enabling relevant service and configuration to add security to meet the organisation goals with best practice. Configure extended Access Control List (ACL), Configure Sentry, Hue authorization and authentication with LDAP, HDFS encrypted zones
Test - Access file system commands via HTTPFS, Create, restore snapshot for HDFS directory, Get/Set extended ACL for a file or directory, Benchmark the cluster
Troubleshoot - Ability to find the cause of any problem, resolve them, optimize inefficient execution. Identify and filter out the warnings and predict the problem and apply the right solution, Configure dynamic resource pool configuration for better optimized use of cluster. Find the Scalability bottleneck and size the cluster.
Planning - Sizing and identify the dependencies, hardware and software requirements.
Getting a real time distributed environment with N number of machines at enterprise quality will be very costly. Thanks to Cloud which can help any user to create distributed environment with very minimal expenditure and pay only for what you are using it. AWS is very much technology neutral and all other cloud providers like Microsoft Azure, IBM Bluemix, Google Compute cloud, etc., works the similar way.
+++++Content Added on Request++++
Dec ++ Cloudera 6 Overview and Quick Install
Nov ++ HDFS Redaction
Nov ++ Memory management - Heap Calculation for Roles and Namenode
Nov ++ IO Compression
Nov ++ Charts and Dashboard
Oct ++ File copy, distcp
Oct ++ Command files added for all the section.
Sep ++ Kafka Service Administration
Sep ++ Spark Service Administration
Aug ++ Cluster Benchmarking
- Those who are taking CCA Hadoop Cloudera Administrator Exam (CCA131)
- Anyone who want to become Cloudera Hadoop Administrator
- Switching from Mainframe / Testing / Analytics domain to Cloudera Hadoop Administration domain
- Data Scientists / Technical Architects / Software Developers / Testing and Mainframe Professionals
- Hadoop Developer and Hadoop Cloudera Administrator who wants to work in Production like environment
- Create production like environment for test or production purpose
iSMAC- IoT, Social, Mobility, Analytics and Cloud will be the five pillars for next business innovation. How these five technology will be driving the transformation of e-business to digital business and where Big data and Cloudera fits in.
- Introduction to iSMAC
- Place for Big data in iSMAC world
- Answering Where, What and How?
- Types of analytics - Descriptive, diagnostic, Predictive and Prescriptive
- Spectrum of analytics - What happened? What is happening? what will happen? what should happen?
Almost every Big data project will fall under Lambda Architecture. Lambda Architecture will be servicing three layers, batch layer, service layer, speed layer. Hadoop ecosystem supports many components for these layers.
- Introduction to Lambda Architecture
- Three layers of Lambda Architecture
- Why Lambda Architecture important in Big data ecosystem?
- Various components serving different layers
- Use cases where lambda architecture being used
Machine learning and deep learning plays a significant role in current big data space. You will understand and get answers to the following questions.
How machine learning will answer the question, what?
What is a model?
How model is generated with train data?
How train model will be used or applied?
Why machine learning / deep learning is important to process unstructured data?
AWS Amazon Web Services provides access to technology resource with pay as you go model. You will pay only for the resources you are using. Provides wide spectrum of hardware resources, right from compute, storage, database, network, Analytics, security, Machine Learning and many more.
For this course we will understand what is AWS how it can facilitate to setup Cloudera cluster similar to production environment
AWS being the market leader in cloud, learning Cloudera administration with AWS will be added advantage
How to signup a new account with Amazon Web Services (AWS). How services would get charged and track the billing
- Signup new account with Amazon Web Services (AWS)
- AWS Free Tier
- Sample billing calculation for EC2 Instance and EBS Volume
- Verify billing information at billing page
AWS resources are located in multiple physical locations across the globe. Regions at Geographic area and each regions with multiple zones. We will learn about Zones and Regions.
- Advantages of regions and zones
- How to achieve high availability with regions and zones
- Why it is important with big data cluster
AWS Elastic Compute Cloud EC2 provides secure, resizable compute capacity in the cloud. Provides the complete control to the user in choosing, scaling up and scaling down the infrastructure based on the need. Launching EC2 instances and terminating EC2 instance involves few mandatory steps.
- Identify the instance type
- Selecting the network details
- Attaching EBS volume
- Setting up security configurations, opening the ip and port
- Tagging the system
- Attach New or existing key to login to the system
- Monitoring the health of the system
- Terminate the instance
- Verification of dependent services whether got terminated property or not
- What all components will be chargeable
Learn about Key based login using Putty Client. As administrator we may login to each and every machine to do OS level configuration. This could be done by loggin into the system using Putty Client
- Private Public key pair
- Putty Client and SSH Protocol
- PuttyGen to extract the ppk file from PEM format. (Privacy Enhanced Mail )
- Login as root to AWS EC2 Instance using private key
Amazon Machine Image (AMI) facilitates the installation of any software / framework real quick and fast. Entire Amazon Elastic Compute Cloud (EC2) will be stored as AMI.
How to create Amazon Machine Image (AMI)
How Elastic Block Storage (EBS) Would get stored as Snapshot
How to create more instances with AMI
How the security group will get applied against the instance
Disable Linux Firewall SELinux in AMI
Covered with CentOS 6
AWS provides four type of instances.
- Spot Instances
- Reserved Instances
- Dedicated Hosts
Out of these On-Demand is production grade which gives very high availability.
Spot instances are spare compute capacity in AWS is made available at huge discount compared to On-Demand. The price will be decided on bidding process. Highest bidder will get the instance. But spot instances can be interrupted by EC2 if the bid price is less than the market price. Apart from this all other features like fault-tolerance, reliability all the same as On-Demand Instance
Since we are going to learn and not run production cluster, we can use Spot Instances and get almost 10X capacity for the same budget. That is budget savings between 80 to 90% than on demand.Objective
- How to find the current pricing?
- Choose the right instance type for Cloudera?
- Check the memory/processor/network capacity of instances?
- Creating bid for Spot Instances?
- Features available in Spot Instances.
Relational Data Service (RDS) is pay as you go model of database service. Users can start database of MySQL, Aurora, Postgre, Maria, Oracle or Microsoft SQL in any capacity and run it for any duration with High Availability
- Introduction to RDS
- Start MySQL RDS
- Connect to MySQL RDS from EC2 Instance
- Connect to MySQL RDS from HeidiSQL client
- Working with RDS security configuration
- Introduction to RDS snapshot and High Availability
- Delete RDS
- Pricing calculation for RDS
Hadoop has HDFS (Hadoop Distributed File System). HDFS provides highly scaling and highly available storage within Hadoop. HDFS works with master worker architecture which provides horizontal scaleability.
- Introduction to HDFS
- How master worker architecture is implemented in HDFS?
- How High Availability (HA) is provided with Master and Worker?
- How Scaleability provided with Master and Worker?
- How data gets distributed across HDFS?
YARN (Yet Another Resource Negotiator) provides distributed processing environment with Master worker architecture within Hadoop.
How Master worker architecture implemented within YARN?
How High Availability implemented with Master and Worker?
How High Scaleability implemented with Master and Worker?
How resource manager works?
Introduction to scheduler and its purpose.
Cloudera services and management roles of Cloudera uses database to store meta information. Few of the services like Sqoop needs sql database to demonstrate its functionality. We will be using MySQL or MariaDB for the same.
- Installing MySQL database
- Prepare OS firewall to use MySQL database
- When to use MariaDB
- Securing MySQL/MariaDB Installation
- Need for MySQL Driver and its installation procedure
- Creating sample database/tables/users
- Using UI client like HeidiSQL
AWS Amazon Machine Image (AMI) is great to use a customized image of any system. Starting of N number of EC2 instances with predefined configuration will be very easy if we have an AMI of our need
- Select the base AMI with CentOS?
- Why CentOS for learning?
- Choose the type of instance with right memory and processing capacity
- Start and Login to EC2 instance with Putty
- Configure Linux firewall, volume, swappiness, iptables and prepare the instance to install Cloudera manager
- Prepare AMI with predefined configurations for Cloudera installation
Cloudera Data Hadoop (CDH) provides quick install option where the Cloudera Manager and Cluster can be installed with minimal configuration.
Setting up AWS EC2 for Cloudera Manager
Installing Cloudera Manager
Adding host to be managed
Selecting the required parcels
Learn about various steps for installation of CDH. In total there are six steps and this six steps could be installed in three different paths. Each will have its own advantage and disadvantage
- Three different paths of installation
- Overview on Path A, B and C
- Differences in these three paths
- Six steps in each path of installation
Cloudera Manager Server is the admin console for Cloudera Administrator.
- Overview on Cloudera Manager
- Exploring the options available in Cloudera Manager
- Overview on Agents
- Overview on Cloudera Management Service
- Overview on Cloudera Manager Database
- Overview on Cloudera Repository and Parcels
- Managing Services
- Managing multiple cluster and its services
Setup Cloudera repository rpm locally in a httpd server. This will help the organisation to define the required version and make the installation available locally.
- Installing httpd server locally in AWS EC2 instance
- Configure firewall for httpd
- Making the Cloudera manager repository to be available locally
- Making Cloudera Data Hadoop (CDH) Parcels available locally
- Verify the repository and parcel accessibility
Cloudera installation could be done by 3 different paths. Here we will learn about Path B where every step installation will be done manually.
- Verify repository for Cloudera installation
- Setup Database/users/tables for Cloudera manager
- Configure firewall, Transparent Huge Pages, Defragmentation, Caching, ntp to work with Cloudera setup
- Creating AMI for future use
Cloudera manager is the heart of Cloudera which can manage N number of clusters and services within it. As a part of Path B installation, Cloudera manager with Management Service Agent, JDK will be configured.
- Installing Cloudera Manager
- Installing Management Service Agents
- Configuring JDK
- Prepare and populate MySQL database for Cloudera Manager
- Updating configuration file to use the correct port and JDK
- Starting the server and agent
- Verifying the installation with web UI
- Introduction to various version of CDH installation
Cloudera Data Hadoop (CDH) can be installed in multiple hosts by selecting various services and their corresponding roles. As a part of Path B CDH installation, Server, Agents, Database and roles will be configured manually
- Make AWS security configurations to facilitate Cloudera Manager to manage hosts
- Installing Agents in individual hosts
- Installing CDH parcels with JDK
- Configuring agent to send heart beat signal to Cloudera Manager
- Starting Agent in all hosts
- Configuring auto start of agents on restart of hosts
- Installing CDH Parcels in all hosts
- Create/Configure required database for report manager, Hue, Hive and Oozie
- Installing core Hadoop
- Selecting roles for various services
- Verifying installation
With Cloudera Manager, we can add any number of clusters, we can add/manage multiple services within each cluster
- Adding a new cluster
- Introduction to roles and service in a cluster
- Adding various services to a cluster
- Adding various roles to a service
- Removing service from a cluster
- Removing a cluster from Cloudera Manager
HDFS Hadoop Distributed File System works on Write Once Read Many (WORM) methodology. we can use HDFS Client to Add, Read, Delete Files and folders. We can also do some additional functionalities like setting permissions, changing ownership etc.
- Add/Move/Delete files and folders
- Changing replication factor
- Finding the files blocks, its locations, rack details
- Getting a file from HDFS to local file system
- File report on space, quota utilization
- Uses of -touchz
- Testing whether the entity is a file or folder
- Capturing the return value from shell commands
- Chanding permission and ownership
High Availability HA can be provided in HDFS by increasing the replication factor of blocks in Datanodes and introducing Standby Namenode for Master. Fail-over controller and zookeeper provides auto fail-over
- Concepts of HA in Datanode
- Concept of HA in Namenode
- Introduction to Standby Namenode
- Functionality of Fail-over controller and Zookeeper
- Introduction and need for Journal nodes
Enabling HDFS setup in Cloudera is as simple as adding the required roles and enabling HA.
- Adding nameservice to existing HDFS
- Selecting systems for journal nodes
- Setting the standby Namenode and fail-over controller
- Configuring zookeeper to coordinate the fail-over
Reliability of High Availability can be tested by manually making a Namenode to fail and check whether fail-over controller is triggering and promoting the standby Namenode to Namenode
Stopping active Namenode
Verifying automatic fail-over controller detection
Purpose of zookeeper in automatic fail-over
HDFS Stores the actual data as blocks in Datanodes. Sometimes due to commissioning or decommissioning of nodes into the cluster, the distribution of blocks in Datanode may not be even. This will create uneven processing load and IO while reading or doing the processing. HDFS balancer helps to balance the blocks evenly across the Datanodes
- Need for HDFS balancer
- Details on HDFS balancer role
- Adding HDFS balancer Role
- Configuration and setup of threshold
- Making balancer to run
- Verify balanced blocks
- Removing balancer role
To apply software or hardware fix, we have to take away few system for maintenance. With Cloudera we can take a host or entire cluster or a service or role to maintenance mode. When a host or service or role taken to maintenance mode, alerts will be suppressed
- Need for maintenance mode
- Taking cluster to maintenance mode and bringing cluster away from maintenance mode
- Taking Datanode to maintenance mode
- Impact of maintenance mode with replication factor
- Exiting from maintenance mode
Within HDFS files count and quantity can be controlled within any folder by administrator. This will help administrator to control the resource utilization of Namenode and Datanode efficiently by the users.
- Understanding Name and Space Quota
- Setting Name and Space Quota
- Details on How Space Quota works.
- Internal working of space allocation while adding file and impacts with Space Quota
- Removing Name and Space Quota
HDFS Canary test is a regular health check done by executing few client operations to monitor the health of HDFS
- Purpose of Canary test
- What all operations gets executed as part of Canary test
- Implications on allowing Canary test to run
- Disabling Canary test and reason for disabling it
HDFS can be made aware of on the arrangements of nodes in different availability zones or racks and HDFS can place the blocks so that the availability of blocks and files can be increased
- Understanding Rack awareness
- Achieving rack awareness in Cloud
- Verifying Rack awareness
- Configuring rack awareness script
- Implementing rack awareness
Within HDFS, Namenode stores meta information and Datanode stores the blocks. Meta information will be stored in RAM and changes in metadata will be recorded in two different type of files. Edits and FSImage. Process of writing into edits and merging edits with FSImage helps us to recover the metadata in any case if it is lost
- Functions and Structure of Edits and FSImage
- Drawback of playing back huge Edits
- Concept of saving namespace
- Concept of Checkpoint
- Types of trigger to update the FSImage from Edits
- Format of Editlog files
On various triggers, transactions in edits will get merged with FSImage. N number of edits gets maintained as a part of rolls
- Reading process of Edits and FSImage on start of Namenode
- Checkpoint process
- Details on segment in Edit logs
- Role played by Standby Namenode and Secondary Namenode during checkpoint process
Content within Edits and FSImage can be seen using OIV and OEV utility.
- Using Offline Image Viewer OIV
- Using Offline Edit Viewer OEV
- Process of how transactions gets updated in Edit logs
- Purpose and use of seen_txid
- purpose of block pool id, Cluster id, namespace id, layout version
- Generating output from oiv and oev in various format
HDFS edits will get rolled on various conditions. Transaction will get rolled and gets saved as FSImage.
- Understanding the process of HDFS Roll edits
- Triggering condition for edits roll
- Understanding checkpoint check period
- Various scenarios when the edit logs roll will happen
- Manually rolling the edits
All edits transactions will be played back to create the effective transaction list within FSImage. This will reduce the metadata loading time into RAM
- Loading of metadata on start of Namenode
- Reason to save namespace
- Introduction to HDFS Safemode
- Verifying edit transactions on save namespace
- HDFS roll edits, safemode, save namespace scenarios with dfsadmin command
HDFS allows to take point in time copy of entire file system or a specific folder. It involves very innovative way where additional copy of blocks will not be stored. Snapshots will be very helpful to recover the state of the HDFS
- Understanding Snapshot process in HDFS
- Enabling Snapshot for a folder
- Internal working of Snapshot
- Restoring Snapshot
- Disabling Snapshot
HDFS Snapshot process can be automated by creating policy. Policy could be taken on regular interval like cron job
- Creating HDFS Snapshot policy
- Setting the frequency of policy execution
- Setting up alerts on Snapshot execution
- Verifying Snapshot policy execution
- Disabling Snapshot policy
Access to the Cluster needs to be protected at the same time all the required access needs to be provided to the end user. Edge node or Gateway achieves this as well as Edge node can handle the client traffic and load and sizing can be done to handle many simultaneous and concurrent users accessing the cluster
- Purpose and need of Edge Node or Gateway
- How security of the cluster can be increased with Edgenode
- Horizontally scaling Edgenode to handle load
- Configuring a Gateway
- Access HDFS from Gateway
- Setting security configuration between Gateway and Cluster to restrict the direct access to cluster by clients
HDFS gives an option to interact with REST API where REST over http protocol. There are advantages on using REST API in terms of abstraction, Security and Gateway. It acts as single point of entry to the system
- Working of WebHDFS
- Setting up WebHDFS
- Interacting with HDFS using WebHDFS
- GET, PUT, POST, DELETE Operations
- Create directory/files with WebHDFS
- Difference between WebHDFS and httpFS
httpFS is additional role provided by cloudera which works on top of WebHDFS. httpFS acts as a gateway or a proxy in a single system. This provides complete control and security on interaction of client with HDFS over web
- Architecture of httpFS
- Advantage and disadvantage of httpFS
- Details on httpFS role
- Adding/removing httpFS role
- Create file/directory with httpFS
- Difference between httpFS and WebHDFS
HDFS provides File Check Utility where the status of the file system can be verified in terms of under/over/mis replicated blocks along with block, file, location and rack details of files and blocks
- Getting File Check Utility Report
- Identifying missing/corrupted blocks
- When and how full block report gets generated
- Identifying block location and details
The transactions stored as part of Edits and FSImage may get corrupted due to loss of part of Edit log or few segments. This will protect the Namenode to start. If we have a latest back up we can use it. Otherwise we can use an option provided by Namenode to recover the Edits and FSImage
- How Namenode Recover works
- Simulating the scenario of corrupting HDFS edits
- Make the Namenode to fail to start
- Recover the Edits
- Start the Namenode successfully
- Understand the process of rectifying corrupted edits segment
HDFS supports Namenode to scale horizontally and the process is called Federation. Namenode will be added and each Namenode will be handling a namespace. Federation supports Scalability of Namenode
- Architecture of Federation
- Enabling Federation
- Verifying the functionality of Federation
- Verifying Cluster Id, Block pool Id, Namespace ID,
- Verifying VERSION file in Namenode and Datanode
- Creating Namespace
- Disabling Federation
Hadoop Cluster capacity may needs to be increased or decreased for various reasons. Hadoop supports a process called commissioning and decommissioning to add or remove hosts without impacting the running services.
- Understand the need of removing/adding hosts to cluster
- Behavior of cluster during and after commissioning and decommissioning
- Processes of commissioning and de-commissioning
- Verifying the state of hosts
Every service within Cloudera will be controlled by configuration files. The same configuration file is required to connect to the service from any client. Cloudera offers to download the client configurations for any or all services.
- Purpose of client configuration files
- Downloading client configuration files
- Using client configuration files
- Deploying changed client configuration across cluster and its clients
Within Cluster in Cloudera we may have to add different type of hosts, playing different roles. While adding the host, we need to choose all the required roles. To reduce the complexity, we can define the template so that, if we apply a template to a host, it would automatically choose the required roles.
- Understanding Host Templates
- Purpose and use of Host Templates
- Applying Host template to hosts while commissioning a host or existing managed hosts