
iSMAC- IoT, Social, Mobility, Analytics and Cloud will be the five pillars for next business innovation. How these five technology will be driving the transformation of e-business to digital business and where Big data and Cloudera fits in.
Almost every Big data project will fall under Lambda Architecture. Lambda Architecture will be servicing three layers, batch layer, service layer, speed layer. Hadoop ecosystem supports many components for these layers.
Machine learning and deep learning plays a significant role in current big data space. You will understand and get answers to the following questions.
How machine learning will answer the question, what?
What is a model?
How model is generated with train data?
How train model will be used or applied?
Why machine learning / deep learning is important to process unstructured data?
AWS Amazon Web Services provides access to technology resource with pay as you go model. You will pay only for the resources you are using. Provides wide spectrum of hardware resources, right from compute, storage, database, network, Analytics, security, Machine Learning and many more.
For this course we will understand what is AWS how it can facilitate to setup Cloudera cluster similar to production environment
AWS being the market leader in cloud, learning Cloudera administration with AWS will be added advantage
How to signup a new account with Amazon Web Services (AWS). How services would get charged and track the billing
AWS resources are located in multiple physical locations across the globe. Regions at Geographic area and each regions with multiple zones. We will learn about Zones and Regions.
AWS Elastic Compute Cloud EC2 provides secure, resizable compute capacity in the cloud. Provides the complete control to the user in choosing, scaling up and scaling down the infrastructure based on the need. Launching EC2 instances and terminating EC2 instance involves few mandatory steps.
Objective
Learn about Simple Storage Service (S3), What is bucket and how it could be used with big data space
Learn about Key based login using Putty Client. As administrator we may login to each and every machine to do OS level configuration. This could be done by loggin into the system using Putty Client
Amazon Machine Image (AMI) facilitates the installation of any software / framework real quick and fast. Entire Amazon Elastic Compute Cloud (EC2) will be stored as AMI.
Objective:
How to create Amazon Machine Image (AMI)
How Elastic Block Storage (EBS) Would get stored as Snapshot
How to create more instances with AMI
How the security group will get applied against the instance
Disable Linux Firewall SELinux in AMI
Covered with CentOS 6
AWS provides four type of instances.
Out of these On-Demand is production grade which gives very high availability.
Spot instances are spare compute capacity in AWS is made available at huge discount compared to On-Demand. The price will be decided on bidding process. Highest bidder will get the instance. But spot instances can be interrupted by EC2 if the bid price is less than the market price. Apart from this all other features like fault-tolerance, reliability all the same as On-Demand Instance
Since we are going to learn and not run production cluster, we can use Spot Instances and get almost 10X capacity for the same budget. That is budget savings between 80 to 90% than on demand.
ObjectiveRelational Data Service (RDS) is pay as you go model of database service. Users can start database of MySQL, Aurora, Postgre, Maria, Oracle or Microsoft SQL in any capacity and run it for any duration with High Availability
Hadoop has HDFS (Hadoop Distributed File System). HDFS provides highly scaling and highly available storage within Hadoop. HDFS works with master worker architecture which provides horizontal scaleability.
YARN (Yet Another Resource Negotiator) provides distributed processing environment with Master worker architecture within Hadoop.
How Master worker architecture implemented within YARN?
How High Availability implemented with Master and Worker?
How High Scaleability implemented with Master and Worker?
How resource manager works?
Introduction to scheduler and its purpose.
Cloudera services and management roles of Cloudera uses database to store meta information. Few of the services like Sqoop needs sql database to demonstrate its functionality. We will be using MySQL or MariaDB for the same.
AWS Amazon Machine Image (AMI) is great to use a customized image of any system. Starting of N number of EC2 instances with predefined configuration will be very easy if we have an AMI of our need
Cloudera Data Hadoop (CDH) provides quick install option where the Cloudera Manager and Cluster can be installed with minimal configuration.
Setting up AWS EC2 for Cloudera Manager
Installing Cloudera Manager
Adding host to be managed
Selecting the required parcels
Installing CDH
Learn about various steps for installation of CDH. In total there are six steps and this six steps could be installed in three different paths. Each will have its own advantage and disadvantage
Cloudera Manager Server is the admin console for Cloudera Administrator.
Cloudera has its own binary distribution format which contains the required programs. Parcels are very similar to Linux packages.
Setup Cloudera repository rpm locally in a httpd server. This will help the organisation to define the required version and make the installation available locally.
Cloudera installation could be done by 3 different paths. Here we will learn about Path B where every step installation will be done manually.
Cloudera manager is the heart of Cloudera which can manage N number of clusters and services within it. As a part of Path B installation, Cloudera manager with Management Service Agent, JDK will be configured.
Cloudera Data Hadoop (CDH) can be installed in multiple hosts by selecting various services and their corresponding roles. As a part of Path B CDH installation, Server, Agents, Database and roles will be configured manually
With Cloudera Manager, we can add any number of clusters, we can add/manage multiple services within each cluster
HDFS Hadoop Distributed File System works on Write Once Read Many (WORM) methodology. we can use HDFS Client to Add, Read, Delete Files and folders. We can also do some additional functionalities like setting permissions, changing ownership etc.
Hadoop supports a process of moving any deleting files to Trash. Trash acts like recycle bin.
High Availability HA can be provided in HDFS by increasing the replication factor of blocks in Datanodes and introducing Standby Namenode for Master. Fail-over controller and zookeeper provides auto fail-over
Enabling HDFS setup in Cloudera is as simple as adding the required roles and enabling HA.
Reliability of High Availability can be tested by manually making a Namenode to fail and check whether fail-over controller is triggering and promoting the standby Namenode to Namenode
Stopping active Namenode
Verifying automatic fail-over controller detection
Purpose of zookeeper in automatic fail-over
HDFS High Availability is a must for production environment, but for test scenario we can disable it to free up the resources. We will see how to disable HDFS HA
HDFS Stores the actual data as blocks in Datanodes. Sometimes due to commissioning or decommissioning of nodes into the cluster, the distribution of blocks in Datanode may not be even. This will create uneven processing load and IO while reading or doing the processing. HDFS balancer helps to balance the blocks evenly across the Datanodes
To apply software or hardware fix, we have to take away few system for maintenance. With Cloudera we can take a host or entire cluster or a service or role to maintenance mode. When a host or service or role taken to maintenance mode, alerts will be suppressed
Within HDFS files count and quantity can be controlled within any folder by administrator. This will help administrator to control the resource utilization of Namenode and Datanode efficiently by the users.
HDFS Canary test is a regular health check done by executing few client operations to monitor the health of HDFS
HDFS can be made aware of on the arrangements of nodes in different availability zones or racks and HDFS can place the blocks so that the availability of blocks and files can be increased
Within HDFS, Namenode stores meta information and Datanode stores the blocks. Meta information will be stored in RAM and changes in metadata will be recorded in two different type of files. Edits and FSImage. Process of writing into edits and merging edits with FSImage helps us to recover the metadata in any case if it is lost
On various triggers, transactions in edits will get merged with FSImage. N number of edits gets maintained as a part of rolls
Content within Edits and FSImage can be seen using OIV and OEV utility.
HDFS edits will get rolled on various conditions. Transaction will get rolled and gets saved as FSImage.
All edits transactions will be played back to create the effective transaction list within FSImage. This will reduce the metadata loading time into RAM
HDFS allows to take point in time copy of entire file system or a specific folder. It involves very innovative way where additional copy of blocks will not be stored. Snapshots will be very helpful to recover the state of the HDFS
HDFS Snapshot process can be automated by creating policy. Policy could be taken on regular interval like cron job
Access to the Cluster needs to be protected at the same time all the required access needs to be provided to the end user. Edge node or Gateway achieves this as well as Edge node can handle the client traffic and load and sizing can be done to handle many simultaneous and concurrent users accessing the cluster
HDFS gives an option to interact with REST API where REST over http protocol. There are advantages on using REST API in terms of abstraction, Security and Gateway. It acts as single point of entry to the system
httpFS is additional role provided by cloudera which works on top of WebHDFS. httpFS acts as a gateway or a proxy in a single system. This provides complete control and security on interaction of client with HDFS over web
HDFS provides File Check Utility where the status of the file system can be verified in terms of under/over/mis replicated blocks along with block, file, location and rack details of files and blocks
The transactions stored as part of Edits and FSImage may get corrupted due to loss of part of Edit log or few segments. This will protect the Namenode to start. If we have a latest back up we can use it. Otherwise we can use an option provided by Namenode to recover the Edits and FSImage
HDFS supports Namenode to scale horizontally and the process is called Federation. Namenode will be added and each Namenode will be handling a namespace. Federation supports Scalability of Namenode
Every user within HDFS will have a dedicated home directory. This will be the default directory to add or read files.
Hadoop Cluster capacity may needs to be increased or decreased for various reasons. Hadoop supports a process called commissioning and decommissioning to add or remove hosts without impacting the running services.
Every service within Cloudera will be controlled by configuration files. The same configuration file is required to connect to the service from any client. Cloudera offers to download the client configurations for any or all services.
Within Cluster in Cloudera we may have to add different type of hosts, playing different roles. While adding the host, we need to choose all the required roles. To reduce the complexity, we can define the template so that, if we apply a template to a host, it would automatically choose the required roles.
This course is designed for professionals from zero experience to already skilled professionals to enhance their learning. Hands on session covers on end to end setup of Cloudera Cluster. We will be using AWS EC2 instances to deploy the cluster.
COURSE UPDATED PERIODICALLY SINCE LAUNCH (Cloudera 6)
What students are saying:
5 stars, "Very clear and adept in delivering the content. Learnt a lot. He covers the material 360 degrees and keeps the students in minds."
5 stars, "This course is an absolute paradigm shift for me. This is really an amazing course, and you shouldn't miss if you are a novice/intermediate level in Cloudera Administration."
5 stars, "Great work by the instructor... highly recommended..."
5 stars, "It is really excellent course. A lot of learning materials."
5 stars, "This course is help me a lot for my certification preparation. thank you!"
The course is targeted at Software Engineers, System Analysts, Database Administrators, Devops engineer and System Administrators who want to learn about Big Data Ecosystem with Cloudera. Other IT professionals can also take this course, but might have to do some extra work to understand some of the advanced concepts.
Cloudera being the market leader in Big data space, Hadoop Cloudera administration brings in huge job opportunities in Cloudera and Big data domain. Covers all the required skills as follows for CCA131 Certification
Install - Demonstrating and Installation of Cloudera Manager, Cloudera Data Hadoop (CDH) and Hadoop Ecosystem components
Configure - Basic to advanced configurations to setup Cloudera manager, Namenode High Availability (HA), Resource manager High Availability(HA)
Manage - Create and maintain day-to-day activities and operations in Cloudera Cluster like Cluster balancing, Alert setup, Rack topology management, Commissioning, Decommissioning hosts, YARN resource management with FIFO, Fair, Capacity Schedulers, Dynamic Resource Manager Configurations
Secure - Enabling relevant service and configuration to add security to meet the organisation goals with best practice. Configure extended Access Control List (ACL), Configure Sentry, Hue authorization and authentication with LDAP, HDFS encrypted zones
Test - Access file system commands via HTTPFS, Create, restore snapshot for HDFS directory, Get/Set extended ACL for a file or directory, Benchmark the cluster
Troubleshoot - Ability to find the cause of any problem, resolve them, optimize inefficient execution. Identify and filter out the warnings and predict the problem and apply the right solution, Configure dynamic resource pool configuration for better optimized use of cluster. Find the Scalability bottleneck and size the cluster.
Planning - Sizing and identify the dependencies, hardware and software requirements.
Getting a real time distributed environment with N number of machines at enterprise quality will be very costly. Thanks to Cloud which can help any user to create distributed environment with very minimal expenditure and pay only for what you are using it. AWS is very much technology neutral and all other cloud providers like Microsoft Azure, IBM Bluemix, Google Compute cloud, etc., works the similar way.
Content Added on Request
Dec Cloudera 6 Overview and Quick Install
Nov HDFS Redaction
Nov Memory management - Heap Calculation for Roles and Namenode
Nov IO Compression
Nov Charts and Dashboard
Oct File copy, distcp
Oct Command files added for all the section.
Sep Kafka Service Administration
Sep Spark Service Administration
Aug Cluster Benchmarking