
Get an overview of the CCA 159 exam for data analysts, detailing eight to twelve scenarios, 120 minutes, and the data preparation, structuring, and analysis skills with Sqoop, Hive, and Impala.
Learn to download and install Oracle VirtualBox on Mac, create Linux virtual machines, mount ISO images, configure storage and memory, and use Vagrant or vendor images to automate multi-VM setups.
Explore the Cloudera QuickStart VM setup, including HDFS, YARN, Hive, Spark, and Sqoop, with Cloudera Manager and Hue interfaces for hands-on exam preparation.
Learn Hive basics by launching Hive, creating an external table against data in HDFS, and using Beeline for queries on a Cloudera Quickstart VM.
Sign up for the labs to access an 11-node multi-node cluster with Ambari, enabling affordable hands-on big data learning and preparation for Cloudera or Hortonworks certification exams.
Access hdfs from the cluster gateway node using hadoop fs commands to explore public datasets. Learn to set up passwordless ssh login and use command line tools for efficient access.
Review yarn and MapReduce two property files in Hadoop conf, focusing on resource manager address and memory settings for map and reduce tasks. Use web ui and job history.
Review MapReduce and yarn job logs via the resource manager and job history server UI, using the tracking URL to inspect task attempts and troubleshoot failures with syslog.
Explore how to interpret MapReduce job counters to benchmark performance, examining map tasks, reduce tasks, combiner effects, file system counters, and final logs from the job history.
Learn how to ingest data into Hdfs using Sqoop for batch load and Flume or Kafka for streaming, preparing data for processing with MapReduce or Spark.
Learn to list files in hdfs using the hadoop fs ls command, with options -t, -s, -R, -h, -C, and -r to sort, display readable sizes, and recursively traverse directories.
Create and manage directories in HDFS using Hadoop fs commands like mkdir, rm, and rmdir, including recursive options -p and -R to build or remove directory structures.
Copy files and directories from local file system into HDFS using Hadoop fs put or copy from local, creating retail_db in user training and handling existing targets carefully.
Explore hive history and dot hive rc to override session properties, then launch hive and use up arrow or ctrl-r to navigate commands and set hive execution engine to tez.
Learn to retrieve complete Hive table metadata using describe formatted, compare with describe extended, and interpret columns, data types, comments, database, owner, location, and storage details.
Learn how to load data from the local file system into a Hive table using load data local, with a row format and fields terminated by comma for correct querying.
Understand the difference between external and managed tables in hive, including how drop table operations affect data and metadata and how a table location determines whether data remains.
Insert data into the bucket table from the stage table after enabling hive bucketing, then verify eight buckets and data distribution with describe formatted and data previews.
Learn to clean strings with ltrim, rtrim, and trim, and pad values using lpad and rpad in Hive. Build correctly formatted dates from year, month, and day, and validate results.
Discover how to get current date and current timestamp in Hive using operators, not functions, with default date-time formatting for filtering in where clauses.
Learn how to use the date_format function in Hive to extract year, month, day, and time components from dates, timestamps, or strings, and format outputs for queries.
Learn how to filter data in Hive with the where clause, using equal, not equal, greater than, and numeric versus string comparisons on orders and order items.
Explore the like operator in the where clause for partial date matching on orders, filtering by 2014% and the 07 pattern to target July, with group by for counts.
As part of Sqoop, Hive, and Impala for Data Analysts (Formerly CCA 159), you will learn key skills such as Sqoop, Hive, and Impala.
This comprehensive course covers all aspects of the certification with real-world examples and data sets.
Overview of Big Data ecosystem
Overview Of Distributions and Management Tools
Properties and Properties Files - General Guidelines
Hadoop Distributed File System
YARN and Map Reduce2
Submitting Map ReduceJob
Determining Number of Mappers and Reducers
Understanding YARN and Map Reduce Configuration Properties
Review and Override Job Properties
Reviewing Map Reduce Job Logs
Map Reduce Job Counters
Overview of Hive
Databases and Query Engines
Overview of Data Ingestion in Big Data
Data Processing using Spark
HDFS Commands to manage files
Introduction to HDFS for Certification Exams
Overview of HDFS and PropertiesFiles
Overview of Hadoop CLI
Listing Files in HDFS
User Spaces or Home Directories in HDFS
Creating Directories in HDFS
Copying Files and Directories into HDFS
File and Directory Permissions Overview
Getting Files and Directories from HDFS
Previewing Text Files in HDFS
Copying or Moving Files and Directories within HDFS
Understanding Size of File System and Files
Overview of Block Size and ReplicationFactor
Getting File Metadata using hdfs fsck
Resources and Exercises
Getting Started with Hive
Overview of Hive Language Manual
Launching and using Hive CLI
Overview of Hive Properties
Hive CLI History and hiverc
Running HDFS Commands in Hive CLI
Understanding Warehouse Directory
Creating and Using Hive Databases
Creating and Describing Hive Tables
Retrieve Matadata of Tables using DESCRIBE
Role of Hive Metastore Database
Overview of beeline
Running Hive Commands and Queries using beeline
Creating Tables in Hive using Hive QL
Creating Tables in Hive - orders
Overview of Basic Data Types in Hive
Adding Comments to Columns and Tables
Loading Data into Hive Tables from Local File System
Loading Data into Hive Tables from HDFS
Loading Data - Overwrite vs Append
Creating External tables in Hive
Specifying Location for Hive Tables
Difference between Managed Table and External Table
Default Delimiters in Hive Tables using Text File
Overview of File Formats in Hive
Differences between Hive and RDBMS
Truncate and Drop tables in Hive
Resources and Exercises
Loading/Inserting data into Hive tables using Hive QL
Introduction to Partitioning and Bucketing
Creating Tables using Orc Format - order_items
Inserting Data into Tables using Stage Tables
Load vs. Insert in Hive
Creating Partitioned Tables in Hive
Adding Partitions to Tables in Hive
Loading into Partitions in Hive Tables
Inserting Data Into Partitions in Hive Tables
Insert Using Dynamic Partition Mode
Creating Bucketed Tables in Hive
Inserting Data into Bucketed Tables
Bucketing with Sorting
Overview of ACID Transactions
Create Tables for Transactions
Inserting Individual Records into Hive Tables
Update and Delete Data in Hive Tables
Overview of functions in Hive
Overview of Functions
Validating Functions
String Manipulation - Case Conversion and Length
String Manipulation - substr and split
String Manipulation - Trimming and Padding Functions
String Manipulation - Reverse and Concatenating Multiple Strings
Date Manipulation - Current Date and Timestamp
Date Manipulation - Date Arithmetic
Date Manipulation - trunc
Date Manipulation - Using date format
Date Manipulation - Extract Functions
Date Manipulation - Dealing with Unix Timestamp
Overview of Numeric Functions
Data Type Conversion Using Cast
Handling Null Values
Query Example - Get Word Count
Writing Basic Queries in Hive
Overview of SQL or Hive QL
Execution Life Cycle of Hive Query
Reviewing Logs of Hive Queries
Projecting Data using Select and Overview of From
Derive Conditional Values using CASE and WHEN
Projecting Distinct Values
Filtering Data using Where Clause
Boolean Operations in Where Clause
Boolean OR vs IN Operator
Filtering Data using LIKE Operator
Performing Basic Aggregations using Aggregate Functions
Performing Aggregations using GROUP BY
Filtering Aggregated Data Using HAVING
Global Sorting using ORDER BY
Overview of DISTRIBUTE BY
Sorting Data within Groups using SORT BY
Using CLUSTERED BY
Joining Data Sets and Set Operations in Hive
Overview of Nested Sub Queries
Nested Sub Queries - Using IN Operator
Nested Sub Queries - Using EXISTS Operator
Overview of Joins in Hive
Performing Inner Joins using Hive
Performing Outer Joins using Hive
Performing Full Outer Joins using Hive
Map Side Join and Reduce Side Join in Hive
Joining in Hive using Legacy Syntax
Cross Joins in Hive
Overview of Set Operations in Hive
Perform Set Union between two Hive Query Results
Set Operations - Intersect and Minus Not Supported
Windowing or Analytics Functions in Hive
Prepare HR Database in Hive with Employees Table
Overview of Analytics or Windowing Functions in Hive
Performing Aggregations using Hive Queries
Create Tables to Get Daily Revenue using CTAS in Hive
Getting Lead and Lag using Windowing Functions in Hive
Getting First and Last Values using Windowing Functions in Hive
Applying Rank using Windowing Functions in Hive
Applying Dense Rank using Windowing Functions in Hive
Applying Row Number using Windowing Functions in Hive
Difference Between rank, dense_rank, and row_number in Hive
Understanding the order of execution of Hive Queries
Overview of Nested Sub Queries in Hive
Filtering Data on Top of Window Functions in Hive
Getting Top 5 Products by Revenue for Each Day using Windowing Functions in Hive - Recap
Running Queries using Impala
Introduction to Impala
Role of Impala Daemons
Impala State Store and Catalog Server
Overview of Impala Shell
Relationship between Hive and Impala
Overview of Creating Databases and Tables using Impala
Loading and Inserting Data into Tables using Impala
Running Queries using Impala Shell
Reviewing Logs of Impala Queries
Synching Hive and Impala - Using Invalidate Metadata
Running Scripts using Impala Shell
Assignment - Using NYSE Data
Assignment - Solution
Getting Started with Sqoop
Introduction to Sqoop
Validate Source Database - MySQL
Review JDBC Jar to Connect to MySQL
Getting Help using Sqoop CLI
Overview of Sqoop User Guide
Validate Sqoop and MySQL Integration using Sqoop List Databases
Listing Tables in Database using Sqoop
Run Queries in MySQL using Sqoop Eval
Understanding Logs in Sqoop
Redirecting Sqoop Job Logs into Log Files
Importing data from MySQL to HDFS using Sqoop Import
Overview of Sqoop Import Command
Import Orders using target-dir
Import Order Items using warehouse-dir
Managing HDFS Directories
Sqoop Import Execution Flow
Reviewing Logs of Sqoop Import
Sqoop Import Specifying Number of Mappers
Review the Output Files generated by Sqoop Import
Sqoop Import Supported File Formats
Validating avro files using Avro Tools
Sqoop Import Using Compression
Apache Sqoop - Importing Data into HDFS - Customizing
Introduction to customizing Sqoop Import
Sqoop Import by Specifying Columns
Sqoop import Using Boundary Query
Sqoop import while filtering Unnecessary Data
Sqoop Import Using Split By to distribute import using non default column
Getting Query Results using Sqoop eval
Dealing with tables with Composite Keys while using Sqoop Import
Dealing with tables with Non Numeric Key Fields while using Sqoop Import
Dealing with tables with No Key Fields while using Sqoop Import
Using autoreset-to-one-mapper to use only one mapper while importing data using Sqoop from tables with no key fields
Default Delimiters used by Sqoop Import for Text File Format
Specifying Delimiters for Sqoop Import using Text File Format
Dealing with Null Values using Sqoop Import
Import Mulitple Tables from source database using Sqoop Import
Importing data from MySQL to Hive Tables using Sqoop Import
Quick Overview of Hive
Create Hive Database for Sqoop Import
Create Empty Hive Table for Sqoop Import
Import Data into Hive Table from source database table using Sqoop Import
Managing Hive Tables while importing data using Sqoop Import using Overwrite
Managing Hive Tables while importing data using Sqoop Import - Errors Out If Table Already Exists
Understanding Execution Flow of Sqoop Import into Hive tables
Review Files generated by Sqoop Import in Hive Tables
Sqoop Delimiters vs Hive Delimiters
Different File Formats supported by Sqoop Import while importing into Hive Tables
Sqoop Import all Tables into Hive from source database
Exporting Data from HDFS/Hive to MySQL using Sqoop Export
Introduction to Sqoop Export
Prepare Data for Sqoop Export
Create Table in MySQL for Sqoop Export
Perform Simple Sqoop Export from HDFS to MySQL table
Understanding Execution Flow of Sqoop Export
Specifying Number of Mappers for Sqoop Export
Troubleshooting the Issues related to Sqoop Export
Merging or Upserting Data using Sqoop Export - Overview
Quick Overview of MySQL - Upsert using Sqoop Export
Update Data using Update Key using Sqoop Export
Merging Data using allowInsert in Sqoop Export
Specifying Columns using Sqoop Export
Specifying Delimiters using Sqoop Export
Using Stage Table for Sqoop Export
Submitting Sqoop Jobs and Incremental Sqoop Imports
Introduction to Sqoop Jobs
Adding Password File for Sqoop Jobs
Creating Sqoop Job
Run Sqoop Job
Overview of Incremental Loads using Sqoop
Incremental Sqoop Import - Using Where
Incremental Sqoop Import - Using Append Mode
Incremental Sqoop Import - Create Table
Incremental Sqoop Import - Create Sqoop Job
Incremental Sqoop Import - Execute Job
Incremental Sqoop Import - Add Additional Data
Incremental Sqoop Import - Rerun Job
Incremental Sqoop Import - Using Last Modified
Here are the objectives for this course.
Provide Structure to the Data
Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala.
Create tables using a variety of data types, delimiters, and file formats
Create new tables using existing tables to define the schema
Improve query performance by creating partitioned tables in the metastore
Alter tables to modify the existing schema
Create views in order to simplify queries
Data Analysis
Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster.
Prepare reports using SELECT commands including unions and subqueries
Calculate aggregate statistics, such as sums and averages, during a query
Create queries against multiple data sources by using join commands
Transform the output format of queries by using built-in functions
Perform queries across a group of rows using windowing functions
Exercises will be provided to have enough practice to get better at Sqoop as well as writing queries using Hive and Impala.
All the demos are given on our state-of-the-art Big Data cluster. If you do not have multi-node cluster, you can sign up for our labs and practice on our multi-node cluster. You will be able to practice Sqoop and Hive on the cluster.