Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)

Name: Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)
Rating: 4.2 (716 reviews)

Hands on Sqoop, Hive and Impala for Data Analysts

Created byDurga Viswanatha Raju Gadiraju

Last updated 3/2023

English

What you'll learn

Overview of Big Data ecosystem such as Hadoop HDFS, YARN, Map Reduce, Sqoop, Hive, etc
Overview of HDFS Commands such as put or copyFromLocal, get or copyToLocal, cat, etc along with concepts such as block size, replication factor, etc
Managing Tables in Hive Metastore using DDL Commands
Load or Insert data into Hive Metastore Tables using commands such as LOAD and INSERT
Overview of Functions in Hive to manipulate strings, dates, etc
Writing Basic Hive QL Queries using WHERE, JOIN, GROUP BY, etc
Analytical or Windowing Functions in Hive
Overview of Impala and understanding similarities and differences between Hive and Impala
Getting Started with Sqoop by reviewing official documentation and also exploring commands such as Sqoop eval
Importing Data from RDBMS tables into HDFS using Sqoop Import
Importing Data from RDBMS tables into Hive tables using Sqoop Import
Exporting Data from Hive or HDFS to RDBMS tables using Sqoop Export
Incremental Imports using Sqoop Import into HDFS or Hive Tables

Course content

19 sections • 233 lectures • 20h 31m total length

CCA 159 Certification Exam - Overview4:16
Get an overview of the CCA 159 exam for data analysts, detailing eight to twelve scenarios, 120 minutes, and the data preparation, structuring, and analysis skills with Sqoop, Hive, and Impala.
Tools for preparation2:32
Getting Details about the Exam6:40
Signing up for the Exam2:26

Download and Install Virtual Box7:35
Learn to download and install Oracle VirtualBox on Mac, create Linux virtual machines, mount ISO images, configure storage and memory, and use Vagrant or vendor images to automate multi-VM setups.
Setup Cloudera QuickStart VM4:57
Overview of Cloudera QuickStart VM7:19
Explore the Cloudera QuickStart VM setup, including HDFS, YARN, Hive, Spark, and Sqoop, with Cloudera Manager and Hue interfaces for hands-on exam preparation.
Overview of MySQL Databases6:22
Setup NYSE Database in MySQL12:07
Overview of HDFS and Setup Datasets10:51
Overview of Hive and Create External Table14:23
Learn Hive basics by launching Hive, creating an external table against data in HDFS, and using Beeline for queries on a Cloudera Quickstart VM.
Validate Sqoop9:00

Signing up for the labs2:33
Sign up for the labs to access an 11-node multi-node cluster with Ambari, enabling affordable hands-on big data learning and preparation for Cloudera or Hortonworks certification exams.
Connecting to the gateway node of the cluster3:38
Overview of HDFS in the cluster2:33
Access hdfs from the cluster gateway node using hadoop fs commands to explore public datasets. Learn to set up passwordless ssh login and use command line tools for efficient access.
Using Hive in the cluster5:04
Understanding MySQL in the cluster2:51
Running Sqoop Commands in the cluster3:09

Overview of Distributions and Management Tools such as Ambari5:57
Properties and Properties Files of Big Data Tools - General Guidelines8:51
Hadoop Distributed File System - Quick Overview3:54
Distributed Computing using YARN and Map Reduce 2 - Quick Overview6:06
Submitting Map Reduce Job in YARN Framework4:06
Determining Number of Mappers and Reducers4:28
Understanding YARN and Map Reduce Configuration Properties5:58
Review yarn and MapReduce two property files in Hadoop conf, focusing on resource manager address and memory settings for map and reduce tasks. Use web ui and job history.
Reviewing and Overriding Map Reduce Job Run Time Properties8:32
Reviewing Map Reduce Job Logs - using Resource Manager and Job History Server UI6:34
Review MapReduce and yarn job logs via the resource manager and job history server UI, using the tracking URL to inspect task attempts and troubleshoot failures with syslog.
Map Reduce Job Counters5:11
Explore how to interpret MapReduce job counters to benchmark performance, examining map tasks, reduce tasks, combiner effects, file system counters, and final logs from the job history.
Overview of Hive3:40
Databases in Big Data and Query Engines2:34
Overview of Data Ingestion Tools in Big Data4:25
Learn how to ingest data into Hdfs using Sqoop for batch load and Flume or Kafka for streaming, preparing data for processing with MapReduce or Spark.

Introduction to HDFS for Certification Exams1:47
Overview of HDFS and Properties Files10:18
Overview of "hadoop fs" or "hdfs dfs" command5:16
Listing Files in HDFS6:42
Learn to list files in hdfs using the hadoop fs ls command, with options -t, -s, -R, -h, -C, and -r to sort, display readable sizes, and recursively traverse directories.
User Spaces or Home Directories in HDFS4:47
Creating Directory in HDFS5:27
Create and manage directories in HDFS using Hadoop fs commands like mkdir, rm, and rmdir, including recursive options -p and -R to build or remove directory structures.
Copying Files and Directories into HDFS8:02
Copy files and directories from local file system into HDFS using Hadoop fs put or copy from local, creating retail_db in user training and handling existing targets carefully.
File and Directory Permissions Overview4:16
Getting Files and Directories from HDFS4:37
Previewing Text Files in HDFS - cat and tail3:46
Copying or Moving Files from one HDFS location to other HDFS location5:41
Understanding Size of the File System and Data Sets - using df and du5:01
Overview of Block Size and Replication Factor5:41
Getting metadata of files using "hdfs fsck"4:55
Resources and Exercises3:57

Overview of Hive Language Manual4:44
Launching and Using Hive CLI6:28
Overview of Hive Properties - SET and .hiverc7:28
Hive CLI History and .hiverc4:55
Explore hive history and dot hive rc to override session properties, then launch hive and use up arrow or ctrl-r to navigate commands and set hive execution engine to tez.
Running HDFS Commands using Hive CLI2:28
Understanding Warehouse Directory3:56
Creating Database in Hive and Switching to the Database3:42
Creating First Table in Hive and list the tables6:52
Retrieve metadata of Hive Tables using DESCRIBE (EXTENDED and FORMATTED)3:45
Learn to retrieve complete Hive table metadata using describe formatted, compare with describe extended, and interpret columns, data types, comments, database, owner, location, and storage details.
Overview of beeline - Alternative to Hive CLI4:41
Role of Hive Metastore5:04
Running Hive Queries using Beeline4:50

Create tables in Hive - orders9:35
Overview of Data Types in Hive4:40
Adding Comments to Columns and Tables3:05
Loading Data into Hive Tables from Local File System9:29
Learn how to load data from the local file system into a Hive table using load data local, with a row format and fields terminated by comma for correct querying.
Loading Data into Hive Tables from HDFS Location5:51
Loading Data into Hive Tables - Overwrite vs. Append3:26
Creating External Tables in Hive4:40
Specifying Location for Hive Tables6:44
Managed Tables vs. External Tables3:51
Understand the difference between external and managed tables in hive, including how drop table operations affect data and metadata and how a table location determines whether data remains.
Default Delimiters in Hive Tables using Text File Format7:32
Overview of File Formats - STORED AS Clause2:47
Differences between Hive and RDBMS4:43
Truncating and Dropping tables in Hive7:59
Resources and Exercises4:11

Introduction to Partitioning and Bucketing in Hive2:35
Creating Tables using orc File Format - order_items6:42
Inserting Data into order_items using stage table5:04
Can we use LOAD Command to get data into order_items with orc file format?9:28
Creating Partitioned Tables in Hive - orders_part with order_month as key5:51
Adding Partitions to Tables in Hive4:44
Loading into Partitions in Hive Tables8:40
Inserting Data into Partitions in Hive Tables4:20
Inserting data into Partitioned Tables - Using dynamic partition mode6:56
Creating Bucketed Tables - orders_buck and order_items_buck3:33
Inserting Data Into Bucketed Tables4:35
Insert data into the bucket table from the stage table after enabling hive bucketing, then verify eight buckets and data distribution with describe formatted and data previews.
Bucketing with Sorting4:13
Overview of ACID Transactions in Hive4:19
Create Tables for ACID Transactions8:04
Inserting individual records into Hive Tables7:56
Updating and Deleting data in Hive Bucketed Tables7:50

Overview of Functions3:05
Validating Functions3:23
String Manipulation - Case Conversion and Length4:19
String Manipulation - substr and split7:41
String Manipulation - trimming and padding Functions6:27
Learn to clean strings with ltrim, rtrim, and trim, and pad values using lpad and rpad in Hive. Build correctly formatted dates from year, month, and day, and validate results.
String Manipulation - Reverse and Concatenating multiple strings6:32
Date Manipulation - Getting Current Date and Timestamp2:17
Discover how to get current date and current timestamp in Hive using operators, not functions, with default date-time formatting for filtering in where clauses.
Date Manipulation - Date Arithmetic5:00
Date Manipulation - trunc2:41
Date Manipulation - Extracting information using date_format5:13
Learn how to use the date_format function in Hive to extract year, month, day, and time components from dates, timestamps, or strings, and format outputs for queries.
Date Manipulation - Extracting information using year, month, day etc2:38
Date Manipulation - Dealing with Unix Timestamp4:38
Overview of Numeric Functions7:21
Type Cast Functions for Data Type Conversion3:47
Handling null values using nvl1:52
Query Example - Get Word Count5:45

Overview of SQL5:40
Hive Query - Execution Life Cycle4:38
Reviewing Logs for Hive Queries5:09
Projecting Data using SELECT and Overview of FROM Clause4:18
Using CASE and WHEN as part of SELECT Clause3:29
Projecting DISTINCT Values4:45
Filtering Data using WHERE Clause4:11
Learn how to filter data in Hive with the where clause, using equal, not equal, greater than, and numeric versus string comparisons on orders and order items.
Boolean Operations such as OR and AND using multiple fields7:03
Boolean OR vs. IN3:50
Filtering data using LIKE Operator3:31
Explore the like operator in the where clause for partial date matching on orders, filtering by 2014% and the 07 pattern to target July, with group by for counts.
Basic Aggregations using Aggregate Functions3:41
Performing basic aggregations such as SUM, MIN, MAX etc using GROUP BY8:05
Filtering post aggregation using HAVING3:34
Global Sorting using ORDER BY4:28
Overview of DISTRIBUTE BY5:22
Sorting Data with in groups using DISTRIBUTE BY and SORT BY12:06
Overview of CLUSTER BY4:55

Requirements

A 64 bit Computer with at least 8 GB RAM is highly desired
Access to Multinode Cluster or our ITVersity Labs (Paid Subscription Required)
Setup Cloudera QuickStart VM in high end laptops (16 GB RAM and Quad Core) - Instructions Provided but Not Supported
Basic Computer Skills
Ability to write based SQL Queries and use Linux based environment

Description

As part of Sqoop, Hive, and Impala for Data Analysts (Formerly CCA 159), you will learn key skills such as Sqoop, Hive, and Impala.

This comprehensive course covers all aspects of the certification with real-world examples and data sets.

Overview of Big Data ecosystem

Overview Of Distributions and Management Tools
Properties and Properties Files - General Guidelines
Hadoop Distributed File System
YARN and Map Reduce2
Submitting Map ReduceJob
Determining Number of Mappers and Reducers
Understanding YARN and Map Reduce Configuration Properties
Review and Override Job Properties
Reviewing Map Reduce Job Logs
Map Reduce Job Counters
Overview of Hive
Databases and Query Engines
Overview of Data Ingestion in Big Data
Data Processing using Spark

HDFS Commands to manage files

Introduction to HDFS for Certification Exams
Overview of HDFS and PropertiesFiles
Overview of Hadoop CLI
Listing Files in HDFS
User Spaces or Home Directories in HDFS
Creating Directories in HDFS
Copying Files and Directories into HDFS
File and Directory Permissions Overview
Getting Files and Directories from HDFS
Previewing Text Files in HDFS
Copying or Moving Files and Directories within HDFS
Understanding Size of File System and Files
Overview of Block Size and ReplicationFactor
Getting File Metadata using hdfs fsck
Resources and Exercises

Getting Started with Hive

Overview of Hive Language Manual
Launching and using Hive CLI
Overview of Hive Properties
Hive CLI History and hiverc
Running HDFS Commands in Hive CLI
Understanding Warehouse Directory
Creating and Using Hive Databases
Creating and Describing Hive Tables
Retrieve Matadata of Tables using DESCRIBE
Role of Hive Metastore Database
Overview of beeline
Running Hive Commands and Queries using beeline

Creating Tables in Hive using Hive QL

Creating Tables in Hive - orders
Overview of Basic Data Types in Hive
Adding Comments to Columns and Tables
Loading Data into Hive Tables from Local File System
Loading Data into Hive Tables from HDFS
Loading Data - Overwrite vs Append
Creating External tables in Hive
Specifying Location for Hive Tables
Difference between Managed Table and External Table
Default Delimiters in Hive Tables using Text File
Overview of File Formats in Hive
Differences between Hive and RDBMS
Truncate and Drop tables in Hive
Resources and Exercises

Loading/Inserting data into Hive tables using Hive QL

Introduction to Partitioning and Bucketing
Creating Tables using Orc Format - order_items
Inserting Data into Tables using Stage Tables
Load vs. Insert in Hive
Creating Partitioned Tables in Hive
Adding Partitions to Tables in Hive
Loading into Partitions in Hive Tables
Inserting Data Into Partitions in Hive Tables
Insert Using Dynamic Partition Mode
Creating Bucketed Tables in Hive
Inserting Data into Bucketed Tables
Bucketing with Sorting
Overview of ACID Transactions
Create Tables for Transactions
Inserting Individual Records into Hive Tables
Update and Delete Data in Hive Tables

Overview of functions in Hive

Overview of Functions
Validating Functions
String Manipulation - Case Conversion and Length
String Manipulation - substr and split
String Manipulation - Trimming and Padding Functions
String Manipulation - Reverse and Concatenating Multiple Strings
Date Manipulation - Current Date and Timestamp
Date Manipulation - Date Arithmetic
Date Manipulation - trunc
Date Manipulation - Using date format
Date Manipulation - Extract Functions
Date Manipulation - Dealing with Unix Timestamp
Overview of Numeric Functions
Data Type Conversion Using Cast
Handling Null Values
Query Example - Get Word Count

Writing Basic Queries in Hive

Overview of SQL or Hive QL
Execution Life Cycle of Hive Query
Reviewing Logs of Hive Queries
Projecting Data using Select and Overview of From
Derive Conditional Values using CASE and WHEN
Projecting Distinct Values
Filtering Data using Where Clause
Boolean Operations in Where Clause
Boolean OR vs IN Operator
Filtering Data using LIKE Operator
Performing Basic Aggregations using Aggregate Functions
Performing Aggregations using GROUP BY
Filtering Aggregated Data Using HAVING
Global Sorting using ORDER BY
Overview of DISTRIBUTE BY
Sorting Data within Groups using SORT BY
Using CLUSTERED BY

Joining Data Sets and Set Operations in Hive

Overview of Nested Sub Queries
Nested Sub Queries - Using IN Operator
Nested Sub Queries - Using EXISTS Operator
Overview of Joins in Hive
Performing Inner Joins using Hive
Performing Outer Joins using Hive
Performing Full Outer Joins using Hive
Map Side Join and Reduce Side Join in Hive
Joining in Hive using Legacy Syntax
Cross Joins in Hive
Overview of Set Operations in Hive
Perform Set Union between two Hive Query Results
Set Operations - Intersect and Minus Not Supported

Windowing or Analytics Functions in Hive

Prepare HR Database in Hive with Employees Table
Overview of Analytics or Windowing Functions in Hive
Performing Aggregations using Hive Queries
Create Tables to Get Daily Revenue using CTAS in Hive
Getting Lead and Lag using Windowing Functions in Hive
Getting First and Last Values using Windowing Functions in Hive
Applying Rank using Windowing Functions in Hive
Applying Dense Rank using Windowing Functions in Hive
Applying Row Number using Windowing Functions in Hive
Difference Between rank, dense_rank, and row_number in Hive
Understanding the order of execution of Hive Queries
Overview of Nested Sub Queries in Hive
Filtering Data on Top of Window Functions in Hive
Getting Top 5 Products by Revenue for Each Day using Windowing Functions in Hive - Recap

Running Queries using Impala

Introduction to Impala
Role of Impala Daemons
Impala State Store and Catalog Server
Overview of Impala Shell
Relationship between Hive and Impala
Overview of Creating Databases and Tables using Impala
Loading and Inserting Data into Tables using Impala
Running Queries using Impala Shell
Reviewing Logs of Impala Queries
Synching Hive and Impala - Using Invalidate Metadata
Running Scripts using Impala Shell
Assignment - Using NYSE Data
Assignment - Solution

Getting Started with Sqoop

Introduction to Sqoop
Validate Source Database - MySQL
Review JDBC Jar to Connect to MySQL
Getting Help using Sqoop CLI
Overview of Sqoop User Guide
Validate Sqoop and MySQL Integration using Sqoop List Databases
Listing Tables in Database using Sqoop
Run Queries in MySQL using Sqoop Eval
Understanding Logs in Sqoop
Redirecting Sqoop Job Logs into Log Files

Importing data from MySQL to HDFS using Sqoop Import

Overview of Sqoop Import Command
Import Orders using target-dir
Import Order Items using warehouse-dir
Managing HDFS Directories
Sqoop Import Execution Flow
Reviewing Logs of Sqoop Import
Sqoop Import Specifying Number of Mappers
Review the Output Files generated by Sqoop Import
Sqoop Import Supported File Formats
Validating avro files using Avro Tools
Sqoop Import Using Compression

Apache Sqoop - Importing Data into HDFS - Customizing

Introduction to customizing Sqoop Import
Sqoop Import by Specifying Columns
Sqoop import Using Boundary Query
Sqoop import while filtering Unnecessary Data
Sqoop Import Using Split By to distribute import using non default column
Getting Query Results using Sqoop eval
Dealing with tables with Composite Keys while using Sqoop Import
Dealing with tables with Non Numeric Key Fields while using Sqoop Import
Dealing with tables with No Key Fields while using Sqoop Import
Using autoreset-to-one-mapper to use only one mapper while importing data using Sqoop from tables with no key fields
Default Delimiters used by Sqoop Import for Text File Format
Specifying Delimiters for Sqoop Import using Text File Format
Dealing with Null Values using Sqoop Import
Import Mulitple Tables from source database using Sqoop Import

Importing data from MySQL to Hive Tables using Sqoop Import

Quick Overview of Hive
Create Hive Database for Sqoop Import
Create Empty Hive Table for Sqoop Import
Import Data into Hive Table from source database table using Sqoop Import
Managing Hive Tables while importing data using Sqoop Import using Overwrite
Managing Hive Tables while importing data using Sqoop Import - Errors Out If Table Already Exists
Understanding Execution Flow of Sqoop Import into Hive tables
Review Files generated by Sqoop Import in Hive Tables
Sqoop Delimiters vs Hive Delimiters
Different File Formats supported by Sqoop Import while importing into Hive Tables
Sqoop Import all Tables into Hive from source database

Exporting Data from HDFS/Hive to MySQL using Sqoop Export

Introduction to Sqoop Export
Prepare Data for Sqoop Export
Create Table in MySQL for Sqoop Export
Perform Simple Sqoop Export from HDFS to MySQL table
Understanding Execution Flow of Sqoop Export
Specifying Number of Mappers for Sqoop Export
Troubleshooting the Issues related to Sqoop Export
Merging or Upserting Data using Sqoop Export - Overview
Quick Overview of MySQL - Upsert using Sqoop Export
Update Data using Update Key using Sqoop Export
Merging Data using allowInsert in Sqoop Export
Specifying Columns using Sqoop Export
Specifying Delimiters using Sqoop Export
Using Stage Table for Sqoop Export

Submitting Sqoop Jobs and Incremental Sqoop Imports

Introduction to Sqoop Jobs
Adding Password File for Sqoop Jobs
Creating Sqoop Job
Run Sqoop Job
Overview of Incremental Loads using Sqoop
Incremental Sqoop Import - Using Where
Incremental Sqoop Import - Using Append Mode
Incremental Sqoop Import - Create Table
Incremental Sqoop Import - Create Sqoop Job
Incremental Sqoop Import - Execute Job
Incremental Sqoop Import - Add Additional Data
Incremental Sqoop Import - Rerun Job
Incremental Sqoop Import - Using Last Modified

Here are the objectives for this course.

Provide Structure to the Data

Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala.

Create tables using a variety of data types, delimiters, and file formats
Create new tables using existing tables to define the schema
Improve query performance by creating partitioned tables in the metastore
Alter tables to modify the existing schema
Create views in order to simplify queries

Data Analysis

Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster.

Prepare reports using SELECT commands including unions and subqueries
Calculate aggregate statistics, such as sums and averages, during a query
Create queries against multiple data sources by using join commands
Transform the output format of queries by using built-in functions
Perform queries across a group of rows using windowing functions

Exercises will be provided to have enough practice to get better at Sqoop as well as writing queries using Hive and Impala.

All the demos are given on our state-of-the-art Big Data cluster. If you do not have multi-node cluster, you can sign up for our labs and practice on our multi-node cluster. You will be able to practice Sqoop and Hive on the cluster.

Who this course is for:

Any Big Data Professional or Aspirant who want to learn about Databases and Query Interfaces in Big Data
Any Business Intelligence Professional who want to understand Data Analysis in Big Data eco system
Any IT Professional who want to prepared for CCA 159 Data Analyst exam

Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)

What you'll learn

Explore related topics

Course content

Introduction4 lectures • 16min

Using Cloudera QuickStart VM8 lectures • 1hr 13min

Using ITVersity labs6 lectures • 20min

Overview of Big Data eco system13 lectures • 1hr 10min

Overview of HDFS Commands15 lectures • 1hr 20min

Apache Hive - Getting Started12 lectures • 59min

Apache Hive - Managing Tables in Hive14 lectures • 1hr 19min

Apache Hive - Managing Tables in Hive - Partitioning and Bucketing16 lectures • 1hr 35min

Apache Hive - Overview of Functions16 lectures • 1hr 13min

Apache Hive - Writing Basic Queries17 lectures • 1hr 29min

Requirements

Description

Who this course is for: