Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

From 0 to 1: Hive for Processing Big Data

Name: From 0 to 1: Hive for Processing Big Data
Rating: 4.5 (1042 reviews)

End-to-End Hive : HQL, Partitioning, Bucketing, UDFs, Windowing, Optimization, Map Joins, Indexes

Created byLoony Corn

Last updated 1/2018

English

What you'll learn

Write complex analytical queries on data in Hive and uncover insights
Leverage ideas of partitioning, bucketing to optimize queries in Hive
Customize hive with user defined functions in Java and Python
Understand what goes on under the hood of Hive with HDFS and MapReduce

Course content

19 sections • 87 lectures • 15h 15m total length

You, Us & This Course2:02
We start with an introduction. What is the course about? What will you know at the end of the course?

Hive: An Open-Source Data Warehouse12:59
Data warehousing systems - which have become the rage with the rise of 'Big Data' - are quite different from traditional transaction processing systems. Hive is a prototypical data warehousing system.
Hive and Hadoop9:19
Hive is built atop Hadoop, and can even be characterized as the SQL skin atop Hadoop MapReduce.
Hive vs Traditional Relational DBMS13:52
Hive tries really hard - and mostly succeeds - at pretending to be a relational DBMS, but really, under the hood its quite different - understand how, and understand schema-on-read.
HiveQL and SQL7:20
Now that we understand the differences between Hive and a traditional RDBMS, the differences between HiveQL and SQL will seem a lot less annoying and arbitrary.

Hadoop Install Modes8:32
Before we install Hive, we need to install Hadoop. Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each
Hadoop Install Step 1 : Standalone Mode15:46
How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.
Hadoop Install Step 2 : Pseudo-Distributed Mode11:44
Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running!
Hive install12:05
If you are all set with Hadoop, let's go ahead and install Hive.
Code-Along: Getting started6:24
Let's run a few basic queries on Hive. Head on over to the SQL primer section at the end of the course, if you have no previous experience in Hive.

Primitive Datatypes17:07
Let's cycle through primitive datatypes in Hive.
Collections_Arrays_Maps9:28
Hive has some really cool datatypes - collections that make it feel like there is a real programming language under the hood. Oh, and btw - there is!
Structs and Unions5:57
Structs and unions are yet another bit of Hive that seem more at home in a programming language.
Create Table13:15
Let's get into the nitty-gritty - starting with creating tables. Remember schema-on-read?
Insert Into Table12:04
Inserting into tables has a few quirks in Hive, because, after all, all writes are just data dumps that know nothing about the schema
Insert into Table 26:51
More on inserts - remember that no schema checking happens during database writes!
Alter Table7:22
Alter table works in Hive - understand how.
HDFS9:25
Hive data is stored as files on HDFS, the distributed file system that is an integral part of Hadoop. Understanding the physical layout of hive tables will make many advanced concepts - bucketing and partitioning - far more clear.
HDFS CLI - Interacting with HDFS10:58
Learn how to interact with HDFS. This comes in handy if you want to understand what's going on under the hood of your Hive Queries.
Code-Along: Create Table9:54
Let's create a few tables and see how to insert data. We'll see external tables as well and what happens under the hood in HDFS with each of these activities.
Code-Along : Hive CLI3:06
Hive CLI allows you to run scripts and execute queries directly from the command line rather than the hive shell.

Three types of Hive functions6:45
Hive has a whole bunch of useful functions available out-of-the-box. This is an introduction to the 3 types of functions available. Standard, aggregate and table generating functions.
The Case-When statement, the Size function, the Cast function10:09
The case-when statement is very useful to populate columns by evaluating conditions. Size() and Cast() are other useful built-in functions.
The Explode function13:06
explode() is a very interesting table generating function which expands an array to produce row for every element in the array.
Code-Along : Hive Built - in functions4:28

Quirky Sub-Queries7:13
Sub-queries in Hive are rather quirky. For instance, union is fine, but intersect is not.
More on subqueries: Exists and In15:13
Sub-queries have a few rather arcane rules - no equality signs, and some rather specific rules on exists and in.
Inserting via subqueries5:23
It is possible to insert data into a table using subqueries - just don't try to specify any schema information!
Code-Along : Use Subqueries to work with Collection Datatypes5:56
Create a Hive table with collection data types like arrays and structs, including an address struct and a subordinates array, then insert and access them via struct and array functions.
Views12:18
Views are an awesome bit of functionality in Hive - use them. Oh, btw, views are non-materialized, if that means anything to you. If not - never mind!

Indices6:40
Indices are just a lot less important in Hive than they are in SQL. Understand why, and also how they can be used.
Partitioning Introduced6:36
Partitioning in Hive is conceptually similar to Indexing in traditional DBMS - way to quickly look up rows with specific values in a particular column
The Rationale for Partitioning6:16
Let's understand the why of partitioning
How Tables are Partitioned9:52
Partitioning needs to specified at the time of table creation - understand the syntax.
Using Partitioned Tables5:27
Once a table has been partitioned appropriately, using it is not a lot of work.
Dynamic Partitioning: Inserting data into partitioned tables12:44
Inserting data into partitioned tables can be a bit tedious - understand how dynamic partitioning can help!
Code-Along : Partitioning4:03
Let's see partitioning in action!

Introducing Bucketing11:56
Bucketing is conceptually quite close to partitioning - and indeed to Indexing in traditional RDBMS - but with a key difference.
The Advantages of Bucketing4:54
Bucketing has an important advantage over partitioning - the metastore is unlikely to be taken down by it.
How Tables are Bucketed12:36
Bucketing needs to specified at the time of table creation - understand how.
Using Bucketed Tables7:22
Once a table has been bucketed, using it is not that difficult.
Sampling11:13
Sampling is a very handy technique in a data warehouse, and bucketing helps power this functionality

Windowing Introduced12:59
Windowing functions start to get at the real number-crunching power of Hive. In effect, they help tack on a new column to a query result - and that column contains the results of aggregate functions on a window of rows.
Windowing - A Simple Example: Cumulative Sum9:39
Let's use windowing to set up a running total, aka a cumulative sum, for revenues in a sales table
Windowing - A More Involved Example: Partitioning11:54
Let's now make that running sum reset each day - combining the power of windowing and the power of partitioning
Windowing - Special Aggregation Functions15:07
Rownumber, rank, lead and lag - Hive places really nifty windowing functions at your disposal.

Requirements

Hive requires knowledge of SQL. If you don't know SQL, please head to the SQL primer at the end of the course first.
You'll need to know Java if you are interested in the sections on custom user defined functions
No other prerequisites: The course covers everything you need to install Hive and run queries!

Description

Prerequisites: Hive requires knowledge of SQL. The course includes and SQL primer at the end. Please do that first if you don't know SQL. You'll need to know Java if you want to follow the sections on custom functions.

Taught by a 4 person team including 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with large-scale data.

Hive is like a new friend with an old face (SQL). This course is an end-to-end, practical guide to using Hive for Big Data processing.

Let's parse that

A new friend with an old face: Hive helps you leverage the power of Distributed computing and Hadoop for Analytical processing. It's interface is like an old friend : the very SQL like HiveQL. This course will fill in all the gaps between SQL and what you need to use Hive.

End-to-End: The course is an end-to-end guide for using Hive: whether you are analyst who wants to process data or an Engineer who needs to build custom functionality or optimize performance - everything you'll need is right here. New to SQL? No need to look elsewhere. The course has a primer on all the basic SQL constructs, .

Practical: Everything is taught using real-life examples, working queries and code .

What's Covered:

Analytical Processing: Joins, Subqueries, Views, Table Generating Functions, Explode, Lateral View, Windowing and more

Tuning Hive for better functionality: Partitioning, Bucketing, Join Optimizations, Map Side Joins, Indexes, Writing custom User Defined functions in Java. UDF, UDAF, GenericUDF, GenericUDTF, Custom functions in Python, Implementation of MapReduce for Select, Group by and Join

For SQL Newbies: SQL In Great Depth

Who this course is for:

Yep! Analysts who want to write complex analytical queries on large scale data
Yep! Engineers who want to know more about managing Hive as their data warehousing solution

From 0 to 1: Hive for Processing Big Data

What you'll learn

Explore related topics

Course content

You, Us & This Course1 lecture • 2min

Introducing Hive4 lectures • 44min

Hadoop and Hive Install5 lectures • 55min

Hadoop and HDFS Overview2 lectures • 18min

Hive Basics11 lectures • 1hr 45min

Built-in Functions4 lectures • 34min

Sub-Queries5 lectures • 46min

Partitioning7 lectures • 52min

Bucketing5 lectures • 48min

Windowing4 lectures • 50min

Requirements

Description

Who this course is for: