Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Pig For Wrangling Big Data

Name: Pig For Wrangling Big Data
Rating: 4.0 (103 reviews)

Extract, Transform and Load data using Pig to harness the power of Hadoop

Created byLoony Corn

Last updated 11/2016

English

What you'll learn

Work with unstructured data to extract information, transform it and store it in a usable form
Write intermediate level Pig scripts to munge data
Optimize Pig operations which work on large data sets

Course content

9 sections • 35 lectures • 5h 24m total length

You, This Course and Us1:46

Pig and the Hadoop ecosystem9:37
There are many technologies that work in the Hadoop eco-system. Where does Pig fit in and why would we use Pig?
Install and set up8:50
Let's walk through the installation for Pig. Please download and use both the attached resources.
How does Pig compare with Hive?10:15
How does Pig stack up against Hive? When would we use Pig vs Hive?
Pig Latin as a data flow language6:17
Pig is a data flow language where data "flows" through the script and is transformed by operations till it can be stored in a data warehouse.
Pig with HBase5:18
Pig can work with HBase as well, this course focuses on Hadoop and HDFS though.

Operating modes, running a Pig script, the Grunt shell9:52
Pig can run in local mode or on a cluster. See how to run Pig scripts and get introduced to the Grunt shell.
Loading data and creating our first relation8:45
Getting started, loading data into a relation.
Scalar data types9:55
The basic data types which exist in Pig.
Complex data types - The Tuple, Bag and Map13:45
Pigs has more complex ways of grouping and storing data. Learn how to use these.
Partial schema specification for relations10:00
Pig consumes any kind of data, it does not require a well-defined schema. You may not know the schema at all or know it only partially.
Displaying and storing relations - The dump and store commands3:54
Display results on screen or store it to a file.

Selecting fields from a relation10:22
Choose certain fields and drop the ones you're not interested in.
Built-in functions5:08
Built-in functions in Pig are super powerful. They give you a wide variety of ways to munge data without having to write much code!
Evaluation functions10:31
These transform data by aggregating, summarizing and calculating new information from the fields available.
Using the distinct, limit and order by keywords5:04
Generate unique records, limit the number of records or order records based on a field.
Filtering records based on a predicate11:01
Choose only the records you're interested in by specifying conditions.

Group by and aggregate transformations12:12
This is where Pig gets interesting. Grouping data and aggregations can be performed on unstructured data as well.
Combining datasets using Join16:19
Bring together multiple relations into one unit by using Join.
Concatenating datasets using Union4:32
Concatenating data sets works even when the schemas are not the same or there are additional columns!
Generating multiple records by flattening complex fields5:24
Nested or group data in a single record can be flattened and converted to multiple records.
Using Co-Group, Semi-Join and Sampling records9:26
Pig allows grouping across relations using co-group. Co-groups can be used to find semi-joins of two relations.And finally sample a portion of data in a data set using the sample command.
The nested Foreach command13:47
The nested foreach is mind-bending but one of the most powerful operations on grouped data.
Debug Pig scripts using Explain and Illustrate12:55
These commands help give you an idea of what exactly happens in a Pig script when you execute it.

Parallelize operations using the Parallel keyword8:02
Pig scripts work with huge chunks of data, all optimizations help!
Join Optimizations: Multiple relations join, large and small relation join10:34
Optimize your joins when more than 2 relations are involved. Or when one table is much larger than another use the fragment-replicate algorithm to execute joins more efficiently.
Join Optimizations: Skew join and sort-merge join8:51
If your data is skewed such that some keys have huge data associated with them then you can speed up joins using the skew join. Or if your data set is sorted use the sort-merge join.
Common sense optimizations5:24
Here are some tricks to process data faster. Small optimizations which can make a big difference.

Hadoop Install Modes8:32
Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each
Hadoop Standalone mode Install15:46
How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.
Hadoop Pseudo-Distributed mode Install11:44
Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running!

[For Linux/Mac OS Shell Newbies] Path and other Environment Variables8:25
If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares.
Setup a Virtual Linux Instance (For Windows users)15:58
Hadoop is basically for Linux/Unix systems. If you are on Windows, you can set up a Linux Virtual Machine on your computer and use that for the install.

Requirements

A basic understanding of SQL and working with data
A basic understanding of the Hadoop eco-system and MapReduce tasks

Description

Prerequisites: Working with Pig requires some basic knowledge of the SQL query language, a brief understanding of the Hadoop eco-system and MapReduce

Taught by a team which includes 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with large-scale data processing jobs.

Pig is aptly named, it is omnivorous, will consume any data that you throw at it and bring home the bacon!

Let's parse that

omnivorous: Pig works with unstructured data. It has many operations which are very SQL-like but Pig can perform these operations on data sets which have no fixed schema. Pig is great at wrestling data into a form which is clean and can be stored in a data warehouse for reporting and analysis.

bring home the bacon: Pig allows you to transform data in a way that makes is structured, predictable and useful, ready for consumption.

What's Covered:

Pig Basics: Scalar and Complex data types (Bags, Maps, Tuples), basic transformations such as Filter, Foreach, Load, Dump, Store, Distinct, Limit, Order by and other built-in functions.

Advanced Data Transformations and Optimizations: The mind-bending Nested Foreach, Joins and their optimizations using "parallel", "merge", "replicated" and other keywords, Co-groups and Semi-joins, debugging using Explain and Illustrate commands

Real-world example: Clean up server logs using Pig

Who this course is for:

Yep! Analysts who want to wrangle large, unstructured data into shape
Yep! Engineers who want to parse and extract useful information from large datasets

Pig For Wrangling Big Data

What you'll learn

Explore related topics

Course content

You, This Course and Us1 lecture • 2min

Where does Pig fit in?5 lectures • 40min

Pig Basics6 lectures • 56min

Pig Operations And Data Transformations5 lectures • 42min

Advanced Data Transformations7 lectures • 1hr 15min

Optimizing Data Transformations4 lectures • 33min

A real-world example2 lectures • 17min

Installing Hadoop in a Local Environment3 lectures • 36min

Appendix2 lectures • 24min

Requirements

Description

Who this course is for: