Pig For Wrangling Big Data
3.8 (28 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
1,848 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Pig For Wrangling Big Data to your Wishlist.

Add to Wishlist

Pig For Wrangling Big Data

Extract, Transform and Load data using Pig to harness the power of Hadoop
3.8 (28 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
1,848 students enrolled
Created by Loony Corn
Last updated 11/2016
Current price: $10 Original price: $50 Discount: 80% off
5 hours left at this price!
30-Day Money-Back Guarantee
  • 5.5 hours on-demand video
  • 65 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Work with unstructured data to extract information, transform it and store it in a usable form
  • Write intermediate level Pig scripts to munge data
  • Optimize Pig operations which work on large data sets
View Curriculum
  • A basic understanding of SQL and working with data
  • A basic understanding of the Hadoop eco-system and MapReduce tasks

Prerequisites: Working with Pig requires some basic knowledge of the SQL query language, a brief understanding of the Hadoop eco-system and MapReduce 

Taught by a team which includes 2 Stanford-educated, ex-Googlers  and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with large-scale data processing jobs. 

Pig is aptly named, it is omnivorous, will consume any data that you throw at it and bring home the bacon!

Let's parse that 

omnivorousPig works with unstructured data. It has many operations which are very SQL-like but Pig can perform these operations on data sets which have no fixed schema. Pig is great at wrestling data into a form which is clean and can be stored in a data warehouse for reporting and analysis.

bring home the baconPig allows you to transform data in a way that makes is structured, predictable and useful, ready for consumption.

What's Covered: 

Pig Basics: Scalar and Complex data types (Bags, Maps, Tuples), basic transformations such as Filter, Foreach, Load, Dump, Store, Distinct, Limit, Order by and other built-in functions.

Advanced Data Transformations and Optimizations: The mind-bending Nested Foreach, Joins and their optimizations using "parallel", "merge", "replicated" and other keywords, Co-groups and Semi-joins, debugging using Explain and Illustrate commands

Real-world example: Clean up server logs using Pig

Using discussion forums

Please use the discussion forums on this course to engage with other students and to help each other out. Unfortunately, much as we would like to, it is not possible for us at Loonycorn to respond to individual questions from students:-(

We're super small and self-funded with only 2 people developing technical video content. Our mission is to make high-quality courses available at super low prices.

The only way to keep our prices this low is to *NOT offer additional technical support over email or in-person*. The truth is, direct support is hugely expensive and just does not scale.

We understand that this is not ideal and that a lot of students might benefit from this additional support. Hiring resources for additional support would make our offering much more expensive, thus defeating our original purpose.

It is a hard trade-off.

Thank you for your patience and understanding!

Who is the target audience?
  • Yep! Analysts who want to wrangle large, unstructured data into shape
  • Yep! Engineers who want to parse and extract useful information from large datasets
Compare to Other Apache Pig Courses
Curriculum For This Course
35 Lectures
You, This Course and Us
1 Lecture 01:46
Where does Pig fit in?
5 Lectures 40:17

There are many technologies that work in the Hadoop eco-system. Where does Pig fit in and why would we use Pig?

Preview 09:37

Let's walk through the installation for Pig. Please download and use both the attached resources. 

Install and set up

How does Pig stack up against Hive? When would we use Pig vs Hive?

How does Pig compare with Hive?

Pig is a data flow language where data "flows" through the script and is transformed by operations till it can be stored in a data warehouse.

Pig Latin as a data flow language

Pig can work with HBase as well, this course focuses on Hadoop and HDFS though.

Pig with HBase
Pig Basics
6 Lectures 56:11

Pig can run in local mode or on a cluster. See how to run Pig scripts and get introduced to the Grunt shell.

Preview 09:52

Getting started, loading data into a relation.

Loading data and creating our first relation

The basic data types which exist in Pig.

Preview 09:55

Pigs has more complex ways of grouping and storing data. Learn how to use these.

Complex data types - The Tuple, Bag and Map

Pig consumes any kind of data, it does not require a well-defined schema. You may not know the schema at all or know it only partially.

Partial schema specification for relations

Display results on screen or store it to a file.

Displaying and storing relations - The dump and store commands
Pig Operations And Data Transformations
5 Lectures 42:06

Choose certain fields and drop the ones you're not interested in.

Preview 10:22

Built-in functions in Pig are super powerful. They give you a wide variety of ways to munge data without having to write much code!

Built-in functions

These transform data by aggregating, summarizing and calculating new information from the fields available.

Evaluation functions

Generate unique records, limit the number of records or order records based on a field.

Using the distinct, limit and order by keywords

Choose only the records you're interested in by specifying conditions.

Filtering records based on a predicate
Advanced Data Transformations
7 Lectures 01:14:35

This is where Pig gets interesting. Grouping data and aggregations can be performed on unstructured data as well.

Group by and aggregate transformations

Bring together multiple relations into one unit by using Join.

Combining datasets using Join

Concatenating data sets works even when the schemas are not the same or there are additional columns!

Preview 04:32

Nested or group data in a single record can be flattened and converted to multiple records.

Generating multiple records by flattening complex fields

Pig allows grouping across relations using co-group. Co-groups can be used to find semi-joins of two relations.And finally sample a portion of data in a data set using the sample command.

Using Co-Group, Semi-Join and Sampling records

The nested foreach is mind-bending but one of the most powerful operations on grouped data.

The nested Foreach command

These commands help give you an idea of what exactly happens in a Pig script when you execute it.

Debug Pig scripts using Explain and Illustrate
Optimizing Data Transformations
4 Lectures 32:51

Pig scripts work with huge chunks of data, all optimizations help!

Preview 08:02

Optimize your joins when more than 2 relations are involved. Or when one table is much larger than another use the fragment-replicate algorithm to execute joins more efficiently.

Join Optimizations: Multiple relations join, large and small relation join

If your data is skewed such that some keys have huge data associated with them then you can speed up joins using the skew join. Or if your data set is sorted use the sort-merge join.

Join Optimizations: Skew join and sort-merge join

Here are some tricks to process data faster. Small optimizations which can make a big difference.

Common sense optimizations
A real-world example
2 Lectures 16:42

A real world example of how you would use Pig to process server logs to extract, transform and store information in a data warehouse.

Parsing server logs

Analyze every step in the Pig script and understand how the transformations work.

Summarizing error logs
Installing Hadoop in a Local Environment
3 Lectures 36:02

Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each

Hadoop Install Modes

How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video. 

Hadoop Standalone mode Install

Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running! 

Hadoop Pseudo-Distributed mode Install
2 Lectures 24:23

If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares. 

[For Linux/Mac OS Shell Newbies] Path and other Environment Variables

Hadoop is basically for Linux/Unix systems. If you are on Windows, you can set up a Linux Virtual Machine on your computer and use that for the install. 

Setup a Virtual Linux Instance (For Windows users)
About the Instructor
Loony Corn
4.3 Average rating
5,481 Reviews
42,676 Students
75 Courses
An ex-Google, Stanford and Flipkart team

Loonycorn is us, Janani Ravi and Vitthal Srinivasan. Between us, we have studied at Stanford, been admitted to IIM Ahmedabad and have spent years  working in tech, in the Bay Area, New York, Singapore and Bangalore.

Janani: 7 years at Google (New York, Singapore); Studied at Stanford; also worked at Flipkart and Microsoft

Vitthal: Also Google (Singapore) and studied at Stanford; Flipkart, Credit Suisse and INSEAD too

We think we might have hit upon a neat way of teaching complicated tech courses in a funny, practical, engaging way, which is why we are so excited to be here on Udemy!

We hope you will try our offerings, and think you'll like them :-)