
There are many technologies that work in the Hadoop eco-system. Where does Pig fit in and why would we use Pig?
Let's walk through the installation for Pig. Please download and use both the attached resources.
How does Pig stack up against Hive? When would we use Pig vs Hive?
Pig is a data flow language where data "flows" through the script and is transformed by operations till it can be stored in a data warehouse.
Pig can work with HBase as well, this course focuses on Hadoop and HDFS though.
Pig can run in local mode or on a cluster. See how to run Pig scripts and get introduced to the Grunt shell.
Getting started, loading data into a relation.
The basic data types which exist in Pig.
Pigs has more complex ways of grouping and storing data. Learn how to use these.
Pig consumes any kind of data, it does not require a well-defined schema. You may not know the schema at all or know it only partially.
Display results on screen or store it to a file.
Choose certain fields and drop the ones you're not interested in.
Built-in functions in Pig are super powerful. They give you a wide variety of ways to munge data without having to write much code!
These transform data by aggregating, summarizing and calculating new information from the fields available.
Generate unique records, limit the number of records or order records based on a field.
Choose only the records you're interested in by specifying conditions.
This is where Pig gets interesting. Grouping data and aggregations can be performed on unstructured data as well.
Bring together multiple relations into one unit by using Join.
Concatenating data sets works even when the schemas are not the same or there are additional columns!
Nested or group data in a single record can be flattened and converted to multiple records.
Pig allows grouping across relations using co-group. Co-groups can be used to find semi-joins of two relations.And finally sample a portion of data in a data set using the sample command.
The nested foreach is mind-bending but one of the most powerful operations on grouped data.
These commands help give you an idea of what exactly happens in a Pig script when you execute it.
Pig scripts work with huge chunks of data, all optimizations help!
Optimize your joins when more than 2 relations are involved. Or when one table is much larger than another use the fragment-replicate algorithm to execute joins more efficiently.
If your data is skewed such that some keys have huge data associated with them then you can speed up joins using the skew join. Or if your data set is sorted use the sort-merge join.
Here are some tricks to process data faster. Small optimizations which can make a big difference.
A real world example of how you would use Pig to process server logs to extract, transform and store information in a data warehouse.
Analyze every step in the Pig script and understand how the transformations work.
Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each
How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.
Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running!
If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares.
Hadoop is basically for Linux/Unix systems. If you are on Windows, you can set up a Linux Virtual Machine on your computer and use that for the install.
Prerequisites: Working with Pig requires some basic knowledge of the SQL query language, a brief understanding of the Hadoop eco-system and MapReduce
Taught by a team which includes 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with large-scale data processing jobs.
Pig is aptly named, it is omnivorous, will consume any data that you throw at it and bring home the bacon!
Let's parse that
omnivorous: Pig works with unstructured data. It has many operations which are very SQL-like but Pig can perform these operations on data sets which have no fixed schema. Pig is great at wrestling data into a form which is clean and can be stored in a data warehouse for reporting and analysis.
bring home the bacon: Pig allows you to transform data in a way that makes is structured, predictable and useful, ready for consumption.
What's Covered:
Pig Basics: Scalar and Complex data types (Bags, Maps, Tuples), basic transformations such as Filter, Foreach, Load, Dump, Store, Distinct, Limit, Order by and other built-in functions.
Advanced Data Transformations and Optimizations: The mind-bending Nested Foreach, Joins and their optimizations using "parallel", "merge", "replicated" and other keywords, Co-groups and Semi-joins, debugging using Explain and Illustrate commands
Real-world example: Clean up server logs using Pig