Taught by a 4 person team including 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with Java and with billions of rows of data.
This course is a zoom-in, zoom-out, hands-on workout involving Hadoop, MapReduce and the art of thinking parallel.
Let’s parse that.
Zoom-in, Zoom-Out: This course is both broad and deep. It covers the individual components of Hadoop in great detail, and also gives you a higher level picture of how they interact with each other.
Hands-on workout involving Hadoop, MapReduce : This course will get you hands-on with Hadoop very early on. You'll learn how to set up your own cluster using both VMs and the Cloud. All the major features of MapReduce are covered - including advanced topics like Total Sort and Secondary Sort.
The art of thinking parallel: MapReduce completely changed the way people thought about processing Big Data. Breaking down any problem into parallelizable units is an art. The examples in this course will train you to "think parallel".
Lot's of cool stuff ..
.. and of course all the basics:
Using discussion forums
Please use the discussion forums on this course to engage with other students and to help each other out. Unfortunately, much as we would like to, it is not possible for us at Loonycorn to respond to individual questions from students:-(
We're super small and self-funded with only 2-3 people developing technical video content. Our mission is to make high-quality courses available at super low prices.
The only way to keep our prices this low is to *NOT offer additional technical support over email or in-person*. The truth is, direct support is hugely expensive and just does not scale.
We understand that this is not ideal and that a lot of students might benefit from this additional support. Hiring resources for additional support would make our offering much more expensive, thus defeating our original purpose.
It is a hard trade-off.
Thank you for your patience and understanding!
Big data may be a cliched term, but what does it really mean? Where does this data come from and why is it big?
Distributed computing makes processing very fast - but why? Let's take a simple example and see why distributed computing is so powerful.
What exactly is Hadoop? Its origins and its logical components explained.
HDFS based on GFS (The Google File System) is the storage layer within Hadoop. It stores files in blocks of 128MB.
MapReduce is the framework which allows developers to write massively parallel programs without worrying about the underlying details of distributed computing. The developer simply implements the map() and reduce() functions in order to crunch large input sets of data.
Yarn is responsible for managing resources in the Hadoop cluster. Yarn was introduced recently in Hadoop 2.0.
Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each
How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.
Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running!
In the world of MapReduce every problem can be thought of in terms of key values pairs. Map transforms the key-value pair in a meaningful way, they are sorted and merged and reduce combines key-value pairs in a meaningful way.
If you're learning MapReduce for the very first time - it's best to visualize what exactly it does before you get down into the little details.
What really goes on with a single record as it flows through the map and then reduce phase?
Counting the number of times a word occurs in input text is the Hello World of MapReduce. This was the very first example given in Jeff Dean and Sanjay Ghemawat's original paper on MapReduce.
Nothing is real unless it is on code. Setting up our very first Mapper.
Nothing is real unless it is on code. Setting up our very first Reducer.
Nothing is real unless it is on code. Setting up our very first MapReduce Job.
Learn how to use HDFS's command line interface and add data to HDFS to run your jobs on.
Run your very first MapReduce Job. We'll also explore the Web interface for YARN and HDFS and see how to track your jobs.
The reduce phase can be optimized by combining the output of the map phase at the map node itself. This is an optimization of the reduce phase to allow it to work on data that has been "partially reduced".
Using a Combiner should not change the output of the MapReduce. Which means not every Reducer can work as a combine function
The number of mapper processes depend on the number of input splits of your data. It's not really in your control. What you, as a developer, do control, is the number of reducers.
In order to have more than one Reducer work on your map data, you need partitions. Visualize how partitions and shuffle and sort work.
The Hadoop Streaming API uses the standard input and output to communicate with mapper and reducer functions in any language. Understand how Hadoop interacts with mappers and reducers in other languages.
It's not real till it's in code. Implement the word count MapReduce example in Python using the Streaming API.
Let's understand HDFS and it's data replication strategy in some detail.
Name nodes provide an index of what file is stored where in the data nodes. If the name node is lost the mapping of where the files are is lost. Which means even though the data is present in the data nodes, we'll have no idea how to access it!
Hadoop backs up name nodes using two strategies. Backing up the snapshot and edits to the file system and by setting up a secondary name node.
The Resource Manager assigns resources to processes based on policies and constraints of the cluster while the Node Manager manages memory, and other resource for a single node. These two form the basic components of Yarn.
What happens under the hood when you submit a job to Yarn? Resource Manager, Container, the Application Master and the Node Manager all work together to run your MapReduce job.
The Resource Manager acts as a pure scheduler and allows plugging in different policies to schedule jobs. Understand how the FIFO scheduler, the Capacity scheduler and the Fair scheduler work.
The user has a lot of leeway in configuring how the scheduler works. Let's study some of the options we can specify in the various config files.
The Main class in your MapReduce needs some special set up before it can accept command line arguments.
The library classes and interfaces which allow parsing command line arguments. Learn what they are and how to use them.
The Job object allows you to plug in your own classes to control inputs, outputs and many intermediate steps in the MapReduce.
Between the Map phase and the Reduce phase lie a whole number of intermediate steps performed by the Hadoop framework. Partitioning, Sorting and Grouping are 3 specific operations and each of these can be customized to fit your problem statement.
The Inverted Index which provides a mapping from every word to the page on which that word occurs is at the heart of every search engine. This is one of the original use cases for MapReduce.
It's not real unless it's in code, generate the inverted index using a MR job.
Understand why we need the Writable and the WritableComparable interface and why the keys in the Mapper output implement these interfaces.
A Bigram is a pair of adjacent words, use a special data type to represent a Bigram, it needs to be a WritableComparable to be serialized across the network and sorted and merged by Hadoop.
Use the Bigram data type in your MapReduce to produce a count of all Bigrams in the input text file.
Follow these instructions to set up your Hadoop project.
No code is complete without unit tests. The MRUnit framework uses JUnit to test MapReduce jobs. Write test cases for the Bigram count code.
The Input Format specifies the kind of input data that feeds into the MapReduce. The FileInputFormat is the base class for all inputs which are files
The most common kind of files are text files and binary files and Hadoop has built in library classes to represent both of these.
What if you want to partition on something other than key hashes? Custom partitioners allow you to partition on whatever metric you, you just need to write a bit of code.
Total Order Partitioning is a mind bending concept in Hadoop. This allows you to locally sort data such that it's in globally sorted order. Sounds confusing? It is a hard concept to wrap one's head around but the results are pretty amazing!
Input sampling, samples the input data to produce a key to partition mapping. The total order partitioner uses this mapping to partition the data in such a manner that locally sorting the data results in a globally sorted result.
The Hadoop Sort/Merge operation sorts the output keys of the mapper. Here is a neat trick to sort the values for each key as well.
Loonycorn is us, Janani Ravi and Vitthal Srinivasan. Between us, we have studied at Stanford, been admitted to IIM Ahmedabad and have spent years working in tech, in the Bay Area, New York, Singapore and Bangalore.
Janani: 7 years at Google (New York, Singapore); Studied at Stanford; also worked at Flipkart and Microsoft
Vitthal: Also Google (Singapore) and studied at Stanford; Flipkart, Credit Suisse and INSEAD too
We think we might have hit upon a neat way of teaching complicated tech courses in a funny, practical, engaging way, which is why we are so excited to be here on Udemy!
We hope you will try our offerings, and think you'll like them :-)