Hadoop,Big Data and Map Reduce Course
2.3 (5 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
72 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Hadoop,Big Data and Map Reduce Course to your Wishlist.

Add to Wishlist

Hadoop,Big Data and Map Reduce Course

Big Data: Taming Big Data with MapReduce and Hadoop
2.3 (5 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
72 students enrolled
Created by Deepika Khanna
Last updated 1/2016
Current price: $10 Original price: $20 Discount: 50% off
1 day left at this price!
30-Day Money-Back Guarantee
  • 1 hour on-demand video
  • 2 Supplemental Resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Understand about Big Data, Hadoop and Map reduce
  • Learn about the hadoop ecosystem
  • Understand about Hadoop technologies: Hive, Pig and Spark
  • Understand about how Mappers and Reducers work
View Curriculum
  • No programming skills needed to take this course.

Big Data is a hot and a valuable skill to learn. As the use of phone, facebook and other social media platform have increased, we are coming across challenges to store this large amount of data and the speed at which this data is getting generated. We will learn how Hadoop is a great solution to address the challenges of big data. Hadoop developers are in great demand and its a great career choice.

Learn the basic concepts of Mappers and reducers,how mappers and reducers work, Hadoop and Big Data in no time. The course is planned for non-programmers and you will understand the concepts of Hadoop in no time.

Who is the target audience?
  • This course is for students who wants to know about Big Data
  • After this course, students will have very good understanding of how mappers and reducers work
  • No programming skills needed to take this course.
Students Who Viewed This Course Also Viewed
Curriculum For This Course
Expand All 21 Lectures Collapse All 21 Lectures 53:48
What is "Big Data"? The dimensions of Big Data. Scaling problems.
21 Lectures 51:48

In this tutorial we will talk about DATA. Why companies need to store every data and why storing data is most important to give better user experience which ultimately generates more sales. For eg: if you go to amazon.com and look for a particular product, amazon save that info and you will see next time when you login they will show you the same product or the recommended products based on your previous purchases. So this gives user a better experience also help companies like Amazon to get more sales.

Preview 02:02

Introduce yourself
1 page

Big Data is a terminology used to describe large amount of data. Every data is not considered as big data, so we will see on what basis we make the decision that what is Big Data???

Preview 00:59

Assignment 1
1 page

In this tutorial we will talk about the three V's associated with Big data.

  1. Volume
  2. Variety
  3. Velocity

  • Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
  • Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
  • Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
Preview 01:47

Hadoop has 2 main components : HDFS and Map Reduce.

HDFS ( Hadoop Distributed File System) :The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets

Map Reduce: MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Hadoop 2 core parts

Hadoop Ecosystem:

The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.

The Hadoop ecosystem includes other tools to address particular needs.

Hive is a SQL dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs behind higher-level abstractions more appropriate for user goals. Zookeeper is used for federating services and Oozie is a scheduling system.

Apache hadoop ecosystem components

Hadoop File System was developed using distributed file system design.

Features of HDFS:

  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.

Namenode:The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline.

How to resolve NameNode problems???

  • List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.

Datanode: A DataNode stores data in the HDFS. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

Taming big data with mapreduce and hadoop

Problem with traditional system

Address the problem

Namenode and datanode in hadoop

MapReduce: programming paradigm which scale thousand of server in hadoop cluster


Mapper and Reducers part 1

Mapper and Reducers part 2

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

  1. Client applications submit jobs to the Job tracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.

A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Job Tracker, Task Tracker, Name Node and Data Nodes
About the Instructor
Deepika Khanna
4.3 Average rating
1,867 Reviews
19,375 Students
16 Courses
JAVA, J2EE,Salesforce, & Android Developer,Teacher

I am a Java/J2EE and salesforce developer, have been writing and working with software for the past 5 years. I currently live in Dallas/TX.

If your goal is to become one of these:

Android Developer

JAVA/J2EE Developer

Salesforce Developer

Then checkout my courses. I have close to 10000 students in and out of udemy. My passion is helping people around the world and guide them into the world of programming.

I am Oracle certified JAVA ,J2EE developer. I love coffee, music, exercise,coding and technology. See you in my course:)