Hadoop,Big Data and Map Reduce Course

Big Data: Taming Big Data with MapReduce and Hadoop
2.3 (5 ratings)
Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
72 students enrolled
25% off
Take This Course
  • Lectures 21
  • Length 1 hour
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works


Find online courses made by experts from around the world.


Take your courses with you and learn anywhere, anytime.


Learn and practice real-world skills and achieve your goals.

About This Course

Published 12/2015 English

Course Description

Big Data is a hot and a valuable skill to learn. As the use of phone, facebook and other social media platform have increased, we are coming across challenges to store this large amount of data and the speed at which this data is getting generated. We will learn how Hadoop is a great solution to address the challenges of big data. Hadoop developers are in great demand and its a great career choice.

Learn the basic concepts of Mappers and reducers,how mappers and reducers work, Hadoop and Big Data in no time. The course is planned for non-programmers and you will understand the concepts of Hadoop in no time.

What are the requirements?

  • No programming skills needed to take this course.

What am I going to get from this course?

  • Understand about Big Data, Hadoop and Map reduce
  • Learn about the hadoop ecosystem
  • Understand about Hadoop technologies: Hive, Pig and Spark
  • Understand about how Mappers and Reducers work

Who is the target audience?

  • This course is for students who wants to know about Big Data
  • After this course, students will have very good understanding of how mappers and reducers work
  • No programming skills needed to take this course.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.


Section 1: What is "Big Data"? The dimensions of Big Data. Scaling problems.

In this tutorial we will talk about DATA. Why companies need to store every data and why storing data is most important to give better user experience which ultimately generates more sales. For eg: if you go to amazon.com and look for a particular product, amazon save that info and you will see next time when you login they will show you the same product or the recommended products based on your previous purchases. So this gives user a better experience also help companies like Amazon to get more sales.

Introduce yourself
1 page

Big Data is a terminology used to describe large amount of data. Every data is not considered as big data, so we will see on what basis we make the decision that what is Big Data???

Assignment 1
1 page
Challenges with Big Data

In this tutorial we will talk about the three V's associated with Big data.

  1. Volume
  2. Variety
  3. Velocity

  • Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
  • Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time.
  • Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.
First V : Volume of data
Variety of data :Data is available in different formats
Data in different format
Why we need to store data?? How it will be helpful for our customers??

Hadoop has 2 main components : HDFS and Map Reduce.

HDFS ( Hadoop Distributed File System) :The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets

Map Reduce: MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).


Hadoop Ecosystem:

The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.

The Hadoop ecosystem includes other tools to address particular needs.

Hive is a SQL dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs behind higher-level abstractions more appropriate for user goals. Zookeeper is used for federating services and Oozie is a scheduling system.


Hadoop File System was developed using distributed file system design.

Features of HDFS:

  • It is suitable for the distributed storage and processing.
  • Hadoop provides a command interface to interact with HDFS.

Namenode:The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline.

How to resolve NameNode problems???

  • List more than one name node directory in the configuration, so that multiple copies of the file system meta-data will be stored. As long as the directories are on separate disks, a single disk failure will not corrupt the meta-data.

Datanode: A DataNode stores data in the HDFS. A functional filesystem has more than one DataNode, with data replicated across them.

On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.

Problem with traditional system
Address the problem
Namenode and datanode in hadoop
MapReduce: programming paradigm which scale thousand of server in hadoop cluster
Mapper and Reducers part 1
Mapper and Reducers part 2

The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.

  1. Client applications submit jobs to the Job tracker.
  2. The JobTracker talks to the NameNode to determine the location of the data
  3. The JobTracker locates TaskTracker nodes with available slots at or near the data
  4. The JobTracker submits the work to the chosen TaskTracker nodes.
  5. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
  6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
  7. When the work is completed, the JobTracker updates its status.
  8. Client applications can poll the JobTracker for information.

A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Deepika Khanna, JAVA, J2EE,Salesforce, & Android Developer,Teacher

I am a Java/J2EE and salesforce developer, have been writing and working with software for the past 5 years. I currently live in Dallas/TX.

If your goal is to become one of these:

Android Developer

JAVA/J2EE Developer

Salesforce Developer

Then checkout my courses. I have close to 10000 students in and out of udemy. My passion is helping people around the world and guide them into the world of programming.

I am Oracle certified JAVA ,J2EE developer. I love coffee, music, exercise,coding and technology. See you in my course:)

Ready to start learning?
Take This Course