Big Data is a hot and a valuable skill to learn. As the use of phone, facebook and other social media platform have increased, we are coming across challenges to store this large amount of data and the speed at which this data is getting generated. We will learn how Hadoop is a great solution to address the challenges of big data. Hadoop developers are in great demand and its a great career choice.
Learn the basic concepts of Mappers and reducers,how mappers and reducers work, Hadoop and Big Data in no time. The course is planned for non-programmers and you will understand the concepts of Hadoop in no time.
In this tutorial we will talk about DATA. Why companies need to store every data and why storing data is most important to give better user experience which ultimately generates more sales. For eg: if you go to amazon.com and look for a particular product, amazon save that info and you will see next time when you login they will show you the same product or the recommended products based on your previous purchases. So this gives user a better experience also help companies like Amazon to get more sales.
Big Data is a terminology used to describe large amount of data. Every data is not considered as big data, so we will see on what basis we make the decision that what is Big Data???
In this tutorial we will talk about the three V's associated with Big data.
Hadoop has 2 main components : HDFS and Map Reduce.
HDFS ( Hadoop Distributed File System) :The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets
Map Reduce: MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce.
The Hadoop ecosystem includes other tools to address particular needs.
Hive is a SQL dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs behind higher-level abstractions more appropriate for user goals. Zookeeper is used for federating services and Oozie is a scheduling system.
Hadoop File System was developed using distributed file system design.
Features of HDFS:
Namenode:The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline.
How to resolve NameNode problems???
Datanode: A DataNode stores data in the HDFS. A functional filesystem has more than one DataNode, with data replicated across them.
On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
I am a Java/J2EE and salesforce developer, have been writing and working with software for the past 5 years. I currently live in Dallas/TX.
If your goal is to become one of these:
Then checkout my courses. I have close to 10000 students in and out of udemy. My passion is helping people around the world and guide them into the world of programming.
I am Oracle certified JAVA ,J2EE developer. I love coffee, music, exercise,coding and technology. See you in my course:)