
Discover what big data is, its velocity, volume, variety, and veracity, and identify the three data types: structured, semi-structured, and unstructured, with attention to data quality.
Learn facts about big data generated by social media and mobile devices, noting billions of users, millions of posts and videos per minute, and data from sensors and geolocation.
Explore how enterprise Hadoop users leverage managed platforms like Amazon Elastic MapReduce and IBM Big Insights to run real-time analytics on large data sets, integrating Cloudera and Cassandra.
explores the three Hadoop cluster modes—standalone (local) mode, pseudo distributed mode, and fully distributed mode—highlighting local file systems, development use, and production deployment.
Explore the Hadoop ecosystem, including HDFS, MapReduce, and Pig Latin for distributed data processing. Learn distributed workflow and job coordination with ZooKeeper and import/export between RDBMS and Hadoop.
Explore the Hadoop cluster architecture, detailing master and slave roles, the name node, secondary name node, data nodes, and how job and task trackers coordinate data access and metadata.
Discover why Hadoop drives big data analytics across industries, enables scalable, cost-effective data processing, and opens diverse career paths in big data technologies.
Explore Hadoop distributions and their compatibilities, including Cloudera CDH, Greenplum, and Hortonworks, along with supported operating systems such as open source variants and others.
Learn how HDFS stores data in 64 MB blocks, distributes blocks across nodes, and uses replication (default three) for fault tolerance under name node management.
Explore the hdfs architecture with the name node as metadata master, data nodes for blocks, replication for fault tolerance, and the secondary name node (checkpoint node) managing fsimage.
Learn hdfs read-write operations in a distributed file system, including file creation, block placement, replication, and reading data from data nodes via the name node.
Install Apache Hadoop via the Cloudera distribution CDH 5.2 on VMware Player, after downloading the VM image and ensuring adequate RAM and free disk space.
Demonstrate a MapReduce word count example in Hadoop by tokenizing text, mapping words to ones, shuffling and sorting by key, and reducing to final counts.
The lecture explains HDFS input splits as logical divisions of data over physical blocks, showing how split size dictates map tasks: large splits create fewer maps, small splits create more.
Explore the MapReduce architecture in Hadoop, detailing two-stage processing with map and reduce tasks, data splits, partitioning, shuffling, and local storage of intermediate results.
Explain how the combiner processes map output before reduce, reducing intermediate data and transfer time, and note advantages like data reduction and drawbacks such as lack of guaranteed combiner execution.
Explore how the partitioner routes map outputs to reducers via key-based hashing, ensuring all values for a key reach the same reducer. Learn how to balance load with custom partitioners.
Download CDH 5.2 and extract the file, then run it in a virtual machine to start CDH 5.2 and learn how to use Tolleson.
Explore essential HDFS commands for beginners, including leaving safe mode, using the file system utility to check health, create directories, and list contents while understanding replication factor and directory status.
Explore essential HDFS commands for beginners, including copy from local, listing directories recursively, and displaying file contents with cat, enabling efficient data management in Hadoop.
Learn to perform core HDFS commands for copying, moving, and deleting files across directories, creating directories, and verifying results using practical syntax examples.
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
This basic course provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.
This course has been prepared for professionals aspiring to learn the basics of Big Data Analytics using Hadoop Framework and become a Hadoop Developer. Software Professionals, Analytics Professionals, and ETL developers are the key beneficiaries of this course.
Before you start proceeding with this course, we assume that you have prior exposure to Core Java, database concepts, and any of the Linux operating system flavors.