The Hadoop ecosystem

Loony Corn
A free video tutorial from Loony Corn
An ex-Google, Stanford and Flipkart team
4.2 instructor rating • 73 courses • 129,179 students

Lecture description

HBase is a database built for the Hadoop ecosystem. Before we get there, let's get a quick understanding of the Hadoop ecosystem.

Learn more from the full course

Learn by Example : HBase - The Hadoop Database

25 solved examples to get you up to speed with HBase

04:56:17 of on-demand video • Updated January 2018

  • Set up a database for your application using HBase
  • Integrate HBase with MapReduce for data processing tasks
  • Create tables, insert , read and delete data from HBase
  • Get an all round understanding of HBase and it's role in the Hadoop ecosystem
English [Auto] Airspace isn't distributed database which is part of the ecosystem to understand HBC and why it exists. It's really necessary to understand Hardup itself and what does it in the design of how to that makes it necessary to have a separate database architecture. In this video we'll be introducing Hadoop and its different components. If you're already pretty familiar with Tardieu and what its components are then feel free to skip ahead to the next lecture then we'll talk about hardtops limitations and why we need to separate database hardened. Is basically a distributer computing framework that was developed and is still maintained by the Software Foundation Hardup is written in Java and this is pretty significant because all the data processing tasks that you might want to do or how you will have to read it in Java as well. If you think about any distributed computing system it definitely needs a file system where you can manage how data stored. It also needs an engine or a framework that can process data across multiple servers. That engine will access the interface to the programmer. And in the back end if they manage how the device crossest across the multiple Norton the cluster the file system is Hartopp distributed file system on the FS and the framework of process D-Ga. cross multiple. So this is nothing but upbraidings Let's spend a little bit of time understanding each of these components. First let's start with HD is this. HD stands for Hardup distributed file system in any computing system the file system manages the storage of data. It knows how the file has been stored in the disk what location in the disk that file has been stored and it knows how to retrieve it. When they use it ask for it. This is exactly what HGF is best for. If you consider that and that cluster as one single system then HD offis knows how to store the file across that single system how to retrieve data. I didn't know was there that the guys located at any point in time. Let's say we have a bunch of mods in one of these notes when I ask a master what and that discard the name the Maston more. Mom is just the overall file system. It's the noise the location of all the files across the different the then notes. It basically stores Methody them it knows what the directory structure of your file system is. The name note also contains all the data link to the fights the last time that the file was modified why didn't the go missions you needed to find and so. All the other know what's in the glass and are basically gone based on what's on these be done once I read the ADA is physically stored as we already mentioned the need not only contains metadata actually of the signs is physically stored on that date on its Let's take an example of a file and see how it could be scored on the office. So let's say we have a large text find like this. In the office this big text file first be broken up into blocks. Each block would have a size of 128 and and let's not go here into why each block has to be of the size 128 and b this just has to do with certain performance considerations for hard once the file is broken up into blocks of size 128 and B These blocks have been stored across the different the panels so some blocks are scored and we don't want one. Some indeed and do some ignore the answer or the name not stores the method the data leading to the fight. What the files location the directory structure is what does the read write permission and so on. In addition to the Met did they get that stored in the name. It also stores block locations for each of the files that are stored in the cluster so it knows which to block guys swarming to the fight is stored in which data on. So if you wanted to redefine it you're basically using a meditator. And then if not then you would reconstruct the fine from the individual blocks that are present across the different data. So you need to be able to fetch all the blocks of the file you can so to reconstruct the file. There's one point of concern here. What if one of some blocks that is stored on these notes gets corrupted or one has to be Dunwich itself completely crashes. How do we then recover the data off the side. This is one of the things that HGF a source that abstract the program going to be from the challenges of distributed computing. And if it does this by allowing the programmer to define something called an application sakta using this applications that that HGF this creates replicas of each of the blocks. So let's see that application factor is to be then the block is replicated three times and each of that up because it's stored in different locations. Then even if one be done what goes down on that that they don't know what get it speaks. So that was all but HD effects the again important component to us is Mauritius which is what we need to process the data that is stored across these multiple nodes. Makarios uses the idea that that is a way to Podley lies more data processing. If you express them in a particular way you need to experience them so that your processing task has two phases. The first phase is called a mop face that all get even what's working. Pardon the process. The blocks that are still work on bad disks. So let's say we have a fight and Loch Lomond blocked off that files are stored on data on mode one then the blocks are processed on data and or the one that sent and results are generated then these results from each of the notes are transferred or where the ones in the lot. And then they are combined to form the final results. The second phase whether it is then copy to work on another note and combine is called that it uses using this Id also produce any data processing Gottes can be expressed as a chain of map and your news operations on the program and needs to do is to specify what logic should be implemented in the map. And then the logic to combine the results and the previous face on the test is taken care of by her to any complexity and more than copying your code or work all indifferent towards making sure that they on tasks making sure that all the designs are combined in one node making sure that if one goes down another comes and takes its place. All of that is taken Gayles by hand. So now we understand the basic components of how to build how they work. Let's go ahead and see why these are not sufficient for database management and why we need a face.