Hadoop Overview and History

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.6 instructor rating • 23 courses • 504,873 students

Lecture description

What's Hadoop for? What problems does it solve? Where did it come from? We'll learn Hadoop's backstory in this lecture.

Learn more from the full course

The Ultimate Hands-On Hadoop: Tame your Big Data!

Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies.

14:40:15 of on-demand video • Updated July 2021

  • Design distributed systems that manage "big data" using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
English So let's talk a little bit about the basics of Hadoop where it came from what it's for what it's all about. Let's dive in. All right let's dive right in and talk about the Hadoop ecosystem. The mascot is this yellow elephant here and we'll talk about why in a minute. Again my name is Frank Kane and I'll be your instructor as we make sense of this wild world of Hadoop. What's it all about. So what is Hadoop anyway. Well the definition from Hortonworks which is one of the major vendors of Hadoop platforms is an open source software platform for distributed storage and distributed processing very large data sets on computer clusters built from commodity hardware. Wow that's a mouthful let's break that down a little bit. So an open source means it's free. That's a good thing. Software platform is a bunch of software that runs on a cluster of computers so as opposed to just running on a single PC. Hadoop is meant to be run on an entire cluster of PCs that run inside a rack in some data center somewhere. So it leverages the power of multiple PCs multiple computers to actually handle big dat. For distributed storage. So distributed storage is one of the main things that Hadoop provides. The idea is that you don't want to be limited by a single hard drive. If you're dealing with big data you might be getting terabytes of information every day or even more where are you gonna put all that. Well the nice thing about distributed storage is that you can just keep on adding more and more computers to your cluster and their hard drives will just become part of your data storage for your data and what you do gives you a way of viewing all of the data distributed across all of the hard drives in your cluster as one single file system. And not only that it's also very redundant. So if one of those computers happens to burst into flames and your data set and melts into this gooey puddle of silicon Hadoop can handle that because it's gonna keep backup copies of all your data on other places in your cluster and it can automatically recover and make that data very resilient and very reliable. The other thing it gives you is distributed processing so not only can it store vast amounts of data across an entire cluster of computers it can distribute the processing of that data as well. So whether you need to transform all that data into some other format or to some other system or aggregate that data in some way Hadoop provides the means for actually doing that all in a parallel manner so it can actually set all the CPU cores on your entire cluster chugging away on that problem in parallel. And that way you can actually get through all that data very very quickly. So you know you don't have to sit there with a single CPU chugging away on a petabyte of information. It can divide and conquer that problem across all the PCs in a cluster. And of course that's meant for very large data sets that can't be managed by a single PC on computer clusters built from commodity hardware. And by commodity hardware we don't mean cheap we just mean readily available stuff you can rent from say Amazon Web Services or Google or any of the other vendors out there that sell cloud services so you can just rent off the shelf hardware throw it into a Hadoop cluster and go for it. You know just go and keep on adding as many computers as you need to handle the amount of data you have What's the history of Hadoop, let's talk about that briefly so Hadoop wasn't the first solution to this problem. Quite honestly Google is the grandfather of all this stuff so back in 2003 and 2004 Google published a couple of papers. One was about the Google file system GFS and that was basically the foundation of the ideas behind how Hadoop does its distributed storage. So GFS kind of informed what Hadoop's storage system turned into. Mapreduce is also Hadoop's similar to well it is Hadoop's solution for distributed storage of information. So GFS is basically what inspired Hadoop's distributed data storage and map reduce is what inspired Hadoop's distributed processing. And in fact they didn't even bother changing the name. It's still called MapReduce. So Hadoop was developed originally by Yahoo. They were building something called Nutch which was an open source web search engine at the time. They saw these papers published by Google and said hey Google thanks for telling us all your trade secrets we'll take that from you. Thank you very much. And primarily a couple of guys Doug Cutting and Tom White in 2006 started putting Hadoop together and of course he had a bigger team than that but they are the guys who take the credit. Story goes that Hadoop was actually the name of Doug Cutting's kid's toy Elephant. And apparently this is the actual yellow stuffed elephant named Hadoop that the project was named after. So now you know why Hadoop's mascot is a yellow elephant. It's actually named after this stuffed elephant here and Hadoop doesn't really mean anything and unfortunately you got to find that a lot of the technologies in Hadoop don't really mean anything and it can be hard to keep track of what does what. But that's why we're here. We're going to help you through all that. And hey you know the rest is history as they say ever since 2006 Hadoop's continued to evolve and not only has Hadoop evolved but the ecosystem surrounding Hadoop has continued to grow and more and more applications and systems surrounding Hadoop have come out and we'll talk about that shortly. Why Hadoop? Let's talk about that again. So again data is just too darn big these days. I mean maybe you have I don't know DNA information sensor information Web logs trading information from the stock market who knows whatever it is in this day and age and the size of the companies that we have today. One PC isn't gonna cut it anymore right. So you can't just go back to the old days of having a giant Oracle database that you just keep throwing more bigger CPU's on or more CPU cores or bigger raid arrays. If you come to a limit where that just can't scale any further and not only that you're dealing with things like disk seek time. So even if you did have like some massive petabyte hard drive you still have to wait for that disk head to move around. Right. So there are still advantages to using a cluster of computers with many many many hard drives where they can all be seeking in parallel with other independent disk heads. The other thing is you don't want a single point of failure. So that giant Oracle database what happens when it goes down well bad things you have to deal with replication and all that nonsense. Hadoop handles that problem for you. So it can actually keep track of what's available on your cluster and fail over automatically to backup copies if need be. Processing times are almost also better if you have an entire cluster you have that many more CPUs at your disposal and Hadoop offers a means of parallel processing that can take advantage of all those CPUs. So remember horizontal scaling like we do with the Hadoop cluster is linear. If you need to handle more data or need to process it faster just add more PCs to the mix add more computers to your cluster and it will be that much faster. And that's not the case with vertical scaling You know you start to hit these limits like for example with your disk seek times where linear scaling is not possible when you're doing vertical scaling. Now Hadoop originally was just made for batch processing and the idea was well you know if you need to run some sort of analysis that you can wait a few minutes for the answer Hadoop might be for you. And let's be honest you know if you've ever dealt with a giant database on a Oracle instance in a data warehouse those don't come back instantly either. You know that can take several minutes as well so you know we're not really doing any worse than we were back in the the Oracle days but there are things that have been built on top of Hadoop that actually do make it appropriate for interactive queries as well. And we'll talk about that as well. There are systems that can also expose the data from Hadoop and it means that can be consumed by web applications or whatever you want at very high transaction rates so it's not just for batch processing anymore. So that's sort of the back story of Hadoop what it's all about what it's for what it's used for why it exists and why it's so powerful. So up next let's talk about the actual technologies that make up the Hadoop ecosystem. So you can talk about it and know just enough to be dangerous. So on the next lecture.