Installing Hadoop [Step by Step]

A free video tutorial from Sundog Education by Frank Kane
Join over 800K students learning ML, AI, AWS, and Data Eng.
Rating: 4.6 out of 5Instructor rating
37 courses
859,233 students
Installing Hadoop [Step by Step]

Lecture description

After a quick intro, we'll dive right in and install Hortonworks Sandbox in a virtual machine right on your own PC. This is the quickest way to get up and running with Hadoop so you can start learning and experimenting with it. We'll then download some real movie ratings data, and use Hive to analyze it!

Learn more from the full course

The Ultimate Hands-On Hadoop: Tame your Big Data!

Data Engineering and Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more!

14:40:37 of on-demand video • Updated November 2023

Design distributed systems that manage "big data" using Hadoop and related data engineering technologies.
Use HDFS and MapReduce for storing and analyzing data at scale.
Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
Analyze relational data using Hive and MySQL
Analyze non-relational data using HBase, Cassandra, and MongoDB
Query data interactively with Drill, Phoenix, and Presto
Choose an appropriate data storage technology for your application
Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
Consume streaming data using Spark Streaming, Flink, and Storm
English [Auto]
Welcome to my course. The ultimate hands on Hadoop. Tame your big data. My name is Frank Kane, and I'm your instructor. I spent almost a decade as a senior engineer and a senior manager at Amazon.com and IMDb.com, where I helped create systems such as product recommendations. People who bought this also bought, and much more. Hadoop is a powerful tool for transforming and analyzing massive data sets on a cluster of computers, but it consists of hundreds of distinct technologies, and understanding how they fit together can be intimidating. At the end of this course, you'll understand the major components of the Hadoop ecosystem and learn how to fit them together. With lots of hands on activities and exercises, you can run right on your own PC at home. We'll install a virtual machine on your desktop that lets us get our hands dirty with over 25 technologies related to Hadoop, and get you comfortable with each one. We'll import and export data into both relational and non-relational data stores, analyze them with SQL like queries using Hadoop, write real working programs with MapReduce, Pig and spark, and design some real world systems using what we've learned. I designed this course for a wide range of people. The activities in this course range from using web based UIs, to using command line interfaces, to writing real programs in Python and Scala. So whether you want a high level understanding of Hadoop or you want to dive deep into programming it, you'll find what you need here. If you work in a company that has big data or want to work at one, it's hard to imagine a better use of your time than learning about Hadoop. Let me show you how it all works. I'll see you in the course. So let's go ahead and install a Hadoop stack on your PC. It's not that hard. Start by opening up a web browser, and the first thing we need to do is install a virtual machine on your desktop. Basically, we're going to run a little Linux environment on your own PC or Mac or whatever the heck it is you're running. So to do that, open up your favorite web browser and search for VirtualBox and that will take you to virtualbox.org, which is free software for actually running a virtual machine right on your PC. And there should be a big friendly download button. Maybe newer than 5.1, depending on when you're watching this. And you can see it's available for windows OS, X, Linux and Solaris. So go ahead and install the version that makes sense for you. For me, that's going to be windows. So I'm just going to go ahead and download that. Comes down pretty quickly, about 100 megs, and then we'll install it. And a standard windows installer going on here. Just accept the defaults and you'll be fine. All right. I stopped things through the magic of video editing, because it does take a few minutes for that to install. If you're following along with me, and I hope you are. Go ahead and pause this video until you get to this screen here, and be sure to watch out for little pop up dialogs that might have come up in the installation that might be hidden behind this window. So keep an eye out in your taskbar down there for anything that you might need to dismiss. Now, we're not actually going to start it yet, so I'm going to uncheck that and click finish. We now have VirtualBox in place, which will let us run a virtual Linux box on our PC. Next, we need to install the Hortonworks Data Platform sandbox so you can actually run a Hadoop environment on your own PC here. And to do that we're going to go to cloudera.com. Cloudera actually bought Hortonworks a few years ago. So now they're actually hosting that file. They don't really like paying for having you download it though it seems. So it's a little bit hard to find. Anyway if you go to cloudera.com, probably my best advice because they keep moving it around, is to hit on the search icon and type in HDP sandbox, and that should lead you to it. Let's go to the downloads link for it. It's a very large file, so it will take some time to download. And if you see something like an access denied message here that happens sometimes, just try it again. And if you just can't get it to work, refer to the text lecture before this lecture for some tips on how to download that file directly if you need to. Otherwise, just wait for it to come down and move on. And remember too, that you can actually take this course without following along. Hands on. It's okay to just watch me do these exercises without doing them yourself. So if you find yourself in a situation where you just can't download that file for whatever reason because your internet connection won't support it, or your PC just won't run it, that's okay. It's all right. Just watch the videos. You'll still learn that way if you need to. Okay. Time is past, and we've downloaded 11GB of Hortonworks Sandbox goodness. All we have to do now is open up that OVA file that we downloaded. Hope you had enough room on your hard drive for that. 11 gigs isn't that big these days, and we're just going to go ahead and click the import button here to import that image into VirtualBox so we can fire it up whenever we want to. And this will take a little bit of time to copy over. After all it's 11GB. So again we'll hit pause here and wait until that's done. All right we've imported our Hortonworks Docker sandbox. What is this thing that we just downloaded anyway? Well, basically it's a pre-installed Hadoop environment that also has a bunch of associated technologies installed as well. A lot of the stuff that we're going to go through in this course is going to be already set up for you by virtue of installing the Hortonworks sandbox. So all we have to do now is click on it and hit start. And that will basically be the same thing as pushing the power button on a PC that has. CentOS installed. So we're just going to sit here and wait for that to boot up. It will just do its thing automatically. It does take a little bit of time to get up and running. So let's let that go ahead and do its thing. While we're waiting, let's go and download some data to play with. So I used to work at imdb.com, so I'm kind of a movie nut. Let's download some movie data. So if we go to your browser and go to grouplens.org, this will take you to some free data that contains some real movie ratings by real people that we can play with. So just click on datasets. And since we're only running on our own PC here, we can't really deal with big data per se. So let's find a smaller data set. How about the Movielens 100 K dataset? It's older. It's from 1998. So you're not going to see any new movies in there. But for our purposes it'll do the job. So let's go ahead and click on 100 K dot zip. It's only five megabytes. Open that up. And we'll decompress it. And just remember where that is, wherever you downloaded it into. So we have a ML 100 k folder. This contains a bunch of files that contain movie ratings data. The data file is what contains the actual ratings data. If we open that up in WordPad or notepad or something, you can take a look at the format here. Basically this these are four columns of information. The first column is a user ID, the second column is a movie ID, and then a rating from a score of 1 to 5 and a timestamp. So for example, this first line saying user ID 196 rated movie number 242 three stars at this timestamp. And if we need to look up what those different movie IDs mean, you item comes to the rescue. This is where all the metadata for the movies lives. So that tells us, for example, movie ID one is Toy Story from 1995, movie ID 82 is GoldenEye, and so on and so forth. Those are the two files that we're going to be playing with. Just as a little experiment here. Let's see if our virtual box has booted up yet. Almost there. All right, we are in business. Check it out. So we have a little CentOS instance running here that actually has Hadoop up and running. And to actually use it well we could just log into this from Putty or some SSH client. But it actually has a browser interface. So we can use that uses something called Ambari to let us visualize what's going on on our little mini Hadoop instance and actually do some interesting stuff on it. So let's go ahead and do what it says. Just go to our browser and go to colon 8888. Just like that. Okay. And you should see a screen like this. So let's go ahead and launch our dashboard. Why not? Remember to disable pop ups. If necessary. And you'll actually come up to this screen here. It does have a bunch of handy dandy tutorials, and if you want to go through those, by all means feel free to. The login is going to be Maria underscore dev and the password is also Maria underscore dev. And that brings us into Ambari, which lets us visualize what's going on in our little Hadoop cluster here. Now, the beautiful thing is that with Hadoop, you don't really have to worry about if it's running on one system like we're doing right now, or on multiple nodes of an entire cluster of computers. All that is generally abstracted for you from Hadoop. So as a programmer or a developer or a analyst, what you do to develop under Hadoop is not going to be a whole lot different if you're running on one machine versus an entire cluster of machines. The main difference is how much data you can actually handle, which is why we're sticking with a small data set for now. So with that, let's do something fun with that movielens data that we just imported. So what we're going to do is import that movie ratings data that we downloaded into our Hadoop cluster, and we're going to use something called hive to actually allow us to execute SQL queries on top of that data, even though it's not really being stored in a relational database, it's being stored in a what could be a distributed manner across an entire cluster. So let's take a quick peek again at that data that we downloaded. A reminder, the ML dot 100 k folder that we downloaded earlier. And the data file represents all of our ratings data. And remember that is tab delimited data that represents a user ID, movie ID rating and timestamp. And the other relevant file is u dot item, which contains a mapping of movie IDs to movie names and a bunch of other metadata, specifically the release date, IMDb link, and information about the movie genres associated with each title. And the key point here is that this file is delimited by pipe characters, those vertical bars. So let's import this data into Hadoop. And to do that we can actually do that right from the Ambari sandbox view here. And we're going to go over here to this little grid icon and select Hive View. Because we're going to use the hive tool to actually play around with this data. We'll talk about hive in a lot more depth soon. But for now I just want to get some results out of this thing that we installed and give you a little bit of a return on all of your work here. So let's click on Upload Table. Once you get here and follow along with me, please know you are going to feel better about all this if you can actually follow along and get some success here. Now it says CSV file comma separated value. And even though this is really going to be separated by tabs and pipes, you can still use the CSV importer because you can actually configure it to use any delimiter you want. It doesn't have to be a comma. So let's start with the ratings data file. We know that's delimited by tab characters. So I'm going to select Ascii value nine the tab character there. And I'm going to import it from my local file system. So again navigate to where you downloaded and extracted the 100 k folder. And we're going to select the data file. Open that up. And we need to give this table a name. We're going to save it into a table named I don't know how about. Ratings. That's creative. And we'll name these columns that we care about. Remember this is the user ID, the movie ID, the rating itself from 1 to 5, 1 to 5 scale and the rating time. So again give your column some names their user underscore id movie underscore id rating and rating underscore time. If you need to catch up, don't hesitate to hit pause before I move on. And when you're done, hit upload table. And it will actually create a table within hive. That's basically a view of this data that's sitting on our Hadoop instance. And we're done. So let's go ahead and import that movie names data as well. Remember that one separated by pipe characters. So I'm going to change that tab delimiter to the pipe which is character ID 124 on the Ascii table. And we will select our item file this time. Open that up and we'll call this table I don't know movie underscore names. And we really only care about the first two columns for what we're doing. And that's going to be the movie underscore ID and the name. All right. And we'll just leave everything else default for now just for the sake of time. Go ahead and upload that as well. And again, it might take a minute, a minute or two. And that's ready as well. So we have this data imported onto our Hadoop instance. Let's go ahead and run some queries on it. So let's find the most popular movie out of the entire data set. Now remember this is a snapshot of the movie ratings data from 1998. So let's play a little game. What do you think the most popular movie of all time was as of 1998? We'll see if your guess was correct. Let's write a little standard SQL query. And again, this is kind of cool because we're not really on a relational database here. This is just a tool called hive that lets us interact with our data on Hadoop as if it were a relational database, although it really isn't. So let's do select. Well, we need a little bit of a cheat sheet here to remember what we called everything. And if you open up the default database here on the left and refresh it with this little icon here we can see we have our new table's movie underscore names and ratings. And we can expand those to have a little cheat sheet here of what we named everything. So we're going to do is from the ratings table we're going to select movie underscore ID comma count movie underscore ID. Follow along carefully here. You got to make sure you don't have any typos. And we're going to say as rating count because that count of each movie is going to be something we sort by later in this query return. And we will say from the table ratings, we're going to group by movie ID. So this is basically going to group together all of the ratings for a given movie ID and count them all up. Right. And one last thing we need to do is order it by rating count. So and we're going to do that in descending order like that. So what this query is going to do is count up how many times each movie was rated in our data, and sort it in reverse order by the number of times it showed up. So this should give us a sorted list of the most popular to the least popular movie in our data set. Let's go ahead and hit execute. And it will go chug away. And under the hood. It's doing an awful lot of stuff. It's going to try to figure out how can I optimize this query to actually look like what's called a MapReduce job on my Hadoop instance? And if this was happening on a cluster, it would be figuring out the optimal way to divide this data up and actually send it across your entire cluster and process it all in parallel. So it looks like a SQL query, but it's doing it in a much more interesting way and a much more scalable way. We already have our answer here. So the top movie is movie ID 50 with 583 ratings. And before we go look up what movie ID 50 is, I want to show you real quick a little bit of eye candy here. Neat little thing in the dashboard here in the hive view is this visualization option. And that makes it very easy to visualize the results of one of these queries from hive. So for example we can look at the movie ID two rating count distribution just by doing that. And you can see this little scatter plot here that shows the rating count for every movie ID and just by glancing at it, I can see that there is, in fact, a relationship between movie IDs and popularity. For whatever reason, the more popular movies tend to have lower IDs, so. Hmm, interesting isn't it? Maybe that's just because they were assigned in order, and older movies have been rated more because they've been around more. I don't know, we could dig into that if we wanted to. Let's go back to our SQL view. And find out who is movie ID 50. What is the most popular movie of all time? Quick query for that, select. Let's see what do we call it from the movie names select name from movie underscore names where movie underscore id equals 50. And that should give us the answer to our little quiz here. The most popular movie as of 1998, in this data was drum Roll please. Star Wars, did you guess correctly? Hope you did, if you didn't. Hey, it's okay, I won't tell anyone. And there you have it. You've actually installed Hadoop on your PC in a virtual machine, imported data into into it and actually queried it using hive all in one lecture. So hey, I hope you're happy about the progress you've made so far. It's pretty exciting. And in the next lecture we'll go into more depth about what's actually going on here. What's Hadoop really all about? What are the pieces of it and how do they all fit together? So stick with me to the next lecture and we'll learn more about what actually went on here.