Installing Hadoop [Step by Step]

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.6 instructor rating • 28 courses • 540,178 students

Lecture description

After a quick intro, we'll dive right in and install Hortonworks Sandbox in a virtual machine right on your own PC. This is the quickest way to get up and running with Hadoop so you can start learning and experimenting with it. We'll then download some real movie ratings data, and use Hive to analyze it!

Learn more from the full course

The Ultimate Hands-On Hadoop: Tame your Big Data!

Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies.

14:41:35 of on-demand video • Updated November 2021

  • Design distributed systems that manage "big data" using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
English Welcome to my course, the ultimate hands-on Hadoop, Taming Your Big Data. My name is Frank Kane, and I'm your instructor. I spent almost a decade as a senior engineer and a senior manager at Amazon.com and IMDB.com, where I helped create systems such as product recommendations, people who bought this also bought and much more. Hadoop is a powerful tool for transforming and analyzing massive data sets on a cluster of computers. But it consists of hundreds of distinct technologies, and understanding how they fit together can be intimidating. At the end of this course, you'll understand the major components of the Hadoop ecosystem, and learn how to fit them. Together with lots of Hands-On activities and exercises, you can run right on your own PC at home, we'll install a virtual machine on your desktop that lets us get our hands dirty with over 25 technologies related to Hadoop, and get you comfortable with each one will import and export data into both relational and non relational data stores, analyze them with simple live queries using Hadoop, write real working programs with MapReduce, Pig, and Spark and design some real-world systems using what we've learned. I designed this course for a wide range of people, the activities in this course range from using Web based uses to using command-line interfaces to writing real programs in Python and Scala. So whether you want a high level understanding of Hadoop or you want to dive deep into programming it, you'll find what you need here. If you work in a company that has big data or want to work at one, it's hard to imagine a better use of your time than learning about Hadoop. Let me show you how it all works. I'll see you in the course. So let's go ahead and install a Hadoop stack on your PC. It's not that hard. Start building up a Web browser, and the first thing we need to do is install a virtual machine on your desktop. Basically, we're going to run a little Linux environment on your own PC or a Mac or whatever the heck it is you're running. So to do that, open up your favorite Web browser and search for virtual box, and that will take you to a virtual box dog, which is free software for actually running a virtual machine right on your PC. And there should be a big friendly download button, maybe fewer than five point one, depending on when you're watching this. And you can see it's available for Windows, OSX, Linux and Solaris. So go ahead and install the version that makes sense for you. For me, that's going to be Windows. So I'm just going to go ahead and download that. Comes down pretty quickly, about 100 megs, and then we'll install it. And a standard Windows installer going on here, just accept the defaults and you'll be fine. All right, to stop things through the magic of video editing, because it does take a few minutes for that to install, if you're following along with me and I hope you are, go ahead and pause this video until you get to this screen here and be sure to watch out for a little pop up dialogues that might have come up in the installation that might be hidden behind this window. So keep an eye out in your taskbar down there for anything that you might need to dismiss. Now, we're not actually going to start it yet, so I'm going to uncheck that and click finish. We now have virtual box in place which will let us run a virtual Linux box on our PC. Next, we need to install the Hortonworks data platform sandbox so you can actually run a Hadoop environment on your own PC here. And to do that, we're going to go to Cloudera dotcom. Cloudera actually bought Hortonworks a few years ago, so now they're actually hosting that file. They don't really like paying for having you download it though it seems so it's a little bit hard to find anyway. If you go to Cloudera dotcom, probably my best advice because they keep moving it around is to hit on the search icon and type in HDP sandbox and that should lead you to it. Let's go to the downloads link for it. And we'll click on Hortonworks Sandbox. It says you need a subscription for software, but you don't, at least not yet for this sandbox environment is kind of a trial thing to let you learn about how to use it. So there's no need for you to pay for that. Just click on download now under Hortonworks HDP. And from here, you want to select virtual box. Let's go. And you do need to provide some personal information, unfortunately, to download it, just say that you're doing it for a online training course. Hey, I think they actually have me in mind. They're typing your information there. And you don't have to actually opt in to be contacted by them if you don't want to. So you leave those blank if you want to. You don't need to fill that out, but you do need to give them a name. So go ahead and fill that out. Once you've continued beyond that, accept their policies and submit it. So we're almost here now, I don't want you to install the latest version of Sandbox 3.0. The reason is that it consumes a lot of system memory and a lot of resources on your computer, and most people don't have enough of a book to actually run it reliably. Instead, I want you to stick with one of these older versions. You're going to have the best experience, most likely with two point six point five. That's going to have the most up to date package repositories and things like that. But if you have trouble running that, you know, you can try 2.5 instead. That will consume even fewer resources, but you're more likely to run into trouble with out-of-date packages and things like that. So I do recommend starting with two point six point five. Just go ahead and click on that. And it should come down to very large files. So it will take some time to download. And if you see something like an access denied message here, that happens sometimes, just try it again. And if you just can't get it to work, refer to the text lecture before this lecture for some tips on how to download that file directly if you need to. Otherwise, just wait for it to come down and move on. And remember, too, that you can actually take this course without following along. Hands on. It's OK to just watch me do these exercises without doing them yourself. So if you find yourself in a situation where you just can't download that file for whatever reason because your Internet connection won't support it or your PC just won't run it, that's OK. It's all right. Just watch the videos. You'll still learn that way if you need to. OK, time has passed and we've downloaded 11 gigabytes of Hortonworks sandbox goodness, all we have to do now is open up that ova file that we downloaded. Hope you had enough room on your hard drive for that. Litigates isn't that big these days. And we're just going to go ahead and click the import button here to import that image into virtual box so we can. Fire it up, whatever we want to, and this will take a little bit of time to copy over, after all, it's 11 gigabytes. So again, we'll hit pause here and wait until that's done. All right, we've imported our Hortonworks Docker's sandbox, what is this thing that we just downloaded anyway? Well, basically it's a preinstalled hoop environment that also has a bunch of associated technologies installed as well. A lot of the stuff that we're going to go through in this course is going to be already set up for you by virtue of installing the Hortonworks sandbox. So all we have to do now is click on it and hit start. And that will basically be the same thing as pushing the power button on a PC that has. Was installed, so it is going to sit here and wait for that to boot up, it will just do its thing automatically. It does take a little bit of time to get up and running. So let's let that go ahead and do its thing while we're waiting. Let's go and download some data to play with. So I used to work at IMDB dot com, so I'm kind of a movie nut. Let's download some movie data. So if we go to your browser and go to Group Lindskog, this will take you to some free data that contains some real movie ratings by real people that we can play with. So just click on datasets. And since we're only running on our own PC here, we can't really deal with big data per say, so let's find a smaller data set. How about the movie lines? 100K data set. It's olders from 1998. So you're not going to see any new movies in there, but for our purposes, it'll do the job. So let's go ahead and click on Amul Dash one undercards zip. It's only five megabytes. Open that up. And we'll decompress it. And just remember where that is, wherever you downloaded it into. So we have a small dash 100k folder, this contains a bunch of files that contain movie ratings data, the Udai data file is what contains the actual ratings data. If we open that up and word pattern notepad or something, you can take a look at the format here. Basically, this these are four columns of information. The first column is a user ID. The second column is a movie ID and then a rating from a score of one to five and a timestamp. So, for example, this first line saying user I.D. 196 rated movie number two, 42, three stars at this timestamp. And if you need to look up what those different movie IDs mean, you dot item comes to the rescue. This is where all the metadata for the movies lives. So that tells us, for example, movie ID one is Toy Story from 1995, movie two is Goldeneye and so on and so forth. Those are the two files that we're going to be playing with just as a little experiment here. Let's see if our virtual boxes booted up yet almost there. All right, we are in business. Check it out so we have a little CentOS instance running here that actually has to do up and running and to actually use it. Well, we could just log into this from, you know, PUDI or some SFH client, but it actually has a browser interface that we can use that uses something called Anbari to let us visualize what's going on in our little mini Hadoop instance and actually do some interesting stuff on it. So let's go ahead and do it. It says just go to our browser and go to 127 DOT 0.01 Colen eight 888. How about just like that, OK, and you should see a screen like this. So let's go ahead and launch our dashboard, why not remember to disable pop ups? If necessary. And you'll actually come to the screen here, it does have a bunch of like handy dandy tutorials, and if you want to go through those, by all means, feel free to the log in is going to be Marea underscore deve and the password is also Marea underscored of. And that brings us to Anbari, which lets us visualize what's going on in our little Hadoop cluster here. Now, the beautiful thing is that with Hadoop, you don't really have to worry about if it's running on one system like we're doing right now or on multiple nodes of an entire cluster of computers, all that is generally abstracted for you from Hadoop. So as a programmer or a developer or a analyst, what you do to develop under Hadoop is not going to be a whole lot different if you're running on one machine versus an entire cluster of machines. The main difference is how much data you can actually handle, which is why we're sticking with a small data set for now. So with that, let's do something fun with that Movieland data that we just imported. So what we're going to do is import that movie ratings data that we downloaded into our Hadoop cluster. And we're going to use something called high five to actually allow us to execute SQL queries on top of that data, even though it's not really being stored in a relational database, it's being stored in a what could be a distributed manner across the entire cluster. So let's take a quick peek again at that data that we downloaded. A reminder, the email, that 100K folder that we downloaded earlier and the Utah data file represents all of our ratings data. And remember, that is TABD limited data that represents a user ID movie, ID rating and timestamped. And the other relevant file is Udaan Item, which contains a mapping of movie IDs to movie names and a bunch of other metadata, specifically the release date IMDB link and information about the movie genres associated with each title. And the key point here is that this file is still limited by pipe characters, those vertical bars. So let's import this data into Hadoop. And to do that, we can actually do that right from the Anbari sandbox view here. And we're going to go over here to this little grid icon and select Hillview because we're going to use the hive tool to actually play around with this data. We'll talk about Hive and a lot more depth soon. But for now, I just want to get some results out of this thing that we installed and give you a little bit of a return on all of your work here. So let's click on upload table once you get here and follow along with me, please. You know, you are gonna feel better about all this if you can actually follow along and get some success here. Now it says CSV file, comma separated value. And even though this is really going to be separated by tabs and pipes, you can still use the CSV importer because you can actually configure it to use any delimiter you want. Doesn't have to be a comma. So let's start with the ratings data file. We know that's still limited by TAB characters, so I'm going to select ASCII Value nine, the TAB character there and I'm going to import it from my local file system. So again, navigate to where you downloaded and extracted the ML Dash 100 K folder and we're going to select the Utah data file. Open that up. And we need to give this table a name, we're going to save it into a table named I don't know, how about. Ratings, that's creative. And will name these columns that we care about. Remember, this is the user ID, the movie ID, the rating itself from a one one to five scale and the rating time. So, again, give your column some names there. User and a score ID movie, underscore idee rating and rating. Underscore time. If you need to catch up, don't hesitate to hit pause before I move on. And when you're done, hit upload table. And it will actually create a table within Hive that's basically a view of this data that's sitting on our Hadoop instance. And we're done, so let's go ahead and import that movie names data as well. Remember that once separated by pipe characters. So I'm going to change that tab delimiter to the pipe, which is characterized 124 on the ASCII table. And we will select our Yuda item file this time. Open that up and we'll call this table, I don't know, movie underscore names. And we really only care about the first two columns for what we're doing. And that's going to be the movie Underscore ID and the name. All right. And we'll just leave everything else default for now, just for the sake of time. Go ahead and upload that as well. And again, it might take a minute to minute or two. And that's ready as well. So we have this data imported onto our Hadoop instance, let's go ahead and run some queries on it. So let's find the most popular movie out of the entire dataset. Now, remember, this is a snapshot of the movie ratings data from 1998. So let's play a little game. What do you think the most popular movie of all time was as of 1998? We'll see if your guess was correct. Let's write a little standard SQL query. And again, this is kind of cool because we're not really on a relational database here. This is just a tool called High that lets us interact with our data on Hadoop as if it were a relational database, although it really isn't. So let's do select well, we need a bit of a cheat sheet here to remember what we called everything. And if you open up the default database here on the left and refresh it with this little icon here, we can see we have our new tables, movie, underscore names and ratings, and we can expand those to have a little cheat sheet here of what we named everything. So we're going to do is from the ratings table, we're going to select movie, underscore ID, comma counts, movie, underscore ID, follow along carefully here. You got to make sure you don't have any typos and we're going to say as rating count, because that count of each movie is going to be something we sought by later in this query return. And we will say from the table ratings we're going to group by movie ID. So this is basically going to group together all of the ratings for a given movie idea and count them all up. Right. And one last thing we need to do is order it by rating count. So and we're going to do that in descending order like that. So what this query is going to do is count up how many times each movie was rated in our data and sort it in reverse order by the number of times that showed up. So this should give us a sorted list of the most popular to the least popular movie in our data set. Let's go ahead and hit execute. And it will go chug away. And under the hood, it's doing an awful lot of stuff, it's going to try to figure out how can I optimize this query to actually look like what's called a mass-produced job on my Hadoop instance. And if this was happening on a cluster, it would be figuring out the optimal way to divide this data up and actually send it across your entire cluster and process it all in parallel. So it looks like a school library, but it's doing it in a much more interesting way, in a much more scalable way. We already have our answer here. So the top movie is movie. Identify with 583 ratings. And before we go look up what movie identity is one, show you real quick. A little bit of eye candy here. Neat little thing on the dashboard here. And the high view is this visualization option. And that makes it very easy to visualize the results of one of these queries from. So, for example, we can look at the movie ID to rating count distribution just by doing that. And you can see this little scatterplot here that shows the rating count for every movie ID. And just by glancing at it, I can see that there is, in fact a relationship between movie IDs and popularity. For whatever reason, the more popular movies tend to have lower or lower IDs. So interesting, isn't it? Maybe that's just because they were assigned in order and older movies have been rated more because they've been around more. I don't know. We could dig into that if we wanted to. Let's go back to our Skyview. And find out who is movie identity, what is the most popular movie of all time? Quick query for that select. Let's see what we call it from the movie names select name from the movie, underscore names where movie underscore ID equals 50. And that should give us the answer to our little quiz here. The most popular movie as of 1998 in this data was drumroll, please Starwars. Did you guessed correctly? I hope you did if you didn't. Hey, it's OK. I won't tell anyone. And there you have it. You've actually installed Hadoop on your PC in a virtual machine, imported data into you, into it and actually queried it using Hive all in one lecture. So, hey, I hope you're happy about the progress we've made so far. It's pretty exciting. And in the next lecture will go into more depth about what's actually going on here. What's the dupe really all about? What are the pieces of it and how do they all fit together? So stick with me to the next lecture and we'll learn more about what actually went on here.