Installing the MovieLens Dataset

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.6 instructor rating • 28 courses • 542,469 students

Lecture description

Before we can analyze movie ratings data from GroupLens using Hadoop, we need to load it into HDFS. You don't need to mess with command lines or programming to use HDFS. We'll start by importing some real movie ratings data into HDFS just using a web-based UI provided by Ambari.

Learn more from the full course

The Ultimate Hands-On Hadoop: Tame your Big Data!

Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies.

14:41:35 of on-demand video • Updated November 2021

  • Design distributed systems that manage "big data" using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
English All right so let's mess around with HDFS for reals. Now what we're gonna do is actually upload some of the movie ratings that we downloaded earlier from the movielens dataset into our HDFS file system on our little virtual Hadoop box so let's go ahead and open up the virtual box. And we will start our Hortonworks sandbox and it'll take a minute or two to boot up. So go ahead and hit pause and resume once you're ready. Once you see the screen that says that everything is ready to go. All right so after a few minutes you should see a screen that looks like this more or less. I'm going to show you A little bit of a shortcut here. Even though it says to connect to 127.0.0.1:8888 We're going to go straight to the Ambari view by going to port 8080 instead. So let's bring up your favorite web browser here and type in 127.0.0.1:8080 OK. Just like that. And you should see a screen like this. And again we'll log in as maria_dev and with a password of maria_dev as well. And here we are in the Ambari view which is sort of our way of looking across all the stuff installed on a little Hadoop box here so you know we talked about all these technologies in our earlier lecture where we sort of did a quick overview of everything in here. They might make a little bit more sense now. So we talked about HBase we talk about Pig we talked about Sqoop zookeeper and all that stuff. But for now we're going to talk about HDFS so you can actually click on that and just get some information about the file system itself. You can see that we're actually running a name node in a data node both obviously on the same virtual machine but if you're on a real cluster of course those would be on two different computers. So we're going to be talking to our name node and our data node you know in sort of a virtual sense here while we're playing with HDFS now to mess around with it from a UI it's very simple. All you have to do is click over this little grid icon and go to files view and this will show you in a graphical manner. Our entire HDFS file system that's running on our little Hadoop cluster here. Well our cluster of one virtual machine such as it is but it would work the same way in a real cluster. So we can for example click on user and click on maria_dev to see maria_dev's home directory and let's go ahead and create a folder for our movielens data so we can create a new folder just by clicking onto folder and we'll call it ml-100k just like that. So remember that's the one hundred thousand ratings data set from the movielens Web site. So that's talked to the name server and said hey we're gonna create a new directory and the name server says OK here you go. So we can go into that and we can actually hit the upload button to upload files from our harddrive and we can just drag them in there or click there and navigate to it. So if you want to do that and navigate to wherever you downloaded and extracted the ml-100k data set you can for example upload the u.data file into there. And what happened there like we talked about it talked to the name nodes to create a file where shall I put it. It said put it on this data node and there's only one data node so that's where it put it. If we had more than one machine though it would actually start replicating that data across multiple data nodes and then acknowledging back through us and then finally back to the name node that that data had been successfully stored. Let's also update the upload the u.item file as well and see what you can do. So you can actually do pretty much anything from the UI that you can do from the command line or anything else so I can for example click on u.data and click on Open and actually see what's in it. So here we have a little view of the contents of the data file. Again it's the user id rated this movie Id given a rating from 1 to five at a given time stamp and scroll down there and look at all 100000 ratings if you want to you can also rename that file. Let's say we want to call it ratings.data it basically just everything works as you would expect it to here. There's a couple of cool features though. You can download that to your local hard drive if you want to. Let's go ahead and do that. Open it up. There it is. We got our data back from HDFS so what we've done here is we've uploaded data from our hard drive into an HDFS cluster cluster of one and then download it back to our hard drive and we got the same file back. So it all it works. You can also do something kind of cool if you want to you can actually select more than one file using the shift button there and concatenate them together. And that would actually download a single file containing the contents of both of those files. So if you were to open that resulting file up you'll see something like this where it starts off with all 100000 movie ratings and ends up with the u.item file which is mapping movie ID's to movie names and genre information and whatnot. Now let's go ahead and clean up after ourselves so I still have those two selected and I click delete to get rid of it. And then I'll go up a level back to my home directory for maria_dev and select ml-100k. Notice I'm clicking over here as opposed to on the name itself so instead of navigating into it I'm just selecting that row and then I can hit delete and delete permanently just to make sure we cleaned up properly. So there you have it. You know we're actually playing around with HDFS through a web interface so this is using an HTTP interface to HDFS that allows us to manipulate and view all the files in our HDFS file system. It's actually really really easy. That's all there is to it. I mean it's hardly even worth talking about so let's let's go a little bit deeper in our next lecture and actually mess around with doing this from a command line from an actual command prompt connected to our Hadoop cluster. Now if you do need to take a break right now you just want to mention real quick if you go back to your virtual machine here you can shut that down cleanly by going to machine and then ACPI shutdown. That's kind of how you're going to close down this session. This actual virtual machine cleanly before you start up again so make sure you always do that as opposed to just closing out of it forcibly. OK so next lecture. Let's do this the hard way. From the command line interface.