How Hive works

Sundog Education by Frank Kane
A free video tutorial from Sundog Education by Frank Kane
Founder, Sundog Education. Machine Learning Pro
4.6 instructor rating • 28 courses • 539,822 students

Lecture description

Learn how Hive works under the hood of your Hadoop cluster, to efficiently query your data across a cluster using SQL commands. Well, technically it's HiveQL, but it will definitely seem familiar.

Learn more from the full course

The Ultimate Hands-On Hadoop: Tame your Big Data!

Hadoop tutorial with MapReduce, HDFS, Spark, Flink, Hive, HBase, MongoDB, Cassandra, Kafka + more! Over 25 technologies.

14:41:35 of on-demand video • Updated November 2021

  • Design distributed systems that manage "big data" using Hadoop and related technologies.
  • Use HDFS and MapReduce for storing and analyzing data at scale.
  • Use Pig and Spark to create scripts to process data on a Hadoop cluster in more complex ways.
  • Analyze relational data using Hive and MySQL
  • Analyze non-relational data using HBase, Cassandra, and MongoDB
  • Query data interactively with Drill, Phoenix, and Presto
  • Choose an appropriate data storage technology for your application
  • Understand how Hadoop clusters are managed by YARN, Tez, Mesos, Zookeeper, Zeppelin, Hue, and Oozie.
  • Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
  • Consume streaming data using Spark Streaming, Flink, and Storm
English Let's go in a little bit more depth on Hive and some of the more advanced features of it and some of the nuances of how it actually works. We're not going to go into a lot of hands-on examples here. I mean, I could do a whole course on Hive easily, but just wanted you to know about some of this stuff, so you have it in the back of your head and you can look up how to do these things, if you need to in the future. So, one of the basic concepts of Hive is something called schema on read, and this is what separates it from a more traditional database. So with a real old relational database, it uses something called schema on write, where you define the schema of your database before you load the data into it, you know, and it's actually enforced at the time that you write the data to disk. Hive flips that on its head. So it takes unstructured data and just sort of applies a schema to it as it's being read instead. So your data is still stored as just, you know, tab-delimited text files or whatever it might be, with no actual structure to it, no actual information about what that... what kind of data that represents. But Hive maintains a metastore, it's what's called the actual schema data that's associated with that unstructured data, so that's what tells it how to actually interpret those raw text files, so... it's called schema on read, and it's actually a separate product called HCatalog that can expose that schema to other services as well. So here's an example of it. So we kind of cheated in our previous example here, where we used the Ambari UI to import our data through a graphical user interface. But if you want to do it the hard way, it would do... it would look something like this. So you enter a command called "CREATE TABLE ratings" - and it will give you the various field names and their types. They all happen to be integers in this case. And then you give it, you know, how to... how that data is actually structured on disk. So the row format is delimited with tab delimiters, stored as a text file. And that is, basically, defining the schema that we will apply when the data is read, as opposed to when the data was written. So the data was already written in a very unstructured tab-delimited format. This "CREATE TABLE" command is just creating a schema in the metastore that gets applied to that data once it gets read into Hive. So to do that from a command-line interface, we would do something like "LOAD DATA LOCAL INPATH". This is a way to refer to your home directory in your Hive. "/ml-100k/u.data", OK? And we will overwrite that into a table called "ratings". So that's what's happening under the hood - OK? - when we do that. Basically, we're defining a schema on read that's applied to data that's already been written in an unstructured data, and it just keeps it in that format. So, you know, Hive isn't, you know... going and, you know, creating some structured relational database under the hood - that would be pretty inefficient, when we're talking about big data, right? So it's just taking the existing data that you have on your cluster and imparting a schema to it when it's read in. Where's that data actually live? Well, this can get a little bit complicated and a little bit confusing. So, when you do a "LOAD DATA" command from the CLI, command-line interface, that's actually going to move data from a distributed file system into Hive. So, basically, we're saying with "LOAD DATA" we want Hive to manage this data from now on, and it's going to take ownership of it and move it into Hive itself. Reason being that if you have a massive dataset on its distributed file system, odds are you don't really want another copy of it, OK? So Hive is just going to move that to where Hive expects it to be, and Hive will take ownership of that data from now on, OK? Now, again, it's not actually reformatting it into some fancy structured format. All it's doing is keeping that... that schema information in its metastore, OK? So it's just moving the raw data from one place to another. Now, in contrast, "LOAD DATA LOCAL" will actually do a copy, so if you're actually loading data from a local file system into Hive, "LOAD DATA LOCAL" will do that, and that will leave your original data untouched. But, again, that's assuming that you're reloading it from a local file system and not a distributed file system, so the the implicit assumption there is that we're not dealing with big data there, because you could actually fit it on your local file system. So we say, "OK, we'll just make a copy". That way you don't worry about it. And further confusing matters is the idea of managed versus external tables. So these are ideas... these are examples of managed tables, where Hive is taking ownership of the data ultimately, and its copy of that data or the data that it moved into Hive is going to be owned by Hive. So if you do a "DROP TABLE" command from Hive, then that is gone - OK? - in this case here. But sometimes you want to share that data with other systems that are, you know, outside of Hive. So that's where external tables come in, and all it really means is that when you create your table, you just say, "CREATE EXTERNAL TABLE", give the format it's in again, and you give it an explicit location of where that data lives. So "CREATE EXTERNAL TABLE" with "LOCATION" says, "I'm going to use Hive on this data here, but I'm not going to really own it anymore", OK? So if I drop this table, if I drop "ratings", it's going to drop the metadata, but it's going to keep this data intact. It's not going to mess with it, it's just going to leave it there, OK? So external tables really just means that Hive doesn't take ownership of it, and when you drop it, the data is still retained, and it doesn't move the data around. So if you need to share your data with other systems, that's an easy way to do it. That's really the only difference between managed and external tables. I mean, remember that Hive isn't really messing with the raw data itself, so it's really more a question of who owns the data and who's actually responsible for deleting it when you're done. OK. Another important topic with Hive is partitioning, so if you do have a massive dataset and your queries tend to be on a specific partition of that data, partitioning in Hive can actually be a very important optimization. So, for example, let's say you have a "customers" database that consists of names and addresses. Maybe that's going to be partitioned by country. So if you say "PARTITIONED BY (country STRING)", that treats "country" as just another column, but it's not really a column under the hood. The column, the "country" column, will become, basically, part of a subdirectory under where Hive stores this data. So it would actually have a structure like this: for a table named "customers", it might be "/customers/country=CA/". That's the actual path there, OK? So the partitions are sort of these little expressions here of what that partition is and the actual partition within it, and you can actually chain these together if you have more than one partition. Maybe you'd have "/country=CA/province=Alberta" or something, right? So that's how partitioning works. Very powerful if you're going to be doing queries that are specific to a given partition. So if my usual usage is to only query for information on countries in a specific country, this will make a lot of sense, OK? It would save a lot of time, because we wouldn't have to scan over the entire customer database - we could just scan over the actual files that are specific to the country that I'm interested in. Another quick note here. Another interesting thing about Hive is that you're not restricted to just the usual data types of integers and strings and bytes and whatnot. You can also do more structured things. This is an example of a struct, a structure called "address" that contains a "street" string, a "city" string, a "state" string, and a "zip" integer. So you can actually store structured data within Hive databases. You know, the flat nature of it, the non-... the denormalized nature of Hive makes that possible. You can also store maps as well, OK? Kind of a neat trick. There are many ways of using Hive. You know, we've looked at using it through Ambari. And if you're on Cloudera, you might be using Hue instead. But there's also a Hive prompt. If you just type in "hive" from your cluster, you can just get a prompt and type in those commands one line at a time interactively, just like you would with a database, right? You can also save your query files off to a file. So if you just type "hive -f" and the path to your query file, that will execute it one at a time. And through Ambari, you might have noticed that there was a Save As command there that we could have used to actually save that script to a file. And if you wanted to rerun it from the command line, you could do it that way. Hive also contains a JDBC and ODBC server, so you can actually connect to it just like you would to any other database or whatever the systems or tools you may have. And it also exposes a Thrift service, so this is actually a little service - and Hive can run that - that lets you talk to it through, you know, web sites or web clients or what have you. Remember, however, Hive is not suitable for online transaction processing, so it's not really meant for being hit with tons of queries all at once, you know, from a web site or something like that. You know, that's where you use something like HBase instead, OK? But it does provide a way of using Hive, and more often it's used to actually keep Hive running on one instance of your cluster instead of every instance, and that Thrift service provides sort of an interface between the rest of your cluster and the Hive service. You can also manage it through Oozie, which we'll talk about later in the course. Oozie is a management tool for your cluster as a whole that lets you schedule more complicated jobs across your Hadoop cluster. That's Hive in a nutshell. So up next let's actually practice what you've learned. And I'll give you a little challenge exercise to get your hands dirty with Hive.