Sqoop Introduction

Navdeep Kaur
A free video tutorial from Navdeep Kaur
Technical Trainer
4.2 instructor rating • 7 courses • 15,446 students

Learn more from the full course

Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume

In-depth course on Big Data - Apache Spark , Hadoop , Sqoop , Flume & Apache Hive, Big Data Cluster setup

08:22:17 of on-demand video • Updated August 2020

  • Hadoop distributed File system and commands. Lifecycle of sqoop command. Sqoop import command to migrate data from Mysql to HDFS. Sqoop import command to migrate data from Mysql to Hive. Working with various file formats, compressions, file delimeter,where clause and queries while importing the data. Understand split-by and boundary queries. Use incremental mode to migrate the data from Mysql to HDFS. Using sqoop export, migrate data from HDFS to Mysql. Using sqoop export, migrate data from Hive to Mysql. Understand Flume Architecture. Using flume, Ingest data from Twitter and save to HDFS. Using flume, Ingest data from netcat and save to HDFS. Using flume, Ingest data from exec and show on console. Flume Interceptors.
English [Auto] Welcome, everyone, let's start with that first letter, what is a party scoop? So Scoop is a tool that is used to migrate data from relational database to Hadoop ecosystem and vice versa. So a scoop is based on the Hadoop produce paradigm. Or you can see three fold and it leverage the parallelism and fold Aldrin's capability of mass produce. So when you run the scoop, come on in the background, they are getting converted into Java program in the form of mappers and reduces to run. Ballylee, on top of your cluster manager, that is just in case of Hadoop. Right. So we will take a deep dive into that in our next lecture so Schoop can import the data from relational databases such as my sequel Postgres or maybe on net psychosexual to Hadoop, distributed filesystem to higher database and Espace, and then also export data back to relational database using the SCOOP Export Command. So this block diagram explains that the school provides to and from data ingestion capabilities between the relational database and the Hadoop ecosystem. So there are various inbuilt connectors that is provided by a scoop to work with all famous relational databases such as my sequel, NetSol Oracle Database. Postgres sequel to the sequel server are to Cloud based in this training, I will be using my sequel database to demonstrate Schoop commands. Now let's take a deep dive into the scoop internal, as I have already discussed, that we are going to use my sequel database with a demonstration purpose. So don't worry about the installation as of now because in the next lecture we will do the installation using Google Cloud so that you can also practice and follow along the course. So first of all, let me show you what I have seen in my mythical database so I will see my sequel minus you. User name is the root minus P password is Etman. One, two, three, four. My my sequel database. Let me press enter so I am inside my my sequel database. Let me go to the retail and the school database that I will be using for the demonstration purposes. So I will say you use rutile, sorry rutile underscore deep and then presenta so you can see my database has been changed now. So let me show you all the tables that is under my retailor let's go deep database so I will see short tables. So this is the list of all the tables that is under my retailor let's go deep database. So what I am going to do for this demonstration purposes, I'm going to import a bit of disgust most mislabel into my office. So let me show you what is inside my customers tables. Well, select start from customers and let me show you the first five records. So can you see Utah's customer ridi, first name, last name, email, etc.? All the information about my customer. And let me show you how many number of records are there so we can verify that we are able to import the correct number of records. So I will say silicones out from customers. So there are total of twelve thousand four hundred and thirty five records in this customer's table. So now let me go and run the scoop command to ingest the data from this customer stable into my office directory. So now let's write our first call command. We have to see Schoop import and then we have to provide the connection string using the hyphen hyphen connect keyword. So the connection string is used to connect to the relational databases and it may differ based on what relational databases you are using. For an example, in case of MySQL, the connection string would be different and in case of Postgres, the connection string could be different and so on. In the case of MySQL, the connection string would be something like GDP c it means we are using the GDP connector. Then I have to say my sequel, Callon Double Backslash and then I have to use the hostname. So my my sequel database is running on localhost. But in your case, if you're my sequel database is running on some other hostname then you have to provide the hostname followed by Callon three three zero six and then. Finally backslash and then provider and could be that is the database that you want to connect to and in the next statement you have to see how hyphen user name and then the user name to connect to my MySQL databases route and then hyphen Highfill password. So the password is admin one, two, three. And in the last I am going to provide the table name using the hyphen hyphen table keyword. The table from which I want to import my data is customizable. So this is the very basic scope import command. So it is saying import the data from this mythical database that is running on the localhost on the board number three three zero six. I want to connect to the retail could be database using the user name RWD the password admin one, two, three. And I want to import the data from the customer's table. So let me run this command now. So this command will take a little time to complete because as we discuss in the background, it is creating the Java code to run my producers Pozzuoli on the on cluster. So we will take a deep dive into these logs to understand the law school command. So as of now, it has been completed. You can see it has retrieved one, two, four, three, five records. So we have already seen that in our table. There were a total of twelve thousand forty five records and it has successfully retrieved those records from that table. So if you want to check the data into your office location, so we will check the data we have and provide any directory here. Right. So it must have created some directory by default to put the records into that. So so it must have created a directory in your home folder, in your HFS, using the same name as the table name. So let me show you that I will see your face to face I finials and then yuzu my home directory is not deep and the school caudillos kormann in the office. So let me see what is. So can you see. A new directory has been created under the safety of his directory with the same name as the table name. So let's see under this directory and check what data has been imported here. So I'll see Ellis and then recompute but. So there are four park fires that this is these are the park fires, zero one, two, three that are created. Let me show you the content of one of the parks. While just to verify this consisted of my customers table, NYSE Estévez, DFS, High Fingerman and then the part of my portfolio. So let me press enter here. So this is the date of my customer stable. Of course, this is the customer idee customer, first name, last name and then other details. And all this data is separated by a comma. So it means that our school import command Rincewind and it has imported the data of my entire customer's table into four of the filing to our office. And these all files are nothing. But as we have already seen, these are the four part files. Now let's take a deep dive into Schoop logs, because you will be able to understand much more about the school floor, but by just looking into the logs itself. So as I have already disclosed, that when we are running the school command, they are generating the mass produce programs in the background. That is nothing but the Jawa classes. So there is not a single process that is running them, which must be multiple processes. Or you can see Java trades that are running in parallel to pull the data from your customers stable and ingest that data into your Easterfest location. So let me take you through the blocks one by one. So this is the first line. It is saying beginning the cogeneration. So at this point of time, it has started beginning the cogeneration using the coaching tool. You don't need to know everything about that, but you should know that there is some code that is actually generated in the background. Then after after that, it has executing the second statement, the sequence statement C select star from customers limit one. So why it is just selecting only one record from the customer's table. This is because it needs to understand the meta information about the table, like what are the column guides? How many number of columns are there, whether those columns are nullable or not? So using that information, using that only one record and using the information, it will create the Java code. So here, after creating the Java code, it has created a jar file inside this location. So if you go inside this location, you will be actually able to see the customer's larger file there. So after writing the jar file here, it actually submits your jar file on to your John cluster manager. So here it's actually starts the import of your customers records. So it is saying beginning the import of customers. So your mass-produced job is actually submitted to your clusters. So that is why it is connecting to the resource manager at this location. So if you don't know about your cluster manager, my produce, don't worry. Just think of school job as of now as the Java program started running on some cluster manager and these jobs are running in parallel in the next few statements, it is running some bond recovery. So we will discuss into details what are the boundary kobori. So let's skip that for now. So here it is saying it has submitted a job. So if you want to see a lot more information about this job, we can actually go to this year. So since I'm using the Google Cloud Data Brocklehurst, it is sometimes not feasible to go to this. You are directly so if I jump to this, you are ill. So it will say something like this. The service IP address could not be found. So if you want to exist this your classroom application, we have to go through other route. We have to go to our cluster in our Google cloud. Don't worry, I will tell you the details about this in our next lecture and we will be doing the installation just for a reference. We will be going to the Web interfaces clicking and then we will click on this rearms resource manager and then redirect to this particular link. So here I will be able to see my application. Uh, that was so this is the same application that we had submitted in our sculpture log. So this is a mass produced type of application. You know, the name is the customers. Rajaa, this is my username. So if you want to see a lot more detail about this application, you can go to this link. It will tell you that this application is succeeding. It has taken 16 seconds to succeed and also it will tell you a lot more detail. So if you want to see the mappers and producers that are created and run for this particular school job, you can go to this history. You are inside start. It will show you. Okay, there are four number of mappers and zero reducers that are created for the school job. If you if you want to see more information about the mappers, you can click on the mappers. So these are the four tasks that was created. So these four tasks are nothing but the Java code that was produced by our school coaching tool. So let me go to one of the task. And let me go to the counters. So as you can see, it gives me a lot more information about the file counters and mass-produced counters, like how many number of whites are red, how many number of whites are written, how many input records were there, as you can see. There were three one zero nine record. So out of 12000, 235 records since we got for mappers, it means every Malpas will have processed one fourth of the records. That's why it is coming out to be three one zero nine. So this map has processed three one zero nine records out of twelve thousand 12000, 435 records that were present in my customers table. Also, it will give you a lot more information about your c.p.u time spent. What will be the people this size and many more information. So this is very important information. If you want to debug and fine-tune your application, you can use this information. So here in the logs also, you will be able to see all those filesystem counter information and also the how many numbers of Tasker launched this thing, launched, mapped, ask it for. And then also it will give me the other information, the total number of what will what were the input records and what was the output records from this Mass-produced task. So that was the brief about the schoolfellows scooped and great Java Cordeaux Java codes are nothing but written in the MA produced framework and that my produce jobs had been submitted to the yarn cluster manager via the Tasker Run Baddiley to import the data from the databases to the SC of this file system. So that is about the brief about Schoop introductions, which is so Stadio, let me see you in the next lecture.