Sqoop Introduction

Navdeep Kaur
A free video tutorial from Navdeep Kaur
Technical Trainer
4.2 instructor rating • 7 courses • 13,487 students

Learn more from the full course

Master Big Data - Apache Spark/Hadoop/Sqoop/Hive/Flume

In-depth course on Big Data - Apache Spark , Hadoop , Sqoop , Flume & Apache Hive, Big Data Cluster setup

06:55:34 of on-demand video • Updated June 2020

  • Hadoop distributed File system and commands. Lifecycle of sqoop command. Sqoop import command to migrate data from Mysql to HDFS. Sqoop import command to migrate data from Mysql to Hive. Working with various file formats, compressions, file delimeter,where clause and queries while importing the data. Understand split-by and boundary queries. Use incremental mode to migrate the data from Mysql to HDFS. Using sqoop export, migrate data from HDFS to Mysql. Using sqoop export, migrate data from Hive to Mysql. Understand Flume Architecture. Using flume, Ingest data from Twitter and save to HDFS. Using flume, Ingest data from netcat and save to HDFS. Using flume, Ingest data from exec and show on console. Flume Interceptors.
English [Auto] In this video we will learn about what is a party school. And then we really do lots of examples on scope import. So let's get started. So a party scoop is a tool that allows us to extract data from a structured data store like my sequel wars great sequel or recall any relational database into her dorm so that we can do some further processing or analytics on top of that data. There are various connectors that are in provided by school to connect with my sequel or be seen in it so that our data and all the relational database so we will look into the exercises now. So it will make us understand scoop import export commands more clearly. So let's get started. So let's start with our scoop import command. As we discussed earlier also the scoop import command is used to migrate data from relational database to DST SDF is in my case what I want to do I have a mighty queer database and inside that my thick will be the database. I have a customer stable I want to import all the data from the customer's table twisty offers. So first of all let me connect to this my sequel database and let me show you that table. So I'll see my sequel. User name is root password it's cloud era and hostname name is local host so I am on to my sequel database. Now I want to use the retail DB database has a user retail DP and let me show you all the tables under this. So I have all these stable under retail underscored PPD derbies. What I want to do I want to put all the customers from this my sequel to SD efforts. So let me describe the stable. So this table contain all the information about the customer like the idea first name last name Steed city sector. What we are going to do we are going to write a simpler scope import command that will import that data from this table to some SDF office file. So let me open a new terminal and start writing the scoop come out to see sometime I'll be copy pasting the Come on let me copy paste this command so this command sees scoop import then hyphen hyphen connect. This is the connection string. It says connected to my secret database using the GDP Zi protocol. The my secret database is running on local host and inside that connect to the retail DB database the user name to connect is rude. The password is cloud era and the table that I want to fetch the data from is the customer's table. Let me run this school. Go on so I'll scoop command is now completed it saying it has retrieved 1 2 4 3 5 records. Let's see how many records are there in the customer's table let me tractors the select count star from customers so you can see there are 1 2 4 3 5 recall and this school command has also retrieved 1 2 4 3 5 records from the customer's table. Let's take a wider look at these logs because it will make us understand how the school works internally. So to give you the brief idea how scoop fluxes. So behind the scenes that when we see a scoop import it create Java classes and those telework classes are used to import data from my sequel to s.t. office. Also school provides the parallel processing. It's not only one trade off Java that is running to import data the ERA by default for trades that are running simultaneously to import those data. Moreover they provide the file folder and so if one of the job gets seed their job is retried. Actually it is using the map reduce. You don't need to understand map it. You can think of them in Java classes only. So let's see what is actually happening. Let's look at the logs. It says building code generation. So from here the Java classes are generated. The first query that it execute is select start from customers. Limit 1 so it selects only 1 true from the customer to get the meta information about the data the columns name the install column data types everything about their data because that will be used internally by the Java classes. Moving on as you can see the thing writing the jar file. So after reading all the meta data it creates a jar file and then it says beginning import of customers. So once it has submitted this char file it has begin the execution of this code which will import the customers. So there are certain warnings that you will see then you will be running discord. You can entirely ignore those warning this year that the office client warnings. Let me go further. The next thing is as I said there are four parallel process running in parallel so table data is divided into four parts that will be processed validly by those four parallel threads. So that site is saying the number of seats stretches for. So if we have 1 2 4 3 5 records then those records will be divided into four parts and each part would be given to one of the strict process to records. If you want to see more information about it you can go to this Your and check what is happening. This is the job that was created. It got executed in 1 minutes inside. Second there were four maps that were actually the Java classes. If I click on this it will show me all these four maps. If I go on one of their task and click on counters so you can see the map input record is 3 1 0 9 so it has processed 3 1 0 9 out of 1 2 4 3 5 record. Similarly the other map task has processed 143 records. So let me close this window now. So now we know the records are transferred from my secret which do you feel. But will that data goes by default. All the data is generated after default. V It holds static tree and its location is user cloud era and entity billing. Let me go to the directory cleared the screen and I'll see HDFC affairs minus list. User Val Dera and then the table name that is customers. So as you can see there are four part five said I generated each map ask or you can see trade has generated its own part file. That's why there were four maps and there are four sides. We can change the number of maps that we will see in our next school come on and see whether it will change these number of park files or not. So let me run this scope. Come on here what I am seeing I am specifying the map process too. So instead of 4 2 maps will be running to let me delete this location otherwise it will give me an error that this location already exists. We cannot override data on the same location. So this character is deleted. Let me run this group command again or leave it to map us this time. No our school command is completed. Let me clear this screen. Let's go back to the same location and see how many sparks fly. I didn't record this time. So this time only two part files I didn't routed through because we had specified the mapper as too explicit so that was end of. This lecture in next lecture we will see how we can define a destination directory so stay tuned.