Programming Hadoop with Apache Pig

Programming Hadoop with Apache Pig
4.0 (2 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
378 students enrolled
$19
$20
5% off
Take This Course
  • Lectures 17
  • Length 1.5 hours
  • Skill Level Intermediate Level
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 12/2015 English

Course Description

Apache Pig is an open source tool that you can analyze large datas . It converts developed scripts to hadoop MapReduce jobs

This course will inform you about pig basics . Tutorials include pig basics , horton installation and some of important examples

If you want to develop hadoop MapRecude easily , you should take this course .

What are the requirements?

  • Hadoop Basic terms

What am I going to get from this course?

  • Learn Pig basics
  • Programming Hadoop with Apache Pig

What is the target audience?

  • Big Data Developers
  • Hadoop Developers

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: Pig Basics and Installation of Horton Sandbox
Introduction
Preview
03:04
Horton Installation
Preview
05:17
03:45

Sample data

-------------------

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d25,EN

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d26,BE

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d27,TR

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d28,US


Pig script

-----------------------------


PC_INFO = LOAD '/user/data/sampledata1' USING PigStorage(',') AS

(

date_time :long,

id_computer: chararray,

country :chararray

);

DUMP PC_INFO;

Section 2: Data Types and Operators
05:08

log file

--------

11,1144121498080,133.5f,122123.45,US,010001

12,2144121498080,133.6f,122123.46,TR,111100

13,3144121498080,133.7f,122123.47,US,110001

14,4144121498080,133.8f,122123.48,EN,111100

15,5144121498080,133.9f,122123.49,RU,011001

16,6144121498080,133.0f,122123.40,US,110100


pig script

---------------

TYPED_DATA = LOAD '/user/data/sampledata2' USING PigStorage(',') AS

(

intTypedData : int,

longTypedData : long,

floatTypeddata: float,

doubleTypedData: double,

chararrayTypedData: chararray,

bytearrayTypedData: bytearray

);

DUMP TYPED_DATA;

Operators
04:39
04:05

Example Data

Distinct


1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d25,EN

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d27,TR

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US


Pig


DATA = LOAD '/user/data/sampledata3' USING PigStorage(',') AS

(

timestamp :long,

computerID :chararray,

countryCode :chararray

);

DUMP DATA;

DISTINCT_DATA = DISTINCT DATA;

DUMP DATA;




Filter tuples with given condition

DATA = LOAD '/user/data/sampledata3' USING PigStorage(',') AS

(

timestamp :long,

computerID :chararray,

countryCode :chararray

);

DUMP DATA;

FILTERED_DATA = FILTER DATA BY countryCode == 'US' ;

DUMP FILTERED_DATA ;

08:40

log file 1

1,US,5,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=car

2,US,25,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=watch

3,TR,20,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=table

4,EN,10,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=car+games

5,PL,16,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=car+games+download

6,US,24,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=home

7,US,36,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=travel

8,EN,48,https://www.google.com/?gfe_rd=cr&ei=djNDVsWFOsqo8wf70JP4Bw&gws_rd=cr#q=car


.code 1


DATA = LOAD '/user/data/sampledata4' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray

);

DATA_GROUPED = GROUP DATA BY countryCode;

DUMP DATA_GROUPED;


code 2

DATA = LOAD '/user/data/sampledata4' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray

);

DATA_GROUPED = GROUP DATA BY countryCode;

RESULT = FOREACH DATA_GROUPED {

GENERATE

group ,

AVG(DATA.durationTime);

}

DUMP RESULT;


log file 2


1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d25,EN

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d27,TR

1441214980800,2664c6bb-9261-42bf-a5c4-534436f50d24,US


code 3



DATA = LOAD '/user/data/sampledata3' USING PigStorage(',') AS

(

timestamp: long,

computerID: chararray,

countryCode: chararray

);

DATA_GROUPED = GROUP DATA BY countryCode;

RESULT = FOREACH DATA_GROUPED {

GENERATE

group as countryCode,

COUNT(DATA);

}

DUMP RESULT;


code 4


DATA = LOAD '/user/data/sampledata3' USING PigStorage(',') AS

(

timestamp: long,

computerID: long,

countryCode: chararray

);

RESULT = FOREACH DATA {

S = FILTER DATA BY country == 'US';

GENERATE

COUNT(S);

}

DUMP RESULT;







Section 3: FUNCTIONS
04:23

log file

1,US,5,https://www.google.com.tr/#q=apache+pig,apache,pig

2,US,25,https://www.google.com.tr/#q=apache+hive,apache,hive

3,TR,20,https://www.google.com.tr/#q=apache+hadoop,apache,hadoop

4,EN,10,https://www.google.com.tr/#q=apache+oozie,apache,oozie

5,PL,16,https://www.google.com.tr/#q=apache%20flume,apache,flume

6,US,24,https://www.google.com.tr/#q=apache+spark,apache,spark

7,US,36,https://www.google.com.tr/#q=apache+kafka,apache,kafka

8,EN,48,https://www.google.com.tr/#q=apache+storm,apache storm in hadoop ,storm


avg pig


DATA = LOAD '/user/data/samplelogfile' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray,

keyword1: chararray,

keyword2: chararray

);

GROUPED_DATA = GROUP DATA BY countryCode;

RESULT = FOREACH GROUPED_DATA {

GENERATE

group ,

AVG(DATA.durationTime);

}

DUMP RESULT;


concat pig


DATA = LOAD '/user/data/samplelogfile' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray,

keyword1: chararray,

keyword2: chararray

);

RESULT = FOREACH DATA {

GENERATE

url ,

CONCAT(keyword1,keyword2) as combinedKeywords;

}

DUMP RESULT;


04:07

log file

1,US,5,https://www.google.com.tr/#q=apache+pig,apache,pig

2,US,25,https://www.google.com.tr/#q=apache+hive,apache,hive

3,TR,20,https://www.google.com.tr/#q=apache+hadoop,apache,hadoop

4,EN,10,https://www.google.com.tr/#q=apache+oozie,apache,oozie

5,PL,16,https://www.google.com.tr/#q=apache%20flume,apache,flume

6,US,24,https://www.google.com.tr/#q=apache+spark,,spark

7,US,36,https://www.google.com.tr/#q=apache+kafka,,kafka

8,EN,48,https://www.google.com.tr/#q=apache+storm,apache storm in hadoop ,storm


max min functions


DATA = LOAD '/user/data/samplelogfile' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray,

keyword1: chararray,

keyword2: chararray

);

GROUPED_DATA = GROUP DATA BY countryCode;

RESULT = FOREACH GROUPED_DATA {

GENERATE

group ,

MAX(DATA.durationTime) as maxDurationTime,

MIN(DATA.durationTime) as minDurationTime;

}

DUMP RESULT;

size pig


DATA = LOAD '/user/data/samplelogfile' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray,

keyword1: chararray,

keyword2: chararray

);

RESULT = FOREACH DATA {

GENERATE

SIZE(keyword1) as numberOfCharacher;

}

DUMP RESULT;

04:16

log file


1,US,5,https://www.google.com.tr/#q=apache+pig,apache,pig

2,US,25,https://www.google.com.tr/#q=apache+hive,apache,hive

3,TR,20,https://www.google.com.tr/#q=apache+hadoop,apache,hadoop

4,EN,10,https://www.google.com.tr/#q=apache+oozie,apache,oozie

5,PL,16,https://www.google.com.tr/#q=apache%20flume,apache,flume

6,US,24,https://www.google.com.tr/#q=apache+spark,,spark

7,US,36,https://www.google.com.tr/#q=apache+kafka,,kafka

8,EN,48,https://www.google.com.tr/#q=apache+storm,apache storm in hadoop ,storm


sum pig


DATA = LOAD '/user/data/samplelogfile' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray,

keyword1: chararray,

keyword2: chararray

);

GROUPED_DATA = GROUP DATA BY countryCode;

RESULT = FOREACH GROUPED_DATA {

GENERATE

group ,

SUM(DATA.durationTime) as totalDurationTime;

}

DUMP RESULT;


tokenize pig


DATA = LOAD '/user/data/samplelogfile' USING PigStorage(',') AS

(

id: int,

countryCode: chararray,

durationTime: int,

url: chararray,

keyword1: chararray,

keyword2: chararray

);

FILTERED_DATA = FILTER DATA BY id==8;

RESULT = FOREACH FILTERED_DATA {

GENERATE

TOKENIZE(keyword1);

}

DUMP RESULT;



Section 4: JOIN and UNION Operators
03:54

Employee log

John,27,1

David,30,2

Peter,29,3


Department log

1,Sales

2,Marketing

3,Engineering


Pig code

CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

name: chararray,

age: int,

dept_id: int

);

DEPARTMENT = LOAD '/user/data/departmentlog' USING PigStorage(',') AS

(

dept_id: int,

dept_name: chararray

);

JOINED_DATA = JOIN CUSTOMER BY dept_id , DEPARTMENT BY dept_id;

DUMP JOINED_DATA;


foreach code

FINAL = FOREACH JOINED_DATA {

GENERATE

CUSTOMER::name,

CUSTOMER::age,

DEPARTMENT::dept_name;

}

DUMP FINAL;

05:05

Customers log

1,John

2,David

3,Peter

Orders log

100,1,2014-01-29 23:56:57.700

200,4,2015-02-29 01:56:57.700

300,3,2013-03-29 23:56:57.700

Pig Script

CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

customerId: int,

name: chararray

);

ORDERS = LOAD '/user/data/orderlog' USING PigStorage(',') AS

(

orderId: int,

customerId: int,

orderDate: chararray

);

JOINED_DATA = JOIN CUSTOMER BY customerId LEFT OUTER , ORDERS BY customerId;

DUMP JOINED_DATA;


right join code

CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

customerId: int,

name: chararray

);

ORDERS = LOAD '/user/data/orderlog' USING PigStorage(',') AS

(

orderId: int,

customerId: int,

orderDate: chararray

);

JOINED_DATA = JOIN CUSTOMER BY customerId RIGHT OUTER , ORDERS BY customerId;

DUMP JOINED_DATA;


03:22

Customers log


1,John

2,David

3,Peter



Orders log

100,1,2014-01-29 23:56:57.700

200,4,2015-02-29 01:56:57.700

300,3,2013-03-29 23:56:57.700



Pig Script

CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

customerId: int,

name: chararray

);

ORDERS = LOAD '/user/data/orderlog' USING PigStorage(',') AS

(

orderId: int,

customerId: int,

orderDate: chararray

);

JOINED_DATA = JOIN CUSTOMER BY customerId FULL OUTER , ORDERS BY customerId;

DUMP JOINED_DATA;

03:41

Example

Customer log

1,Sam

2,John



Employee log

1,John

2,David

3,Peter


Pig Script


CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

customerId: int,

name: chararray

);

EMPLOYEE = LOAD '/user/data/employeelog' USING PigStorage(',') AS

(

employeeId: int,

employeeName: chararray

);

MERGED_DATA = UNION CUSTOMER, EMPLOYEE;

DUMP MERGED_DATA;

Section 5: Pig Commands and UDF
04:28

Example log



1,John,US

2,Sarah,TR

3,John,US

4,David,EN

5,Peter,PL


Pig Script


CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

customerId: int,

name: chararray,

country: chararray

);

RESULT = FOREACH CUSTOMER {

GENERATE

name ,

country;

}

STORE RESULT INTO '/user/data/customeroutput' using PigStorage(',');


Storing MySql Database
%declare PARAM_DB_URL 'jdbc:mysql://192.168.1.1:3306/my_db'

%declare PARAM_DB_USERNAME 'username1'

%declare PARAM_DB_PASSWORD 'password1'

register /user/lib/mysql-connector-java-5.1.21.jar

CUSTOMER = LOAD '/user/data/customerlog' USING PigStorage(',') AS

(

customerId: int,

name: chararray,

country: chararray

);

RESULT = FOREACH CUSTOMER {

GENERATE

name ,

country;

}

STORE RESULT INTO into 'my_table' using org.apache.pig.piggybank.storage.DBStorage(

'com.mysql.jdbc.Driver','$PARAM_DB_URL','$PARAM_DB_USERNAME','$PARAM_DB_PASSWORD','insert into my_table (name,country) values (?,?)');


03:45

Example log

1,John,US

2,Sarah,TR

3,John,US

4,David,EN

5,Peter,PL





Pig Script


CUSTOMER = LOAD '/user/data/customer' USING PigStorage(',') AS

(

customerId: int,

name: chararray,

country: chararray

);

RESULT = FOREACH CUSTOMER {

GENERATE

name ,

country;

}

fs -mkdir /user/data/temp

fs -cp /user/data/customer /user/data/temp

05:33

Log file

1,John,US

2,Sarah,TR

3,John,US

4,David,EN

5,Peter,PL



Step 2 - Add hadoop and pig based dependencies

<dependency>

<groupId>org.apache.pig</groupId>

<artifactId>pig</artifactId>

<version>0.9.0</version>

</dependency>

<dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

<version>0.20.2</version>

</dependency>


java code

package com.test;

import java.io.IOException;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

public class Uppercase extends EvalFunc<String> {

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0)

return null;

try {

String str = (String) input.get(0);

return str.toUpperCase();

} catch (Exception e) {

throw new IOException("Caught exception processing input row ", e);

}

}

}


pig code


REGISTER com.test.jar

-- Define function for use.

DEFINE Uppercase com.test.Uppercase();

CUSTOMER = LOAD '/user/data/customer' USING PigStorage(',') AS

(

customerId: int,

name: chararray,

country: chararray

);

RESULT = FOREACH CUSTOMER {

GENERATE

Uppercase(name);

}

DUMP RESULT;



Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

I was born in Ankara . I have graduated from Istanbul University from computer engineering department.I have been making projects on Java and Java web technology for 7 years.I am also a Java&Android trainer at a private corporation of education.In my free time ,I write Java,Big Data and Android based blogs in my personal blogger.

Ready to start learning?
Take This Course