storing pig output into Hive table in a single instance

hadoop hive apache-pig

12,271

Solution 1

Ok. Create a external hive table with a schema layout somewhere in HDFS directory. Lets say

create external table emp_records(id int,
                                  name String,
                                  city String)
                                  row formatted delimited 
                                  fields terminated by '|'
                                  location '/user/cloudera/outputfiles/usecase1';

Just create a table like above and no need to load any file into that directory.

Now write a Pig script that we read data for some input directory and then when you store the output of that Pig script use as below

A =  LOAD 'inputfile.txt' USING PigStorage(',') AS(id:int,name:chararray,city:chararray);
B = FILTER A by id > = 678933;
C = FOREACH B GENERATE id,name,city;
STORE C INTO '/user/cloudera/outputfiles/usecase1' USING PigStorage('|');

Ensure that destination location and delimiter and schema layout of final FOREACH statement in you Pigscript matches with Hive DDL schema.

Solution 2

There are two approaches explained below with 'Employee' table example to store pig output into hive table. (Prerequisite is that hive table should be already created)

A =  LOAD 'EMPLOYEE.txt' USING PigStorage(',') AS(EMP_NUM:int,EMP_NAME:chararray,EMP_PHONE:int);

Approach 1: Using Hcatalog

// dump pig result to Hive using Hcatalog 
store A into 'Empdb.employee' using org.apache.hive.hcatalog.pig.HCatStorer();

(or)

Approach 2: Using HDFS physical location

// dump pig result to external hive warehouse location
STORE A INTO 'hdfs://<<nmhost>>:<<port>>/user/hive/warehouse/Empdb/employee/' USING PigStorage(',')

;

12,271

Author by

Kirthika

Updated on June 13, 2022

Comments

Kirthika almost 2 years

I would like to insert the pig output into Hive tables(tables in Hive is already created with the exact schema).Just need to insert the output values into table. I dont want to the usual method, wherein I first store into a file, then read that file from Hive and then insert into tables. I need to reduce that extra hop which is done.

Is it possible. If so please tell me how this can be done ?

Thanks