How can I add row numbers for rows in PIG or HIVE?
Solution 1
Facebook posted a number of hive UDFs including NumberRows. Depending on your hive version (I believe 0.8) you may need to add an attribute to the class (stateful=true).
Solution 2
In Hive:
Query
select str_id,row_number() over() from tabledata;
Output
3D64B18BC842 1
BAECEFA8EFB6 2
346B13E4E240 3
6D8A9D0249B4 4
9FD024AA52BA 5
Solution 3
Pig 0.11 introduced a RANK operator that can be used for this purpose.
Solution 4
For folks wondering about Pig, I found the best way (currently) is to write your own UDF. I wanted to add row numbers for tuples in a bag. This is the code for that:
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;
public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
DataBag bg = (DataBag)input.get(0);
Iterator it = bg.iterator();
Integer count = new Integer(1);
while(it.hasNext())
{ Tuple t = (Tuple)it.next();
t.append(count);
output.add(t);
count = count + 1;
}
return output;
} catch (ExecException ee) {
// error handling goes here
throw ee;
}
}
public Schema outputSchema(Schema input) {
try{
Schema bagSchema = new Schema();
bagSchema.add(new Schema.FieldSchema(null, DataType.BAG));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
bagSchema, DataType.BAG));
}catch (Exception e){
return null;
}
}
}
This code is for reference only. Might not be error-proof.
Solution 5
In Hive:
select
str_id, ROW_NUMBER() OVER() as row_num
from myTable;
Breakinen
Updated on June 23, 2020Comments
-
Breakinen almost 4 years
I have a problem when adding row numbers using Apache Pig. The problem is that I have a STR_ID column and I want to add a ROW_NUM column for the data in STR_ID, which is the row number of the STR_ID.
For example, here is the input:
STR_ID ------------ 3D64B18BC842 BAECEFA8EFB6 346B13E4E240 6D8A9D0249B4 9FD024AA52BA
How do I get the output like:
STR_ID | ROW_NUM ---------------------------- 3D64B18BC842 | 1 BAECEFA8EFB6 | 2 346B13E4E240 | 3 6D8A9D0249B4 | 4 9FD024AA52BA | 5
Answers using Pig or Hive are acceptable. Thank you.