Pig:FLATTEN keyword
- Sometimes you have data in a bag or a tuple and you want to remove that level of nesting.
- when you want to switch around your data on the fly and group by a particular field, you need a way to pull those entries out of the bag.
As per Pig documentation:
The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.
For more details check this link they have explained the usage of FLATTEN clearly with examples
user182944
Updated on June 04, 2022Comments
-
user182944 almost 2 years
I am a little confused with the use of
FLATTEN
keyword in PIG.Consider the below dataset:
tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}
Without using the
FLATTEN
I can access a field (suppose firstname) like this:display_firstname = FOREACH tuple_record GENERATE details.firstname;
Now, using the
FLATTEN
keyword:flatten_record = FOREACH tuple_record GENERATE FLATTEN(details);
DESCRIBE
gives me this:flatten_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}
And hence I can access the fields present directly without
dereferencing
like this:display_record = FOREACH flatten_record GENERATE firstname;
My questions related to this
FLATTEN
keyword is:1) Which way among the two (i.e. with or without using
FLATTEN
) is the optimized way of achieving the same output?2) Any special scenarios where without using the
FLATTEN
keywords, the desired output cant be achieved?Totally confused; please clarify its use and in which all scenarios I shall use it.