Pig:FLATTEN keyword

12,004
  1. Sometimes you have data in a bag or a tuple and you want to remove that level of nesting.
  2. when you want to switch around your data on the fly and group by a particular field, you need a way to pull those entries out of the bag.

As per Pig documentation:

The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.

For more details check this link they have explained the usage of FLATTEN clearly with examples

Share:
12,004
user182944
Author by

user182944

Updated on June 04, 2022

Comments

  • user182944
    user182944 almost 2 years

    I am a little confused with the use of FLATTEN keyword in PIG.

    Consider the below dataset:

    tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}
    

    Without using the FLATTEN I can access a field (suppose firstname) like this:

    display_firstname = FOREACH tuple_record GENERATE details.firstname;
    

    Now, using the FLATTEN keyword:

    flatten_record = FOREACH tuple_record GENERATE FLATTEN(details);
    

    DESCRIBE gives me this:

    flatten_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}
    

    And hence I can access the fields present directly without dereferencing like this:

    display_record = FOREACH flatten_record GENERATE firstname;
    

    My questions related to this FLATTEN keyword is:

    1) Which way among the two (i.e. with or without using FLATTEN) is the optimized way of achieving the same output?

    2) Any special scenarios where without using the FLATTEN keywords, the desired output cant be achieved?

    Totally confused; please clarify its use and in which all scenarios I shall use it.