How can I use the map datatype in Apache Pig?

28,439

Solution 1

Currently pig maps need the key to a chararray (string) that you supply and not a variable which contains a string. so in map#key the key has to be constant string that you supply (eg: map#'keyvalue').

The typical use case of this is to load a complex data structure one of the element being a key value pair and later in a foreach statement you can refer to a particular value based on the key you are interested in.

http://pig.apache.org/docs/r0.9.1/basic.html#map-schema

Solution 2

Great question! I personally do not like Maps in Pig. They have a place in traditional programming languages like Java, C# etc, wherein its really handy and fast to lookup a key in the map. On the other hand, Maps in Pig have very limited features.

As you rightly pointed, one can not lookup variable key in the Map in Pig. The key needs to be Constant. e.g. myMap#'keyFoo' is allowed but myMap#$SOME_VARIABLE is not allowed.

If you think about it, you do not need Map in Pig. One usually loads the data from some source, transforms it, joins it with some other dataset, filter it, transform it and so on. JOIN actually does a good job of looking up the variable keys in the data. e.g. data1 has 2 columns A and B and data2 has 3 columns X, Y, Z. If you join data1 BY A with data2 BY Z, JOIN does the work of a Map (from traditional language) which maps value of column Z to value of column B (via column A). So data1 essentially represents a Map A -> B.

So why do we need Map in Pig?

Usually Hadoop data are the dumps of different data sources from Traditional languages. If original data sources contain Maps, the HDFS data would contain a corresponding Map.

How can one handle the Map data?

There are really 2 use cases:

  1. Map keys are constants. e.g. HttpRequest Header data contains time, server, clientIp as the keys in Map. to access value of a particular key, one case access them with Constant key. e.g. header#'clientIp'.

  2. Map keys are variables. In these cases, you would most probably would want to JOIN the Map keys with some other data set. I usually convert the Map to Bag using UDF MapToBag, which converts map data into Bag of 2 field tuples (key, value). Once map data is converted to Bag of tuples, its easy to join it with other data sets.

I hope this helps.

Solution 3

In Pig version 0.10.0 there is a new function available called "TOMAP" (http://pig.apache.org/docs/r0.10.0/func.html#tomap) that converts its odd (chararray) parameters to keys and even parameters to values. Unfortunately I haven't found it to be that useful, though, since I typically deal with arbitrary dicts of varying lengths and keys.

I would find a TOMAP function that took a tuple as a single argument, instead of a variable number of parameters, to be much more useful.

This isn't a complete solution to your problem, but the availability of TOMAP gives you some more options for your constructing a real solution.

Solution 4

1)If you want to load map data it should be like "[programming#SQL,rdbms#Oracle]"

2)If you want to load tuple data it should be like "(first_name_1234,middle_initial_1234,last_name_1234)"

3)If you want to load bag data it should be like"{(project_4567_1),(project_4567_2),(project_4567_3)}"

my file pigtest.csv like this

1234|[email protected]|(first_name_1234,middle_initial_1234,last_name_1234)|{(project_1234_1),(project_1234_2),(project_1234_3)}|[programming#SQL,rdbms#Oracle] 4567|[email protected]|(first_name_4567,middle_initial_4567,last_name_4567)|{(project_4567_1),(project_4567_2),(project_4567_3)}|[programming#Java,OS#Linux]


my schema:

a = LOAD 'pigtest.csv' using PigStorage('|') AS (employee_id:int, email:chararray, name:tuple(first_name:chararray, middle_name:chararray, last_name:chararray), project_list:bag{project: tuple(project_name:chararray)}, skills:map[chararray]) ;

b = FOREACH a GENERATE employee_id, email, name.first_name, project_list, skills#'programming' ;

dump b;

Share:
28,439
Admin
Author by

Admin

Updated on July 09, 2022

Comments

  • Admin
    Admin almost 2 years

    I'd like to use Apache Pig to build a large key -> value mapping, look things up in the map, and iterate over the keys. However, there does not even seem to be syntax for doing these things; I've checked the manual, wiki, sample code, Elephant book, Google, and even tried parsing the parser source. Every single example loads map literals from a file... and then never uses them. How can you use Pig's maps?

    First, there doesn't seem to be a way to load a 2-column CSV file into a map directly. If I have a simple map.csv:

    1,2
    3,4
    5,6
    

    And I try to load it as a map:

    m = load 'map.csv' using PigStorage(',') as (M: []);
    dump m;
    

    I get three empty tuples:

    ()
    ()
    ()
    

    So I try to load tuples and then generate the map:

    m = load 'map.csv' using PigStorage(',') as (key:chararray, val:chararray);
    b = foreach m generate [key#val];
    ERROR 1000: Error during parsing. Encountered " "[" "[ "" at line 1, column 24.
    ...
    

    Many variations on the syntax also fail (e.g., generate [$0#$1]).

    OK, so I munge my map into Pig's map literal format as map.pig:

    [1#2]
    [3#4]
    [5#6]
    

    And load it up:

    m = load 'map.pig' as (M: []);
    

    Now let's load up some keys and try lookups:

    k = load 'keys.csv' as (key);
    dump k;
    3
    5
    1
    
    c = foreach k generate m#key;  /* Or m[key], or... what? */
    ERROR 1000: Error during parsing.  Invalid alias: m in {M: map[ ]}
    

    Hrm, OK, maybe since there are two relations involved, we need a join:

    c = join k by key, m by /* ...um, what? */ $0;
    dump c;
    ERROR 1068: Using Map as key not supported.
    c = join k by key, m by m#key;
    dump c;
    Error 1000: Error during parsing. Invalid alias: m in {M: map[ ]}
    

    Fail. How do I refer to the key (or value) of a map? The map schema syntax doesn't seem to let you even name the key and value (the mailing list says there's no way to assign types).

    Finally, I'd just like to be able to find all they keys in my map:

    d = foreach m generate ...oh, forget it.
    

    Is Pig's map type half-baked? What am I missing?