Hive Explode / Lateral View multiple arrays

51,524

Solution 1

You can use the numeric_range and array_index UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/

Using those UDFs, the query would be something like

select cookie,
   array_index( product_id_arr, n ) as product_id,
   array_index( catalog_id_arr, n ) as catalog_id,
   array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;

Solution 2

I found a very good solution to this problem without using any UDF, posexplode is a very good solution :

SELECT COOKIE ,
ePRODUCT_ID,
eCAT_ID,
eQTY
FROM TABLE 
LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID
LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
WHERE seqp = seqc AND seqc = seqq;

Solution 3

You can do this by using posexplode, which will provide an integer between 0 and n to indicate the position in the array for each element in the array. Then use this integer - call it pos (for position) to get the matching values in other arrays, using block notation, like this:

select 
  cookie, 
  n.pos as position, 
  n.prd_id as product_id,
  cat_id[pos] as catalog_id,
  qty[pos] as qty
from table
lateral view posexplode(product_id_arr) n as pos, prd_id;

This avoids the using imported UDF's as well as joining various arrays together (this has much better performance).

Solution 4

If you are using Spark 2.4 in pyspark, use arrays_zip with posexplode:

df = (df
    .withColumn('zipped', arrays_zip('col1', 'col2'))
    .select('id', posexplode('zipped')))
Share:
51,524

Related videos on Youtube

user2726995
Author by

user2726995

Updated on June 26, 2020

Comments

  • user2726995
    user2726995 almost 4 years

    I have a hive table with the following schema:

    COOKIE  | PRODUCT_ID | CAT_ID |    QTY    
    1234123   [1,2,3]    [r,t,null]  [2,1,null]
    

    How can I normalize the arrays so I get the following result

    COOKIE  | PRODUCT_ID | CAT_ID |    QTY
    
    1234123   [1]          [r]         [2]
    
    1234123   [2]          [t]         [1] 
    
    1234123   [3]          null       null 
    

    I have tried the following:

    select concat_ws('|',visid_high,visid_low) as cookie
    ,pid
    ,catid 
    ,qty
    from table
    lateral view explode(productid) ptable as pid
    lateral view explode(catalogId) ptable2 as catid 
    lateral view explode(qty) ptable3 as qty
    

    however the result comes out as a Cartesian product.

  • Davos
    Davos about 7 years
    Hey this works! I guess with the caveat that your arrays need to have the same length, and if they don't it's going to truncate them to the length of the shortest one. I'm not sure on the performance of this either, especially with more and more lateral views.
  • Ahmed Abdellatif
    Ahmed Abdellatif about 7 years
    Yeah, it will take the shortest one because of "WHERE seqp = seqc AND seqc = seqq", for sure the performances will be affected.
  • E B
    E B almost 7 years
    @Jerome .. would this be able to work if the array is of different sizes.. ?
  • Jerome Banks
    Jerome Banks almost 7 years
    I'm not sure different array sizes would make sense. Then you would have to check if n was larger than the current array. Something like . SELECT cookie, IF( n >= size(array1), array_index( array1, n), null), IF(n > =size(array2), array_index(array2,n ) .....
  • runr
    runr almost 4 years
    This deserves more votes! While the alternative answers with multiple posexplodes may be fine with smaller tables, this is the way to go with larger tables and larger number of variables (to be passed to posexplode).