Hive Explode / Lateral View multiple arrays

hive explode hiveql

51,524

Solution 1

You can use the numeric_range and array_index UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/

Using those UDFs, the query would be something like

select cookie,
   array_index( product_id_arr, n ) as product_id,
   array_index( catalog_id_arr, n ) as catalog_id,
   array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;

Solution 2

I found a very good solution to this problem without using any UDF, posexplode is a very good solution :

SELECT COOKIE ,
ePRODUCT_ID,
eCAT_ID,
eQTY
FROM TABLE 
LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID
LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
WHERE seqp = seqc AND seqc = seqq;

Solution 3

You can do this by using posexplode, which will provide an integer between 0 and n to indicate the position in the array for each element in the array. Then use this integer - call it pos (for position) to get the matching values in other arrays, using block notation, like this:

select 
  cookie, 
  n.pos as position, 
  n.prd_id as product_id,
  cat_id[pos] as catalog_id,
  qty[pos] as qty
from table
lateral view posexplode(product_id_arr) n as pos, prd_id;

This avoids the using imported UDF's as well as joining various arrays together (this has much better performance).

Solution 4

If you are using Spark 2.4 in pyspark, use arrays_zip with posexplode:

df = (df
    .withColumn('zipped', arrays_zip('col1', 'col2'))
    .select('id', posexplode('zipped')))

View more solutions

51,524

user2726995

Updated on June 26, 2020

Comments

user2726995 almost 4 years

I have a hive table with the following schema:

COOKIE  | PRODUCT_ID | CAT_ID |    QTY    
1234123   [1,2,3]    [r,t,null]  [2,1,null]

How can I normalize the arrays so I get the following result

COOKIE  | PRODUCT_ID | CAT_ID |    QTY

1234123   [1]          [r]         [2]

1234123   [2]          [t]         [1] 

1234123   [3]          null       null

I have tried the following:

select concat_ws('|',visid_high,visid_low) as cookie
,pid
,catid 
,qty
from table
lateral view explode(productid) ptable as pid
lateral view explode(catalogId) ptable2 as catid 
lateral view explode(qty) ptable3 as qty

however the result comes out as a Cartesian product.

Davos about 7 years

Hey this works! I guess with the caveat that your arrays need to have the same length, and if they don't it's going to truncate them to the length of the shortest one. I'm not sure on the performance of this either, especially with more and more lateral views.
Ahmed Abdellatif about 7 years

Yeah, it will take the shortest one because of "WHERE seqp = seqc AND seqc = seqq", for sure the performances will be affected.
E B almost 7 years

@Jerome .. would this be able to work if the array is of different sizes.. ?
Jerome Banks almost 7 years

I'm not sure different array sizes would make sense. Then you would have to check if n was larger than the current array. Something like . SELECT cookie, IF( n >= size(array1), array_index( array1, n), null), IF(n > =size(array2), array_index(array2,n ) .....
runr almost 4 years

This deserves more votes! While the alternative answers with multiple posexplodes may be fine with smaller tables, this is the way to go with larger tables and larger number of variables (to be passed to posexplode).