Hive Explode / Lateral View multiple arrays
Solution 1
You can use the numeric_range
and array_index
UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/
Using those UDFs, the query would be something like
select cookie,
array_index( product_id_arr, n ) as product_id,
array_index( catalog_id_arr, n ) as catalog_id,
array_index( qty_id_arr, n ) as qty
from table
lateral view numeric_range( size( product_id_arr )) n1 as n;
Solution 2
I found a very good solution to this problem without using any UDF, posexplode is a very good solution :
SELECT COOKIE , ePRODUCT_ID, eCAT_ID, eQTY FROM TABLE LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported WHERE seqp = seqc AND seqc = seqq;
Solution 3
You can do this by using posexplode, which will provide an integer between 0 and n to indicate the position in the array for each element in the array. Then use this integer - call it pos (for position) to get the matching values in other arrays, using block notation, like this:
select
cookie,
n.pos as position,
n.prd_id as product_id,
cat_id[pos] as catalog_id,
qty[pos] as qty
from table
lateral view posexplode(product_id_arr) n as pos, prd_id;
This avoids the using imported UDF's as well as joining various arrays together (this has much better performance).
Solution 4
If you are using Spark 2.4 in pyspark, use arrays_zip
with posexplode
:
df = (df
.withColumn('zipped', arrays_zip('col1', 'col2'))
.select('id', posexplode('zipped')))
Related videos on Youtube
user2726995
Updated on June 26, 2020Comments
-
user2726995 almost 4 years
I have a hive table with the following schema:
COOKIE | PRODUCT_ID | CAT_ID | QTY 1234123 [1,2,3] [r,t,null] [2,1,null]
How can I normalize the arrays so I get the following result
COOKIE | PRODUCT_ID | CAT_ID | QTY 1234123 [1] [r] [2] 1234123 [2] [t] [1] 1234123 [3] null null
I have tried the following:
select concat_ws('|',visid_high,visid_low) as cookie ,pid ,catid ,qty from table lateral view explode(productid) ptable as pid lateral view explode(catalogId) ptable2 as catid lateral view explode(qty) ptable3 as qty
however the result comes out as a Cartesian product.
-
Davos about 7 yearsHey this works! I guess with the caveat that your arrays need to have the same length, and if they don't it's going to truncate them to the length of the shortest one. I'm not sure on the performance of this either, especially with more and more lateral views.
-
Ahmed Abdellatif about 7 yearsYeah, it will take the shortest one because of "WHERE seqp = seqc AND seqc = seqq", for sure the performances will be affected.
-
E B almost 7 years@Jerome .. would this be able to work if the array is of different sizes.. ?
-
Jerome Banks almost 7 yearsI'm not sure different array sizes would make sense. Then you would have to check if n was larger than the current array. Something like . SELECT cookie, IF( n >= size(array1), array_index( array1, n), null), IF(n > =size(array2), array_index(array2,n ) .....
-
runr almost 4 yearsThis deserves more votes! While the alternative answers with multiple posexplodes may be fine with smaller tables, this is the way to go with larger tables and larger number of variables (to be passed to posexplode).