How to query and iterate over array of structures in Athena (Presto)?
Solution 1
UNNEST
is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:
SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1
This solution uses FILTER
on the profile.primaryApplicant.incomes
array to get only those with an incomeType
of SALARY
, and then CARDINALITY
to extract the length of that result.
Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.
Solution 2
You can combine filter
with cardinality
to filter array elements having incomeType = 'SALARY'
more than once.
This can be further improve so that intermediate array is not materialized by using reduce
(see examples in the docs; I'm not quoting them here, since they do not directly answer your question).
Related videos on Youtube
Comments
-
tea almost 2 years
I have a S3 bucket with 500,000+
json
records, eg.{ "userId": "00000000001", "profile": { "created": 1539469486, "userId": "00000000001", "primaryApplicant": { "totalSavings": 65000, "incomes": [ { "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" }, { "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" } ] } } }
I created a new table in Athena
CREATE EXTERNAL TABLE profiles ( userId string, profile struct< created:int, userId:string, primaryApplicant:struct< totalSavings:int, incomes:array<struct<amount:int,incomeType:string,frequency:string>>, > > ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true') LOCATION 's3://profile-data'
I am interested in the
incomeTypes
, eg."SALARY"
,"PENSIONS"
,"OTHER"
, etc.. and ran this query changingjsonData.incometype
each time:SELECT jsonData FROM "sampledb"."profiles" CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData) WHERE jsonData.incometype='SALARY'
This worked fine with
CROSS JOIN UNNEST
which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was thatCROSS JOIN UNNEST
made all the field names lowercase, eg. a row looked like this:{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}
Now I have been asked how many users have two or more
"SALARY"
entries, eg."incomes": [ { "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" }, { "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" } ],
I'm not sure how to go about this.
How do I query the array of structures to look for duplicate
incomeTypes
of"SALARY"
?Do I have to iterate over the array?
What should the result look like?
-
tea about 5 yearsThanks for explaining and offering a great solution @Theo. This was a very fast and effective.
-
tea about 5 yearsThanks @PiotrFindeisen. Theo had similar thoughts.