How to get array/bag of elements from Hive group by operator?

30,817

The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:

SELECT F1, collect_set(F2)
FROM sample_table
GROUP BY F1

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

Share:
30,817

Related videos on Youtube

Anuroop
Author by

Anuroop

Maker at heart. Love programming whenever the opportunity presents. Over 14 years of experience in leading a team of engineers focused on building, maintaining and supporting infrastructure solutions with a key focus on driving Scrum and DevOps practices across the entire engineering team Adept at DevOps methodology to build and support the application in the Agile, Scrum and Waterfall model Experience in recruiting, building and leading a team of full-stack, front end, back-end engineers, balanced between application development, release management, and project work Adept in Stake holder management and escalations. Expertise in architecture, Code Reviews, design, implementation and operating IT systems on leading commercial Cloud platforms like AWS and Azure Ability to manage and coach teams comprising highly skilled DevOps engineers at different experience levels Skilled at establishing, tracking and communicating agile metrics (such as team velocity and sprint progress/burn down chart) to team, management and any impacted stakeholders Proficiency in DevOps efforts and hands-on in multiple domains - such as CI/CD toolchain, cloud environments Excellence in building and fostering developer-friendly deployments and proper software release management

Updated on March 16, 2020

Comments

  • Anuroop
    Anuroop about 4 years

    I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-

    Imagine a table named 'sample_table' with two columns as below:-

    F1  F2
    001 111
    001 222
    001 123
    002 222
    002 333
    003 555
    

    I want to write Hive Query that will give the below output:-

    001 [111, 222, 123]
    002 [222, 333]
    003 [555]
    

    In Pig, this can be very easily achieved by something like this:-

    grouped_relation = GROUP sample_table BY F1;
    

    Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.

  • Alex Woolford
    Alex Woolford over 9 years
    In Hive 0.13 there's a collect_list function which would return duplicates.