Hive getting top n records in group by query

48,687

Solution 1

You can do it with a rank() UDF described here: http://ragrawal.wordpress.com/2011/11/18/extract-top-n-records-in-each-group-in-hadoophive/

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(user-id) as rank, clicks
    FROM mytable
    DISTRIBUTE BY page-id, user-id
    SORT BY page-id, user-id, clicks desc
) a 
WHERE rank < 5
ORDER BY page-id, rank

Solution 2

As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM my table
) ranked_mytable
WHERE ranked_mytable.rank < 5
ORDER BY page-id, rank

No UDF required, and only one subquery! Also, all of the rank logic is localized.

You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.

Solution 3

Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

SELECT page-id, user-id, clicks
FROM (
    SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (
        SELECT page-id, user-id, clicks FROM mytable
        DISTRIBUTE BY page-id
        SORT BY page-id, clicks desc
) a ) b
WHERE rank < 5
ORDER BY page-id, rank

Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)

Solution 4

Let us say your data looks like following :

page-id   user-id   clicks
page1     user1     10
page1     user2     10
page1     user3     9
page1     user4     8
page1     user5     7
page1     user6     7
page1     user7     6
page1     user8     5
page2     user1     20
page2     user2     19
page2     user3     18

Below Query will give you :

SELECT page-id, user-id, clicks, rank
FROM (
    SELECT page-id, user-id, rank() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5

Result :

page-id   user-id   clicks  rank
page1     user1     10      1
page1     user2     10      1 
page1     user3     9       3 
page1     user4     8       4
page1     user5     7       5 
page1     user6     7       5 
page2     user1     20      1
page2     user2     19      2  
page2     user3     18      3

So, for page1 you are getting 6 users, as users with same number of clicks are ranked same.

But, if you are looking for exactly 5 users, and pick randomly in case multiple users fall in same rank. You can use the below query

SELECT page-id, user-id, clicks, rank
FROM (
    SELECT page-id, user-id, row_number() 
           over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks 
    FROM your_table
) ranked_table
WHERE ranked_table.rank <= 5

Result :

page-id   user-id   clicks  rank
page1     user1     10      1
page1     user2     10      2 
page1     user3     9       3 
page1     user4     8       4
page1     user5     7       5 
page2     user1     20      1
page2     user2     19      2  
page2     user3     18      3

Solution 5

You can use each_top_k function of hivemall for an efficient top-k computation on Apache Hive.

select
  page-id, 
  user-id,
  clicks
from (
  select
    each_top_k(5, page-id, clicks, page-id, user-id)
      as (rank, clicks, page-id, user-id)
  from (
    select
      page-id, user-id, clicks
    from
      mytable
    DISTRIBUTE BY page-id SORT BY page-id
  ) t1
) t2
order by page-id ASC, clicks DESC

The each_top_k UDTF is very fast when compared to other methods running top-k queries (e.g., distributed by/rank) in Hive because it does not hold the whole ranking for the intermediate result.

Share:
48,687

Related videos on Youtube

TopCoder
Author by

TopCoder

Updated on July 13, 2022

Comments

  • TopCoder
    TopCoder almost 2 years

    I have following table in hive

    user-id, user-name, user-address,clicks,impressions,page-id,page-name

    I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

    I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

    How can we do this using HIve UDF ?

  • arsenal
    arsenal almost 12 years
    Hi Maxime, Sorry for bothering you like this. I am also having the similar problem. I have posted on the SO, but haven't got any good response as I am working with Hive and HiveQL is new to me. http://stackoverflow.com/questions/11405446/find-10-latest-r‌​ecord-for-each-buyer‌​-id-for-yesterdays-d‌​ate. It will be of great help to me.
  • Himanshu Gahlot
    Himanshu Gahlot about 11 years
    I just put in hours making this work but it didn't work. The bug is that you are first ranking and then doing DISTRIBUTE BY and SORT BY. Instead you should apply rank in an outer query and use DISTRIBUTE BY and SORT BY in an inner query. For example, SELECT page-id, user-id, clicks FROM (SELECT page-id, user-id, rank(user-id) as rank, clicks FROM (SELECT * FROM mytable DISTRIBUTE BY page-id, user-id SORT BY page-id, user-id, clicks DESC)a )b WHERE rank < 5 ORDER BY page-id, rank;
  • Hai-Anh Trinh
    Hai-Anh Trinh about 11 years
    Confirmed that @HimanshuGahlot is correct. The answer has a BUG! You must use rank() in an outer query and use DISTRIBUTE/SORT BY in the inner query!
  • minhas23
    minhas23 over 8 years
    Cool.. It saved my search :)
  • John Jiang
    John Jiang over 5 years
    The last order by rank doesn't work because rank was not selected at the outermost level.
  • Dharman
    Dharman over 4 years
    Please don't add the same answer to multiple questions. Answer the best one and flag the rest as duplicates. See Is it acceptable to add a duplicate answer to several questions?
  • jasie
    jasie over 4 years
    Please use code formatting for your query (and user upper case etc)