How do I select TOP 5 PERCENT from each group?

22,526

Solution 1

You could use a CTE (Common Table Expression) paired with the NTILE windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).

;WITH SlicedData AS
(
   SELECT Category, Name, COUNT(Name) Total,
            NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS  'NTile'
   FROM #TEMP
   GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1

This basically groups your data by Category,Name, orders by something else (not sure if COUNT(Name) is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1 is the top 5% slice - just ignore that when selecting from the CTE.

See:

for more info

Solution 2

select Category,name,CountTotal,RankSeq,(50*CountTotal)/100 from (
select Category,name,COUNT(*)
over (partition by Category,name ) as CountTotal,
ROW_NUMBER()
over (partition by Category,name order by Category) RankSeq from #TEMP
--group by Category,Name 
) temp
where RankSeq <= ((50*CountTotal)/100)
order by Category,Name,RankSeq

Output:

Category    name     CountTotal RankSeq     50*CountTotal)/100
A           Adam     4          1           2
A           Adam     4          2           2
A           John     6          1           3
A           John     6          2           3
A           John     6          3           3
A           Lisa     2          1           1
B           Lily     5          1           2
B           Lily     5          2           2
B           Ross     3          1           1
B           Tom      4          1           2
B           Tom      4          2           2

I hope this helps :)

Share:
22,526
Legend
Author by

Legend

Just a simple guy :)

Updated on July 19, 2022

Comments

  • Legend
    Legend almost 2 years

    I have a sample table like this:

    CREATE TABLE #TEMP(Category VARCHAR(100), Name VARCHAR(100))
    
    INSERT INTO #TEMP VALUES('A', 'John')
    INSERT INTO #TEMP VALUES('A', 'John')
    INSERT INTO #TEMP VALUES('A', 'John')
    INSERT INTO #TEMP VALUES('A', 'John')
    INSERT INTO #TEMP VALUES('A', 'John')
    INSERT INTO #TEMP VALUES('A', 'John')
    INSERT INTO #TEMP VALUES('A', 'Adam')
    INSERT INTO #TEMP VALUES('A', 'Adam')
    INSERT INTO #TEMP VALUES('A', 'Adam')
    INSERT INTO #TEMP VALUES('A', 'Adam')
    INSERT INTO #TEMP VALUES('A', 'Lisa')
    INSERT INTO #TEMP VALUES('A', 'Lisa')
    INSERT INTO #TEMP VALUES('A', 'Bucky')
    INSERT INTO #TEMP VALUES('B', 'Lily')
    INSERT INTO #TEMP VALUES('B', 'Lily')
    INSERT INTO #TEMP VALUES('B', 'Lily')
    INSERT INTO #TEMP VALUES('B', 'Lily')
    INSERT INTO #TEMP VALUES('B', 'Lily')
    INSERT INTO #TEMP VALUES('B', 'Tom')
    INSERT INTO #TEMP VALUES('B', 'Tom')
    INSERT INTO #TEMP VALUES('B', 'Tom')
    INSERT INTO #TEMP VALUES('B', 'Tom')
    INSERT INTO #TEMP VALUES('B', 'Ross')
    INSERT INTO #TEMP VALUES('B', 'Ross')
    INSERT INTO #TEMP VALUES('B', 'Ross')
    
    SELECT Category, Name, COUNT(Name) Total
    FROM #TEMP
    GROUP BY Category, Name
    ORDER BY Category, Total DESC
    
    DROP TABLE #TEMP
    

    Gives me the following:

    A   John    6
    A   Adam    4
    A   Lisa    2
    A   Bucky   1
    B   Lily    5
    B   Tom     4
    B   Ross    3
    

    Now, how do I select the TOP 5 PERCENT records from each category assuming each category has more than 100 records (did not show in sample table here)? For instance, in my actual table, it should remove the John record from A and Lily record from B as appropriate (again, I did not show the full table here) to get:

    A   Adam    4
    A   Lisa    2
    A   Bucky   1
    B   Tom     4
    B   Ross    3
    

    I have been trying to use CTEs and PARTITION BY clauses but cannot seem to achieve what I want. It removes the TOP 5 PERCENT from the overall result but not from each category. Any suggestions?

    • Kieren Johnstone
      Kieren Johnstone over 12 years
      May help in a small way - If you have a count for a group, remember that 5 percent would be "row_num <= (5 * count) / 100"
    • Legend
      Legend over 12 years
      @KierenJohnstone: +1 Thank you. I know I might have to use CROSS APPLY or something similar but still having some trouble. Will update if I figure it out.
    • ZygD
      ZygD over 12 years
      What is the desired output then please? Remove top 5 percent percent is very little compared to a count of 6. One row (A, John) is 16%.
    • ypercubeᵀᴹ
      ypercubeᵀᴹ over 12 years
      See this similar question: stackoverflow.com/questions/4373451/…
    • Legend
      Legend over 12 years
      @gbn: Sorry! I could not add 100 records to the sample table. I updated my question to have the assumption that I have more than 100 records in each category. I put the output I am expecting but of course this won't work on the sample table.
    • Martin Smith
      Martin Smith over 12 years
      @Legend - Still not clear to me what you want. Please give desired results and explain how they are arrived at. Not sure at what point you want the TOP 5% applied.
    • Legend
      Legend over 12 years
      @MartinSmith: Updated my question with an example output. I want the TOP 5 PERCENT to be applied after calculating the counts.
    • Tim Rogers
      Tim Rogers over 12 years
      I'm not sure if I understand you right, but it's going to be very hard to remove some rows but not others if they don't have unique fields. I.e. let's suppose there are 100 A, John records; you can only remove all or none of them, not the top 5%.
  • Legend
    Legend over 12 years
    This serves my purpose. Thanks a million. I fixed your post for some missing parts in the query to make it run out of the box.
  • Tim Rogers
    Tim Rogers over 12 years
    @Legend I thought you wanted to remove records, not just select them?
  • Legend
    Legend over 12 years
    @TimRogers: Sure. I just made an exclusion list using this query for the names that I wanted to remove. I will try to fix my question.