t-sql GROUP BY with COUNT, and then include MAX from the COUNT

28,612

Solution 1

Mechanically there are a few ways to do this. You could use temp tables/table variable. Another way is with nested queries and/or a CTE as @Aaron_Bertrand showed. A third way is to use WINDOWED FUNCTIONS such as...

SELECT    CarName,
          COUNT(*) as theCount,
          MAX(Count(*)) OVER(PARTITION BY 'foo') as MaxPerGroup
FROM      dbo.tbl_Cars
GROUP BY CarName

A DISFAVORED (read depricated) fourth way is using the COMPUTE keyword as such...

SELECT   CarID, CarName, Count(*)
FROM     dbo.tbl_Cars
GROUP BY CarID, CarName 
COMPUTE MAX(Count(*))   

The COMPUTE keyword generates totals that appear as additional summary columns at the end of the result set (see this). In the query above you will actually see two record sets.

Fastest

Now, the next issue is what's the "best/fastest/easiest." I immediately think of an indexed view. As @Aaron gently reminded me, indexed views have all sorts of restrictions. The above, strategy, however, allows you to create an indexed view on the SELECT...FROM..GROUP BY. Then selecting from the indexed view apply the WINDOWED FUNCTION clause.

Without knowing more, however, about your design it is going to be difficult for anyone tell you what's best. You will get lighting fast queries from an indexed view. That performance comes at a price, though. The price is maintenance costs. If the underlying table is the target of a large amount of insert/update/delete operations the maintenance of the indexed view will bog down performance in other areas.

If you share a bit more about your use case and data access patterns people will be able to share more insight.


MICRO PERFORMANCE TEST

So I generated a little data script and looked at sql profiler numbers for the CTE performance vs windowed functions. This is a micro-test, so try some real numbers in your system under real load.

Data generation:

Create table Cars ( CarID int identity (1,1) primary key, 
                    CarName varchar(20), 
                    value int)
GO
insert into Cars (CarName, value)
values  ('Buick', 100),
        ('Ford', 10),
        ('Buick', 300),     
        ('Buick', 100),
        ('Pontiac', 300),       
        ('Bmw', 100),
        ('Mecedes', 300),       
        ('Chevy', 300),     
        ('Buick', 100),
        ('Ford', 200);
GO 1000

This script generates 10,000 rows. I then ran each of the four following queries multiple times :

--just group by
select  CarName,COUNT(*) countThis
FROM    Cars
GROUP BY CarName        

--group by with compute (BAD BAD DEVELOPER!)
select  CarName,COUNT(*) countThis
FROM    Cars
GROUP BY CarName        
COMPUTE  MAX(Count(*));

-- windowed aggregates...
SELECT  CarName,
        COUNT(*) as theCount,
        MAX(Count(*)) OVER(PARTITION BY 'foo') as MaxInAnyGroup
FROM Cars
GROUP BY CarName        

--CTE version
;WITH x AS (
  SELECT   CarName,
           COUNT(*) AS Total
  FROM     Cars
  GROUP BY CarName
)
SELECT x.CarName, x.Total, x2.[Max Total]
FROM x CROSS JOIN (
  SELECT [Max Total] = MAX(Total) FROM x
) AS x2;

After running the above queries, I created an indexed view on the "just group by" query above. Then I ran a query on the indexed view that performed a MAX(Count(*)) OVER(PARTITION BY 'foo'.

AVERAGE RESULTS

Query                      CPU       Reads     Duration   
--------------------------------------------------------
Group By                   15        31        7 ms  
Group & Compute            15        31        7 ms
Windowed Functions         14        56        8 ms 
Common Table Exp.          16        62       15 ms
Windowed on Indexed View    0        24        0 ms

Obviously this is a micro-benchmark and only mildly instructive, so take it for what it's worth.

Solution 2

Here's one way:

;WITH x AS
(
  SELECT   CarID
         , CarName
         , COUNT(*) AS Total
  FROM     dbo.tbl_Cars
  GROUP BY CarID, CarName
)
SELECT x.CarID, x.CarName, x.Total, x2.[Max Total]
FROM x CROSS JOIN
(
  SELECT [Max Total] = MAX(Total) FROM x
) AS x2;
Share:
28,612
mg1075
Author by

mg1075

I spend a fair amount of time querying, reviewing, and visualizing data for projects large and small; once in a while, I write about it, too.

Updated on April 12, 2020

Comments

  • mg1075
    mg1075 about 4 years

    Suppose you had a table of "Cars" with hundreds of thousands of rows, and you wanted to do a GROUP BY:

    SELECT   CarID
             , CarName
             , COUNT(*) AS Total
    FROM     dbo.tbl_Cars
    GROUP BY CarID
             , CarName
    

    The grouping leaves you with a result akin to:

    CarID       CarName    Total
    1872        Olds       202,121   
    547841      BMW        175,298
    9877        Ford        10,241
    

    All fine and well. My question, though, is what is the best way to get the Total and the MAX Total into one table, in terms of performance and clean coding, so you have a result like:

    CarID       CarName    Total      Max Total
    1872        Olds       202,121    202,121
    547841      BMW        175,298    202,121
    9877        Ford        10,241    202,121 
    

    One approach would be to put the GROUP result into a temp table, and then get the MAX from the temp table into a local variable. But I'm wondering what the best way to do this would be.


    UPDATE

    The Common Table Expression seems the most elegant to write, yet similar to @EBarr, my limited testing indicates a significantly slower performance. So I won't be going with the CTE.

    As the link @EBarr has for the COMPUTE option indicates the feature is deprecated, that doesn't seem the best route, either.

    The option of a local variable for the MAX value and the use of a temp table will likely be the route I go down, as I'm not aware of performance issues with it.

    A bit more detail about my use case: it could probably end up being a series of other SO questions. But suffice to say that I'm loading a large subset of data into a temp table (so a subset of tbl_Cars is going into #tbl_Cars, and even #tbl_Cars may be further filtered and have aggregations performed on it), because I have to perform multiple filtering and aggregation queries on it within a single stored proc that returns multiple result sets.


    UPDATE 2

    @EBarr's use of a windowed function is nice and short. Note to self: if using a RIGHT JOIN to an outer reference table, the COUNT() function should select a column from tbl_Cars, not '*'.

    SELECT       M.MachineID
                 , M.MachineType
                 , COUNT(C.CarID) AS Total
                 , MAX(COUNT(C.CarID)) OVER() as MaxTotal
    FROM         dbo.tbl_Cars C
    RIGHT JOIN   dbo.tbl_Machines M
          ON     C.CarID = M.CarID
    GROUP BY     M.MachineID
                 , M.MachineType
    

    In terms of speed, it seems fine, but at what point do you have to be worried about the number of reads?

  • Adir D
    Adir D over 12 years
    You can't use MAX in an indexed view (I've been asking for it for 5 years - connect.microsoft.com/SQLServer/feedback/details/267516/…). Also theFieldBeingSearchedForMax isn't in the table, it's part of the output (it's the highest count).
  • mg1075
    mg1075 over 12 years
    - sorry I failed to add the GROUP BY in the first query; my bad.
  • Adir D
    Adir D over 12 years
    Unfortunately COMPUTE is deprecated and will be removed in the next version of SQL Server, so it isn't really valid to demonstrate its use or compare its performance.
  • EBarr
    EBarr over 12 years
    @AaronBertrand - thanks for keeping me honest. I hadn't used compute in so long that I didn't even know it was depricated. Please tell me windowed functions haven't been yanked too ;-)
  • Adir D
    Adir D over 12 years
    No, they are making all kinds of advancements to windowing functions - this is one reason they're deprecating the COMPUTE / ROLLUP nonsense. :-)