Partition Function COUNT() OVER possible using DISTINCT

sql sql-server tsql sql-server-2008-r2 sql-server-2014

197,707

Solution 1

There is a very simple solution using dense_rank()

dense_rank() over (partition by [Mth] order by [UserAccountKey]) 
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc) 
- 1

This will give you exactly what you were asking for: The number of distinct UserAccountKeys within each month.

Solution 2

Necromancing:

It's relativiely simple to emulate a COUNT DISTINCT over PARTITION BY with MAX via DENSE_RANK:

;WITH baseTable AS
(
    SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
    SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr 
    FROM baseTable
)
SELECT
     RM
    ,ADR

    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1 
    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2 
    -- Not supported
    --,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
    ,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu 
FROM CTE

Note:
This assumes the fields in question are NON-nullable fields.
If there is one or more NULL-entries in the fields, you need to subtract 1.

Solution 3

I use a solution that is similar to that of David above, but with an additional twist if some rows should be excluded from the count. This assumes that [UserAccountKey] is never null.

-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
) 
+ dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1

An SQL Fiddle with an extended example can be found here.

Solution 4

I think the only way of doing this in SQL-Server 2008R2 is to use a correlated subquery, or an outer apply:

SELECT  datekey,
        COALESCE(RunningTotal, 0) AS RunningTotal,
        COALESCE(RunningCount, 0) AS RunningCount,
        COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM    document
        OUTER APPLY
        (   SELECT  SUM(Amount) AS RunningTotal,
                    COUNT(1) AS RunningCount,
                    COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
            FROM    Document d2
            WHERE   d2.DateKey <= document.DateKey
        ) rt;

This can be done in SQL-Server 2012 using the syntax you have suggested:

SELECT  datekey,
        SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM    document

However, use of DISTINCT is still not allowed, so if DISTINCT is required and/or if upgrading isn't an option then I think OUTER APPLY is your best option

Solution 5

There is a solution in simple SQL:

SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users

SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
    SELECT user, MIN(time) AS time
    FROM users
    GROUP BY user
) t

View more solutions

197,707

whytheq

Current addictions: DAX / POWERSHELL Time served with: (T-)sql / MDX / VBA / SSRS Would like more time for the following: C# Python Maxim: if you build something idiot-proof, the world will build a better idiot

Updated on February 03, 2022

Comments

whytheq over 2 years
I'm trying to write the following in order to get a running total of distinct NumUsers, like so:
```
NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])
```
Management studio doesn't seem too happy about this. The error disappears when I remove the DISTINCT keyword, but then it won't be a distinct count.

DISTINCT does not appear to be possible within the partition functions. How do I go about finding the distinct count? Do I use a more traditional method such as a correlated subquery?

Looking into this a bit further, maybe these OVER functions work differently to Oracle in the way that they cannot be used in SQL-Server to calculate running totals.

I've added a live example here on SQLfiddle where I attempt to use a partition function to calculate a running total.
- Damien_The_Unbeliever almost 12 years
  
  COUNT with ORDER BY instead of PARTITION BY is ill-defined in 2008. I'm surprised it's letting you have it at all. Per the documentation, you're not allowed an ORDER BY for an aggregate function.
- whytheq almost 12 years
  
  yep - think I'm getting confused with some oracle functionality; these running totals and running counts will be a little more involved
whytheq almost 12 years

cool thank you. I found this SO answer which features the OUTER APPLY option which I will attempt. Have you seen the looping UPDATE approach in that answer ... it's pretty far out & apparently fast. Life will be easier in 2012 - is that a straight Oracle copy?
bf2020 about 10 years

One thing to be careful about with dense_rank() is that it will count NULLs whereas COUNT(field) OVER does not. I can't employ it in my solution because of this but I still think it's quite clever.
whytheq almost 8 years

But I'm looking for a running total of distinct useraccountkeys over the months of each year: not sure how this answers that?
Vladimir Baranov almost 7 years

@bf2020, if there can be NULL values in the UserAccountKey, then you need to add this term: -MAX(CASE WHEN UserAccountKey IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY Mth). Idea is taken from the answer by LarsRönnbäck below. Essentially, if UserAccountKey has NULL values, you need to subtract extra 1 from the result, because DENSE_RANK counts NULLs.
Vladimir Baranov almost 7 years

Your idea can be used to make the original formula (without complexities of [Include] that you are talking about in your answer) with dense_rank() work when UserAccountKey can be NULL. Add this term to the formula: -MAX(CASE WHEN UserAccountKey IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY Mth).
K4M over 3 years

Here a discussion of using this dense_rank solution when window function has a frame. SQL Server does not allow dense_rank used with a window frame: stackoverflow.com/questions/63527035/…