T-SQL: Calculating the Nth Percentile Value from column

19,981

Solution 1

If you want to get exactly the 90th percentile value, excluding NULLs, I would suggest doing the calculation directly. The following version calculates the row number and number of rows, and selects the appropriate value:

select max(case when rownum*1.0/numrows <= 0.9 then colA end) as percentile_90th
from (select colA,
             row_number() over (order by colA) as rownum,
             count(*) over (partition by NULL) as numrows
      from t
      where colA is not null
     ) t

I put the condition in the SELECT clause rather than the WHERE clause, so you can easily get the 50th percentile, 17th, or whatever values you want.

Solution 2

WITH
  percentiles AS
(
  SELECT
    NTILE(100) OVER (ORDER BY ColA) AS percentile,
    *
  FROM
    data
)
SELECT
  *
FROM
  percentiles
WHERE
  percentile = 90


Note: If the data has less than 100 observations, not all percentiles will have a value. Equally, if you have more than 100 observations, some percentiles will contain more values.

Solution 3

Starting with SQL Server 2012, there are now PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions. These are (so far) only available as window functions, not as aggregate functions, so you would have to remove redundant results because of the lacking grouping, e.g. by using DISTINCT or TOP 1:

WITH t AS (
  SELECT *
  FROM (
    VALUES(NULL),(100),(200),(300),
      (NULL),(400),(500),(600),(700),
      (800),(900),(1000)
  ) t(ColA)
)
SELECT DISTINCT percentile_disc(0.9) WITHIN GROUP (ORDER BY ColA) OVER()
FROM t
;

I have blogged about percentiles more in detail here.

Share:
19,981
jbeldock
Author by

jbeldock

Day Job: Facebook, formerly ShotSpotter. Night Job: SF Bay Area Immigrant. Life Job: New Yorker!

Updated on July 06, 2022

Comments

  • jbeldock
    jbeldock almost 2 years

    I have a column of data, some of which are NULL values, from which I wish to extract the single 90th percentile value:

    ColA
    -----
    NULL
    100
    200
    300
    NULL
    400
    500
    600
    700
    800
    900
    1000
    

    For the above, I am looking for a technique which returns the value 900 when searching for the 90th percentile, 800 for the 80th percentile, etc. An analogous function would be AVG(ColA) which returns 550 for the above data, or MIN(ColA) which returns 100, etc.

    Any suggestions?