Explain the quantile() function in R

33,495

Solution 1

You're understandably confused. That documentation is terrible. I had to go back to the paper its based on (Hyndman, R.J.; Fan, Y. (November 1996). "Sample Quantiles in Statistical Packages". American Statistician 50 (4): 361–365. doi:10.2307/2684934) to get an understanding. Let's start with the first problem.

where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.

The first part comes straight from the paper, but what the documentation writers omitted was that j = int(pn+m). This means Q[i](p) only depends on the two order statistics closest to being p fraction of the way through the (sorted) observations. (For those, like me, who are unfamiliar with the term, the "order statistics" of a series of observations is the sorted series.)

Also, that last sentence is just wrong. It should read

Here gamma depends on the fractional part of np+m, g = np+m-j

As for m that's straightforward. m depends on which of the 9 algorithms was chosen. So just like Q[i] is the quantile function, m should be considered m[i]. For algorithms 1 and 2, m is 0, for 3, m is -1/2, and for the others, that's in the next part.

For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):

p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.

This is really confusing. What the documentation calls p(k) is not the same as the p from before. p(k) is the plotting position. In the paper, the authors write it as pk, which helps. Especially since in the expression for m, the p is the original p, and the m = alpha + p * (1 - alpha - beta). Conceptually, for algorithms 4-9, the points (pk, x[k]) are interpolated to get the solution (p, Q[i](p)). Each algorithm only differs in the algorithm for the pk.

As for the last bit, R is just stating what S uses.

The original paper gives a list of 6 "desirable properties for a sample quantile" function, and states a preference for #8 which satisfies all by 1. #5 satisfies all of them, but they don't like it on other grounds (it's more phenomenological than derived from principles). #2 is what non-stat geeks like myself would consider the quantiles and is what's described in wikipedia.

BTW, in response to dreeves answer, Mathematica does things significantly differently. I think I understand the mapping. While Mathematica's is easier to understand, (a) it's easier to shoot yourself in the foot with nonsensical parameters, and (b) it can't do R's algorithm #2. (Here's Mathworld's Quantile page, which states Mathematica can't do #2, but gives a simpler generalization of all the other algorithms in terms of four parameters.)

Solution 2

There are various ways of computing quantiles when you give it a vector, and don't have a known CDF.

Consider the question of what to do when your observations don't fall on quantiles exactly.

The "types" are just determining how to do that. So, the methods say, "use a linear interpolation between the k-th order statistic and p(k)".

So, what's p(k)? One guy says, "well, I like to use k/n". Another guy says, "I like to use (k-1)/(n-1)" etc. Each of these methods have different properties that are better suited for one problem or another.

The \alpha's and \beta's are just ways to parameterize the functions p. In one case, they're 1 and 1. In another case, they're 3/8 and -1/4. I don't think the p's are ever a constant in the documentation. They just don't always show the dependency explicitly.

See what happens with the different types when you put in vectors like 1:5 and 1:6.

(also note that even if your observations fall exactly on the quantiles, certain types will still use linear interpolation).

Solution 3

I believe the R help documentation is clear after the revisions noted in @RobHyndman's comment, but I found it a bit overwhelming. I am posting this answer in case it helps someone move quickly through the options and their assumptions.

To get a grip on quantile(x, probs=probs), I wanted to check out the source code. This too was trickier than I anticipated in R so I actually just grabbed it from a github repo that looked recent enough to run with. I was interested in the default (type 7) behavior, so I annotated that some, but didn't do the same for each option.

You can see how the "type 7" method interpolates, step by step, both in the code and also I added a few lines to print some important values as it goes.

quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
         , type = 7, ...){
    if(is.factor(x)) { #worry about non-numeric data
        if(!is.ordered(x) || ! type %in% c(1L, 3L))
            stop("factors are not allowed")
        lx <- levels(x)
    } else lx <- NULL
    if (na.rm){
        x <- x[!is.na(x)]
    } else if (anyNA(x)){
        stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
        }
    eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
    if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
        stop("'probs' outside [0,1]")

    #####################################
    # here is where terms really used in default type==7 situation get defined

    n <- length(x) #how many observations are in sample?

    if(na.p <- any(!p.ok)) { # set aside NA & NaN
        o.pr <- probs
        probs <- probs[p.ok]
        probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
    }

    np <- length(probs) #how many quantiles are you computing?

    if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
        if(type == 7) { # be completely back-compatible

            index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
            lo <- floor(index)  #this is the observed order statistic just below each quantile
            hi <- ceiling(index) #above
            x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort, 
            #and it only guarantees that sorting is "right" at these order statistics, important for large vectors 
            #ties are not broken and tied elements just stay in their original order
            qs <- x[lo] #the values associated with the "floor" order statistics
            i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value

            #this is the difference between the order statistic and the available ranks, i think
            h <- (index - lo)[i] # > 0  by construction 
            ##      qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
            ##      qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
            qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high. 
            # h is the arithmetic difference between the desired order statistic amd the available ranks
            #interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd. 
            # This means having a more extreme 99th observation doesn't matter when computing the .75 quantile


            ###################################
            # print all of these things

            cat("floor pos=", c(lo))
            cat("\nceiling pos=", c(hi))
            cat("\nfloor values= ", c(x[lo]))
            cat( "\nwhich floors not targets? ", c(i))
            cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
            cat( "\nadjustment values= ", c(h))
            cat("\nquantile estimates:")

    }else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
                nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
                } else {n * probs} # m = 0

                j <- floor(nppm)
                h <- switch(type,
                            (nppm > j),     # type 1
                            ((nppm > j) + 1)/2, # type 2
                            (nppm != j) | ((j %% 2L) == 1L)) # type 3

                } else{
                ## Types 4 through 9 are continuous sample qs.
                switch(type - 3,
                       {a <- 0; b <- 1},    # type 4
                       a <- b <- 0.5,   # type 5
                       a <- b <- 0,     # type 6
                       a <- b <- 1,     # type 7 (unused here)
                       a <- b <- 1 / 3, # type 8
                       a <- b <- 3 / 8) # type 9
                ## need to watch for rounding errors here
                fuzz <- 4 * .Machine$double.eps
                nppm <- a + probs * (n + 1 - a - b) # n*probs + m
                j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
                h <- nppm - j

                if(any(sml <- abs(h) < fuzz)) h[sml] <- 0

            x <- sort(x, partial =
                          unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
            )
            x <- c(x[1L], x[1L], x, x[n], x[n])
            ## h can be zero or one (types 1 to 3), and infinities matter
            ####        qs <- (1 - h) * x[j + 2] + h * x[j + 3]
            ## also h*x might be invalid ... e.g. Dates and ordered factors
            qs <- x[j+2L]
            qs[h == 1] <- x[j+3L][h == 1]
            other <- (0 < h) & (h < 1)
            if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]

            } 
    } else {
        qs <- rep(NA_real_, np)}

    if(is.character(lx)){
        qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
    if(names && np > 0L) {
        names(qs) <- format_perc(probs)
    }
    if(na.p) { # do this more elegantly (?!)
        o.pr[p.ok] <- qs
        names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
        names(o.pr)[p.ok] <- names(qs)
        o.pr
    } else qs
}

####################

# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)

#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)

# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but 
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.

#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))
Share:
33,495
Sakie
Author by

Sakie

Gregg Lind is a professional programmer living in Minneapolis, Minnesota, USA. I work for Mozilla on Test Pilot. Before that: Renesys, U of MN. Areas of interest: NoSQL, Literate programming, math phobia, gender gap in the hard sciences, information visualization, photography, snooty food, fixed-gear bicycles, Minneapolis, wind-power, dark beer, numbers, math and other kinky topics. likes: long walks on the beach, financially-secure men, flowers, data visualization, robot conspiracies, bicycles built for two. dislikes: eggplants, 2nd ring suburbs. Originally from Massachusetts, he stayed in the midwest after earning an undergrad degree in anthropology and biology from Grinnell. After a few years of shovelbumming, he found himself in Milwaukee, with weak prospects. After conning his way into a statistics job, and realizing that the people asking him for advice knew even less than he did about numbers, he decided to do the normal thing, and actually return to school, earning his M.S. from the U of Minnesota's School of Public Health's Division of Biostatistics in 2005. From there, it's been up, up, up, including a stint at the U of M Epidemiology Department where he worked on statistical genetics and statistical simulation projects.

Updated on July 05, 2022

Comments

  • Sakie
    Sakie almost 2 years

    I've been mystified by the R quantile function all day.

    I have an intuitive notion of how quantiles work, and an M.S. in stats, but boy oh boy, the documentation for it is confusing to me.

    From the docs:

    Q[i](p) = (1 - gamma) x[j] + gamma x[j+1],

    I'm with it so far. For a type i quantile, it's an interpolation between x[j] and x [j+1], based on some mysterious constant gamma

    where 1 <= i <= 9, (j-m)/n <= p < (j-m+1)/ n, x[j] is the jth order statistic, n is the sample size, and m is a constant determined by the sample quantile type. Here gamma depends on the fractional part of g = np+m-j.

    So, how calculate j? m?

    For the continuous sample quantile types (4 through 9), the sample quantiles can be obtained by linear interpolation between the kth order statistic and p(k):

    p(k) = (k - alpha) / (n - alpha - beta + 1), where α and β are constants determined by the type. Further, m = alpha + p(1 - alpha - beta), and gamma = g.

    Now I'm really lost. p, which was a constant before, is now apparently a function.

    So for Type 7 quantiles, the default...

    Type 7

    p(k) = (k - 1) / (n - 1). In this case, p(k) = mode[F(x[k])]. This is used by S.

    Anyone want to help me out? In particular I'm confused by the notation of p being a function and a constant, what the heck m is, and now to calculate j for some particular p.

    I hope that based on the answers here, we can submit some revised documentation that better explains what is going on here.

    quantile.R source code or type: quantile.default

  • Sakie
    Sakie over 14 years
    Thank you for actually answering my question :) That was a serious amount of detective work.
  • Anuj Singh
    Anuj Singh over 14 years
    No problem. I'm trying to write a quantile function for Python/Numpy for our group, which lead me to this question. When I eventually found the answer, I figured I'd share.
  • Rob Hyndman
    Rob Hyndman over 14 years
    I wrote the quantile() function and the associated help file and submitted it to the R core team in August 2004 (replacing the previous versions). I've just checked and all of these errors were caused by my help file being changed after I submitted it. (I am responsible for the use of p and p[k] though.) I'd never noticed it as I assumed my file would be left untouched. I'll see if I can get the help file fixed for R 2.10.0.
  • Rob Hyndman
    Rob Hyndman over 14 years
    @AFoglia. I've put a proposed new help file at robjhyndman.com/quantile.html. Comments before I submit to Rcore?
  • Anuj Singh
    Anuj Singh over 14 years
    The new one is much better. I have a minor suggestion to add the definitions of gamma for methods one through three, although that might not be necessary for the statistical-knowledgable R audience. Otherwise, it looks great.
  • Rob Hyndman
    Rob Hyndman over 14 years
    Just to complete this discussion, the new help file is now part of base Rv2.10.0.
  • Sakie
    Sakie about 12 years
    Every time I look at help(quantile) in R, I am pleased with how this turned out!
  • Lampard
    Lampard over 4 years
    The detail explain of quantile.default() source code is very useful, lovely thanks