Fastest way to find *the index* of the second (third...) highest/lowest value in vector or column

23,259

Solution 1

library Rfast has implemented the nth element function with return index option.

UPDATE (28/FEB/21) package kit offers a faster implementation (topn) as shown in the simulations below.

x <- runif(1e+6)

n <- 2

which_nth_highest_richie <- function(x, n)
{
  for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
  which(x == max(x))
}

which_nth_highest_joris <- function(x, n)
{
  ux <- unique(x)
  nux <- length(ux)
  which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
} 

microbenchmark::microbenchmark(
        topn = kit::topn(x, n,decreasing = T)[n],
        Rfast = Rfast::nth(x,n,descending = T,index.return = T),
        order = order(x, decreasing = TRUE)[n],
        richie = which_nth_highest_richie(x,n),
        joris = which_nth_highest_joris(x,n))

Unit: milliseconds
          expr    min     lq      mean     median       uq      max   neval 
          topn  3.741101 3.7917  4.517201  4.060752  5.108901 7.403901 100
         Rfast  15.8121 16.7586  20.64204  17.73010  20.7083  47.6832  100
         order 110.5416 113.4774 120.45807 116.84005 121.2291 164.5618 100
        richie  22.7846 24.1552  39.35303  27.10075  42.0132  179.289  100
         joris 131.7838 140.4611 158.20704 156.61610 165.1735 243.9258 100

Topn is the clear winner in finding the index of the 2nd biggest value in 1 million numbers.

Futher, simulations where run to estimate running times of finding the nth biggest number for varying n. Variable x was repopulated for each n but it's size was always 1 million numbers.

Running time of finding the index of the nth biggest element between 1 million numbers.

As shown topn is the best option for finding the nth biggest element and it's index, given that n is not too big. In the plot we can observe that topn becomes slower than Rfast's nth for bigger n. It is worthy to note that topn has not been implemented for n > 1000 and will throw an error in such cases.

Solution 2

One possible route is to use the index.return argument to sort. I'm not sure if this is fastest though.

set.seed(21)
x <- rnorm(10)
ind <- 2
sapply(sort(x, index.return=TRUE), `[`, length(x)-ind+1)
#        x       ix 
# 1.746222 3.000000

Solution 3

EDIT 2 :

As Joshua pointed out, none of the given solutions actually performs correct when you have a tie on the maxima, so :

X <- c(11:19,19)

n <- length(unique(X))
which(X == sort(unique(X),partial=n-1)[n-1])

fastest way of doing it correctly then. I deleted the order way, as that one doesn't work and is a lot slower, so not a good answer according to OP.

To point to the issue we ran into :

> X <- c(11:19,19)    
> n <- length(X)
> which(X == sort(X,partial=n-1)[n-1])
[1]  9 10 #which is the indices of the double maximum 19

> n <- length(unique(X))
> which(X == sort(unique(X),partial=n-1)[n-1])
[1] 8 # which is the correct index of 18

The timings of the valid solutions :

> x <- runif(1000000)

> ind <- 2

> n <- length(unique(x))

> system.time(which(x == sort(unique(x),partial=n-ind+1)[n-ind+1]))
   user  system elapsed 
   0.11    0.00    0.11 

> system.time(sapply(sort(unique(x), index.return=TRUE), `[`, n-ind+1))
   user  system elapsed 
   0.69    0.00    0.69 

Solution 4

Method: Set all max values to -Inf, then find the indices of the max. No sorting required.

X <- runif(1e7)
system.time(
{
  X[X == max(X)] <- -Inf
  which(X == max(X))
})

Works with ties and is very fast.

If you can guarantee no ties, then an even faster version is

system.time(
{
  X[which.max(X)] <- -Inf
  which.max(X)
})

EDIT: As Joris mentioned, this method doesn't scale that well for finding third, fourth, etc., highest values.

which_nth_highest_richie <- function(x, n)
{
  for(i in seq_len(n - 1L)) x[x == max(x)] <- -Inf
  which(x == max(x))
}

which_nth_highest_joris <- function(x, n)
{
  ux <- unique(x)
  nux <- length(ux)
  which(x == sort(ux, partial = nux - n + 1)[nux - n + 1])
}

Using x <- runif(1e7) and n = 2, Richie wins

system.time(which_nth_highest_richie(x, 2))   #about half a second
system.time(which_nth_highest_joris(x, 2))    #about 2 seconds

For n = 100, Joris wins

system.time(which_nth_highest_richie(x, 100)) #about 20 seconds, ouch! 
system.time(which_nth_highest_joris(x, 100))  #still about 2 seconds

The balance point, where they take the same length of time, is about n = 10.

Solution 5

No ties which() is probably your friend here. Combine the output from the sort() solution with which() to find the index that matches the output from the sort() step.

> set.seed(1)
> x <- sample(1000, 250)
> sort(x,partial=n-1)[n-1]
[1] 992
> which(x == sort(x,partial=n-1)[n-1])
[1] 145

Ties handling The solution above doesn't work properly (and wasn't intended to) if there are ties and the ties are the values that are the ith largest or larger values. We need to take the unique values of the vector before sorting those values and then the above solution works:

> set.seed(1)
> x <- sample(1000, 1000, replace = TRUE)
> length(unique(x))
[1] 639
> n <- length(x)
> i <- which(x == sort(x,partial=n-1)[n-1])
> sum(x > x[i])
[1] 0
> x.uni <- unique(x)
> n.uni <- length(x.uni)
> i <- which(x == sort(x.uni, partial = n.uni-1)[n.uni-1])
> sum(x > x[i])
[1] 2
> tail(sort(x))
[1]  994  996  997  997 1000 1000

order() is also very useful here:

> head(ord <- order(x, decreasing = TRUE))
[1] 220 145 209 202 211 163

So the solution here is ord[2] for the index of the 2nd highest/largest element of x.

Some timings:

> set.seed(1)
> X <- sample(1e7, 1e7)
> system.time({n <- length(X); which(X == sort(X, partial = n-1)[n-1])})
   user  system elapsed 
  0.319   0.058   0.378 
> system.time({ord <- order(X, decreasing = TRUE); ord[2]})
   user  system elapsed 
 14.578   0.084  14.708 
> system.time({order(X, decreasing = TRUE)[2]})
   user  system elapsed 
 14.647   0.084  14.779

But as the linked post was getting at and the timings above show, order() is much slower, but both provide the same results:

> all.equal(which(X == sort(X, partial = n-1)[n-1]), 
+           order(X, decreasing = TRUE)[2])
[1] TRUE

And for the ties-handling version:

foo <- function(x, i) {
    X <- unique(x)
    N <- length(X)
    i <- i-1
    which(x == sort(X, partial = N-i)[N-i])
}

> system.time(foo(X, 2))
   user  system elapsed 
  1.249   0.176   1.454

So the extra steps slow this solution down a bit, but it is still very competitive with order().

Share:
23,259

Related videos on Youtube

user189035
Author by

user189035

Updated on July 05, 2022

Comments

  • user189035
    user189035 almost 2 years

    Fastest way to find the index of the second (third...) highest/lowest value in vector or column ?

    i.e. what

    sort(x,partial=n-1)[n-1]
    

    is to

    max()
    

    but for

    which.max()
    

    Best,

    Fastest way to find second (third...) highest/lowest value in vector or column

  • Joris Meys
    Joris Meys about 13 years
    why not just use order if you only need the indices?
  • Joshua Ulrich
    Joshua Ulrich about 13 years
    @Joris: You're certainly correct if you only need the index. I guess I assumed they needed both the value and the index.
  • Joshua Ulrich
    Joshua Ulrich about 13 years
    Speaking of ties, you may get an inaccurate answer if there are ties before the index you want. If there are 2 elements with the max value and you want the second highest value, none of the currently proposed solutions will give you that. They will all return the second highest sorted value, which is tied with the max value. You would need to unique your vector first.
  • Gavin Simpson
    Gavin Simpson about 13 years
    Are you sure this works? I'm might be loosing the plot, but consider my example data. Your solution doesn't give the right answer. Remember order() gives the permutation of the vector to get the vector in increasing/decreasing order. It works here because x is already sorted but if it isn't. order(x, decreasing = TRUE)[i] should be sufficient for the ith largest. Or am I so wrong it is embarrassing?
  • Joris Meys
    Joris Meys about 13 years
    and @Joshua : also very right. Goes for all solutions actually, and I showed it in my solution as well...
  • Gavin Simpson
    Gavin Simpson about 13 years
    @Joris, note that the original Q was also concerned with doing this fast. For large vectors, sorting is slow and order needs to sort the whole vector to work. This is the reason for the OP linking to the previous Q.
  • Gavin Simpson
    Gavin Simpson about 13 years
    To be fair, the OP never mentioned the ties handling, but you are quite - I knew when posting the solution it would fail with ties and was planning to show a ties-handling version although it does slow things down a bit.
  • Joris Meys
    Joris Meys about 13 years
    +1 Clearly the fastest for the second-highest value. Yet, it's not really extendible to the third, fourth,... highest value.
  • Suresh_Patel
    Suresh_Patel about 3 years
    could you please update your post to take into account kit::topn. I gave you the solution in the link provided by the OP. Rfast is NOT the fastest solution.
  • Suresh_Patel
    Suresh_Patel about 3 years
    Thanks. I think it would be interesting to add topn with hasna=FALSE. I tested Rfast::nth and it does not handle NA very well or at least the way order and topn do.

Related