Map each list value to its corresponding percentile
Solution 1
I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8]
or [0.2, 0.4, 0.6, 0.8, 1]
. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).
With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore
function provides four ways of computing percentiles:
>>> x = [1, 1, 2, 2, 17]
>>> [stats.percentileofscore(x, a, 'rank') for a in x]
[30.0, 30.0, 70.0, 70.0, 100.0]
>>> [stats.percentileofscore(x, a, 'weak') for a in x]
[40.0, 40.0, 80.0, 80.0, 100.0]
>>> [stats.percentileofscore(x, a, 'strict') for a in x]
[0.0, 0.0, 40.0, 40.0, 80.0]
>>> [stats.percentileofscore(x, a, 'mean') for a in x]
[20.0, 20.0, 60.0, 60.0, 90.0]
(I used a dataset containing ties to illustrate what happens in such cases.)
The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.
As Kevin H. Lin noted, calling percentileofscore
in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata
, letting you calculate all the percentiles at once:
>>> from scipy import stats
>>> stats.rankdata(x, "average")/len(x)
array([ 0.3, 0.3, 0.7, 0.7, 1. ])
>>> stats.rankdata(x, 'max')/len(x)
array([ 0.4, 0.4, 0.8, 0.8, 1. ])
>>> (stats.rankdata(x, 'min')-1)/len(x)
array([ 0. , 0. , 0.4, 0.4, 0.8])
In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)
I did some timings. With small data such as that in your example, using rankdata
is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore
in a loop as in reptilicus's answer:
In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
1000 loops, best of 3: 414 µs per loop
In [12]: %timeit list_to_percentiles(x)
100000 loops, best of 3: 11.1 µs per loop
In [13]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 39.3 µs per loop
With a large dataset, however, the performance advantage of numpy takes effect and using rankdata
is 10 times faster than Kevin's list_to_percentiles
:
In [18]: x = np.random.randint(0, 10000, 1000)
In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
1 loops, best of 3: 437 ms per loop
In [20]: %timeit list_to_percentiles(x)
100 loops, best of 3: 1.08 ms per loop
In [21]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 102 µs per loop
This advantage will only become more pronounced on larger and larger datasets.
Solution 2
I think you want scipy.stats.percentileofscore
Example:
percentileofscore([1, 2, 3, 4], 3)
75.0
percentiles = [percentileofscore(data, i) for i in data]
Solution 3
In terms of complexity, I think reptilicus's answer is not optimal. It takes O(n^2) time.
Here is a solution that takes O(n log n) time.
def list_to_percentiles(numbers):
pairs = zip(numbers, range(len(numbers)))
pairs.sort(key=lambda p: p[0])
result = [0 for i in range(len(numbers))]
for rank in xrange(len(numbers)):
original_index = pairs[rank][1]
result[original_index] = rank * 100.0 / (len(numbers)-1)
return result
I'm not sure, but I think this is the optimal time complexity you can get. The rough reason I think it's optimal is because the information of all of the percentiles is essentially equivalent to the information of the sorted list, and you can't get better than O(n log n) for sorting.
EDIT: Depending on your definition of "percentile" this may not always give the correct result. See BrenBarn's answer for more explanation and for a better solution which makes use of scipy/numpy.
Solution 4
Pure numpy version of Kevin's solution
As Kevin said, optimal solution works in O(n log(n)) time. Here is fast version of his code in numpy
, which works almost the same time as stats.rankdata
:
percentiles = numpy.argsort(numpy.argsort(array)) * 100. / (len(array) - 1)
PS. This is one if my favourite tricks in numpy
.
Solution 5
I tried Scipy's percentile score but it turned out to be very slow for one of my tasks. So, simply implemented it this way. Can be modified if a weak ranking is needed.
def assign_pct(X):
mp = {}
X_tmp = np.sort(X)
pct = []
cnt = 0
for v in X_tmp:
if v in mp:
continue
else:
mp[v] = cnt
cnt+=1
for v in X:
pct.append(mp[v]/cnt)
return pct
Calling the function
assign_pct([23,4,1,43,1,6])
Output of function
[0.75, 0.25, 0.0, 1.0, 0.0, 0.5]
Jubbles
Updated on July 12, 2022Comments
-
Jubbles almost 2 years
I'd like to create a function that takes a (sorted) list as its argument and outputs a list containing each element's corresponding percentile.
For example,
fn([1,2,3,4,17])
returns[0.0, 0.25, 0.50, 0.75, 1.00]
.Can anyone please either:
- Help me correct my code below? OR
- Offer a better alternative than my code for mapping values in a list to their corresponding percentiles?
My current code:
def median(mylist): length = len(mylist) if not length % 2: return (mylist[length / 2] + mylist[length / 2 - 1]) / 2.0 return mylist[length / 2] ############################################################################### # PERCENTILE FUNCTION ############################################################################### def percentile(x): """ Find the correspoding percentile of each value relative to a list of values. where x is the list of values Input list should already be sorted! """ # sort the input list # list_sorted = x.sort() # count the number of elements in the list list_elementCount = len(x) #obtain set of values from list listFromSetFromList = list(set(x)) # count the number of unique elements in the list list_uniqueElementCount = len(set(x)) # define extreme quantiles percentileZero = min(x) percentileHundred = max(x) # define median quantile mdn = median(x) # create empty list to hold percentiles x_percentile = [0.00] * list_elementCount # initialize unique count uCount = 0 for i in range(list_elementCount): if x[i] == percentileZero: x_percentile[i] = 0.00 elif x[i] == percentileHundred: x_percentile[i] = 1.00 elif x[i] == mdn: x_percentile[i] = 0.50 else: subList_elementCount = 0 for j in range(i): if x[j] < x[i]: subList_elementCount = subList_elementCount + 1 x_percentile[i] = float(subList_elementCount / list_elementCount) #x_percentile[i] = float(len(x[x > listFromSetFromList[uCount]]) / list_elementCount) if i == 0: continue else: if x[i] == x[i-1]: continue else: uCount = uCount + 1 return x_percentile
Currently, if I submit
percentile([1,2,3,4,17])
, the list[0.0, 0.0, 0.5, 0.0, 1.0]
is returned. -
Jubbles over 11 yearsClose, but not quite. If I try
percentileList([1,2,3,4,4,5,5])
the list[0.0, 0.17, 0.33, 0.5, 0.67, 0.83, 0.99]
is returned, where I'd like[0.0, 0.17, 0.33, 0.50, 0.50, 1.00, 1.00]
returned. -
Jubbles over 11 yearsClose, but this has the same problem as Aladdin's first attempt above.
-
Mahmoud Aladdin over 11 yearsWell, I want to know more, about what you want to do, the repeating numbers should have the same percentile, but still their percentile are affected by the number of repeated numbers ?!
-
Karl Knechtel over 11 yearsSpecifically,
[percentileofscore(score) for score in original_list]
. -
Jubbles over 11 yearsYes, while multiple observations of distinct values should all have the same percentile, each observation still adds to the count of observations that are strictly less than observations with greater values. Percentiles are no quite as straight-forward as some people initially think.
-
Jubbles over 11 years@user1443118 and @Karl Knechtel: That does it. Specific to my preferences,
[percentileofscore(data, i, 'weak') for i in data]
is what I'm looking for. Very Pythonic too. -
senderle over 11 years@Jubbles, indeed they are not. I'll admit to being a bit confused by the example you give above. Having the lowest value be
0.0
and the highest value be100.0
seems inconsistent. -
Rob Bednark over 11 yearsThanks @Aladdin, I like this solution for my problem. Note that it would be nice to generalize it for empty lists and lists with one element (which results in a ZeroDivisionError exception).
-
Kevin H. Lin over 9 yearsI think this solution is O(n^2) which is not optimal.
-
Kevin H. Lin over 9 yearsAfter I posted this answer, someone decided to serially downvote all of my SO posts. Not cool...
-
Jubbles about 9 yearsThanks! You are very right that the answer using list comprehension with
scipy.stats.percentileofscore
is "not optimal." I timed both approaches withtimeit
and your function is great. -
Jubbles about 9 yearsThe advantages that you illustrate above have been confirmed.
-
Kevin H. Lin over 8 yearsNice. If you look at the implementation of
scipy.stats.rankdata
(github.com/scipy/scipy/blob/v0.16.0/scipy/stats/…) you'll see that it makes use ofargsort()
. Their algorithm is essentially the same as mine, and the difference is entirely accounted for by the difference between Python lists and numpy arrays. -
Robert Yi over 4 yearsThis is not optimal, as duplicate values get ranked differently, as a result of the sort.
-
Patrick about 3 yearsFor Python3 add
list
around thezip
and removex
fromxrange
-
12944qwerty almost 3 yearsCan you unindent the
def assign_pct(x)
part of the code? -
Admin almost 3 yearsCorrected the indentation.