Python Pandas - how is 25 percentile calculated by describe function

14,436

Solution 1

In the pandas documentation there is information about the computation of quantiles, where a reference to numpy.percentile is made:

Return value at the given quantile, a la numpy.percentile.

Then, checking numpy.percentile explanation, we can see that the interpolation method is set to linear by default:

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

For your specfic case, the 25th quantile results from:

res_25 = 4 + (6-4)*(3/4) =  5.5

For the 75th quantile we then get:

res_75 = 8 + (10-8)*(1/4) = 8.5

If you set the interpolation method to "midpoint", then you will get the results that you thought of.

.

Solution 2

I think it's easier to understand by seeing this calculation as min+(max-min)*percentile. It has the same result as this function described in NumPy:

linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j

res_25 = 4+(10-4)*percentile = 4+(10-4)*25% = 5.5
res_75 = 4+(10-4)*percentile = 4+(10-4)*75% = 8.5
Share:
14,436
Gublooo
Author by

Gublooo

Updated on June 09, 2022

Comments

  • Gublooo
    Gublooo almost 2 years

    For a given dataset in a data frame, when I apply the describe function, I get the basic stats which include min, max, 25%, 50% etc.

    For example:

    data_1 = pd.DataFrame({'One':[4,6,8,10]},columns=['One'])
    data_1.describe()
    

    The output is:

            One
    count   4.000000
    mean    7.000000
    std     2.581989
    min     4.000000
    25%     5.500000
    50%     7.000000
    75%     8.500000
    max     10.000000
    

    My question is: What is the mathematical formula to calculate the 25%?

    1) Based on what I know, it is:

    formula = percentile * n (n is number of values)
    

    In this case:

    25/100 * 4 = 1
    

    So the first position is number 4 but according to the describe function it is 5.5.

    2) Another example says - if you get a whole number then take the average of 4 and 6 - which would be 5 - still does not match 5.5 given by describe.

    3) Another tutorial says - you take the difference between the 2 numbers - multiply by 25% and add to the lower number:

    25/100 * (6-4) = 1/4*2 = 0.5
    

    Adding that to the lower number: 4 + 0.5 = 4.5

    Still not getting 5.5.

    Can someone please clarify?

  • M. K. Hunter
    M. K. Hunter about 6 years
    But why would you use 10 instead of 6?
  • Amar Prakash Pandey
    Amar Prakash Pandey over 5 years
    As Orli said, its min+(max-min)*percentile. So, its 4+(10-4)*percentile
  • pgalilea
    pgalilea about 3 years
    I tried to understand, but the results with this formula and the method percentile are different: my_data = [2.3, 2.7, 3.5, 3.6, 4.2, 4.5] print(np.percentile(my_data, 25)) print(np.percentile(my_data, 75))