How can I sort a boxplot in pandas by the median values?

22,749

Solution 1

You can use the answer in How to sort a boxplot by the median values in pandas but first you need to group your data and create a new data frame:

import pandas as pd
import random
import matplotlib.pyplot as plt

n = 100
# this is probably a strange way to generate random data; please feel free to correct it
df = pd.DataFrame({"X": [random.choice(["A","B","C"]) for i in range(n)], 
                   "Y": [random.choice(["a","b","c"]) for i in range(n)],
                   "Z": [random.gauss(0,1) for i in range(n)]})
grouped = df.groupby(["X", "Y"])

df2 = pd.DataFrame({col:vals['Z'] for col,vals in grouped})

meds = df2.median()
meds.sort_values(ascending=False, inplace=True)
df2 = df2[meds.index]
df2.boxplot()

plt.show()

plot

Solution 2

Similar answer to Alvaro Fuentes' in function form for more portability

import pandas as pd

def boxplot_sorted(df, by, column):
  df2 = pd.DataFrame({col:vals[column] for col, vals in df.groupby(by)})
  meds = df2.median().sort_values()
  df2[meds.index].boxplot(rot=90)

boxplot_sorted(df, by=["X", "Y"], column="Z")

Solution 3

To answer the question in the title, without addressing the extra detail of plotting all combinations of two categorical variables:

n = 100
df = pd.DataFrame({"Category": [np.random.choice(["A","B","C","D"]) for i in range(n)],      
                   "Variable": [np.random.normal(0, 10) for i in range(n)]})

grouped = df.loc[:,['Category', 'Variable']] \
    .groupby(['Category']) \
    .median() \
    .sort_values(by='Variable')

sns.boxplot(x=df.Category, y=df.Variable, order=grouped.index)

enter image description here

I've added this solution because it is hard to reduce the accepted answer to a single variable, and I'm sure people are looking for a way to do that. I myself came to this question multiple time looking for such an answer.

Share:
22,749

Related videos on Youtube

Fred S
Author by

Fred S

Updated on July 30, 2021

Comments

  • Fred S
    Fred S almost 3 years

    I want to draw a boxplot of column Z in dataframe df by the categories X and Y. How can I sort the boxplot by the median, in descending order?

    import pandas as pd
    import random
    n = 100
    # this is probably a strange way to generate random data; please feel free to correct it
    df = pd.DataFrame({"X": [random.choice(["A","B","C"]) for i in range(n)], 
                       "Y": [random.choice(["a","b","c"]) for i in range(n)],
                       "Z": [random.gauss(0,1) for i in range(n)]})
    df.boxplot(column="Z", by=["X", "Y"])
    

    Note that this question is very similar, but they use a different data structure. I'm relatively new to pandas (and have only done some tutorials on python in general), so I couldn't figure out how to make my data work with the answer posted there. This may well be more of a reshaping than a plotting question. Maybe there is a solution using groupby?

  • Stephen McAteer
    Stephen McAteer about 7 years
    I had to change: meds.sort(ascending=False) to meds.sort_values(ascending=False, inplace=True) to make this work (Pandas 0.20.1, Python 3.6.1, Windows 8).
  • Alvaro Fuentes
    Alvaro Fuentes about 7 years
    @StephenMcAteer Thanks for the tip. I'm not using the latest versions of Pandas so please feel free to edit the answer and add your version of the answer for future users.
  • rococo
    rococo over 5 years
    Is there any way to have a backup sort for when medians are the same? For example, if two medians are the same then sort by one of the quartiles.
  • Christian Karcher
    Christian Karcher almost 4 years
    There are a few inconsistencies with your minimal example (a missing ' after the first 'Category, switching from "X" and "Z" in the declaration to "Category" and "Variable" during grouping and plotting. But the overall idea behind it was useful for my seaborn-powered application.
  • rocksNwaves
    rocksNwaves almost 4 years
    @ChristianKarcher Thanks for pointing those things out. That's what I get for not copying and pasting.