Naturally sorting Pandas DataFrame
Solution 1
If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:
In [7]:
df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')
Note that df.index = natsorted(df.index)
also works
if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:
In [10]:
natsorted(df)
Out[10]:
[]
EDIT
If you want to sort the index so that the data is reordered along with the index then use reindex
:
In [13]:
df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
0
0hr 0
128hr 1
72hr 2
48hr 3
96hr 4
In [14]:
df = df*2
df
Out[14]:
0
0hr 0
128hr 2
72hr 4
48hr 6
96hr 8
In [15]:
df.reindex(index=natsorted(df.index))
Out[15]:
0
0hr 0
48hr 6
72hr 4
96hr 8
128hr 2
Note that you have to assign the result of reindex
to either a new df or to itself, it does not accept the inplace
param.
Solution 2
Now that pandas
has support for key
in both sort_values
and sort_index
you should now refer to this other answer and send all upvotes there as it is now the correct answer.
I will leave my answer here for people stuck on old pandas
versions, or as a historical curiosity.
The accepted answer answers the question being asked. I'd like to also add how to use natsort
on columns in a DataFrame
, since that will be the next question asked.
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted, index_natsorted, order_by_index
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df
Out[4]:
a b
0hr a5 b1
128hr a1 b1
72hr a10 b2
48hr a2 b2
96hr a12 b1
As the accepted answer shows, sorting by the index is fairly straightforward:
In [5]: df.reindex(index=natsorted(df.index))
Out[5]:
a b
0hr a5 b1
48hr a2 b2
72hr a10 b2
96hr a12 b1
128hr a1 b1
If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsort
provides the convenience functions index_natsorted
and order_by_index
to do just that.
In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip
(or itertools.izip
on Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...
In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]:
a b
0hr a5 b1
96hr a12 b1
128hr a1 b1
48hr a2 b2
72hr a10 b2
Here is an alternate method using Categorical
objects that I have been told by the pandas
devs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndex
which will allow this method to be used on an index.
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df.a = df.a.astype('category')
In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)
In [6]: df.b = df.b.astype('category')
In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)
In [9]: df.sort('a')
Out[9]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [10]: df.sort('b')
Out[10]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
In [11]: df.sort(['b', 'a'])
Out[11]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
The Categorical
object lets you define a sorting order for the DataFrame
to use. The elements given when calling reorder_categories
must be unique, hence the call to set
for column "b".
I leave it to the user to decide if this is better than the reindex
method or not, since it requires you to sort the column data independently before sorting within the DataFrame
(although I imagine that second sort is rather efficient).
Full disclosure, I am the natsort
author.
Solution 3
Using sort_values
for pandas >= 1.1.0
With the new key
argument in DataFrame.sort_values
, since pandas 1.1.0
, we can directly sort a column without setting it as an index using natsort.natsort_keygen
:
df = pd.DataFrame({
"time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
"value": [10, 20, 30, 40, 50]
})
time value
0 0hr 10
1 128hr 20
2 72hr 30
3 48hr 40
4 96hr 50
from natsort import natsort_keygen
df.sort_values(
by="time",
key=natsort_keygen()
)
time value
0 0hr 10
3 48hr 40
2 72hr 30
4 96hr 50
1 128hr 20
Related videos on Youtube
Comments
-
agf1997 almost 2 years
I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?
from natsort import natsorted import pandas as pd # An unsorted list of strings a = ['0hr', '128hr', '72hr', '48hr', '96hr'] # Sorted incorrectly b = sorted(a) # Naturally Sorted c = natsorted(a) # Use a as the index for a DataFrame df = pd.DataFrame(index=a) # Sorted Incorrectly df2 = df.sort() # Natsort doesn't seem to work df3 = natsorted(df) print(a) print(b) print(c) print(df.index) print(df2.index) print(df3.index)
-
agf1997 about 9 years@sethMMorton I supposed I would expect
df3.index
to be the same asc
while sorting the data to keep it inline with its index values -
SethMMorton about 9 yearsIt would be nice if
pd.sort
had akey
option, but it does not. This answer provides a workaround that would let you pass a key generated fromnatsort_keygen
. -
SethMMorton about 9 yearsI just made an official request to the
pandas
devs to addkey
to thesort
methods here: github.com/pydata/pandas/issues/9855 -
SethMMorton about 9 yearsMy above issue was a dupe, the active issue is github.com/pydata/pandas/issues/3942
-
SethMMorton over 3 yearsNow that
pandas
has akey
argument tosort_values
, stackoverflow.com/a/63890954/1399279 should now be the accepted answer.
-
-
SethMMorton about 9 yearsHi,
natsort
developer here.natsort
currently does not have any explicit support for handling entire dataframe objects. What would your expected output be for passing a dataframe object? -
agf1997 about 9 yearsI believe this misses the point. I realize I can naturally sort the a and used that as the index, but my actual code messes up the sorting of the data frame index because of the manipulations I perform on the dataframe. I need to resort the index and associated data while it is in the dataframe.
-
EdChum about 9 yearsSo what is it you're asking here, you want to natsort the index after data manipulations? You can use
reindex
and callnatsorted
on the indexdf.reindex(index=natsorted(df.index))
-
SethMMorton about 9 years@EdChum Yes, it sounds exactly like that is what they want. I think ultimately that is the correct answer.
-
agf1997 about 9 years@EdChum This appears to work but only if I put it in a few dataframe
df2 = df.reindex(index=natsorted(df.index))
performing the operation in place as you show doesn't seem to work. -
SethMMorton about 9 years@agf1997 Use
inplace=True
-
SethMMorton about 9 years@EdChum BTW, on my machine
natsorted(df)
returns the columns of the dataframe sorted, which is the same behavior ofsorted(df)
. -
EdChum about 9 years@SethMMorton sorry
reindex
is one of the few functions that does not accept paraminplace
, so yes you have to assign it to itself -
SethMMorton about 9 years@EdChum That seems like an oversight... sort of like omitting
key
fromsort
! -
EdChum about 9 years@SethMMorton ah that output was based on an empty dataframe, the iterable returned from a df is the columns and yes I get the same result when my df has columns
-
EdChum about 9 years@SethMMorton I'm not a pandas dev so I don't know why it's not in there
-
agf1997 about 9 years@EdChum it's not
sorted(df)
I'm trying to replicate but df.sort(). I believe your reindex trick is a good work around to no key value available fordf.sort()
-
EdChum about 9 years@agf1997 I think Seth was interested as to why I got an empty list, I was just confirming that the behaviour is consistent with
sorted(df)
so unrelated to your question -
SethMMorton over 3 yearsThis proposed solution is a bit of a "maximum effort" solution - would not
key=natsort_keygen()
be less effort? -
Erfan almost 3 yearsAgree, updated my answer accordingly. Thanks for the heads up and beautiful package you wrote :) @SethMMorton
-
Agostino over 2 yearsIf I try to sort 2 columns of different type, like this
df.sort_values(['Title', 'Copies'], ascending=[False, True], key=natsort_keygen())
, I get this errorValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
. I got the dataframe frompd.read_csv
, providing the column names and types. Any idea how to fix it?