Pandas every nth row
Solution 1
I'd use iloc
, which takes a row/column slice, both based on integer position and following normal python syntax. If you want every 5th row:
df.iloc[::5, :]
Solution 2
Though @chrisb's accepted answer does answer the question, I would like to add to it the following.
A simple method I use to get the nth
data or drop the nth
row is the following:
df1 = df[df.index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.index % 3 == 0] # Selects every 3rd raw starting from 0
This arithmetic based sampling has the ability to enable even more complex row-selections.
This assumes, of course, that you have an index
column of ordered, consecutive, integers starting at 0.
Solution 3
There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__
.
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For example, to get every 2 rows, you can do
df[::2]
a b c
0 x x x
2 x x x
4 x x x
There's also GroupBy.first
/GroupBy.head
, you group on the index:
df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')
df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)
a b c
0 x x x
1 x x x
2 x x x
The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do
# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()
a b c
0 x x x
1 x x x
2 x x x
Solution 4
Adding reset_index()
to metastableB's answer allows you to only need to assume that the rows are ordered and consecutive.
df1 = df[df.reset_index().index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.reset_index().index % 3 == 0] # Selects every 3rd row starting from 0
df.reset_index().index
will create an index that starts at 0 and increments by 1, allowing you to use the modulo easily.
Solution 5
I had a similar requirement, but I wanted the n'th item in a particular group. This is how I solved it.
groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x % 3 == 0)
subset = data[selection]
Related videos on Youtube
mikael
Updated on June 25, 2021Comments
-
mikael almost 3 years
Dataframe.resample()
works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method? -
Little Bobby Tables over 7 yearsFor those who might want, for example, every fifth row, but starting at the 2nd row it would be
df.iloc[1::5, :]
. -
Constantine almost 6 yearsthis is not a good answer because makes three assumptions, which are frequently not met: (1) the index is numeric (2) the index it starts at zero (3) the index values are consecutive ... the last one is especially important since you can't use your suggested method more than once without resetting the index
-
metastableB almost 6 yearsI take your point. Will edit the answer to make the assumptions more explicit.
-
joctee over 5 yearsYou can omit the column part:
df.iloc[::5]
-
Readler almost 5 years@Constantine still, wouldn't that be faster than the other solution as you can simply add an index?
-
FabioSpaghetti over 4 years@chrisb how do I specify the starting row ? like every 5 row, starting from the second row ?
-
ppwater over 3 yearsWhile this code may answer the question, including an explanation of how or why this solves the problem would really help to improve the quality of your post. Remember that you are answering the question for readers in the future, not just the person asking now. Please edit your answer to add explanations and give an indication of what limitations and assumptions apply.
-
JohnAndrews about 3 yearsHow do you include it from the back?
-
Raksha almost 3 yearshow do you make it not include 0th row?
-
topher217 almost 3 yearsWhat is this slicing syntax called and where can I read more about it?
-
David Parks over 2 yearsThis is standard Python slicing. See stackoverflow.com/questions/509211/understanding-slice-notation
-
banderlog013 over 2 yearsFor every 3rd row it will be unintuitive
df.iloc[2::3]
-
Lodinn about 2 years@banderlog013 No, that's intuitive - just
df.iloc[::3]
would suffice. What you want ("intuitively") is to the first row in selection to not be the first row in the dataframe. It's not hard to see that for any given N ("give me N rows starting with the naturally-counted Nth row") the indexing isdf.iloc[(N-1)::N]
. This behavior is rarely needed, however...