Pandas drop rows vs filter

14,139

Solution 1

The recommended solution is the most eficient, which in this case, is the first one.

df = df[df['A'] >= 0]

On the second solution

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

you are repeating the slicing process. But lets break it to pieces to understand why.

When you write

df['A'] >= 0

you are creating a mask, a Boolean Series with an entry for each index of df, whose value is either True or False according to a condition (on this case, if such the value of column 'A' at a given index is greater than or equal to 0).

When you write

df[df['A'] >= 0]

you accessing the rows for which your mask (df['A'] >= 0) is True. This is a slicing method supported by Pandas that lets you select certain rows by passing a Boolean Series and will return a view of the original DataFrame with only the entries for which the Series was True.

Finally, when you write this

selRows = df[df['A'] < 0].index
df = df.drop(selRows, axis=0)

you are repeating the proccess because

df[df['A'] < 0]

is already slicing your DataFrame (in this case for the rows you want to drop). You are then getting those indices, going back to the original DataFrame and explicitly dropping them. No need for this, you already sliced the DataFrame in the first step.

Solution 2

df = df[df['A'] >= 0]

is indeed the faster solution. Just be aware that it returns a view of the original data frame, not a new data frame. This can lead you into trouble, for example when you want to change its values, as pandas will give you the SettingwithCopyWarning.

The simple fix of course is what Wen-Ben recommended:

df = df[df['A'] >= 0].copy()

Solution 3

Your question is like this: "I have two identical cakes, but one has icing. Which has more calories?"

The second solution is doing the same thing but twice. A filtering step is enough, there's no need to filter and then redundantly proceed to call a function that does the exact same thing the filtering op from the previous step did.

To clarify: regardless of the operation, you are still doing the same thing: generating a boolean mask, and then subsequently indexing.

Share:
14,139
ojon
Author by

ojon

Updated on July 25, 2022

Comments

  • ojon
    ojon almost 2 years

    I have a pandas dataframe and want to get rid of rows in which the column 'A' is negative. I know 2 ways to do this:

    df = df[df['A'] >= 0]
    

    or

    selRows = df[df['A'] < 0].index
    df = df.drop(selRows, axis=0)
    

    What is the recommended solution? Why?