Why use pandas.assign rather than simply initialize new column?

python pandas

14,899

Solution 1

The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.

In particular, DataFrame.assign returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.

In your particular case:

>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})

Now suppose you wish to create a new frame in which A is everywhere 1 without destroying df. Then you could use .assign

>>> new_df = df.assign(A=1)

If you do not wish to maintain the original values, then clearly df["A"] = 1 will be more appropriate. This also explains the speed difference, by necessity .assign must copy the data while [...] does not.

Solution 2

The premise on assign is that it returns:

A new DataFrame with the new columns in addition to all the existing columns.

And also you cannot do anything in-place to change the original dataframe.

The callable must not change input DataFrame (though pandas doesn't check it).

On the other hand df['ln_A'] = np.log(df['A']) will do things inplace.

So is there a reason I should stop using my old method in favour of df.assign?

I think you can try df.assign but if you do memory intensive stuff, better to work what you did before or operations with inplace=True.

14,899

Author by

sacuL

Updated on July 02, 2022

Comments

sacuL almost 2 years
I just discovered the assign method for pandas dataframes, and it looks nice and very similar to dplyr's mutate in R. However, I've always gotten by by just initializing a new column 'on the fly'. Is there a reason why assign is better?

For instance (based on the example in the pandas documentation), to create a new column in a dataframe, I could just do this:
```
df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df['ln_A'] = np.log(df['A'])
```
but the pandas.DataFrame.assign documentation recommends doing this:
```
df.assign(ln_A = lambda x: np.log(x.A))
# or 
newcol = np.log(df['A'])
df.assign(ln_A=newcol)
```
Both methods return the same dataframe. In fact, the first method (my 'on the fly' method) is significantly faster (0.20225788200332318 seconds for 1000 iterations) than the .assign method (0.3526602769998135 seconds for 1000 iterations).

So is there a reason I should stop using my old method in favour of df.assign?