Using pandas .append within for loop

120,654

Solution 1

You need to set the the variable data equal to the appended data frame. Unlike the append method on a python list the pandas append does not happen in place

import pandas as pd
import numpy as np

data = pd.DataFrame([])

for i in np.arange(0, 4):
    if i % 2 == 0:
        data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
    else:
        data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)

print(data.head())

   A    B
0  0  1.0
1  2  3.0
2  3  NaN

NOTE: This answer aims to answer the question as it was posed. It is not however the optimal strategy for combining large numbers of dataframes. For a more optimal solution have a look at Alexander's answer below

Solution 2

Every time you call append, Pandas returns a copy of the original dataframe plus your new row. This is called quadratic copy, and it is an O(N^2) operation that will quickly become very slow (especially since you have lots of data).

In your case, I would recommend using lists, appending to them, and then calling the dataframe constructor.

a_list = []
b_list = []
for data in my_data:
    a, b = process_data(data)
    a_list.append(a)
    b_list.append(b)
df = pd.DataFrame({'A': a_list, 'B': b_list})
del a_list, b_list

Timings

%%timeit
data = pd.DataFrame([])
for i in np.arange(0, 10000):
    if i % 2 == 0:
        data = data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
else:
    data = data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
1 loops, best of 3: 6.8 s per loop

%%timeit
a_list = []
b_list = []
for i in np.arange(0, 10000):
    if i % 2 == 0:
        a_list.append(i)
        b_list.append(i + 1)
    else:
        a_list.append(i)
        b_list.append(None)
data = pd.DataFrame({'A': a_list, 'B': b_list})
100 loops, best of 3: 8.54 ms per loop

Solution 3

You can build your dataframe without a loop:

n = 4
data = pd.DataFrame({'A': np.arange(n)})
data['B'] = np.NaN
data.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1

For:

n = 10000

This is a bit faster:

%%timeit
data = pd.DataFrame({'A': np.arange(n)})
data['B'] = np.NaN
data.loc[data['A'] % 2 == 0, 'B'] = data['A'] + 1

100 loops, best of 3: 3.3 ms per loop

vs.

%%timeit
a_list = []
b_list = []
for i in np.arange(n):
    if i % 2 == 0:
        a_list.append(i)
        b_list.append(i + 1)
    else:
        a_list.append(i)
        b_list.append(None)
data1 = pd.DataFrame({'A': a_list, 'B': b_list})

100 loops, best of 3: 12.4 ms per loop
Share:
120,654

Related videos on Youtube

calpyte
Author by

calpyte

Updated on July 09, 2022

Comments

  • calpyte
    calpyte almost 2 years

    I am appending rows to a pandas DataFrame within a for loop, but at the end the dataframe is always empty. I don't want to add the rows to an array and then call the DataFrame constructer, because my actual for loop handles lots of data. I also tried pd.concat without success. Could anyone highlight what I am missing to make the append statement work? Here's a dummy example:

    import pandas as pd
    import numpy as np
    
    data = pd.DataFrame([])
    
    for i in np.arange(0, 4):
        if i % 2 == 0:
            data.append(pd.DataFrame({'A': i, 'B': i + 1}, index=[0]), ignore_index=True)
        else:
            data.append(pd.DataFrame({'A': i}, index=[0]), ignore_index=True)
    
    print data.head()
    
    Empty DataFrame
    Columns: []
    Index: []
    [Finished in 0.676s]
    
  • calpyte
    calpyte about 8 years
    Alright, so saving it to an array and then calling the DataFrame is actually faster then. Thanks!
  • calpyte
    calpyte about 8 years
    Thanks that works! Kinda silly that I didn't think of that.
  • Dan Fiorino
    Dan Fiorino over 4 years
    The question is not about efficiently creating the dummy dataframe example.