Iteration over the rows of a Pandas DataFrame as dictionaries

25,658

Solution 1

You can try:

for k, row in df.iterrows():
    myfunc(**row)

Here k is the dataframe index and row is a dict, so you can access any column with: row["my_column_name"]

Solution 2

one clean option is this one:

for row_dict in df.to_dict(orient="records"):
    print(row_dict['column_name'])

Solution 3

Defining a separate function for this will be inefficient, as you are applying row-wise calculations. More efficient would be to calculate a new series, then iterate the series:

df = pd.DataFrame({'length':[1,2,3,'test'], 'width':[10, 20, 30,'hello']})

df2 = df.iloc[:].apply(pd.to_numeric, errors='coerce')

error_str = 'Error : length and width should be int or float'
print(*(df2['length'] * df2['width']).fillna(error_str), sep='\n')

10.0
40.0
90.0
Error : length and width should be int or float
Share:
25,658
Matina G
Author by

Matina G

Updated on July 09, 2022

Comments

  • Matina G
    Matina G almost 2 years

    I need to iterate over a pandas dataframe in order to pass each row as argument of a function (actually, class constructor) with **kwargs. This means that each row should behave as a dictionary with keys the column names and values the corresponding ones for each row.

    This works, but it performs very badly:

    import pandas as pd
    
    
    def myfunc(**kwargs):
        try:
            area = kwargs.get('length', 0)* kwargs.get('width', 0)
            return area
        except TypeError:
            return 'Error : length and width should be int or float'
    
    
    df = pd.DataFrame({'length':[1,2,3], 'width':[10, 20, 30]})
    
    for i in range(len(df)):
        print myfunc(**df.iloc[i])
    

    Any suggestions on how to make that more performing ? I have tried iterating with tried df.iterrows(), but I get the following error :

    TypeError: myfunc() argument after ** must be a mapping, not tuple

    I have also tried df.itertuples() and df.values , but either I am missing something, or it means that I have to convert each tuple / np.array to a pd.Series or dict , which will also be slow. My constraint is that the script has to work with python 2.7 and pandas 0.14.1.