Pandas to_sql fails on duplicate primary key

python mysql pandas

28,622

Solution 1

There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)

for i in range(len(df)):
    try:
        df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
    except IntegrityError:
        pass #or any other action

Solution 2

please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists. The if_exists don't related to the content of the table. see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’ fail: If table exists, do nothing. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data. Create if does not exist.

Solution 3

You can do this with the method parameter of to_sql:

from sqlalchemy.dialects.mysql import insert

def insert_on_duplicate(table, conn, keys, data_iter):
    insert_stmt = insert(table.table).values(list(data_iter))
    on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
    conn.execute(on_duplicate_key_stmt)

df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)

for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))

Solution 4

Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.

Solution 5

The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql

The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.

def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
    """
    Execute SQL statement inserting data

    Parameters
    ----------
    sqltable : pandas.io.sql.SQLTable
    conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
    keys : list of str
        Column names
    data_iter : Iterable that iterates the values to be inserted
    """
    from sqlalchemy.dialects.postgresql import insert
    from sqlalchemy import table, column
    columns=[]
    for c in keys:
        columns.append(column(c))

    if sqltable.schema:
        table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
    else:
        table_name = sqltable.name

    mytable = table(table_name, *columns)

    insert_stmt = insert(mytable).values(list(data_iter))
    do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])

    conn.execute(do_nothing_stmt)

pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)

View more solutions

28,622

ryantuck

Technology brother

Updated on May 11, 2022

Comments

ryantuck almost 2 years

I'd like to append to an existing table, using pandas df.to_sql() function.

I set if_exists='append', but my table has primary keys.

I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.

Is this possible with pandas, or do I need to write an explicit query?
- maxymoo almost 9 years
  
  possible duplicate of Appending Pandas dataframe to sqlite table by primary key
theStud54 over 7 years

dont forget to add if_exists='append' as a parameter
miro almost 7 years

this solves the problem,...but it slows down the query VEEEEEERY MUCH
Halee over 4 years

For those using sqlalchemy, this is what worked for me: Adding this import: from sqlalchemy import exc and changing the exception to this: except exc.IntegrityError as e:. Like @miro said, it does slow down the process by a lot.
KIC over 3 years

and for the meantime there is pangres pypi.org/project/pangres
DirtyBit over 3 years

What if there are columns like created_at and updated_at in the table that are auto-filled. this approach doesnt work then!
Jayen over 2 years

pandas.pydata.org/pandas-docs/stable/whatsnew/… pandas.DataFrame.to_sql() has gained the method argument to control SQL insertion clause. See the insertion method section in the documentation. (GH8953)
Huy Tran over 2 years

got an error raise ValueError("update parameter must be a non-empty dictionary") ValueError: update parameter must be a non-empty dictionary
Jayen over 2 years

@HuyTran i'm not sure why you would get that. does the db table exist already? does your dataframe's columns match the table's columns?
Jayen over 2 years

@HuyTran what version of pandas are you using?
Huy Tran over 2 years

Hi @Jayden, panda=v1.2.1, sqlalchmy=1.3.22 I found the error to be the panda table.table and insert dialects. It seemed the ValueError referred to insert() requiring a table object instead of a string.
Jayen over 2 years

@HuyTran if you have some different code, can you edit my answer to clarify? i recently tried this on sqlalchemy 1.3.22 but that version's on_duplicate_key_update doesn't accept ColumnCollection and i had to create a dict.
Grimlock about 2 years

@jayen Can you please explain your answer? For example, on how insert_stmt.inserted behaves? I intend to use your function, but want slightly different behavior. This function seems to be causing issue like this: dba.stackexchange.com/questions/60295/…
Jayen about 2 years

@Grimlock see the "tip" on docs.sqlalchemy.org/en/14/dialects/… . tbh i don't think this should affect the auto-increment but i don't really know.