Pandas to_sql fails on duplicate primary key
Solution 1
There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)
for i in range(len(df)):
try:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
except IntegrityError:
pass #or any other action
Solution 2
please note that the "if_exists='append'"
related to the existing of the table and what to do in case the table not exists.
The if_exists don't related to the content of the table.
see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’ fail: If table exists, do nothing. replace: If table exists, drop it, recreate it, and insert data. append: If table exists, insert data. Create if does not exist.
Solution 3
You can do this with the method
parameter of to_sql
:
from sqlalchemy.dialects.mysql import insert
def insert_on_duplicate(table, conn, keys, data_iter):
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
conn.execute(on_duplicate_key_stmt)
df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)
for older versions of sqlalchemy, you need to pass a dict
to on_duplicate_key_update
. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))
Solution 4
Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.
Solution 5
The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.
def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import table, column
columns=[]
for c in keys:
columns.append(column(c))
if sqltable.schema:
table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
else:
table_name = sqltable.name
mytable = table(table_name, *columns)
insert_stmt = insert(mytable).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])
conn.execute(do_nothing_stmt)
pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)
Related videos on Youtube
Comments
-
ryantuck almost 2 years
I'd like to append to an existing table, using pandas
df.to_sql()
function.I set
if_exists='append'
, but my table has primary keys.I'd like to do the equivalent of
insert ignore
when trying toappend
to the existing table, so I would avoid a duplicate entry error.Is this possible with pandas, or do I need to write an explicit query?
-
maxymoo almost 9 yearspossible duplicate of Appending Pandas dataframe to sqlite table by primary key
-
-
theStud54 over 7 yearsdont forget to add
if_exists='append'
as a parameter -
miro almost 7 yearsthis solves the problem,...but it slows down the query VEEEEEERY MUCH
-
Halee over 4 yearsFor those using sqlalchemy, this is what worked for me: Adding this import:
from sqlalchemy import exc
and changing the exception to this:except exc.IntegrityError as e:
. Like @miro said, it does slow down the process by a lot. -
KIC over 3 yearsand for the meantime there is pangres pypi.org/project/pangres
-
DirtyBit over 3 yearsWhat if there are columns like
created_at
andupdated_at
in the table that are auto-filled. this approach doesnt work then! -
Jayen over 2 yearspandas.pydata.org/pandas-docs/stable/whatsnew/… pandas.DataFrame.to_sql() has gained the method argument to control SQL insertion clause. See the insertion method section in the documentation. (GH8953)
-
Huy Tran over 2 yearsgot an error raise ValueError("update parameter must be a non-empty dictionary") ValueError: update parameter must be a non-empty dictionary
-
Jayen over 2 years@HuyTran i'm not sure why you would get that. does the db table exist already? does your dataframe's columns match the table's columns?
-
Jayen over 2 years@HuyTran what version of pandas are you using?
-
Huy Tran over 2 yearsHi @Jayden, panda=v1.2.1, sqlalchmy=1.3.22 I found the error to be the panda table.table and insert dialects. It seemed the ValueError referred to insert() requiring a table object instead of a string.
-
Jayen over 2 years@HuyTran if you have some different code, can you edit my answer to clarify? i recently tried this on sqlalchemy 1.3.22 but that version's
on_duplicate_key_update
doesn't acceptColumnCollection
and i had to create adict
. -
Grimlock about 2 years@jayen Can you please explain your answer? For example, on how
insert_stmt.inserted
behaves? I intend to use your function, but want slightly different behavior. This function seems to be causing issue like this: dba.stackexchange.com/questions/60295/… -
Jayen about 2 years@Grimlock see the "tip" on docs.sqlalchemy.org/en/14/dialects/… . tbh i don't think this should affect the auto-increment but i don't really know.