What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

10,938

Solution 1

You can write the rows to a file in the format "field1", "field2", .. and then use LOAD DATA to load them

data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()

Then execute this:

LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;

Reference

Solution 2

For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.

Solution 3

If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :

  1. Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
  2. Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)

Solution 4

If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.

Solution 5

Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.

Share:
10,938
Roee Adler
Author by

Roee Adler

Lover of technology and science and Battlestar Galactica.

Updated on June 04, 2022

Comments

  • Roee Adler
    Roee Adler almost 2 years

    I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.

    Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.

    [("hello", 3, 4), ("cat", 5, 3), ...]
    

    I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.

    My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:

    @transaction.commit_on_success
    def save_data_elements(input_list):
        for (s, i1, i2) in input_list:
            entry = DataElement(string=s, number1=i1, number2=i2)
            entry.save()
    

    This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.

    For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.

    Do you have any suggestion to optimize such a process?

    Thanks

  • user1066101
    user1066101 about 15 years
    The only potential issue is overriding the save() method. If you do this, you'll have to think twice about your design.
  • Tom Leys
    Tom Leys about 15 years
    More likely the problem is that the server is so busy processing this input it cannot handle any other requests.
  • Roee Adler
    Roee Adler about 15 years
    @S.Lott: what do you mean by "overriding the save()"? Do you mean whether I override the .save() method in the module class so that there are pre/post processing tasks taking place while saving through the code that will be lost in the "load data infile"? If so - that's not the case, I'm not overriding .save(). Otherwise please elaborate... Thanks
  • Roee Adler
    Roee Adler about 15 years
    @Sean: Thanks, by "prepared statements" do you mean SQL code with many %s elements that I just "fill" by providing the list of strings/numbers? Also, please take a look at my code (in the body of the question) - if I understand correctly I'm already using a single transaction with the @transaction.commit_on_success decorator (I'm using InnoDB)
  • Roee Adler
    Roee Adler about 15 years
    @staticsan: Do you think there's any "practical" limitation to such a statement? i.e. can I send the database a single INSERT query with 10k lines of text?
  • Roee Adler
    Roee Adler about 15 years
    Thanks, I added a clarification that I'm using InnoDB
  • Deva
    Deva about 15 years
    The only real limitation is the size of the network buffer. The default value of this was 1Mb for many years, but many people raised it to the maximum of 16Mb. More recent versions of MySQL can support even large packet sizes.
  • Deva
    Deva about 15 years
    It would need to be LOAD DATA LOCAL INFILE unless the code is running on the database server.
  • Sean McSomething
    Sean McSomething about 15 years
    I'm not really sure what's going on behind the scenese with Django - I'm just coming from a generic background of using MySQL so I don't know what that's doing regarding transactions. As for prepared statements - it looks like that's an implementation detail of your DataElement objects. A prepared statement would be : stmt = Prepare(sqlStatement); stmt.execute(var1, var2..) rather than db.execute(sqlStatement, var1, var2...) - it's like compiling regular expressions rather than parsing them every time.
  • pufferfish
    pufferfish almost 15 years
    Also, disable indexes before loading, and then enable them afterwards (will take a while to build the index). Haven't looked whether it helps with Django inserts too.
  • Will
    Will almost 11 years
    It's more about the packet size than the number of records. As you build your insert buffer, don't add any more if doing so will put the buffer over the max mysql packet size. I'd do some benchmarking and see where the benefit starts leveling off. You can also ask your MySQL server for it's maximum packet size: mysql> select @@max_allowed_packet\G: @@max_allowed_packet: 33554432