Fastest way to read huge MySQL table in python

13,007

Solution 1

Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.

Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.

This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.

Example 1

import pandas as pd
import MySQLdb

def worker(y):
    #where y is value in an indexed column, e.g. a category
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x

p.close()
results = pd.concat(data) 

Example 2

import pandas as pd
import MySQLdb
import datetime

def worker(a,b):
    #where a and b are timestamps
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)

date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)

p.close()
results = pd.concat(data)

Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.

Solution 2

You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).

pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.

Usage is basically the same as pymsql:

import MySQLdb

connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')

Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?

Solution 3

For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.

import mysql.connector

i = 1
limit = 1000

while True:
    sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
    cursor.execute(sql)
    rows = self.cursor.fetchall()

    if not len(rows):  # break the loop when no more rows
        print("Done!")
        break

    for row in rows:   # do something with results
        print(row)

    i += limit
Share:
13,007

Related videos on Youtube

UgoL
Author by

UgoL

Updated on July 29, 2022

Comments

  • UgoL
    UgoL over 1 year

    I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:

    import pandas as pd
    import numpy as np
    import pymysql.cursors
    
    connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    
    try:
        with connection.cursor() as cursor:
            query = "SELECT * FROM example_table;"
    
            chunks=[]
    
            for chunk in pd.read_sql(query, connection, chunksize = 1000):
                chunks.append(chunk)
            #print(len(chunks))    
            result = pd.concat(chunks, ignore_index=True)
            #print(type(result))
            #print(result)
    
    finally:
        print("Done!")
    
        connection.close()
    

    Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.

    Maybe is there a better/faster way to select the data from a relational database within python?

  • UgoL
    UgoL over 5 years
    Thanks @djmac for this suggestion. I'll give a try. What can I take in consideration for speed-up in general the reading process? Maybe this chunks solution is not so well optimized?
  • djmac
    djmac over 5 years
    Honestly, I think it mainly kind of depends a lot on the database structure (and data). If you really doing a SELECT * on a whole table maybe a different data structure (i.e. not MySQL) could be better? Otherwise, making sure you have the right index(s) can make a big difference (the EXPLAIN statement can help you out here and provide guidance on good tweaks)
  • djmac
    djmac over 5 years
    Okay. I am not so familiar with the intenals of pandas - but the MySQL-python connector (on which mysqlclient is based) is about 1000% faster than pymysql (according to this wiki.openstack.org/wiki/PyMySQL_evaluation).
  • djmac
    djmac over 5 years
    Another option would be to consider parallelisation (e.g. use multiprocessing module and a pool of workers). For example, you could have multiple workers selecting smaller chunks and stitching them together with pd.concat. You could even shard (e.g. vertically partition) the database and get even better performance - though if the bottleneck is pandas that might not be as good bang for your buck.
  • UgoL
    UgoL over 5 years
    Thank you the for the last suggestions... Can you show me a little example of multiprocessing application considering my actual experimental code?
  • djmac
    djmac over 5 years
    Sure - will put it as another answer
  • UgoL
    UgoL over 5 years
    I think i haven't understand exactly how these examples work.. let's say the first one. You map and call the function worker for each items of the list col_x_categories. But where we define y and col_x_categories? And I supposed that this map operation is parallelized for n processes.
  • han058
    han058 almost 4 years
    It will be slow later. Because of "limit". if "limit" used, then full table scanning will occur. "join" will be a proper solution. like this ''' "SELECT id, xValue, yValue, zValue, createdAt FROM ( SELECT id FROM Position LIMIT 2000000, 1000 ) q JOIN Position p ON p.id = q.id" '''
  • SteveS
    SteveS about 2 years
    Is it doable with Vaex?