Fastest way to read huge MySQL table in python
Solution 1
Another option might be to use the multiprocessing
module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas
chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
Solution 2
You could try using a different mysql connector. I would recommend trying mysqlclient
which is the fastest mysql connector (by a considerable margin I believe).
pymysql
is a pure python mysql client, whereas mysqlclient
is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql
:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
Solution 3
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit
Related videos on Youtube
UgoL
Updated on July 29, 2022Comments
-
UgoL over 1 year
I was trying to read a very huge MySQL table made of several millions of rows. I have used
Pandas
library andchunks
. See the code below:import pandas as pd import numpy as np import pymysql.cursors connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx') try: with connection.cursor() as cursor: query = "SELECT * FROM example_table;" chunks=[] for chunk in pd.read_sql(query, connection, chunksize = 1000): chunks.append(chunk) #print(len(chunks)) result = pd.concat(chunks, ignore_index=True) #print(type(result)) #print(result) finally: print("Done!") connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
-
UgoL over 5 yearsThanks @djmac for this suggestion. I'll give a try. What can I take in consideration for speed-up in general the reading process? Maybe this chunks solution is not so well optimized?
-
djmac over 5 yearsHonestly, I think it mainly kind of depends a lot on the database structure (and data). If you really doing a
SELECT *
on a whole table maybe a different data structure (i.e. not MySQL) could be better? Otherwise, making sure you have the right index(s) can make a big difference (theEXPLAIN
statement can help you out here and provide guidance on good tweaks) -
djmac over 5 yearsOkay. I am not so familiar with the intenals of
pandas
- but the MySQL-python connector (on which mysqlclient is based) is about 1000% faster than pymysql (according to this wiki.openstack.org/wiki/PyMySQL_evaluation). -
djmac over 5 yearsAnother option would be to consider parallelisation (e.g. use
multiprocessing
module and a pool of workers). For example, you could have multiple workers selecting smaller chunks and stitching them together withpd.concat
. You could even shard (e.g. vertically partition) the database and get even better performance - though if the bottleneck ispandas
that might not be as good bang for your buck. -
UgoL over 5 yearsThank you the for the last suggestions... Can you show me a little example of multiprocessing application considering my actual experimental code?
-
djmac over 5 yearsSure - will put it as another answer
-
UgoL over 5 yearsI think i haven't understand exactly how these examples work.. let's say the first one. You map and call the function worker for each items of the list col_x_categories. But where we define y and col_x_categories? And I supposed that this map operation is parallelized for n processes.
-
han058 almost 4 yearsIt will be slow later. Because of "limit". if "limit" used, then full table scanning will occur. "join" will be a proper solution. like this ''' "SELECT id, xValue, yValue, zValue, createdAt FROM ( SELECT id FROM Position LIMIT 2000000, 1000 ) q JOIN Position p ON p.id = q.id" '''
-
SteveS about 2 yearsIs it doable with Vaex?