pandas read_sql is unusually slow
pandas.read_sql
can be slow when loading large result set. In this case you can give a try on our tool ConnectorX (pip install -U connectorx
). We provide the read_sql
functionality and aim to improve the performance in both speed and memory usage.
In your example you can switch to it like this:
import connectorx as cx
start_time = time.time()
q_crash = 'SELECT <query string> FROM table1'
q_vehicle = 'SELECT <query string> table2'
q_person = 'SELECT <query string> FROM table3'
db_url = "mysql://user:password@host:port/dbasename"
crash = cx.read_sql(q_crash, db_url)
print('Read_sql time for table 1: {:.1f}'.format(time.time() - start_time))
vehicle = cx.read_sql(q_vehicle, db_url)
print('Read_sql time for table 2: {:.1f}'.format(time.time() - start_time))
person = cx.read_sql(q_person, db_url)
print('Read_sql time for table 3: {:.1f}'.format(time.time() - start_time))
Furthermore, you can leverage multi-cores on your client machine by specifying a partition column (partition_on
) and partition number (partition_num
), in which ConnectorX
would split the original query and fetch the result of each split in parallel. You can find some examples of how to do it here.
This is our benchmark result using 4 threads on MySQL fetching 60M rows x 16 columns of data:
ale19
Currently learning some languages to help me at work, and to practice for the real world. I love birdwatching and browsing StackOverflow.
Updated on June 16, 2022Comments
-
ale19 almost 2 years
I'm trying to read several columns from three different MySQL tables into three different dataframes.
It doesn't take long to read from the database, but actually putting them into a dataframe is fairly slow.
start_time = time.time() print('Reading data from database...') from sqlalchemy import create_engine q_crash = 'SELECT <query string> FROM table1' q_vehicle = 'SELECT <query string> table2' q_person = 'SELECT <query string> FROM table3' engine = create_engine('mysql+pymysql://user:password@host:port/dbasename') print('Database time: {:.1f}'.format(time.time() - start_time)) crash = pd.read_sql_query(q_crash, engine) print('Read_sql time for table 1: {:.1f}'.format(time.time() - start_time)) vehicle = pd.read_sql_query(q_vehicle, engine) print('Read_sql time for table 2: {:.1f}'.format(time.time() - start_time)) person = pd.read_sql_query(q_person, engine) print('Read_sql time for table 3: {:.1f}'.format(time.time() - start_time))
Output:
Reading data from database... Database time: 0.0 Read_sql time for table 1: 13.4 Read_sql time for table 2: 30.9 Read_sql time for table 3: 49.4
Is this normal? The tables are quite large-- table 3 is over 601,000 rows. But pandas has handled larger datasets without a hitch whenever I use read_csv.
-
Dave Kielpinski about 4 yearsdoesn't actually answer the question
-
James Robinson almost 4 yearsI up voted this comment, but I feel bad about it. The answer may well be the right one, trying to be helpful on SO can be hard, many people use the tool they know over the right tool for the job and it can be hard to be helpful, "I asked for help on how to did a trench with a spoon, I didn't ask about using a 'tractor' or whatever that is."