pandas read_sql is unusually slow

12,431

pandas.read_sql can be slow when loading large result set. In this case you can give a try on our tool ConnectorX (pip install -U connectorx). We provide the read_sql functionality and aim to improve the performance in both speed and memory usage.

In your example you can switch to it like this:

import connectorx as cx

start_time = time.time()

q_crash = 'SELECT <query string> FROM table1'
q_vehicle = 'SELECT <query string> table2'
q_person = 'SELECT <query string> FROM table3'
db_url = "mysql://user:password@host:port/dbasename"

crash = cx.read_sql(q_crash, db_url)
print('Read_sql time for table 1: {:.1f}'.format(time.time() - start_time))
vehicle = cx.read_sql(q_vehicle, db_url)
print('Read_sql time for table 2: {:.1f}'.format(time.time() - start_time))
person = cx.read_sql(q_person, db_url)
print('Read_sql time for table 3: {:.1f}'.format(time.time() - start_time))

Furthermore, you can leverage multi-cores on your client machine by specifying a partition column (partition_on) and partition number (partition_num), in which ConnectorX would split the original query and fetch the result of each split in parallel. You can find some examples of how to do it here.

This is our benchmark result using 4 threads on MySQL fetching 60M rows x 16 columns of data:

mysql time mysql memory

Share:
12,431
ale19
Author by

ale19

Currently learning some languages to help me at work, and to practice for the real world. I love birdwatching and browsing StackOverflow.

Updated on June 16, 2022

Comments

  • ale19
    ale19 almost 2 years

    I'm trying to read several columns from three different MySQL tables into three different dataframes.

    It doesn't take long to read from the database, but actually putting them into a dataframe is fairly slow.

    start_time = time.time()
    print('Reading data from database...')
    
    from sqlalchemy import create_engine
    q_crash = 'SELECT <query string> FROM table1'
    q_vehicle = 'SELECT <query string> table2'
    q_person = 'SELECT <query string> FROM table3'
    engine = create_engine('mysql+pymysql://user:password@host:port/dbasename')
    
    print('Database time: {:.1f}'.format(time.time() - start_time))
    
    crash = pd.read_sql_query(q_crash, engine)
    print('Read_sql time for table 1: {:.1f}'.format(time.time() - start_time))
    vehicle = pd.read_sql_query(q_vehicle, engine)
    print('Read_sql time for table 2: {:.1f}'.format(time.time() - start_time))
    person = pd.read_sql_query(q_person, engine)
    print('Read_sql time for table 3: {:.1f}'.format(time.time() - start_time))
    

    Output:

    Reading data from database...
    Database time: 0.0
    Read_sql time for table 1: 13.4
    Read_sql time for table 2: 30.9
    Read_sql time for table 3: 49.4
    

    Is this normal? The tables are quite large-- table 3 is over 601,000 rows. But pandas has handled larger datasets without a hitch whenever I use read_csv.

  • Dave Kielpinski
    Dave Kielpinski about 4 years
    doesn't actually answer the question
  • James Robinson
    James Robinson almost 4 years
    I up voted this comment, but I feel bad about it. The answer may well be the right one, trying to be helpful on SO can be hard, many people use the tool they know over the right tool for the job and it can be hard to be helpful, "I asked for help on how to did a trench with a spoon, I didn't ask about using a 'tractor' or whatever that is."