Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor?

python mysql performance orm sqlalchemy

36,356

Solution 1

Here is the SQLAlchemy version of your MySQL script that performs in four seconds, compared to three for MySQLdb:

from sqlalchemy import Integer, Column, create_engine, MetaData, Table
import datetime

metadata = MetaData()

foo = Table(
    'foo', metadata,
    Column('id', Integer, primary_key=True),
    Column('a', Integer(), nullable=False),
    Column('b', Integer(), nullable=False),
    Column('c', Integer(), nullable=False),
)


class Foo(object):
    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)
start = datetime.datetime.now()

with engine.connect() as conn:
    foos = [
        Foo(row['a'], row['b'], row['c'])
        for row in
        conn.execute(foo.select().limit(1000000)).fetchall()
    ]


print "total time: ", datetime.datetime.now() - start

runtime:

total time:  0:00:04.706010

Here is a script that uses the ORM to load object rows fully; by avoiding the creation of a fixed list with all 1M objects at once using yield per, this runs in 13 seconds with SQLAlchemy master (18 seconds with rel 0.9):

import time
from sqlalchemy import Integer, Column, create_engine, Table
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Foo(Base):
    __table__ = Table(
        'foo', Base.metadata,
        Column('id', Integer, primary_key=True),
        Column('a', Integer(), nullable=False),
        Column('b', Integer(), nullable=False),
        Column('c', Integer(), nullable=False),
    )


engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)

sess = Session(engine)

now = time.time()

# avoid using all() so that we don't have the overhead of building
# a large list of full objects in memory
for obj in sess.query(Foo).yield_per(100).limit(1000000):
    pass

print("Total time: %d" % (time.time() - now))

We can then split the difference between these two approaches, and load just individual columns with the ORM:

for obj in sess.query(Foo.id, Foo.a, Foo.b, Foo.c).yield_per(100).limit(1000000):
    pass

The above again runs in 4 seconds.

The comparison of SQLAlchemy Core is the more apt comparison to a raw MySQLdb cursor. If you use the ORM but query for individual columns, it's about four seconds in most recent versions.

At the ORM level, the speed issues are because creating objects in Python is slow, and the SQLAlchemy ORM applies a large amount of bookkeeping to these objects as it fetches them, which is necessary in order for it to fulfill its usage contract, including unit of work, identity map, eager loading, collections, etc.

To speed up the query dramatically, fetch individual columns instead of full objects. See the techniques at http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-orm which describe this.

For your comparison to PeeWee, PW is a much simpler system with a lot less features, including that it doesn't do anything with identity maps. Even with PeeWee, about as simple of an ORM as is feasible, it still takes 15 seconds, which is evidence that cPython is really really slow compared to the raw MySQLdb fetch which is in straight C.

For comparison to Java, the Java VM is way way way faster than cPython. Hibernate is ridiculously complicated, yet the Java VM is extremely fast due to the JIT and even all that complexity ends up running faster. If you want to compare Python to Java, use Pypy.

Solution 2

SQLAlchemy is complicated. It has to deal with converting types to Python which the underlying database does not support natively, tables with inheritance, JOINs, caching the objects, maintaining consistency, translated rows, partial results, and whatnot. Check out sqlalchemy/orm/loading.py:instance_processor -- it's insane.

The solution would be to piece together and compile Python code to process the results of a specific query, like Jinja2 does for templates. So far, nobody has done this work, possibly because the common case is a couple of rows (where this kind of optimization would be pessimal) and people who need to process bulk data do that by hand, like you did.

Solution 3

This is not an answer to my question, but may help the general public with speed issues on large data sets. I found that selecting a million records can typically be done in about 3 seconds, however JOINS may slow down the process. In this case that one has approximately 150k Foo's which has a 1-many relation to 1M Bars, then selecting those using a JOIN may be slow as each Foo is returned approximately 6.5 times. I found that selecting both tables seperately and joining them using dicts in python is approximately 3 times faster than SQLAlchemy (approx 25 sec) and 2 times faster than 'bare' python code using joins(approx 17 sec). The code took 8 sec in my use case. Selecting 1M records without relations, like the Bar-example above, took 3 seconds. I used this code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime
import inspect
from operator import itemgetter, attrgetter

# fetch all objects of class Class, where the fields are determined as the
# arguments of the __init__ constructor (not flexible, but fairly simple ;))
def fetch(Class, cursor, tablename, ids=["id"], where=None):
    arguments = inspect.getargspec(Class.__init__).args; del arguments[0];
    fields = ", ".join(["`" + tablename + "`.`" + column + "`" for column in arguments])
    sql = "SELECT " + fields + " FROM `" + tablename + "`"
    if where != None: sql = sql + " WHERE " + where
    sql=sql+";"
    getId = itemgetter(*[arguments.index(x) for x in ids])
    elements = dict()

    cursor.execute(sql)
    for record in cursor:
        elements[getId(record)] = Class(*record)
    return elements

# attach the objects in dict2 to dict1, given a 1-many relation between both
def merge(dict1, fieldname, dict2, ids):
    idExtractor = attrgetter(*ids)
    for d in dict1: setattr(dict1[d], fieldname, list())
    for d in dict2:
        dd = dict2[d]
        getattr(dict1[idExtractor(dd)], fieldname).append(dd)

# attach dict2 objects to dict1 objects, given a 1-1 relation
def attach(dict1, fieldname, dict2, ids):
    idExtractor = attrgetter(*ids)
    for d in dict1: dd=dict1[d]; setattr(dd, fieldname, dict2[idExtractor(dd)])

It helped me speed up my querying, however I am more than happy to hear from the experts about possible improvements to this approach.

36,356

Author by

Herbert

Long distance cycling trips, survivalrun, camping, nature, running, swimming, computing science, machine/deep learning, math. I can open word documents, edit them, e-mail them, delete documents, the list goes on and on.

Updated on February 20, 2020

Comments

Herbert over 4 years

I noticed that SQLAlchemy was slow fetching (and ORMing) some data, which was rather fast to fetch using bare bone SQL. First off, I created a database with a million records:

mysql> use foo
mysql> describe Foo;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | NO   | PRI | NULL    |       |
| A     | int(11) | NO   |     | NULL    |       |
| B     | int(11) | NO   |     | NULL    |       |
| C     | int(11) | NO   |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
mysql> SELECT COUNT(*) FROM Foo;
+----------+
| COUNT(*) |
+----------+
|  1000000 |
+----------+
mysql>

As a crude test, querying all Foo's takes approximately 2 seconds:

herbert@dev0 ~ $ date; echo 'use foo; select * from Foo;' | mysql -uroot -pxxx > /dev/null; date
zo apr 20 18:48:49 CEST 2014
zo apr 20 18:48:51 CEST 2014

If I do this in python using MySQLdb this takes a approximately 3 seconds, including the construction of Foo objects:

herbert@dev0 ~ $ python BareORM.py 
query execution time:  0:00:02.198986
total time:  0:00:03.403084

Which is the output of:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime

class Foo:
    def __init__(self, a, b, c):
        self.a=a; self.b=b; self.c=c;

try:
    start = datetime.datetime.now()
    con = MySQLdb.connect('localhost', 'root', 'xxx', 'foo')
    cur = con.cursor();

    cur.execute("""SELECT * FROM Foo LIMIT 1000000""")
    print "query execution time: ", datetime.datetime.now()-start
    foos = [];
    for elem in cur:
        foos.append(Foo(elem[1], elem[2], elem[3]))
    con.commit()

except MySQLdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit(1)

finally:
    if con: con.close()
    print "total time: ",  datetime.datetime.now()-start

However, using SQLAlchemy to reduce boilerplate code, it needed approximately 25 seconds to do the same job:

herbert@dev0 ~ $ python AlchemyORM.py 
total time:  0:00:24.649279

Using this code:

import sqlalchemy
import datetime
import MySQLdb

from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref

Base = declarative_base()

class Foo(Base):
    __tablename__ = 'Foo'
    id = Column(Integer, primary_key=True)
    A  = Column(Integer(unsigned=False), nullable=False)
    B  = Column(Integer(unsigned=False), nullable=False)
    C  = Column(Integer(unsigned=False), nullable=False)

engine  = create_engine('mysql+mysqldb://root:xxx@localhost/foo')
Session = sessionmaker(bind=engine)
session = Session()
start = datetime.datetime.now()
foos  = session.query(Foo).limit(1000000).all()
print "total time: ", datetime.datetime.now()-start

Why does SQLAlchemy operate ~10x slower than the bare SQL solution, assuming that SQLAlchemy should do approximately the same thing? Can I speed things up somehow?

This is a minimal working example of a more complicated query, which joins several tables using eager loading. I was considering just doing simple queries on a single table, and then using dictionaries to create id->object maps and collate one-to-N relations. But before doing so, I want to be sure that SQLAlchemy is unable to perform better, because writing your own ORM is a bad idea from a software design point of view. Imho, a 2x slowdown would be acceptable (maybe).

If you know about other (faster) python-SQL ORM's, or maybe BigTable-alike solutions (that already are the ORM), feel free to mention them as a comment.

EDIT: Also tried this with Peewee, which resulted in ~15 s.

from peewee import *
import datetime;
database = MySQLDatabase("foo", host="localhost", port=3306, user="root", passwd="xxx")

class Foo(Model):
        id = IntegerField()
        A  = IntegerField()
        B  = IntegerField()
        C  = IntegerField()

        class Meta:
                db_table = 'Foo'
                database = database

start = datetime.datetime.now()
foos = Foo.select()
cnt=0;
for i in foos: cnt=cnt+1
print "total time: ", datetime.datetime.now() - start

EDIT: As a response to Matthias I tried to do the same thing in Java with Hibernate, the result is approximately 8 to 10 seconds, not exactly fast, but a lot faster than 25 seconds. The code, starting with some classes and ending with some configuration:

package herbert.hibernateorm;

import java.util.List;

import org.hibernate.Session; 
import org.hibernate.Transaction;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.Configuration;

public class App {
   public static void main(String[] args) throws Exception {
      SessionFactory factory = new Configuration().configure().buildSessionFactory();
      Session session = factory.openSession();
      Transaction tx = session.beginTransaction();
      long start = System.currentTimeMillis();
      List foos = session.createQuery("FROM Foo").list(); 
      System.out.println(foos.size());
      System.out.printf("total time: %d\n", System.currentTimeMillis() - start);
      session.close();
   }
}

package herbert.hibernateorm;

public class Foo {
    private int id, a, b, c;
    public Foo() {}
    public Foo(int A, int B, int C) { this.a=A; this.b=B; this.c=C; }

    public int getId() { return id; }
    public void setId(int id) { this.id = id; }
    public int getA() { return a; }
    public void setA(int a) { this.a = a; }
    public int getB() { return b; }
    public void setB(int b) { this.b = b; }
    public int getC() { return c; }
    public void setC(int c) { this.c = c; }
}

The configuration (hibernate.cfg.xml and hibernate.hbm.xml respectively)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-configuration PUBLIC "-//Hibernate/Hibernate Configuration DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-configuration-3.0.dtd">
<hibernate-configuration>
  <session-factory>
    <property name="hibernate.dialect">org.hibernate.dialect.MySQLDialect</property>
    <property name="hibernate.connection.driver_class">com.mysql.jdbc.Driver</property>
    <property name="hibernate.connection.url">jdbc:mysql://localhost:3306/foo?zeroDateTimeBehavior=convertToNull</property>
    <property name="hibernate.connection.username">root</property>
    <property name="hibernate.connection.password">xxx</property>
    <mapping resource="hibernate.hbm.xml"/>
  </session-factory>
</hibernate-configuration>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="herbert.hibernateorm.Foo" table="Foo" catalog="foo">
        <id name="id" type="int">
            <column name="id" />
            <generator class="assigned" />
        </id>
        <property name="a" type="int">
            <column name="A" not-null="true" />
        </property>
        <property name="b" type="int">
            <column name="B" not-null="true" />
        </property>
        <property name="c" type="int">
            <column name="C" not-null="true" />
        </property>
    </class>
</hibernate-mapping>

And finally the pom file to run it all in maven:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>herbert</groupId>
    <artifactId>hibernateORM</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>hibernateORM</name>
    <url>http://maven.apache.org</url>
    <repositories>
        <repository>
            <id>unknown-jars-temp-repo</id>
            <name>A temporary repository created by NetBeans for libraries and jars it could not identify. Please replace the dependencies in this repository with correct ones and delete this repository.</name>
            <url>file:${project.basedir}/lib</url>
        </repository>
    </repositories>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.21</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-core</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-entitymanager</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate.common</groupId>
            <artifactId>hibernate-commons-annotations</artifactId>
            <version>4.0.1.Final</version>
        </dependency>   
        <dependency>
            <groupId>nz.ac.waikato.cms.weka</groupId>
            <artifactId>weka-dev</artifactId>
            <version>3.7.10</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.9</version>
        </dependency>
        <dependency>
            <groupId>commons-net</groupId>
            <artifactId>commons-net</artifactId>
            <version>3.1</version>
            <classifier>examples</classifier>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.2.2</version>
        </dependency>
        <dependency>
            <groupId>maven</groupId>
            <artifactId>maven-jetty-plugin</artifactId>
            <version>1.1</version>
            <type>plugin</type>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
                <groupId>com.kenai.nbpwr</groupId>
                <artifactId>org-slf4j-jdk14</artifactId>
                <version>1.6.1-201106101300</version>
                <type>nbm</type>
        </dependency>

    </dependencies>
</project>

Herbert about 10 years

My bare SQL example, also needs to convert the INT(11) to a python integer, I am assuming MySQLdb does that. Furthermore, the time SQLAlchemy needs to figure out inheritance and joins should not scale linearly with the number of results (100k takes ~2sec where 1M takes ~24 sec). Personally, I am not willing to pay speed for inheritance. As you may see, I also tried my example in java, which takes ~8-10 s. If I may, summarized your answer is "No, SQLAlchemy is not fast, if you want speed you need to create object/sql code yourself."
Herbert about 10 years

I like your feedback, but it is not very satisfying when SQLAlchemy claims "Mature, High Performing Architecture" sqlalchemy.org/features.html , imho it is not high performance if it is 10x slower than bare sql+simple oo. I guess the design decisions to not create schema's and not synthesize python code is fair, but due to the inherent performance decrease, this I think it is unfair to call SQLAlchemy "High performance". Nevertheless, I would be very pleased if someone contradicts me! ;)
coleifer about 10 years

Ditto for peewee with regards to type conversions, relevant code: github.com/coleifer/peewee/blob/master/peewee.py#L1523-L1658
Matthias Urlichs about 10 years

The question is, do you need the speed with SQLAlchemy's ORM badly enough to (co-)fund developing a Python code synthesizer for it? Up to now, apparently nobody has.
Herbert about 10 years

My point is that SQLAlchemy is not "High performance", and this speed trade-off should be more notable mentioned in the features page. Not doing so tricks people into writing object models in SQLAlchemy, which may not be the best solution at all. I tried looking into writing a small framework which would compile to cpp, but that seems only marginally faster than just using python's dictionary. My approach: join all wanted tables, create itemgetter for all fields and primary keys, create a dict PK->object for each join, 'relate' object using these dicts. I'm not yet sure how fast this will be.
Herbert almost 10 years

Thank you for your per-field suggestion and explanations! However, my general problem is that 1M records it not a lot, in fact it is a very small amount considering that we are talking about approximately 4 * (11 bits) * 1M = 5.5 megabyte. Loading that amount of stuff into memory is normally really fast, consider for example imread() in Matlab of an image that size, it takes 0.058425 seconds. Your point however was that 'logic' uses the time, however my BareORM example shows that simple logic should also be possible fast.
Herbert almost 10 years

I think SQLAlchemy is not "Mature, High Performing Architecture" (like they state in their feature list), as it fails to be fast in my simple example. Especially since logic could easily be implemented in straight c(++) and still be part of a python API. I also think SQL fails, at it returns [(A,B)] (list of tuples) instead of [(A,[B])] whenever you do a one-to-many join, but I have not found an ORM that compensates within reasonable time
Herbert almost 10 years

But fair enough: you do answer the question of why SQLAlchemy is slow, unfortunately I still think it is not reasonable that it is ;) Moreover, my answer shows a small python implementation of a different approach where the one-to-more relations can be merged efficiently using two fetches and a merge on the result. I just don't get why there is no database/ORM combination specialized in efficiently loading lists of objects and then joining them efficiently language-specific and API-wise.
zzzeek almost 10 years

your comments aren't reasonable because you are expecting Python to perform as fast as straight C code. Comparing it to things like Matlab, Java, etc. are completely unfair comparisons, but again this has little to do with SQLAlchemy. SQLAlchemy is very, very fast. It's just that users tend to be unaware of just how much functionality is being delivered, and confuse an ORM result set with that of a raw database cursor. They are quite different, and SQLAlchemy offers many options for controlling the mixture of "raw" vs. "automated".
zzzeek almost 10 years

For the "two fetches, then merge on result", SQLAlchemy offers this: see docs.sqlalchemy.org/en/latest/orm/loading.html. SQLAlchemy is the only ORM in the Python world I'm familiar with which offers these kinds of patterns, which is largely b.c. we are the only ORM that stores collections persistently instead of load-on-access every time.
zzzeek almost 10 years

If you are looking for an ORM that is in Python, yet uses C to a huge degree, see the storm orm. It doesn't offer any of the eager loading features of SQLAlchemy but from an implementation perspective it is as fast as you're going to get - it's as minimalist as PeeWee but it's entire persistence engine provides a C-based backend. Run your benchmarks against that, and that's the best you're going to do. Or just use Pypy.
Herbert almost 10 years

First: creating objects in Python may be slow, but it is not the bottle neck as it only adds 1 second in my BareORM.py example. Second: Your first suggestion is only slightly better than the BareORM.py example as it creates a dict instead of a list, allowing to index by field name instead of index. I guess you loose all benefits from an ORM if you use it like this. Your second suggestion simply states that it is fast as long as you don't actually create the appropriate data structure in memory: which is exactly what an ORM should add on top of the default MySQL API.
Herbert almost 10 years

The main problem with your solution is that you try to reenact what the BareORM.py example does using SQLAlchemy, which is not what I want. I want ORM-functionality and I want it to be fast. I think the only bottle neck that should exist when loading data from a database is that it needs to be transported from disk to memory. Everything else, especially when only dealing with 1M records, should imho be details time wise. Maybe SQL is an inefficient way to store objects or maybe Python is inefficient, but imho these may not be excuses to accept these speed losses.
zzzeek almost 10 years

so your statement is that, the need to apply a single directive "yield_per(100)", which doubles the speed of the query, is unreasonable. I look forward to your Python ORM implementation that solves these issues in a superior way! Good luck to you.
Herbert almost 10 years

That is not at all what I said. If I am not mistaken your first implementation is fast because you retrieve records and not objects, avoiding ORM functionality. Your second implementation avoids actual object creation, if you replace the "pass" by appending the obj to a list, will it still comparably fast?
Herbert almost 10 years

Also I do think adding a yield_per(100) is a bit strange, how should I determine which is the optimal or a reasonable value for 100? Also, why is doing 10k times 100 things faster than 1 time 1M things?