Idiomatic way to do list/dict in Cython?

20,806

Solution 1

Cython now has template support, and comes with declarations for some of the STL containers.

See http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library

Here's the example they give:

from libcpp.vector cimport vector

cdef vector[int] vect
cdef int i
for i in range(10):
    vect.push_back(i)
for i in range(10):
    print vect[i]

Solution 2

Doing similar operations in Python as in C++ can often be slower. list and dict are actually implemented very well, but you gain a lot of overhead using Python objects, which are more abstract than C++ objects and require a lot more lookup at runtime.

Incidentally, std::vector is implemented in a pretty similar way to list. std::map, though, is actually implemented in a way that many operations are slower than dict as its size gets large. For suitably large examples of each, dict overcomes the constant factor by which it's slower than std::map and will actually do operations like lookup, insertion, etc. faster.

If you want to use std::map and std::vector, nothing is stopping you. You'll have to wrap them yourself if you want to expose them to Python. Do not be shocked if this wrapping consumes all or much of the time you were hoping to save. I am not aware of any tools that make this automatic for you.

There are C API calls for controlling the creation of objects with some detail. You can say "Make a list with at least this many elements", but this doesn't improve the overall complexity of your list creation-and-filling operation. It certainly doesn't change much later as you try to change your list.

My general advice is

  • If you want a fixed-size array (you talk about specifying the size of a list), you may actually want something like a numpy array.

  • I doubt you are going to get any speedup you want out of using std::vector over list for a general replacement in your code. If you want to use it behind the scenes, it may give you a satisfying size and space improvement (I of course don't know without measuring, nor do you. ;) ).

  • dict actually does its job really well. I definitely wouldn't try introducing a new general-purpose type for use in Python based on std::map, which has worse algorithmic complexity in time for many important operations and—in at least some implementations—leaves some optimisations to the user that dict already has.

    If I did want something that worked a little more like std::map, I'd probably use a database. This is generally what I do if stuff I want to store in a dict (or for that matter, stuff I store in a list) gets too big for me to feel comfortable storing in memory. Python has sqlite3 in the stdlib and drivers for all other major databases available.

Solution 3

C++ is fast not just because of the static declarations of the vector and the elements that go into it, but crucially because using templates/generics one specifies that the vector will only contain elements of a certain type, e.g. vector with tuples of three elements. Cython can't do this last thing and it sounds nontrivial -- it would have to be enforced at compile time, somehow (typechecking at runtime is what Python already does). So right now when you pop something off a list in Cython there is no way of knowing in advance what type it is , and putting it in a typed variable only adds a typecheck, not speed. This means that there is no way of bypassing the Python interpreter in this regard, and it seems to me it's the most crucial shortcoming of Cython for non-numerical tasks.

The manual way of solving this is to subclass the python list/dict (or perhaps std::vector) with a cdef class for a specific type of element or key-value combination. This would amount to the same thing as the code that templates are generating. As long as you use the resulting class in Cython code it should provide an improvement.

Using databases or arrays just solves a different problem, because this is about putting arbitrary objects (but with a specific type, and preferably a cdef class) in containers.

And std::map shouldn't be compared to dict; std::map maintains keys in sorted order because it is a balanced tree, dict solves a different problem. A better comparison would be dict and Google's hashtable.

Solution 4

You can take a look at the standard array module for Python if this is appropriate for your Cython setting. I'm not sure since I have never used Cython.

Share:
20,806
ramanujan
Author by

ramanujan

Stuff I like: git, python, numpy, sage, matplotlib, R, GNU command line tools, cython, haskell, emacs/elisp, vtk/rgl.

Updated on May 31, 2020

Comments

  • ramanujan
    ramanujan almost 4 years

    My problem: I've found that processing large data sets with raw C++ using the STL map and vector can often be considerably faster (and with lower memory footprint) than using Cython.

    I figure that part of this speed penalty is due to using Python lists and dicts, and that there might be some tricks to use less encumbered data structures in Cython. For example, this page (http://wiki.cython.org/tutorials/numpy) shows how to make numpy arrays very fast in Cython by predefining the size and types of the ND array.

    Question: Is there any way to do something similar with lists/dicts, e.g. by stating roughly how many elements or (key,value) pairs you expect to have in them? That is, is there an idiomatic way to convert lists/dicts to (fast) data structures in Cython?

    If not I guess I'll just have to write it in C++ and wrap in a Cython import.

  • ramanujan
    ramanujan over 14 years
    I guess what I'm wondering then is whether one can write a similar C extension which mimics the list/dict functionality and syntax, with the crucial prerequisite that the types of list elements and dict (key,value) pairs are statically specified, using the STL vector & map under the hood. Shouldn't be that hard (famous last words...)
  • Mike Graham
    Mike Graham about 14 years
    But it is possible to get dict faster than std::map! ;)
  • jfs
    jfs over 12 years
    libcpp (part of Cython) already has wrapper for map. It is easy to add others e.g., for unordered_map (see hash-based solution from my answer)
  • Andrew
    Andrew almost 9 years
    Do you have any hard numbers for the break-even between dict and std::map?
  • Phillip
    Phillip over 7 years
    @MikeGraham I'm having a memory problem that I was wondering if something like sqlite3 may address. The code example in the post isn't what I'm currently running (I'm now using Cython), but I have the same performance issue. You may be able to provide some insight into the problem. If applicable, feel free to add anything in answer or comment format.
  • Phillip
    Phillip over 7 years
    I'm having a memory problem that you may have some insight into. The code example in the post isn't what I'm currently running (I'm now using Cython), but I have the same performance issue. If applicable, feel free to add anything in answer or comment format.
  • kb1000
    kb1000 about 5 years
    @jfs there's no need for a wrapper anymore, look at libcpp.unordered_map
  • Eli Korvigo
    Eli Korvigo about 5 years
    Builtin list and dict are implemented in C and are exposed to Cython at a pretty low level, hence there is barely any interpreter involved when it comes to these structures themselves. The speed penalty really comes from dynamic dispatch on values stored inside them.