Fast string to integer conversion in Python
Solution 1
The following most simplistic C extension already improves heavily on the builtin, managing to convert over three times as many strings per second (650kcps vs 214kcps):
static PyObject *fastint_int(PyObject *self, PyObject *args) {
char *s; unsigned r = 0;
if (!PyArg_ParseTuple(args, "s", &s)) return NULL;
for (r = 0; *s; r = r * 10 + *s++ - '0');
return Py_BuildValue("i", r);
}
This obviously does not cater for integers of arbitrary length and various other special cases, but that's no problem in our scenario.
Solution 2
You will get some percentage of speed by ensuring only "local" variables are used in your tightest of loops. The int
function is a global, so looking it up will be more expensive than a local.
Do you really need all billion numbers in memory at all times. Consider using some iterators to give you only a few values at a time A billion numbers will take a bit of storage. Appending these to a list, one at a time, is going to require several large reallocations.
Get your looping out of Python entirely if possible. The map function here can be your friend. I'm not sure how your data is stored. If it is a single number per line, you could reduce the code to
values = map(int, open("numberfile.txt"))
If there are multiple values per line that are white space separated, dig into the itertools to keep the looping code out of Python. This version has the added benefit of creating a number iterator, so you can spool only one or several numbers out of the file at a time, instead of one billion in one shot.
numfile = open("numberfile.txt")
valIter = itertools.imap(int, itertools.chain(itertools.imap(str.split, numfile)))
Solution 3
I might suggest that for raw speed, Python isn't the right tool for this task. A hand-coded C implementation will beat Python easily.
Solution 4
Agree with Greg; Python, as an interpreted language, is generally slow. You could try compiling the source code on-the-fly with the Psyco library or coding the app in a lower level language such C/C++.
Solution 5
As others have said you could code up your own C module to do the parsing/conversion for you. Then you could simply import that and call on it. You might be able to use Pyrex or its Cython derivative to generate your C from your Python (by adding a few type constraining hints to the Python).
You can read more about Cython and see if that will help.
Another question that comes to mind though ... what are you going to be doing with these billion integers? Is it possible that you might load them as strings, search for them as strings and perform a lazy conversion as necessary? Or could you parallelize the conversion and the other computations using threading
or multiprocessing
modules and Queues? (Have one or more threads/processes performing the conversion and feeding a Queue from which your processing engine fetches them). In other words would a producer/consumer design alleviate the problem?
earl
Updated on June 04, 2022Comments
-
earl almost 2 years
A simple problem, really: you have one billion (1e+9) unsigned 32-bit integers stored as decimal ASCII strings in a TSV (tab-separated values) file. Conversion using
int()
is horribly slow compared to other tools working on the same dataset. Why? And more importantly: how to make it faster?Therefore the question: what is the fastest way possible to convert a string to an integer, in Python?
What I'm really thinking about is some semi-hidden Python functionality that could be (ab)used for this purpose, not unlike Guido's use of
array.array
in his "Optimization Anecdote".Sample data (with tabs expanded to spaces)
38262904 "pfv" 2002-11-15T00:37:20+00:00 12311231 "tnealzref" 2008-01-21T20:46:51+00:00 26783384 "hayb" 2004-02-14T20:43:45+00:00 812874 "qevzasdfvnp" 2005-01-11T00:29:46+00:00 22312733 "bdumtddyasb" 2009-01-17T20:41:04+00:00
The time it takes reading the data is irrelevant here, processing the data is the bottleneck.
Microbenchmarks
All of the following are interpreted languages. The host machine is running 64-bit Linux.
Python 2.6.2 with IPython 0.9.1, ~214k conversions per second (100%):
In [1]: strings = map(str, range(int(1e7))) In [2]: %timeit map(int, strings); 10 loops, best of 3: 4.68 s per loop
REBOL 3.0 Version 2.100.76.4.2, ~231kcps (108%):
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)] == "9999999" >> delta-time [map str strings [to integer! str]] == 0:00:04.328675
REBOL 2.7.6.4.2 (15-Mar-2008), ~523kcps (261%):
As John noted in the comments, this version does not build a list of converted integers, so the speed-ratio given is relative to Python's 4.99s runtime of
for str in strings: int(str)
.>> delta-time: func [c /local t] [t: now/time/precise do c now/time/precise - t] >> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)] == "9999999" >> delta-time [foreach str strings [to integer! str]] == 0:00:01.913193
KDB+ 2.6t 2009.04.15, ~2016kcps (944%):
q)strings:string til "i"$1e7 q)\t "I"$strings 496
-
earl over 14 yearsI totally agree, but that's not really the point of my question. I added a paragraph of what I'm looking for. A custom Python extension would be an option, though.
-
Yuval Adam over 14 years-1 on the interpreted ==> slow corollary. A C implementation will be faster in this case, but your generalization is simply wrong.
-
ramosg over 14 yearsAn interpreted language must be translated into machine code at the time of execution and that is simply slower than executing a compiled object code. Still don't understand your downvote. Please explain why do you think "my generalization" is wrong.
-
Yuval Adam over 14 yearsInterpreted languages can make optimizations on the bytecode during runtime, sometimes leading to better performance than native machine code. Look it up, it has been discussed to death.
-
ramosg over 14 yearsWell, I suppose 90% of the cases isn't enough to generalize, so it's edited.
-
hughdbrown over 14 yearsMoving as much as can out of the inner loop and running this on 1e7 iterations takes 27 seconds using psyco.full(). So it would take something resembling 45 minutes on my machine to do 1e9. I'm empted to believe that C/C++/C# would be faster, though I have not benchmarked them.
-
jfs over 14 yearsIs there any reason not to use C standard lib's functions e.g.,
strtoul()
? -
Kaerber over 10 yearsActually, the current state of affairs is 90% of the code is interpreted, technicaly both Java and .NET code is interpreted on the bytecode level. That is why a whole industry of JIT compilers sprung up, giving an unexpected benefit of both targeting right this specific platform in optimizations and having full runtime information about behaviour of the code right at compiler's fingertips. So yeah, an old adage that statically compiled languages are faster that interpreters is no longer true.