Simple Python Challenge: Fastest Bitwise XOR on Data Buffers

20,932

Solution 1

First Try

Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:

import numpy
import time
from os import urandom
from scipy import weave

SIZE = 2**20

def faster_slow_xor(aa,bb):
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b)
    return b.tostring()

code = """
const __m128i* pa = (__m128i*)a;
const __m128i* pend = (__m128i*)(a + arr_size);
__m128i* pb = (__m128i*)b;
__m128i xmm1, xmm2;
while (pa < pend) {
  xmm1 = _mm_loadu_si128(pa); // must use unaligned access 
  xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries
  _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2));
  ++pa;
  ++pb;
}
"""

def inline_xor(aa, bb):
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.fromstring(bb, dtype=numpy.uint64)
    arr_size = a.shape[0]
    weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"'])
    return b.tostring()

Second Try

Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:

support = """
#define ALIGNMENT 16
static void memxor(const char* in1, const char* in2, char* out, ssize_t n) {
    const char* end = in1 + n;
    while (in1 < end) {
       *out = *in1 ^ *in2;
       ++in1; 
       ++in2;
       ++out;
    }
}
"""

code2 = """
PyObject* res = PyString_FromStringAndSize(NULL, real_size);

const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT;
const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT;

memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head);

const __m128i* pa = (__m128i*)((char*)a + head);
const __m128i* pend = (__m128i*)((char*)a + real_size - tail);
const __m128i* pb = (__m128i*)((char*)b + head);
__m128i xmm1, xmm2;
__m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head);
while (pa < pend) {
    xmm1 = _mm_loadu_si128(pa);
    xmm2 = _mm_loadu_si128(pb);
    _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2));
    ++pa;
    ++pb;
    ++pc;
}
memxor((const char*)pa, (const char*)pb, (char*)pc, tail);
return_val = res;
Py_DECREF(res);
"""

def inline_xor_nocopy(aa, bb):
    real_size = len(aa)
    a = numpy.frombuffer(aa, dtype=numpy.uint64)
    b = numpy.frombuffer(bb, dtype=numpy.uint64)
    return weave.inline(code2, ["a", "b", "real_size"], 
                        headers = ['"emmintrin.h"'], 
                        support_code = support)

The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.

The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::strings. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128.

Instead of using _mm_store_si128, we use _mm_stream_si128, which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.

Timings

As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.

  • slow_xor: 1.85s (1x)
  • faster_slow_xor: 1.25s (1.48x)
  • inline_xor: 0.95s (1.95x)
  • inline_xor_nocopy: 0.32s (5.78x)

The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.

Solution 2

Performance comparison: numpy vs. Cython vs. C vs. Fortran vs. Boost.Python (pyublas)

| function               | time, usec | ratio | type         |
|------------------------+------------+-------+--------------|
| slow_xor               |       2020 |   1.0 | numpy        |
| xorf_int16             |       1570 |   1.3 | fortran      |
| xorf_int32             |       1530 |   1.3 | fortran      |
| xorf_int64             |       1420 |   1.4 | fortran      |
| faster_slow_xor        |       1360 |   1.5 | numpy        |
| inline_xor             |       1280 |   1.6 | C            |
| cython_xor             |       1290 |   1.6 | cython       |
| xorcpp_inplace (int32) |        440 |   4.6 | pyublas      |
| cython_xor_vectorised  |        325 |   6.2 | cython       |
| inline_xor_nocopy      |        172 |  11.7 | C            |
| xorcpp                 |        144 |  14.0 | boost.python |
| xorcpp_inplace         |        122 |  16.6 | boost.python |
#+TBLFM: $3=@2$2/$2;%.1f

To reproduce results, download http://gist.github.com/353005 and type make (to install dependencies, type: sudo apt-get install build-essential python-numpy python-scipy cython gfortran, dependencies for Boost.Python, pyublas are not included due to they require manual intervention to work)

Where:

And xor_$type_sig() are:

! xorf.f90.template
subroutine xor_$type_sig(a, b, n, out)
  implicit none
  integer, intent(in)             :: n
  $type, intent(in), dimension(n) :: a
  $type, intent(in), dimension(n) :: b
  $type, intent(out), dimension(n) :: out

  integer i
  forall(i=1:n) out(i) = ieor(a(i), b(i))

end subroutine xor_$type_sig

It is used from Python as follows:

import xorf # extension module generated from xorf.f90.template
import numpy as np

def xor_strings(a, b, type_sig='int64'):
    assert len(a) == len(b)
    a = np.frombuffer(a, dtype=np.dtype(type_sig))
    b = np.frombuffer(b, dtype=np.dtype(type_sig))
    return getattr(xorf, 'xor_'+type_sig)(a, b).tostring()

xorcpp_inplace() (Boost.Python, pyublas):

xor.cpp:

#include <inttypes.h>
#include <algorithm>
#include <boost/lambda/lambda.hpp>
#include <boost/python.hpp>
#include <pyublas/numpy.hpp>

namespace { 
  namespace py = boost::python;

  template<class InputIterator, class InputIterator2, class OutputIterator>
  void
  xor_(InputIterator first, InputIterator last, 
       InputIterator2 first2, OutputIterator result) {
    // `result` migth `first` but not any of the input iterators
    namespace ll = boost::lambda;
    (void)std::transform(first, last, first2, result, ll::_1 ^ ll::_2);
  }

  template<class T>
  py::str 
  xorcpp_str_inplace(const py::str& a, py::str& b) {
    const size_t alignment = std::max(sizeof(T), 16ul);
    const size_t n         = py::len(b);
    const char* ai         = py::extract<const char*>(a);
    char* bi         = py::extract<char*>(b);
    char* end        = bi + n;

    if (n < 2*alignment) 
      xor_(bi, end, ai, bi);
    else {
      assert(n >= 2*alignment);

      // applying Marek's algorithm to align
      const ptrdiff_t head = (alignment - ((size_t)bi % alignment))% alignment;
      const ptrdiff_t tail = (size_t) end % alignment;
      xor_(bi, bi + head, ai, bi);
      xor_((const T*)(bi + head), (const T*)(end - tail), 
           (const T*)(ai + head),
           (T*)(bi + head));
      if (tail > 0) xor_(end - tail, end, ai + (n - tail), end - tail);
    }
    return b;
  }

  template<class Int>
  pyublas::numpy_vector<Int> 
  xorcpp_pyublas_inplace(pyublas::numpy_vector<Int> a, 
                         pyublas::numpy_vector<Int> b) {
    xor_(b.begin(), b.end(), a.begin(), b.begin());
    return b;
  }
}

BOOST_PYTHON_MODULE(xorcpp)
{
  py::def("xorcpp_inplace", xorcpp_str_inplace<int64_t>);     // for strings
  py::def("xorcpp_inplace", xorcpp_pyublas_inplace<int32_t>); // for numpy
}

It is used from Python as follows:

import os
import xorcpp

a = os.urandom(2**20)
b = os.urandom(2**20)
c = xorcpp.xorcpp_inplace(a, b) # it calls xorcpp_str_inplace()

Solution 3

Here are my results for cython

slow_xor   0.456888198853
faster_xor 0.400228977203
cython_xor 0.232881069183
cython_xor_vectorised 0.171468019485

Vectorising in cython shaves about 25% off the for loop on my computer, However more than half the time is spent building the python string (the return statement) - I don't think the extra copy can be avoided (legally) as the array may contain null bytes.

The illegal way would be to pass in a Python string and mutate it in place and would double the speed of the function.

xor.py

from time import time
from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64
import pyximport; pyximport.install()
import xor_

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

def faster_xor(aa,bb):
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

aa=urandom(2**20)
bb=urandom(2**20)

def test_it():
    t=time()
    for x in xrange(100):
        slow_xor(aa,bb)
    print "slow_xor  ",time()-t
    t=time()
    for x in xrange(100):
        faster_xor(aa,bb)
    print "faster_xor",time()-t
    t=time()
    for x in xrange(100):
        xor_.cython_xor(aa,bb)
    print "cython_xor",time()-t
    t=time()
    for x in xrange(100):
        xor_.cython_xor_vectorised(aa,bb)
    print "cython_xor_vectorised",time()-t

if __name__=="__main__":
    test_it()

xor_.pyx

cdef char c[1048576]
def cython_xor(char *a,char *b):
    cdef int i
    for i in range(1048576):
        c[i]=a[i]^b[i]
    return c[:1048576]

def cython_xor_vectorised(char *a,char *b):
    cdef int i
    for i in range(131094):
        (<unsigned long long *>c)[i]=(<unsigned long long *>a)[i]^(<unsigned long long *>b)[i]
    return c[:1048576]

Solution 4

An easy speedup is to use a larger 'chunk':

def faster_xor(aa,bb):
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

with uint64 also imported from numpy of course. I timeit this at 4 milliseconds, vs 6 milliseconds for the byte version.

Solution 5

Your problem isn't the speed of NumPy's xOr method, but rather with all of the buffering/data type conversions. Personally I suspect that the point of this post may have really been to brag about Python, because what you are doing here is processing THREE GIGABYTES of data in timeframes on par with non-interpreted languages, which are inherently faster.

The below code shows that even on my humble computer Python can xOr "aa" (1MB) and "bb" (1MB) into "c" (1MB) one thousand times (total 3GB) in under two seconds. Seriously, how much more improvement do you want? Especially from an interpreted language! 80% of the time was spent calling "frombuffer" and "tostring". The actual xOr-ing is completed in the other 20% of the time. At 3GB in 2 seconds, you would be hard-pressed to improve upon that substantially even just using memcpy in c.

In case this was a real question, and not just covert bragging about Python, the answer is to code so as to minimize the number, amount and frequency of your type conversions such as "frombuffer" and "tostring". The actual xOr'ing is lightning fast already.

from os import urandom
from numpy import frombuffer,bitwise_xor,byte,uint64

def slow_xor(aa,bb):
    a=frombuffer(aa,dtype=byte)
    b=frombuffer(bb,dtype=byte)
    c=bitwise_xor(a,b)
    r=c.tostring()
    return r

bb=urandom(2**20)
aa=urandom(2**20)

def test_it():
    for x in xrange(1000):
    slow_xor(aa,bb)

def test_it2():
    a=frombuffer(aa,dtype=uint64)
    b=frombuffer(bb,dtype=uint64)
    for x in xrange(1000):
        c=bitwise_xor(a,b);
    r=c.tostring()    

test_it()
print 'Slow Complete.'
#6 seconds
test_it2()
print 'Fast Complete.'
#under 2 seconds

Anyway, the "test_it2" above accomplishes exactly the same amount of xOr-ing as "test_it" does, but in 1/5 the time. 5x speed improvement should qualify as "substantial", no?

Share:
20,932
user213060
Author by

user213060

Updated on July 09, 2022

Comments

  • user213060
    user213060 almost 2 years

    Challenge:

    Perform a bitwise XOR on two equal sized buffers. The buffers will be required to be the python str type since this is traditionally the type for data buffers in python. Return the resultant value as a str. Do this as fast as possible.

    The inputs are two 1 megabyte (2**20 byte) strings.

    The challenge is to substantially beat my inefficient algorithm using python or existing third party python modules (relaxed rules: or create your own module.) Marginal increases are useless.

    from os import urandom
    from numpy import frombuffer,bitwise_xor,byte
    
    def slow_xor(aa,bb):
        a=frombuffer(aa,dtype=byte)
        b=frombuffer(bb,dtype=byte)
        c=bitwise_xor(a,b)
        r=c.tostring()
        return r
    
    aa=urandom(2**20)
    bb=urandom(2**20)
    
    def test_it():
        for x in xrange(1000):
            slow_xor(aa,bb)
    
  • user213060
    user213060 over 14 years
    Marginal improvement though, especially for smaller buffers.
  • Tor Valamo
    Tor Valamo over 14 years
    It compiles to machine code. It always gets converted to c code before compiling.
  • mloskot
    mloskot over 14 years
    +1 for good suggestion. I timed it and it's much faster than the original, see my comment to Ira Baxter's answer below.
  • John La Rooy
    John La Rooy over 14 years
    This requires the buffer length to be a multiple of 8, but the challenge is for 2**20, so no special handling of other cases is required
  • Torsten Marek
    Torsten Marek over 14 years
    Thanks! It's probably possible to speed it up a little bit by using prefetching (_mm_prefetch intrinsic), but I wasn't able to produce any spectacular results with it.
  • SoapBox
    SoapBox over 14 years
    I've found that in cases of linear walks through an array prefetch really doesn't help. Intel processors are already smart about prefetching for linear walks.
  • user213060
    user213060 over 14 years
    Is there a simple way to eliminate the frombuffer and tostring calls? Those seem to be the largest bottleneck now. Promising approach though.
  • Torsten Marek
    Torsten Marek over 14 years
    That code really bothered me, so I gave it another try ;-) Doesn't really have much to do with Python anymore, though :-(
  • just somebody
    just somebody over 14 years
    isn't it perhaps because test_it2 runs c.tostring once while test_it a thousand times?
  • user213060
    user213060 over 14 years
    Somewhere between Cython and the C compiler, there is a failure to vectorize into SIMD instructions. Shame. Good demonstration of a very simple optimization though. Also good for being the first, at the time of posting, to remove the expensive buffer type conversion operations.
  • Rudiger
    Rudiger over 14 years
    I just did XOR over an array in C, and my implementation did 1000 runs in 0.03s... An order of magnitude faster.
  • John La Rooy
    John La Rooy over 14 years
    @user213060, I'm sure there would be a decent speedup by casting to 64 or 128bit types for the xor. I don't know enough cython to do that though.
  • jfs
    jfs about 14 years
    Ratio slow_xor/inline_xor_nocopy is 11 (2020 usec vs. 172 usec per iteration). Ratio slow_xor/inline_xor is 1.6; slow_xor/faster_slow_xor is 1.5 (Python 2.6.4 x86_64 GNU/Linux)
  • jfs
    jfs about 14 years
    Ratio slow_xor/cython_xor_vectorised is 6.2 (2020 usec vs. 325 usec for 2**20 size). Ratio slow_xor/cython_xor is 1.6 (Python 2.6.4 x86_64 GNU/Linux)
  • jfs
    jfs about 14 years
    I've posted results of comparison of all presented approaches stackoverflow.com/questions/2119761/…
  • jfs
    jfs about 14 years
    I've posted results of comparison of all presented approaches stackoverflow.com/questions/2119761/…