C++ performance challenge: integer to std::string conversion

c++ performance string integer

47,152

Solution 1

#include <string>

const char digit_pairs[201] = {
  "00010203040506070809"
  "10111213141516171819"
  "20212223242526272829"
  "30313233343536373839"
  "40414243444546474849"
  "50515253545556575859"
  "60616263646566676869"
  "70717273747576777879"
  "80818283848586878889"
  "90919293949596979899"
};


std::string& itostr(int n, std::string& s)
{
    if(n==0)
    {
        s="0";
        return s;
    }

    int sign = -(n<0);
    unsigned int val = (n^sign)-sign;

    int size;
    if(val>=10000)
    {
        if(val>=10000000)
        {
            if(val>=1000000000)
                size=10;
            else if(val>=100000000)
                size=9;
            else 
                size=8;
        }
        else
        {
            if(val>=1000000)
                size=7;
            else if(val>=100000)
                size=6;
            else
                size=5;
        }
    }
    else 
    {
        if(val>=100)
        {
            if(val>=1000)
                size=4;
            else
                size=3;
        }
        else
        {
            if(val>=10)
                size=2;
            else
                size=1;
        }
    }
    size -= sign;
    s.resize(size);
    char* c = &s[0];
    if(sign)
        *c='-';

    c += size-1;
    while(val>=100)
    {
       int pos = val % 100;
       val /= 100;
       *(short*)(c-1)=*(short*)(digit_pairs+2*pos); 
       c-=2;
    }
    while(val>0)
    {
        *c--='0' + (val % 10);
        val /= 10;
    }
    return s;
}

std::string& itostr(unsigned val, std::string& s)
{
    if(val==0)
    {
        s="0";
        return s;
    }

    int size;
    if(val>=10000)
    {
        if(val>=10000000)
        {
            if(val>=1000000000)
                size=10;
            else if(val>=100000000)
                size=9;
            else 
                size=8;
        }
        else
        {
            if(val>=1000000)
                size=7;
            else if(val>=100000)
                size=6;
            else
                size=5;
        }
    }
    else 
    {
        if(val>=100)
        {
            if(val>=1000)
                size=4;
            else
                size=3;
        }
        else
        {
            if(val>=10)
                size=2;
            else
                size=1;
        }
    }

    s.resize(size);
    char* c = &s[size-1];
    while(val>=100)
    {
       int pos = val % 100;
       val /= 100;
       *(short*)(c-1)=*(short*)(digit_pairs+2*pos); 
       c-=2;
    }
    while(val>0)
    {
        *c--='0' + (val % 10);
        val /= 10;
    }
    return s;
}

This will blow up on systems that disallow unaligned memory accesses (in which case, the first unaligned assignment via *(short*) would cause a segfault), but should work very nicely otherwise.

One important thing to do is to minimize the use of std::string. (Ironic, I know.) In Visual Studio, for example, most calls to methods of std::string are not inlined, even if you specify /Ob2 in compiler options. So even something as trivial as a call to std::string::clear(), which you might expect to be very fast, can take 100 clockticks when linking CRT as a static library, and as much as 300 clockticks when linking as a DLL.

For the same reason, returning by reference is better because it avoids an assignment, a constructor and a destructor.

Solution 2

Ah, awesome challenge by the way... I've had a lot of fun with this.

I have two algorithms to submit (code is at the bottom if you feel like skipping to it). In my comparisons I require that the function return a string and that it can handle int and unsigned int. Comparing things that don't construct a string to those that do doesn't really make sense.

The first one is a fun implementation that doesn't use any precomputed lookup tables or explicit division/modulo. This one is competitive with the others with gcc and with all but Timo's on msvc (for a good reason that I explain below). The second algorithm is my actual submission for highest performance. In my tests it beats all the others on both gcc and msvc.

I think I know why some of the results on MSVC are very good. std::string has two relevant constructors std::string(char* str, size_t n)
and
std::string(ForwardIterator b, ForwardIterator e)
gcc does the same thing for both of them... that is it uses the second to implement the first. The first constructor can be implemented significantly more efficiently than that and MSVC does so. The side benefit of this is that in some cases (like my fast code and Timo's code) the string constructor can be inlined. In fact, just switching between these constructors in MSVC is almost a 2x difference for my code.

My performance testing results:

Code Sources:

- Voigt
- Timo
- ergosys
- user434507
- user-voigt-timo
- hopman-fun
- hopman-fast

gcc 4.4.5 -O2 on Ubuntu 10.10 64-bit, Core i5

hopman_fun: 124.688  MB/sec --- 8.020 s
hopman_fast: 137.552  MB/sec --- 7.270 s
voigt: 120.192  MB/sec --- 8.320 s
user_voigt_timo: 97.9432  MB/sec --- 10.210 s
timo: 120.482  MB/sec --- 8.300 s
user: 97.7517  MB/sec --- 10.230 s
ergosys: 101.42  MB/sec --- 9.860 s

MSVC 2010 64-bit /Ox on Windows 7 64-bit, Core i5

hopman_fun: 127  MB/sec --- 7.874 s
hopman_fast: 259  MB/sec --- 3.861 s
voigt: 221.435  MB/sec --- 4.516 s
user_voigt_timo: 195.695  MB/sec --- 5.110 s
timo: 253.165  MB/sec --- 3.950 s
user: 212.63  MB/sec --- 4.703 s
ergosys: 78.0518  MB/sec --- 12.812 s

Here are some results and a testing/timing framework on ideone
http://ideone.com/XZRqp
Note that ideone is a 32-bit environment. Both of my algorithms suffer from that, but hopman_fast is at least still competetive.

Note that for those the two or so that don't construct a string I added the following function template:

template <typename T>
std::string itostr(T t) {
    std::string ret;
    itostr(t, ret);
    return ret;
}

Now for my code...first the fun one:

    // hopman_fun

template <typename T> 
T reduce2(T v) {
    T k = ((v * 410) >> 12) & 0x000F000F000F000Full;
    return (((v - k * 10) << 8) + k);
}

template <typename T>
T reduce4(T v) {
    T k = ((v * 10486) >> 20) & 0xFF000000FFull;
    return reduce2(((v - k * 100) << 16) + (k));
}

typedef unsigned long long ull;
inline ull reduce8(ull v) {
    ull k = ((v * 3518437209u) >> 45);
    return reduce4(((v - k * 10000) << 32) + (k));
}

template <typename T>
std::string itostr(T o) {
    union {
        char str[16];
        unsigned short u2[8];
        unsigned u4[4];
        unsigned long long u8[2];
    };

    unsigned v = o < 0 ? ~o + 1 : o;

    u8[0] = (ull(v) * 3518437209u) >> 45;
    u8[0] = (u8[0] * 28147497672ull);
    u8[1] = v - u2[3] * 100000000;

    u8[1] = reduce8(u8[1]);
    char* f;
    if (u2[3]) {
        u2[3] = reduce2(u2[3]);
        f = str + 6;
    } else {
        unsigned short* k = u4[2] ? u2 + 4 : u2 + 6;
        f = *k ? (char*)k : (char*)(k + 1);
    }
    if (!*f) f++;

    u4[1] |= 0x30303030;
    u4[2] |= 0x30303030;
    u4[3] |= 0x30303030;
    if (o < 0) *--f = '-';
    return std::string(f, (str + 16) - f);
}

And then the fast one:

    // hopman_fast

struct itostr_helper {
    static unsigned out[10000];

    itostr_helper() {
        for (int i = 0; i < 10000; i++) {
            unsigned v = i;
            char * o = (char*)(out + i);
            o[3] = v % 10 + '0';
            o[2] = (v % 100) / 10 + '0';
            o[1] = (v % 1000) / 100 + '0';
            o[0] = (v % 10000) / 1000;
            if (o[0]) o[0] |= 0x30;
            else if (o[1] != '0') o[0] |= 0x20;
            else if (o[2] != '0') o[0] |= 0x10;
            else o[0] |= 0x00;
        }
    }
};
unsigned itostr_helper::out[10000];

itostr_helper hlp_init;

template <typename T>
std::string itostr(T o) {
    typedef itostr_helper hlp;

    unsigned blocks[3], *b = blocks + 2;
    blocks[0] = o < 0 ? ~o + 1 : o;
    blocks[2] = blocks[0] % 10000; blocks[0] /= 10000;
    blocks[2] = hlp::out[blocks[2]];

    if (blocks[0]) {
        blocks[1] = blocks[0] % 10000; blocks[0] /= 10000;
        blocks[1] = hlp::out[blocks[1]];
        blocks[2] |= 0x30303030;
        b--;
    }

    if (blocks[0]) {
        blocks[0] = hlp::out[blocks[0] % 10000];
        blocks[1] |= 0x30303030;
        b--;
    }

    char* f = ((char*)b);
    f += 3 - (*f >> 4);

    char* str = (char*)blocks;
    if (o < 0) *--f = '-';
    return std::string(f, (str + 12) - f);
}

Solution 3

Benchmark data for the code provided in the question:

On ideone (gcc 4.3.4):

stringstreams: 4.4 MB/s
sprintf: 25.0 MB/s
mine (Ben Voigt): 55.8 MB/s
Timo: 58.5 MB/s
user434507: 199 MB/s
user434507's Ben-Timo-507 hybrid: 263 MB/s

Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 32-bit:

cl /Ox /EHsc

stringstreams: 3.39 MB/s, 3.67 MB/s
sprintf: 16.8 MB/s, 16.2 MB/s
mine: 194 MB/s, 207 MB/s (with PGO enabled: 250 MB/s)

Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 64-bit:

cl /Ox /EHsc

stringstreams: 4.42 MB/s, 4.92 MB/s
sprintf: 21.0 MB/s, 20.8 MB/s
mine: 238 MB/s, 228 MB/s

Core i7, Windows 7 64-bit, 8 GB RAM, cygwin gcc 4.3.4:

g++ -O3

stringstreams: 2.19 MB/s, 2.17 MB/s
sprintf: 13.1 MB/s, 13.4 MB/s
mine: 30.0 MB/s, 30.2 MB/s

edit: I was gonna add my own answer, but the question was was closed so I'm adding it here. :) I wrote my own algorithm and managed to get a decent improvement over Ben's code, though I only tested it in MSVC 2010. I also made a benchmark of all the implementations presented so far, using the same testing setup that was in Ben's original code. -- Timo

Intel Q9450, Win XP 32bit, MSVC 2010

cl /O2 /EHsc

stringstream: 2.87 MB/s
sprintf: 16.1 MB/s
Ben: 202 MB/s
Ben (unsigned buffer): 82.0 MB/s
ergosys (updated version): 64.2 MB/s
user434507: 172 MB/s
Timo: 241 MB/s

const char digit_pairs[201] = {
  "00010203040506070809"
  "10111213141516171819"
  "20212223242526272829"
  "30313233343536373839"
  "40414243444546474849"
  "50515253545556575859"
  "60616263646566676869"
  "70717273747576777879"
  "80818283848586878889"
  "90919293949596979899"
};

static const int BUFFER_SIZE = 11;

std::string itostr(int val)
{
  char buf[BUFFER_SIZE];
  char *it = &buf[BUFFER_SIZE-2];

  if(val>=0) {
    int div = val/100;
    while(div) {
      memcpy(it,&digit_pairs[2*(val-div*100)],2);
      val = div;
      it-=2;
      div = val/100;
    }
    memcpy(it,&digit_pairs[2*val],2);
    if(val<10)
      it++;
  } else {
    int div = val/100;
    while(div) {
      memcpy(it,&digit_pairs[-2*(val-div*100)],2);
      val = div;
      it-=2;
      div = val/100;
    }
    memcpy(it,&digit_pairs[-2*val],2);
    if(val<=-10)
      it--;
    *it = '-';
  }

  return std::string(it,&buf[BUFFER_SIZE]-it);
}

std::string itostr(unsigned int val)
{
  char buf[BUFFER_SIZE];
  char *it = (char*)&buf[BUFFER_SIZE-2];

  int div = val/100;
  while(div) {
    memcpy(it,&digit_pairs[2*(val-div*100)],2);
    val = div;
    it-=2;
    div = val/100;
  }
  memcpy(it,&digit_pairs[2*val],2);
  if(val<10)
    it++;

  return std::string((char*)it,(char*)&buf[BUFFER_SIZE]-(char*)it);
}

Solution 4

While the info we get here for the algorithms is pretty nice, I think the question is "broken", and I'll explain why I think this:

The question asks to take the performance of int->std::string conversion, and this may be of interest when comparing a commonly available method, such as different stringstream implementations or boost::lexical_cast. It does not, however, make sense when asking for new code, a specialized algorithm, to do this. The reason is that int2string will always involve heap allocation from std::string and if we are trying to squeeze the last out of our conversion algorithm, I do not think it makes sense to mix these measurements up with the heap allocations done by std::string. If I want performant conversion I will always use a fixed size buffer and certainly never allocate anything on the heap!

To sum up, I think the timings should be split:

First, fastest (int -> fixed buffer) conversion.
Second, timing of (fixed buffer -> std::string) copy.
Third, checking how the std::string allocation can directly be used as buffer, to save the copying.

These aspects should not be mixed up in one timing, IMHO.

Solution 5

Updated user2985907's answer... modp_ufast...

Integer To String Test (Type 1)
[modp_ufast]Numbers: 240000000  Total:   657777786      Time:  1.1633sec        Rate:206308473.0686nums/sec
[sprintf] Numbers: 240000000    Total:   657777786      Time: 24.3629sec        Rate:  9851045.8556nums/sec
[karma]   Numbers: 240000000    Total:   657777786      Time:  5.2389sec        Rate: 45810870.7171nums/sec
[strtk]   Numbers: 240000000    Total:   657777786      Time:  3.3126sec        Rate: 72450283.7492nums/sec
[so   ]   Numbers: 240000000    Total:   657777786      Time:  3.0828sec        Rate: 77852152.8820nums/sec
[timo ]   Numbers: 240000000    Total:   657777786      Time:  4.7349sec        Rate: 50687912.9889nums/sec
[voigt]   Numbers: 240000000    Total:   657777786      Time:  5.1689sec        Rate: 46431985.1142nums/sec
[hopman]  Numbers: 240000000    Total:   657777786      Time:  4.6169sec        Rate: 51982554.6497nums/sec
Press any key to continue . . .

Integer To String Test(Type 2)
[modp_ufast]Numbers: 240000000  Total:   660000000      Time:  0.5072sec        Rate:473162716.4618nums/sec
[sprintf] Numbers: 240000000    Total:   660000000      Time: 22.3483sec        Rate: 10739062.9383nums/sec
[karma]   Numbers: 240000000    Total:   660000000      Time:  4.2471sec        Rate: 56509024.3035nums/sec
[strtk]   Numbers: 240000000    Total:   660000000      Time:  2.1683sec        Rate:110683636.7123nums/sec
[so   ]   Numbers: 240000000    Total:   660000000      Time:  2.7133sec        Rate: 88454602.1423nums/sec
[timo ]   Numbers: 240000000    Total:   660000000      Time:  2.8030sec        Rate: 85623453.3872nums/sec
[voigt]   Numbers: 240000000    Total:   660000000      Time:  3.4019sec        Rate: 70549286.7776nums/sec
[hopman]  Numbers: 240000000    Total:   660000000      Time:  2.7849sec        Rate: 86178023.8743nums/sec
Press any key to continue . . .

Integer To String Test (type 3)
[modp_ufast]Numbers: 240000000  Total:   505625000      Time:  1.6482sec        Rate:145610315.7819nums/sec
[sprintf] Numbers: 240000000    Total:   505625000      Time: 20.7064sec        Rate: 11590618.6109nums/sec
[karma]   Numbers: 240000000    Total:   505625000      Time:  4.3036sec        Rate: 55767734.3570nums/sec
[strtk]   Numbers: 240000000    Total:   505625000      Time:  2.9297sec        Rate: 81919227.9275nums/sec
[so   ]   Numbers: 240000000    Total:   505625000      Time:  3.0278sec        Rate: 79266003.8158nums/sec
[timo ]   Numbers: 240000000    Total:   505625000      Time:  4.0631sec        Rate: 59068204.3266nums/sec
[voigt]   Numbers: 240000000    Total:   505625000      Time:  4.5616sec        Rate: 52613393.0285nums/sec
[hopman]  Numbers: 240000000    Total:   505625000      Time:  4.1248sec        Rate: 58184194.4569nums/sec
Press any key to continue . . .

int ufast_utoa10(unsigned int value, char* str)
{
#define JOIN(N) N "0", N "1", N "2", N "3", N "4", N "5", N "6", N "7", N "8", N "9"
#define JOIN2(N) JOIN(N "0"), JOIN(N "1"), JOIN(N "2"), JOIN(N "3"), JOIN(N "4"), \
                 JOIN(N "5"), JOIN(N "6"), JOIN(N "7"), JOIN(N "8"), JOIN(N "9")
#define JOIN3(N) JOIN2(N "0"), JOIN2(N "1"), JOIN2(N "2"), JOIN2(N "3"), JOIN2(N "4"), \
                 JOIN2(N "5"), JOIN2(N "6"), JOIN2(N "7"), JOIN2(N "8"), JOIN2(N "9")
#define JOIN4    JOIN3("0"), JOIN3("1"), JOIN3("2"), JOIN3("3"), JOIN3("4"), \
                 JOIN3("5"), JOIN3("6"), JOIN3("7"), JOIN3("8"), JOIN3("9")
#define JOIN5(N) JOIN(N), JOIN(N "1"), JOIN(N "2"), JOIN(N "3"), JOIN(N "4"), \
                 JOIN(N "5"), JOIN(N "6"), JOIN(N "7"), JOIN(N "8"), JOIN(N "9")
#define JOIN6    JOIN5(), JOIN5("1"), JOIN5("2"), JOIN5("3"), JOIN5("4"), \
                 JOIN5("5"), JOIN5("6"), JOIN5("7"), JOIN5("8"), JOIN5("9")
#define F(N)     ((N) >= 100 ? 3 : (N) >= 10 ? 2 : 1)
#define F10(N)   F(N),F(N+1),F(N+2),F(N+3),F(N+4),F(N+5),F(N+6),F(N+7),F(N+8),F(N+9)
#define F100(N)  F10(N),F10(N+10),F10(N+20),F10(N+30),F10(N+40),\
                 F10(N+50),F10(N+60),F10(N+70),F10(N+80),F10(N+90)
  static const short offsets[] = { F100(0), F100(100), F100(200), F100(300), F100(400),
                                  F100(500), F100(600), F100(700), F100(800), F100(900)};
  static const char table1[][4] = { JOIN("") }; 
  static const char table2[][4] = { JOIN2("") }; 
  static const char table3[][4] = { JOIN3("") };
  static const char table4[][5] = { JOIN4 }; 
  static const char table5[][4] = { JOIN6 };
#undef JOIN
#undef JOIN2
#undef JOIN3
#undef JOIN4
  char *wstr;
  int remains[2];
  unsigned int v2;
  if (value >= 100000000) {
    v2 = value / 10000;
    remains[0] = value - v2 * 10000;
    value = v2;
    v2 = value / 10000;
    remains[1] = value - v2 * 10000;
    value = v2;
    wstr = str;
    if (value >= 1000) {
      *(__int32 *) wstr = *(__int32 *) table4[value];
      wstr += 4;
    } else {
      *(__int32 *) wstr = *(__int32 *) table5[value];
      wstr += offsets[value];
    }
    *(__int32 *) wstr = *(__int32 *) table4[remains[1]];
    wstr += 4;
    *(__int32 *) wstr = *(__int32 *) table4[remains[0]];
    wstr += 4;
    *wstr = 0;
    return (wstr - str);
  }
  else if (value >= 10000) {
    v2 = value / 10000;
    remains[0] = value - v2 * 10000;
    value = v2;
    wstr = str;
    if (value >= 1000) {
      *(__int32 *) wstr = *(__int32 *) table4[value];
      wstr += 4;
      *(__int32 *) wstr = *(__int32 *) table4[remains[0]];
      wstr += 4;
      *wstr = 0;
      return 8;
    } else {
      *(__int32 *) wstr = *(__int32 *) table5[value];
      wstr += offsets[value];
      *(__int32 *) wstr = *(__int32 *) table4[remains[0]];
      wstr += 4;
      *wstr = 0;
      return (wstr - str);
    }
  }
  else {
    if (value >= 1000) {
      *(__int32 *) str = *(__int32 *) table4[value];
      str += 4;
      *str = 0;
      return 4;
    } else if (value >= 100) {
      *(__int32 *) str = *(__int32 *) table3[value];
      return 3;
    } else if (value >= 10) {
      *(__int16 *) str = *(__int16 *) table2[value];
      str += 2;
      *str = 0;
      return 2;
    } else {
      *(__int16 *) str = *(__int16 *) table1[value];
      return 1;
    }
  }
}

int ufast_itoa10(int value, char* str) {
  if (value < 0) { *(str++) = '-'; 
    return ufast_utoa10(-value, str) + 1; 
  }
  else return ufast_utoa10(value, str);
}


    void ufast_test() {

   print_mode("[modp_ufast]");

   std::string s;
   s.reserve(32);
   std::size_t total_length = 0;
   strtk::util::timer t;
   t.start();

   char buf[128];
   int len;
   for (int i = (-max_i2s / 2); i < (max_i2s / 2); ++i)
   {
      #ifdef enable_test_type01
      s.resize(ufast_itoa10(((i & 1) ? i : -i), const_cast<char*>(s.c_str())));
      total_length += s.size();
      #endif

      #ifdef enable_test_type02
      s.resize(ufast_itoa10(max_i2s + i, const_cast<char*>(s.c_str())));
      total_length += s.size();
      #endif

      #ifdef enable_test_type03
      s.resize(ufast_itoa10(randval[(max_i2s + i) & 1023], const_cast<char*>(s.c_str())));
      total_length += s.size();
      #endif
   }
   t.stop();
   printf("Numbers:%10lu\tTotal:%12lu\tTime:%8.4fsec\tRate:%14.4fnums/sec\n",
          static_cast<unsigned long>(3 * max_i2s),
          static_cast<unsigned long>(total_length),
          t.time(),
          (3.0 * max_i2s) / t.time());
}

View more solutions

47,152

Author by

Ben Voigt

Since Stack Exchange has effectively turned my profile page into a political platform without my foreknowledge or consent, I feel obligated to share my own views: An oligarchy of five justices setting policy for the entire USA is not the government the Constitution guarantees, and it's not the one we want. Not everything President Gerald Ford said was very smart or worth repeating, but this is A government big enough to give you everything you want is a government big enough to take from you everything you have. The USA was designed as a Constitutional Republic, with checks-and-balances between the branches of the federal government, and also division of power between federal state and local governments. (And these are explicitly required by the plain language of the Constitution) All government officials need to stay within the authority granted to them by the Constitution, especially when that means they can't do whatever they want.

Updated on June 24, 2021

Comments

Ben Voigt almost 3 years
Can anyone beat the performance of my integer to std::string code, linked below?

There are already several questions that explain how to convert an integer into a std::string in C++, such as this one, but none of the solutions provided are efficient.

Here is compile-ready code for some common methods to compete against:
- The "C++ way", using stringstream: http://ideone.com/jh3Sa
- sprintf, which SO-ers usually recommend to the performance conscious: http://ideone.com/82kwR
Contrary to popular belief, boost::lexical_cast has its own implementation (white paper) and does not use stringstream and numeric insertion operators. I'd really like to see its performance compared, because this other question suggests that it's miserable.

And my own contribution, which is competitive on desktop computers, and demonstrates an approach that runs at full speed on embedded systems as well, unlike algorithms dependent on integer modulo:
- Ben's algorithms: http://ideone.com/SsEUW
If you want to use that code, I'll make it available under a simplified BSD license (commercial use allowed, attribution required). Just ask.

Finally, the function ltoa is non-standard but widely available.
- ltoa version, for anyone who has a compiler that provides it (ideone doesn't): http://ideone.com/T5Wim
I'll post my performance measurements as an answer shortly.

Rules for algorithms
- Provide code for a conversion of at least 32-bit signed and unsigned integers into decimal.
- Produce output as a std::string.
- No tricks that are incompatible with threading and signals (for example, static buffers).
- You may assume an ASCII character set.
- Make sure to test your code on INT_MIN on a two's complement machine where the absolute value is not representable.
- Ideally, the output should be character-for-character identical with the canonical C++ version using stringstream, http://ideone.com/jh3Sa, but anything that is clearly understandable as the correct number is ok, too.
- NEW: Although you can use whatever compiler and optimizer options (except completely disabled) you want for the comparison, the code needs to also compile and give correct results under at least VC++ 2010 and g++.
Hoped-for Discussion

Besides better algorithms, I'd also like to get some benchmarks on several different platforms and compilers (let's use MB/s throughput as our standard unit of measure). I believe that the code for my algorithm (I know the sprintf benchmark takes some shortcuts -- now fixed) is well-defined behavior by the standard, at least under the ASCII assumption, but if you see any undefined behavior or inputs for which the output is invalid, please point that out.

Conclusions:

Different algorithms perform for g++ and VC2010, likely due to the different implementations of std::string on each. VC2010 clearly does a better job with NRVO, getting rid of return-by-value helped only on gcc.

Code was found that outperforms sprintf by an order of magnitude. ostringstream falls behind by a factor of 50 and more.

The winner of the challenge is user434507 who produces code that runs 350% of the speed of my own on gcc. Further entries are closed due to the whims of the SO community.

The current (final?) speed champions are:
- For gcc: user434507, at 8 times faster than sprintf: http://ideone.com/0uhhX
- For Visual C++: Timo, at 15 times faster than sprintf: http://ideone.com/VpKO3
Ben Voigt over 13 years

Thanks for your attempt. On ideone (ideone.com/BCp5r), it scores 18.5 MB/s, about half the speed of sprintf. And with VC++ 2010, it gets about 50 MB/s, about twice the speed of sprintf.
Eugene Smith over 13 years

MB/s is a strange metric, especially seeing how you don't remove trailing whitespaces from the string in your implementations. My updated code runs faster than your implementation with x64 VC++ 2005 on Core i7 920 (16.2M ops/s vs. 14.8M ops/s), _ltoa does 8.5M ops/s and sprintf() does 3.85M ops/s.
Ben Voigt over 13 years

Your code doesn't properly resize the string, mine does (see lines 81, 198, and 290). I took some shortcuts in the sprintf implementation, I already mentioned that in my question, but I believe the code-to-beat gives exactly the same result as stringstream.
Ben Voigt over 13 years

I've fixed the sprintf wrapper as well, to avoid confusion.
Ben Voigt over 13 years

BTW, your improved version (ideone.com/GLAbS) gets 41.7 MB/s on ideone, and right around 120 MB/s on VC++ 2010 32-bit.
vpalmu over 13 years

Remaining fix -- try using the old math function div. On some architectures calling it is faster.
Ben Voigt over 13 years

With unsigned variant: ideone.com/pswq9. It seems that changing the buffer type from char to unsigned produces a similar speed improvement in my code, at least on gcc/ideone ideone.com/uthKK. I'll test on VS tomorrow.
Behrouz.M over 13 years

thanks for these infos , please explain about gcc speed ! it is very low :(
Ben Voigt over 13 years

@Behrouz: Indeed. I'm not exactly sure why gcc is so slow, whether it is gcc's version of std::string or poor optimization of the arithmetic code. I shall make another version that doesn't convert to std::string at the end and see whether gcc fares any better.
Eugene Smith over 13 years

Ben: what happens if you declare 'val' as int? the only reason for it to be __int64 is to avoid the integer overflow for n=INT_MIN, which is a case that can be handled explicitly. I'd imagine that manipulating a __int64 is costly in 32-bit.
Eugene Smith over 13 years

Joshua: not on Intel. VC2005 compiler is smart enough to replace the division by 10 with integer multiplication, which is THE fastest way to do things here.
Eugene Smith over 13 years

OK, after some more hacking in 32-bit VC2005, I got to the point where two calls to string::resize() (one in the beginning with resize(16) and one in the end with resize(size+sign)) take more time than the actual conversion. It seems that memmove() is being called. This is bizarre...
Ben Voigt over 13 years

The new version is fantastic under gcc. I had to make a couple small modifications to make it comply with the rules (caller-provided result string instead of global), but get your same great performance. ideone.com/0uhhX I'll test on VC++ 2010 later and let you know how it performs, but you win as soon as you update your answer to the new version of the code.
Ben Voigt over 13 years

BTW, while I was writing the question, I saw memmove appearing in the call tree under VC2010 as well, tracked it down to silly logic in std::string::erase when erasing from the tail, and submitted a patch to Microsoft. Hopefully the next Visual Studio service pack will resize std::string much faster.
Ben Voigt over 13 years

@Timo: That's very cool. I didn't really expect the change to an unsigned buffer to help with VC++, which was already quite fast, so it was only applicable to gcc and now user434507 has provided a far better version there.
Eugene Smith over 13 years

But wait, there's more. Incorporating some ideas of yours and timo gets us here: ideone.com/UiWKI
Ben Voigt over 13 years

@user434507: Seems safer to just use memcpy like Timo did... it's an intrinsic on most compilers, so the unaligned copy will get used on architectures that allow it. At any rate, there was no speed decrease on ideone: ideone.com/VLUco
Ben Voigt over 13 years

std::string::clear in Visual Studio is probably affected by the std::string::erase problems, since erase(0, npos) is a reasonable way to implement clear and avoid code duplication.
Chris Hopman over 13 years

For those that are interested in how hopman-fun works but don't feel like puzzling it out, I created a commented version at ideone.com/rnDxk
Ben Voigt over 13 years

<quote>int2string will always involve heap allocation from std::string</quote> Not with the small-string optimization, which is present in most current implementations of the Standard Library.
Timo over 13 years

I don't understand how the first one works even with the comments. :D The fast one is really nice, though it has its price in memory usage. But I guess 40kB is still acceptable. I actually modified my own code to also use 4 character groups, and got similar speed. ideone.com/KbTFe
fuz almost 12 years

I have posted this question with a different algorithm.
Ben Voigt over 11 years

It's because this challenge is about speed, not the fewest lines of code.
Ben Voigt over 10 years

You never put it into the string. Also I don't know why your results for everyone else's code are so low, your CPU isn't slow.
Ben Voigt over 10 years

In the end, though, the "output as std::string" requirement was placed there just to make things fair and consistent for all submissions. Algorithms that are faster to make std::string results will also be faster to fill a preallocated buffer.
Martin Ba over 10 years

@Ben - good comments. Esp. the sm.str.opt. is something I'll have to remember in the future when judging std.string performance.
Denis Zaikin almost 10 years

modp_ufast has an error, it returns 10 instead of 1000000, 19 instead of 1090000 and etc, till 11000000.
atlaste about 9 years

PS: And for the people that want to use this in my solution: (1) it's much much slower and (2) because div works on signed integers - which breaks abs(INT32_MIN).
Ben Voigt about 9 years

Interestingly, I recently gave a copy of Hacker's Delight to a coworker. Any particular sections? Of course, note that modulo and div, although both returned from a single divide instruction, will not be obtained that way, because division by a constant is implemented much faster using hardware multiply than divide.
atlaste about 9 years

@BenVoigt actually if you run 'disassemble' on VS2013 you get exactly the code you would expect after reading H's delight. The chapter you're looking for is chapter 10.
Ben Voigt about 9 years

Yes, that's the implementation using hardware multiply I was referring to.
atlaste about 9 years

@BenVoigt Yes of course, that was what I meant. Both modulo and multiply (by constant) use the same magic number, shift (arith and normal). My assumption here was that the compiler is able to figure out it is emitting the same instructions multiple times and optimize that - and since all operations can be vectorized, it might figure that out as well (let's call that a bonus :-). My point with H's delight was that if you know how these operations are compiled (integer multiply, shift), you can make these assumptions.
Waldemar over 8 years

I've tested it from 0x80000000 to 0x7FFFFFFF and already at -999999999 you get invalid values (I've stopped after a few mismatches). Mismatch found: Generated: -9999999990 Reference: -999999999 Mismatch found: Generated: -9999999980 Reference: -999999998 Mismatch found: Generated: -9999999970 Reference: -999999997
Waldemar over 8 years

Modified ufast returns invalid values (stopped after a few errors). Mismatch found: Generated: -99 Reference: -9099999 Mismatch found: Generated: -99 Reference: -9099998 Mismatch found: Generated: -99 Reference: -9099997
pbn over 7 years

Would it be difficult to modify it to work with uint64_t? I moved this code to C and replaced 'T' with int type and it works, but it doesn't work for uint64_t and I don't have a clue how to customize it.
luizfls about 4 years

For the record, the answer above is the "user434507" algorithm.
user1593842 almost 4 years

I think you should add a version that does not convert to std::string. By changing just one line of code the function runs in half the time on my machine, using GCC. And by removing the std::string people would be able to use this function inside C programs.
vitaut almost 4 years

There is a more portable version with benchmarks available here: github.com/fmtlib/format-benchmark/blob/master/src/u2985907.‌h