C++ performance challenge: integer to std::string conversion
Solution 1
#include <string>
const char digit_pairs[201] = {
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
};
std::string& itostr(int n, std::string& s)
{
if(n==0)
{
s="0";
return s;
}
int sign = -(n<0);
unsigned int val = (n^sign)-sign;
int size;
if(val>=10000)
{
if(val>=10000000)
{
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
}
else
{
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
}
}
else
{
if(val>=100)
{
if(val>=1000)
size=4;
else
size=3;
}
else
{
if(val>=10)
size=2;
else
size=1;
}
}
size -= sign;
s.resize(size);
char* c = &s[0];
if(sign)
*c='-';
c += size-1;
while(val>=100)
{
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
}
while(val>0)
{
*c--='0' + (val % 10);
val /= 10;
}
return s;
}
std::string& itostr(unsigned val, std::string& s)
{
if(val==0)
{
s="0";
return s;
}
int size;
if(val>=10000)
{
if(val>=10000000)
{
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
}
else
{
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
}
}
else
{
if(val>=100)
{
if(val>=1000)
size=4;
else
size=3;
}
else
{
if(val>=10)
size=2;
else
size=1;
}
}
s.resize(size);
char* c = &s[size-1];
while(val>=100)
{
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
}
while(val>0)
{
*c--='0' + (val % 10);
val /= 10;
}
return s;
}
This will blow up on systems that disallow unaligned memory accesses (in which case, the first unaligned assignment via *(short*)
would cause a segfault), but should work very nicely otherwise.
One important thing to do is to minimize the use of std::string
. (Ironic, I know.) In Visual Studio, for example, most calls to methods of std::string are not inlined, even if you specify /Ob2 in compiler options. So even something as trivial as a call to std::string::clear()
, which you might expect to be very fast, can take 100 clockticks when linking CRT as a static library, and as much as 300 clockticks when linking as a DLL.
For the same reason, returning by reference is better because it avoids an assignment, a constructor and a destructor.
Solution 2
Ah, awesome challenge by the way... I've had a lot of fun with this.
I have two algorithms to submit (code is at the bottom if you feel like skipping to it). In my comparisons I require that the function return a string and that it can handle int and unsigned int. Comparing things that don't construct a string to those that do doesn't really make sense.
The first one is a fun implementation that doesn't use any precomputed lookup tables or explicit division/modulo. This one is competitive with the others with gcc and with all but Timo's on msvc (for a good reason that I explain below). The second algorithm is my actual submission for highest performance. In my tests it beats all the others on both gcc and msvc.
I think I know why some of the results on MSVC are very good. std::string has two relevant constructors
std::string(char* str, size_t n)
and
std::string(ForwardIterator b, ForwardIterator e)
gcc does the same thing for both of them... that is it uses the second to implement the first. The first constructor can be implemented significantly more efficiently than that and MSVC does so. The side benefit of this is that in some cases (like my fast code and Timo's code) the string constructor can be inlined. In fact, just switching between these constructors in MSVC is almost a 2x difference for my code.
My performance testing results:
Code Sources:
- Voigt
- Timo
- ergosys
- user434507
- user-voigt-timo
- hopman-fun
- hopman-fast
gcc 4.4.5 -O2 on Ubuntu 10.10 64-bit, Core i5
hopman_fun: 124.688 MB/sec --- 8.020 s hopman_fast: 137.552 MB/sec --- 7.270 s voigt: 120.192 MB/sec --- 8.320 s user_voigt_timo: 97.9432 MB/sec --- 10.210 s timo: 120.482 MB/sec --- 8.300 s user: 97.7517 MB/sec --- 10.230 s ergosys: 101.42 MB/sec --- 9.860 s
MSVC 2010 64-bit /Ox on Windows 7 64-bit, Core i5
hopman_fun: 127 MB/sec --- 7.874 s hopman_fast: 259 MB/sec --- 3.861 s voigt: 221.435 MB/sec --- 4.516 s user_voigt_timo: 195.695 MB/sec --- 5.110 s timo: 253.165 MB/sec --- 3.950 s user: 212.63 MB/sec --- 4.703 s ergosys: 78.0518 MB/sec --- 12.812 s
Here are some results and a testing/timing framework on ideone
http://ideone.com/XZRqp
Note that ideone is a 32-bit environment. Both of my algorithms suffer from that, but hopman_fast is at least still competetive.
Note that for those the two or so that don't construct a string I added the following function template:
template <typename T>
std::string itostr(T t) {
std::string ret;
itostr(t, ret);
return ret;
}
Now for my code...first the fun one:
// hopman_fun
template <typename T>
T reduce2(T v) {
T k = ((v * 410) >> 12) & 0x000F000F000F000Full;
return (((v - k * 10) << 8) + k);
}
template <typename T>
T reduce4(T v) {
T k = ((v * 10486) >> 20) & 0xFF000000FFull;
return reduce2(((v - k * 100) << 16) + (k));
}
typedef unsigned long long ull;
inline ull reduce8(ull v) {
ull k = ((v * 3518437209u) >> 45);
return reduce4(((v - k * 10000) << 32) + (k));
}
template <typename T>
std::string itostr(T o) {
union {
char str[16];
unsigned short u2[8];
unsigned u4[4];
unsigned long long u8[2];
};
unsigned v = o < 0 ? ~o + 1 : o;
u8[0] = (ull(v) * 3518437209u) >> 45;
u8[0] = (u8[0] * 28147497672ull);
u8[1] = v - u2[3] * 100000000;
u8[1] = reduce8(u8[1]);
char* f;
if (u2[3]) {
u2[3] = reduce2(u2[3]);
f = str + 6;
} else {
unsigned short* k = u4[2] ? u2 + 4 : u2 + 6;
f = *k ? (char*)k : (char*)(k + 1);
}
if (!*f) f++;
u4[1] |= 0x30303030;
u4[2] |= 0x30303030;
u4[3] |= 0x30303030;
if (o < 0) *--f = '-';
return std::string(f, (str + 16) - f);
}
And then the fast one:
// hopman_fast
struct itostr_helper {
static unsigned out[10000];
itostr_helper() {
for (int i = 0; i < 10000; i++) {
unsigned v = i;
char * o = (char*)(out + i);
o[3] = v % 10 + '0';
o[2] = (v % 100) / 10 + '0';
o[1] = (v % 1000) / 100 + '0';
o[0] = (v % 10000) / 1000;
if (o[0]) o[0] |= 0x30;
else if (o[1] != '0') o[0] |= 0x20;
else if (o[2] != '0') o[0] |= 0x10;
else o[0] |= 0x00;
}
}
};
unsigned itostr_helper::out[10000];
itostr_helper hlp_init;
template <typename T>
std::string itostr(T o) {
typedef itostr_helper hlp;
unsigned blocks[3], *b = blocks + 2;
blocks[0] = o < 0 ? ~o + 1 : o;
blocks[2] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[2] = hlp::out[blocks[2]];
if (blocks[0]) {
blocks[1] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[1] = hlp::out[blocks[1]];
blocks[2] |= 0x30303030;
b--;
}
if (blocks[0]) {
blocks[0] = hlp::out[blocks[0] % 10000];
blocks[1] |= 0x30303030;
b--;
}
char* f = ((char*)b);
f += 3 - (*f >> 4);
char* str = (char*)blocks;
if (o < 0) *--f = '-';
return std::string(f, (str + 12) - f);
}
Solution 3
Benchmark data for the code provided in the question:
On ideone (gcc 4.3.4):
- stringstreams: 4.4 MB/s
- sprintf: 25.0 MB/s
- mine (Ben Voigt): 55.8 MB/s
- Timo: 58.5 MB/s
- user434507: 199 MB/s
- user434507's Ben-Timo-507 hybrid: 263 MB/s
Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 32-bit:
cl /Ox /EHsc
- stringstreams: 3.39 MB/s, 3.67 MB/s
- sprintf: 16.8 MB/s, 16.2 MB/s
- mine: 194 MB/s, 207 MB/s (with PGO enabled: 250 MB/s)
Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 64-bit:
cl /Ox /EHsc
- stringstreams: 4.42 MB/s, 4.92 MB/s
- sprintf: 21.0 MB/s, 20.8 MB/s
- mine: 238 MB/s, 228 MB/s
Core i7, Windows 7 64-bit, 8 GB RAM, cygwin gcc 4.3.4:
g++ -O3
- stringstreams: 2.19 MB/s, 2.17 MB/s
- sprintf: 13.1 MB/s, 13.4 MB/s
- mine: 30.0 MB/s, 30.2 MB/s
edit: I was gonna add my own answer, but the question was was closed so I'm adding it here. :) I wrote my own algorithm and managed to get a decent improvement over Ben's code, though I only tested it in MSVC 2010. I also made a benchmark of all the implementations presented so far, using the same testing setup that was in Ben's original code. -- Timo
Intel Q9450, Win XP 32bit, MSVC 2010
cl /O2 /EHsc
- stringstream: 2.87 MB/s
- sprintf: 16.1 MB/s
- Ben: 202 MB/s
- Ben (unsigned buffer): 82.0 MB/s
- ergosys (updated version): 64.2 MB/s
- user434507: 172 MB/s
- Timo: 241 MB/s
-
const char digit_pairs[201] = {
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
};
static const int BUFFER_SIZE = 11;
std::string itostr(int val)
{
char buf[BUFFER_SIZE];
char *it = &buf[BUFFER_SIZE-2];
if(val>=0) {
int div = val/100;
while(div) {
memcpy(it,&digit_pairs[2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
}
memcpy(it,&digit_pairs[2*val],2);
if(val<10)
it++;
} else {
int div = val/100;
while(div) {
memcpy(it,&digit_pairs[-2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
}
memcpy(it,&digit_pairs[-2*val],2);
if(val<=-10)
it--;
*it = '-';
}
return std::string(it,&buf[BUFFER_SIZE]-it);
}
std::string itostr(unsigned int val)
{
char buf[BUFFER_SIZE];
char *it = (char*)&buf[BUFFER_SIZE-2];
int div = val/100;
while(div) {
memcpy(it,&digit_pairs[2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
}
memcpy(it,&digit_pairs[2*val],2);
if(val<10)
it++;
return std::string((char*)it,(char*)&buf[BUFFER_SIZE]-(char*)it);
}
Solution 4
While the info we get here for the algorithms is pretty nice, I think the question is "broken", and I'll explain why I think this:
The question asks to take the performance of int
->std::string
conversion, and this may be of interest when comparing a commonly available method, such as different stringstream implementations or boost::lexical_cast. It does not, however, make sense when asking for new code, a specialized algorithm, to do this. The reason is that int2string will always involve heap allocation from std::string and if we are trying to squeeze the last out of our conversion algorithm, I do not think it makes sense to mix these measurements up with the heap allocations done by std::string. If I want performant conversion I will always use a fixed size buffer and certainly never allocate anything on the heap!
To sum up, I think the timings should be split:
- First, fastest (int -> fixed buffer) conversion.
- Second, timing of (fixed buffer -> std::string) copy.
- Third, checking how the std::string allocation can directly be used as buffer, to save the copying.
These aspects should not be mixed up in one timing, IMHO.
Solution 5
Updated user2985907's answer... modp_ufast...
Integer To String Test (Type 1)
[modp_ufast]Numbers: 240000000 Total: 657777786 Time: 1.1633sec Rate:206308473.0686nums/sec
[sprintf] Numbers: 240000000 Total: 657777786 Time: 24.3629sec Rate: 9851045.8556nums/sec
[karma] Numbers: 240000000 Total: 657777786 Time: 5.2389sec Rate: 45810870.7171nums/sec
[strtk] Numbers: 240000000 Total: 657777786 Time: 3.3126sec Rate: 72450283.7492nums/sec
[so ] Numbers: 240000000 Total: 657777786 Time: 3.0828sec Rate: 77852152.8820nums/sec
[timo ] Numbers: 240000000 Total: 657777786 Time: 4.7349sec Rate: 50687912.9889nums/sec
[voigt] Numbers: 240000000 Total: 657777786 Time: 5.1689sec Rate: 46431985.1142nums/sec
[hopman] Numbers: 240000000 Total: 657777786 Time: 4.6169sec Rate: 51982554.6497nums/sec
Press any key to continue . . .
Integer To String Test(Type 2)
[modp_ufast]Numbers: 240000000 Total: 660000000 Time: 0.5072sec Rate:473162716.4618nums/sec
[sprintf] Numbers: 240000000 Total: 660000000 Time: 22.3483sec Rate: 10739062.9383nums/sec
[karma] Numbers: 240000000 Total: 660000000 Time: 4.2471sec Rate: 56509024.3035nums/sec
[strtk] Numbers: 240000000 Total: 660000000 Time: 2.1683sec Rate:110683636.7123nums/sec
[so ] Numbers: 240000000 Total: 660000000 Time: 2.7133sec Rate: 88454602.1423nums/sec
[timo ] Numbers: 240000000 Total: 660000000 Time: 2.8030sec Rate: 85623453.3872nums/sec
[voigt] Numbers: 240000000 Total: 660000000 Time: 3.4019sec Rate: 70549286.7776nums/sec
[hopman] Numbers: 240000000 Total: 660000000 Time: 2.7849sec Rate: 86178023.8743nums/sec
Press any key to continue . . .
Integer To String Test (type 3)
[modp_ufast]Numbers: 240000000 Total: 505625000 Time: 1.6482sec Rate:145610315.7819nums/sec
[sprintf] Numbers: 240000000 Total: 505625000 Time: 20.7064sec Rate: 11590618.6109nums/sec
[karma] Numbers: 240000000 Total: 505625000 Time: 4.3036sec Rate: 55767734.3570nums/sec
[strtk] Numbers: 240000000 Total: 505625000 Time: 2.9297sec Rate: 81919227.9275nums/sec
[so ] Numbers: 240000000 Total: 505625000 Time: 3.0278sec Rate: 79266003.8158nums/sec
[timo ] Numbers: 240000000 Total: 505625000 Time: 4.0631sec Rate: 59068204.3266nums/sec
[voigt] Numbers: 240000000 Total: 505625000 Time: 4.5616sec Rate: 52613393.0285nums/sec
[hopman] Numbers: 240000000 Total: 505625000 Time: 4.1248sec Rate: 58184194.4569nums/sec
Press any key to continue . . .
int ufast_utoa10(unsigned int value, char* str)
{
#define JOIN(N) N "0", N "1", N "2", N "3", N "4", N "5", N "6", N "7", N "8", N "9"
#define JOIN2(N) JOIN(N "0"), JOIN(N "1"), JOIN(N "2"), JOIN(N "3"), JOIN(N "4"), \
JOIN(N "5"), JOIN(N "6"), JOIN(N "7"), JOIN(N "8"), JOIN(N "9")
#define JOIN3(N) JOIN2(N "0"), JOIN2(N "1"), JOIN2(N "2"), JOIN2(N "3"), JOIN2(N "4"), \
JOIN2(N "5"), JOIN2(N "6"), JOIN2(N "7"), JOIN2(N "8"), JOIN2(N "9")
#define JOIN4 JOIN3("0"), JOIN3("1"), JOIN3("2"), JOIN3("3"), JOIN3("4"), \
JOIN3("5"), JOIN3("6"), JOIN3("7"), JOIN3("8"), JOIN3("9")
#define JOIN5(N) JOIN(N), JOIN(N "1"), JOIN(N "2"), JOIN(N "3"), JOIN(N "4"), \
JOIN(N "5"), JOIN(N "6"), JOIN(N "7"), JOIN(N "8"), JOIN(N "9")
#define JOIN6 JOIN5(), JOIN5("1"), JOIN5("2"), JOIN5("3"), JOIN5("4"), \
JOIN5("5"), JOIN5("6"), JOIN5("7"), JOIN5("8"), JOIN5("9")
#define F(N) ((N) >= 100 ? 3 : (N) >= 10 ? 2 : 1)
#define F10(N) F(N),F(N+1),F(N+2),F(N+3),F(N+4),F(N+5),F(N+6),F(N+7),F(N+8),F(N+9)
#define F100(N) F10(N),F10(N+10),F10(N+20),F10(N+30),F10(N+40),\
F10(N+50),F10(N+60),F10(N+70),F10(N+80),F10(N+90)
static const short offsets[] = { F100(0), F100(100), F100(200), F100(300), F100(400),
F100(500), F100(600), F100(700), F100(800), F100(900)};
static const char table1[][4] = { JOIN("") };
static const char table2[][4] = { JOIN2("") };
static const char table3[][4] = { JOIN3("") };
static const char table4[][5] = { JOIN4 };
static const char table5[][4] = { JOIN6 };
#undef JOIN
#undef JOIN2
#undef JOIN3
#undef JOIN4
char *wstr;
int remains[2];
unsigned int v2;
if (value >= 100000000) {
v2 = value / 10000;
remains[0] = value - v2 * 10000;
value = v2;
v2 = value / 10000;
remains[1] = value - v2 * 10000;
value = v2;
wstr = str;
if (value >= 1000) {
*(__int32 *) wstr = *(__int32 *) table4[value];
wstr += 4;
} else {
*(__int32 *) wstr = *(__int32 *) table5[value];
wstr += offsets[value];
}
*(__int32 *) wstr = *(__int32 *) table4[remains[1]];
wstr += 4;
*(__int32 *) wstr = *(__int32 *) table4[remains[0]];
wstr += 4;
*wstr = 0;
return (wstr - str);
}
else if (value >= 10000) {
v2 = value / 10000;
remains[0] = value - v2 * 10000;
value = v2;
wstr = str;
if (value >= 1000) {
*(__int32 *) wstr = *(__int32 *) table4[value];
wstr += 4;
*(__int32 *) wstr = *(__int32 *) table4[remains[0]];
wstr += 4;
*wstr = 0;
return 8;
} else {
*(__int32 *) wstr = *(__int32 *) table5[value];
wstr += offsets[value];
*(__int32 *) wstr = *(__int32 *) table4[remains[0]];
wstr += 4;
*wstr = 0;
return (wstr - str);
}
}
else {
if (value >= 1000) {
*(__int32 *) str = *(__int32 *) table4[value];
str += 4;
*str = 0;
return 4;
} else if (value >= 100) {
*(__int32 *) str = *(__int32 *) table3[value];
return 3;
} else if (value >= 10) {
*(__int16 *) str = *(__int16 *) table2[value];
str += 2;
*str = 0;
return 2;
} else {
*(__int16 *) str = *(__int16 *) table1[value];
return 1;
}
}
}
int ufast_itoa10(int value, char* str) {
if (value < 0) { *(str++) = '-';
return ufast_utoa10(-value, str) + 1;
}
else return ufast_utoa10(value, str);
}
void ufast_test() {
print_mode("[modp_ufast]");
std::string s;
s.reserve(32);
std::size_t total_length = 0;
strtk::util::timer t;
t.start();
char buf[128];
int len;
for (int i = (-max_i2s / 2); i < (max_i2s / 2); ++i)
{
#ifdef enable_test_type01
s.resize(ufast_itoa10(((i & 1) ? i : -i), const_cast<char*>(s.c_str())));
total_length += s.size();
#endif
#ifdef enable_test_type02
s.resize(ufast_itoa10(max_i2s + i, const_cast<char*>(s.c_str())));
total_length += s.size();
#endif
#ifdef enable_test_type03
s.resize(ufast_itoa10(randval[(max_i2s + i) & 1023], const_cast<char*>(s.c_str())));
total_length += s.size();
#endif
}
t.stop();
printf("Numbers:%10lu\tTotal:%12lu\tTime:%8.4fsec\tRate:%14.4fnums/sec\n",
static_cast<unsigned long>(3 * max_i2s),
static_cast<unsigned long>(total_length),
t.time(),
(3.0 * max_i2s) / t.time());
}
Ben Voigt
Since Stack Exchange has effectively turned my profile page into a political platform without my foreknowledge or consent, I feel obligated to share my own views: An oligarchy of five justices setting policy for the entire USA is not the government the Constitution guarantees, and it's not the one we want. Not everything President Gerald Ford said was very smart or worth repeating, but this is A government big enough to give you everything you want is a government big enough to take from you everything you have. The USA was designed as a Constitutional Republic, with checks-and-balances between the branches of the federal government, and also division of power between federal state and local governments. (And these are explicitly required by the plain language of the Constitution) All government officials need to stay within the authority granted to them by the Constitution, especially when that means they can't do whatever they want.
Updated on June 24, 2021Comments
-
Ben Voigt almost 3 years
Can anyone beat the performance of my integer to std::string code, linked below?
There are already several questions that explain how to convert an integer into a
std::string
in C++, such as this one, but none of the solutions provided are efficient.Here is compile-ready code for some common methods to compete against:
- The "C++ way", using stringstream: http://ideone.com/jh3Sa
- sprintf, which SO-ers usually recommend to the performance conscious: http://ideone.com/82kwR
Contrary to popular belief,
boost::lexical_cast
has its own implementation (white paper) and does not usestringstream
and numeric insertion operators. I'd really like to see its performance compared, because this other question suggests that it's miserable.And my own contribution, which is competitive on desktop computers, and demonstrates an approach that runs at full speed on embedded systems as well, unlike algorithms dependent on integer modulo:
- Ben's algorithms: http://ideone.com/SsEUW
If you want to use that code, I'll make it available under a simplified BSD license (commercial use allowed, attribution required). Just ask.
Finally, the function
ltoa
is non-standard but widely available.- ltoa version, for anyone who has a compiler that provides it (ideone doesn't): http://ideone.com/T5Wim
I'll post my performance measurements as an answer shortly.
Rules for algorithms
- Provide code for a conversion of at least 32-bit signed and unsigned integers into decimal.
- Produce output as a
std::string
. - No tricks that are incompatible with threading and signals (for example, static buffers).
- You may assume an ASCII character set.
- Make sure to test your code on
INT_MIN
on a two's complement machine where the absolute value is not representable. - Ideally, the output should be character-for-character identical with the canonical C++ version using
stringstream
, http://ideone.com/jh3Sa, but anything that is clearly understandable as the correct number is ok, too. - NEW: Although you can use whatever compiler and optimizer options (except completely disabled) you want for the comparison, the code needs to also compile and give correct results under at least VC++ 2010 and g++.
Hoped-for Discussion
Besides better algorithms, I'd also like to get some benchmarks on several different platforms and compilers (let's use MB/s throughput as our standard unit of measure). I believe that the code for my algorithm (I know the
sprintf
benchmark takes some shortcuts -- now fixed) is well-defined behavior by the standard, at least under the ASCII assumption, but if you see any undefined behavior or inputs for which the output is invalid, please point that out.Conclusions:
Different algorithms perform for g++ and VC2010, likely due to the different implementations of
std::string
on each. VC2010 clearly does a better job with NRVO, getting rid of return-by-value helped only on gcc.Code was found that outperforms
sprintf
by an order of magnitude.ostringstream
falls behind by a factor of 50 and more.The winner of the challenge is user434507 who produces code that runs 350% of the speed of my own on gcc. Further entries are closed due to the whims of the SO community.
The current (final?) speed champions are:
- For gcc: user434507, at 8 times faster than
sprintf
: http://ideone.com/0uhhX - For Visual C++: Timo, at 15 times faster than
sprintf
: http://ideone.com/VpKO3
-
Ben Voigt over 13 yearsThanks for your attempt. On ideone (ideone.com/BCp5r), it scores 18.5 MB/s, about half the speed of
sprintf
. And with VC++ 2010, it gets about 50 MB/s, about twice the speed of sprintf. -
Eugene Smith over 13 yearsMB/s is a strange metric, especially seeing how you don't remove trailing whitespaces from the string in your implementations. My updated code runs faster than your implementation with x64 VC++ 2005 on Core i7 920 (16.2M ops/s vs. 14.8M ops/s), _ltoa does 8.5M ops/s and sprintf() does 3.85M ops/s.
-
Ben Voigt over 13 yearsYour code doesn't properly resize the string, mine does (see lines 81, 198, and 290). I took some shortcuts in the
sprintf
implementation, I already mentioned that in my question, but I believe the code-to-beat gives exactly the same result as stringstream. -
Ben Voigt over 13 yearsI've fixed the
sprintf
wrapper as well, to avoid confusion. -
Ben Voigt over 13 yearsBTW, your improved version (ideone.com/GLAbS) gets 41.7 MB/s on ideone, and right around 120 MB/s on VC++ 2010 32-bit.
-
vpalmu over 13 yearsRemaining fix -- try using the old math function div. On some architectures calling it is faster.
-
Ben Voigt over 13 yearsWith unsigned variant: ideone.com/pswq9. It seems that changing the buffer type from
char
tounsigned
produces a similar speed improvement in my code, at least on gcc/ideone ideone.com/uthKK. I'll test on VS tomorrow. -
Behrouz.M over 13 yearsthanks for these infos , please explain about gcc speed ! it is very low :(
-
Ben Voigt over 13 years@Behrouz: Indeed. I'm not exactly sure why gcc is so slow, whether it is gcc's version of
std::string
or poor optimization of the arithmetic code. I shall make another version that doesn't convert tostd::string
at the end and see whether gcc fares any better. -
Eugene Smith over 13 yearsBen: what happens if you declare 'val' as int? the only reason for it to be __int64 is to avoid the integer overflow for n=INT_MIN, which is a case that can be handled explicitly. I'd imagine that manipulating a __int64 is costly in 32-bit.
-
Eugene Smith over 13 yearsJoshua: not on Intel. VC2005 compiler is smart enough to replace the division by 10 with integer multiplication, which is THE fastest way to do things here.
-
Eugene Smith over 13 yearsOK, after some more hacking in 32-bit VC2005, I got to the point where two calls to string::resize() (one in the beginning with resize(16) and one in the end with resize(size+sign)) take more time than the actual conversion. It seems that memmove() is being called. This is bizarre...
-
Ben Voigt over 13 yearsThe new version is fantastic under gcc. I had to make a couple small modifications to make it comply with the rules (caller-provided result string instead of global), but get your same great performance. ideone.com/0uhhX I'll test on VC++ 2010 later and let you know how it performs, but you win as soon as you update your answer to the new version of the code.
-
Ben Voigt over 13 yearsBTW, while I was writing the question, I saw
memmove
appearing in the call tree under VC2010 as well, tracked it down to silly logic instd::string::erase
when erasing from the tail, and submitted a patch to Microsoft. Hopefully the next Visual Studio service pack will resizestd::string
much faster. -
Ben Voigt over 13 years@Timo: That's very cool. I didn't really expect the change to an unsigned buffer to help with VC++, which was already quite fast, so it was only applicable to gcc and now user434507 has provided a far better version there.
-
Eugene Smith over 13 yearsBut wait, there's more. Incorporating some ideas of yours and timo gets us here: ideone.com/UiWKI
-
Ben Voigt over 13 years@user434507: Seems safer to just use
memcpy
like Timo did... it's an intrinsic on most compilers, so the unaligned copy will get used on architectures that allow it. At any rate, there was no speed decrease on ideone: ideone.com/VLUco -
Ben Voigt over 13 years
std::string::clear
in Visual Studio is probably affected by thestd::string::erase
problems, sinceerase(0, npos)
is a reasonable way to implement clear and avoid code duplication. -
Chris Hopman over 13 yearsFor those that are interested in how hopman-fun works but don't feel like puzzling it out, I created a commented version at ideone.com/rnDxk
-
Ben Voigt over 13 years<quote>int2string will always involve heap allocation from std::string</quote> Not with the small-string optimization, which is present in most current implementations of the Standard Library.
-
Timo over 13 yearsI don't understand how the first one works even with the comments. :D The fast one is really nice, though it has its price in memory usage. But I guess 40kB is still acceptable. I actually modified my own code to also use 4 character groups, and got similar speed. ideone.com/KbTFe
-
fuz almost 12 yearsI have posted this question with a different algorithm.
-
Ben Voigt over 11 yearsIt's because this challenge is about speed, not the fewest lines of code.
-
Ben Voigt over 10 yearsYou never put it into the string. Also I don't know why your results for everyone else's code are so low, your CPU isn't slow.
-
Ben Voigt over 10 yearsIn the end, though, the "output as
std::string
" requirement was placed there just to make things fair and consistent for all submissions. Algorithms that are faster to makestd::string
results will also be faster to fill a preallocated buffer. -
Martin Ba over 10 years@Ben - good comments. Esp. the sm.str.opt. is something I'll have to remember in the future when judging std.string performance.
-
Denis Zaikin almost 10 yearsmodp_ufast has an error, it returns 10 instead of 1000000, 19 instead of 1090000 and etc, till 11000000.
-
atlaste about 9 yearsPS: And for the people that want to use this in my solution: (1) it's much much slower and (2) because div works on signed integers - which breaks abs(INT32_MIN).
-
Ben Voigt about 9 yearsInterestingly, I recently gave a copy of Hacker's Delight to a coworker. Any particular sections? Of course, note that modulo and div, although both returned from a single divide instruction, will not be obtained that way, because division by a constant is implemented much faster using hardware multiply than divide.
-
atlaste about 9 years@BenVoigt actually if you run 'disassemble' on VS2013 you get exactly the code you would expect after reading H's delight. The chapter you're looking for is chapter 10.
-
Ben Voigt about 9 yearsYes, that's the implementation using hardware multiply I was referring to.
-
atlaste about 9 years@BenVoigt Yes of course, that was what I meant. Both modulo and multiply (by constant) use the same magic number, shift (arith and normal). My assumption here was that the compiler is able to figure out it is emitting the same instructions multiple times and optimize that - and since all operations can be vectorized, it might figure that out as well (let's call that a bonus :-). My point with H's delight was that if you know how these operations are compiled (integer multiply, shift), you can make these assumptions.
-
Waldemar over 8 yearsI've tested it from 0x80000000 to 0x7FFFFFFF and already at -999999999 you get invalid values (I've stopped after a few mismatches).
Mismatch found: Generated: -9999999990 Reference: -999999999 Mismatch found: Generated: -9999999980 Reference: -999999998 Mismatch found: Generated: -9999999970 Reference: -999999997
-
Waldemar over 8 yearsModified ufast returns invalid values (stopped after a few errors).
Mismatch found: Generated: -99 Reference: -9099999 Mismatch found: Generated: -99 Reference: -9099998 Mismatch found: Generated: -99 Reference: -9099997
-
pbn over 7 yearsWould it be difficult to modify it to work with uint64_t? I moved this code to C and replaced 'T' with int type and it works, but it doesn't work for uint64_t and I don't have a clue how to customize it.
-
luizfls about 4 yearsFor the record, the answer above is the "user434507" algorithm.
-
user1593842 almost 4 yearsI think you should add a version that does not convert to std::string. By changing just one line of code the function runs in half the time on my machine, using GCC. And by removing the std::string people would be able to use this function inside C programs.
-
vitaut almost 4 yearsThere is a more portable version with benchmarks available here: github.com/fmtlib/format-benchmark/blob/master/src/u2985907.h