Efficient 128-bit addition using carry flag
Solution 1
Actually gcc will use the carry automatically if you write your code carefully...
Current GCC can optimize hiWord += (loWord < loAdd);
into add
/adc
(x86's add-with-carry). This optimization was introduced in GCC5.3.
- With separate
uint64_t
chunks in 64-bit mode: https://godbolt.org/z/S2kGRz. - And the same thing in 32-bit mode with
uint32_t
chunks: https://godbolt.org/z/9FC9vc
(editor's note: Of course the hard part is writing a correct full-adder with carry in and carry out; that's hard in C and GCC doesn't know how to optimize any that I've seen.)
Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.
Older GCC, like GCC4.5, will branch or setc
on the carry-out from an add, instead of using adc
, and only used adc
(add-with-carry) on the flag-result from an add
if you used __int128
. (Or uint64_t
on a 32-bit target). See Is there a 128 bit integer in gcc? - only on 64-bit targets, supported since GCC4.1.
I compiled this code with gcc -O2 -Wall -Werror -S
:
void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
if (loWord < loAdd) ++hiWord; // test_and_add_carry
hiWord += hiAdd;
}
void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
loWord += loAdd;
hiWord += hiAdd;
hiWord += (loWord < loAdd); // test_and_add_carry
}
This is the assembly for increment128_1:
.cfi_startproc
movabsq $-8801131483544218438, %rax
addq (%rsi), %rax
movabsq $-8801131483544218439, %rdx
cmpq %rdx, %rax
movq %rax, (%rsi)
ja .L5
movq (%rdi), %rax
addq $1, %rax
.L3:
movabsq $6794178679361, %rdx
addq %rdx, %rax
movq %rax, (%rdi)
ret
...and this is the assembly for increment128_2:
movabsq $-8801131483544218438, %rax
addq %rax, (%rsi)
movabsq $6794178679361, %rax
addq (%rdi), %rax
movabsq $-8801131483544218439, %rdx
movq %rax, (%rdi)
cmpq %rdx, (%rsi)
setbe %dl
movzbl %dl, %edx
leaq (%rdx,%rax), %rax
movq %rax, (%rdi)
ret
Note the lack of conditional branches in the second version.
[edit]
Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:
struct my_uint128_t {
unsigned long hi;
unsigned long lo;
};
my_uint128_t increment128_3(my_uint128_t x)
{
const unsigned long hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
x.lo += loAdd;
x.hi += hiAdd + (x.lo < loAdd);
return x;
}
Assembly:
.cfi_startproc
movabsq $-8801131483544218438, %rdx
movabsq $-8801131483544218439, %rax
movabsq $6794178679362, %rcx
addq %rsi, %rdx
cmpq %rdx, %rax
sbbq %rax, %rax
addq %rcx, %rax
addq %rdi, %rax
ret
This is actually the tightest code of the three.
...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).
[edit 2]
And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?
typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));
my_uint128_t increment128_4(my_uint128_t x)
{
const my_uint128_t hiAdd=0x0000062DE49B5241;
const unsigned long loAdd=0x85DC198BCDD714BA;
return x + (hiAdd << 64) + loAdd;
}
The assembly for this one is about as good as it gets:
.cfi_startproc
movabsq $-8801131483544218438, %rax
movabsq $6794178679361, %rdx
pushq %rbx
.cfi_def_cfa_offset 16
addq %rdi, %rax
adcq %rsi, %rdx
popq %rbx
.cfi_offset 3, -16
.cfi_def_cfa_offset 8
ret
(Not sure where the push/pop of ebx
came from, but this is still not bad.)
All of these are with GCC 4.5.2, by the way.
Solution 2
The best answer, of course, is to use the built-in __int128_t
support.
Alternatively, use an inline asm. I prefer to use the named-argument form:
__asm("add %[src_lo], %[dst_lo]\n"
"adc %[src_hi], %[dst_hi]"
: [dst_lo] "+&r" (loWord), [dst_hi] "+r" (hiWord)
: [src_lo] "erm" (loAdd), [src_hi] "erm" (hiAdd)
: );
loWord
is flagged as an early clobber operand, because it's written before some of the other operands are read. This avoids wrong code for hiAdd = loWord
, because it will stop gcc from using the same register to hold both. It does stop the compiler from using the same register for the loAdd = loWord
case, though, where it is safe.
As that early-clobber question points out, inline asm is really easy to get wrong (in hard-to-debug ways which only cause trouble up after some change to the code it's inlined into).
x86 and x86-64 inline asm is assumed to clobber the flags, so an explicit "cc" clobber isn't needed.
Randall Meyers
Updated on June 06, 2022Comments
-
Randall Meyers about 2 years
I'm using a 128 bit integer counter in the very inner loops of my C++ code. (Irrelevant background: The actual application is evaluating finite difference equations on a regular grid, which involves repetitively incrementing large integers, and even 64 bits isn't enough precision because small rounding accumulates enough to affect the answers.)
I've represented the integer as two 64 bit unsigned longs. I now need to increment those values by a 128 bit constant. This isn't hard, but you have to manually catch the carry from the low word to the high word.
I have working code something like this:
inline void increment128(unsigned long &hiWord, unsigned long &loWord) { const unsigned long hiAdd=0x0000062DE49B5241; const unsigned long loAdd=0x85DC198BCDD714BA; loWord += loAdd; if (loWord < loAdd) ++hiWord; // test_and_add_carry hiWord += hiAdd; }
This is tight and simple code. It works.
Unfortunately this is about 20% of my runtime. The killer line is that loWord test. If I remove it, I obviously get the wrong answers but the runtime overhead drops from 20% to 4%! So that carry test is especially expensive!
My question: Does C++ expose the hardware carry flags, even as an extension to GCC? It seems like the additions could be done without the test-and-add-carry line above if the actual compiled instructions used an add using last carry instruction for the hiWord addition. Is there a way to rewrite the test-and-add-carry line to get the compiler to use the intrinsic opcode?
-
gordy almost 13 years+1, the very last one has the adc instruction you're looking for
-
Nemo almost 13 years@Gordy: Indeed. I am a little surprised I could not coax GCC into generating it otherwise... I could have sworn I had seen it do so at some point (and not with the "TI" mode type).
-
Randall Meyers almost 13 yearsThis is a terrific analysis! I'll be marking this as the correct answer, but bruce brings up a good point in his comment: why not just use an ASM include? I made a second answer below just to document it, even though this answer here by Nemo will be the best reference.
-
Nemo almost 13 years@Randall: The asm solution works, but it is specific to x86_64. The 128-bit GCC integers will work on any platform that has GCC, and the other suggestions will work anywhere you have a C++ compiler. You might think you can ignore such concerns, but years of experience suggest you will live to regret it if you do :-). Only go the full asm route if you simply cannot get adequate performance any other way; and if you do use it, be sure to limit it to a single module if possible.
-
Stephen Canon almost 13 years@Nemo: does gcc support
__int128_t
on 32-bit platforms now? it didn't at some point in the past. -
Nemo almost 13 years@Stephen: Sorry, but I do not know. I do not even have a 32-bit system anymore... Very possible you are right.
-
Z boson over 9 yearsShould you use an early clobber flag on the outputs (at least for hiWord)? For 256-bit addition (one add, three adc) I get the wrong result without using early clobber flags.
-
Stephen Canon over 9 yearsProbably, but you'd need it on loWord, I think.
-
Z boson over 9 years@StephenCanon, GCC's
__int128
requires 64-bit. It's easy to check by compiling for 32-bit (e.g. with-m32
). MSVC's_umul128
also requires 64-bit. -
Stephen Canon over 9 years@Zboson: it's easy to check if you use (or have) an up-to-date GCC.
-
Benoît almost 9 yearsUsing add + adc is definitely the way to go. It's unfortunate that C's abstraction hides the fact that it's easy to do a 128bit addition.
-
Peter Cordes over 8 yearsThe constraints on the source operands (
loAdd
andhiAdd
) could beemr
. (e
means 32bit signed constant. 64bit add/adc can take sign-extended 32bit immediate operands). There should also be a clobber oncc
, since this modifies the flags. In this case, broadening the constraints doesn't really matter, sincemov r64, imm64
is the best way to set things up anyway. But see goo.gl/MmdKgo for a case where an early-clobber onloWord
will generate different code (but worse, with a uselessmov
). I still haven't figured out if there's a need at all. -
Peter Cordes over 8 yearsupdate: hiAdd is read after loWord is written, so with
hiAdd = loWord
, you get wrong code without an early clobber (code which would add the new value of loWord, not the old value). -
Marc Glisse over 8 years@PeterCordes Isn't the cc clobber implicit for inline asm on x86? (which is actually a pain in cases where you don't clobber it...)
-
Peter Cordes over 8 years@MarcGlisse: Oh, I didn't realize that! Yes, it's that way for sure on x86-32 (i386), and it looks like that discussion about what to do for x86-64 picked the sane choice of working the same as i386. Thanks.
-
Tomilov Anatoliy about 8 yearsHow to generalize it to wider long integers? (maybe should be there a loop?)
-
Peter Cordes over 4 yearsre: previous comments: it's easy to check on godbolt.org which has up-to-date gcc, clang, ICC, and MSVC for a variety of architectures, including x86-64 (and you can use
-m32
). I'm pretty sure more recent GCC will recognize thehi += lo<x
pattern and useadc
on the carry-out fromadd
. And clang definitely can, even for carry out from__int128
: Fastest way to sum dot product of vector of unsigned 64 bit integers using 192/256 bit integer?