How to do unsigned saturating addition in C?
Solution 1
You probably want portable C code here, which your compiler will turn into proper ARM assembly. ARM has conditional moves, and these can be conditional on overflow. The algorithm then becomes: add and conditionally set the destination to unsigned(-1), if overflow was detected.
uint16_t add16(uint16_t a, uint16_t b)
{
uint16_t c = a + b;
if (c < a) /* Can only happen due to overflow */
c = -1;
return c;
}
Note that this differs from the other algorithms in that it corrects overflow, instead of relying on another calculation to detect overflow.
x86-64 clang 3.7 -O3 output for adds32: significantly better than any other answer:
add edi, esi
mov eax, -1
cmovae eax, edi
ret
ARMv7: gcc 4.8 -O3 -mcpu=cortex-a15 -fverbose-asm
output for adds32:
adds r0, r0, r1 @ c, a, b
it cs
movcs r0, #-1 @ conditional-move
bx lr
16bit: still doesn't use ARM's unsigned-saturating add instruction (UADD16
)
add r1, r1, r0 @ tmp114, a
movw r3, #65535 @ tmp116,
uxth r1, r1 @ c, tmp114
cmp r0, r1 @ a, c
ite ls @
movls r0, r1 @,, c
movhi r0, r3 @,, tmp116
bx lr @
Solution 2
In plain C:
uint16_t sadd16(uint16_t a, uint16_t b) {
return (a > 0xFFFF - b) ? 0xFFFF : a + b;
}
uint32_t sadd32(uint32_t a, uint32_t b) {
return (a > 0xFFFFFFFF - b) ? 0xFFFFFFFF : a + b;
}
which is almost macro-ized and directly conveys the meaning.
Solution 3
In IA32 without conditional jumps:
uint32_t sadd32(uint32_t a, uint32_t b)
{
#if defined IA32
__asm
{
mov eax,a
xor edx,edx
add eax,b
setnc dl
dec edx
or eax,edx
}
#elif defined ARM
// ARM code
#else
// non-IA32/ARM way, copy from above
#endif
}
Solution 4
In ARM you may already have saturated arithmetic built-in. The ARMv5 DSP-extensions can saturate registers to any bit-length. Also on ARM saturation is usually cheap because you can excute most instructions conditional.
ARMv6 even has saturated addition, subtraction and all the other stuff for 32 bits and packed numbers.
On the x86 you get saturated arithmetic either via MMX or SSE.
All this needs assembler, so it's not what you've asked for.
There are C-tricks to do saturated arithmetic as well. This little code does saturated addition on four bytes of a dword. It's based on the idea to calculate 32 half-adders in parallel, e.g. adding numbers without carry overflow.
This is done first. Then the carries are calculated, added and replaced with a mask if the addition would overflow.
uint32_t SatAddUnsigned8(uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80808080;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 7);
return (x ^ t0) | t1;
}
You can get the same for 16 bits (or any kind of bit-field) by changing the signmask constant and the shifts at the bottom like this:
uint32_t SatAddUnsigned16(uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80008000;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 15);
return (x ^ t0) | t1;
}
uint32_t SatAddUnsigned32 (uint32_t x, uint32_t y)
{
uint32_t signmask = 0x80000000;
uint32_t t0 = (y ^ x) & signmask;
uint32_t t1 = (y & x) & signmask;
x &= ~signmask;
y &= ~signmask;
x += y;
t1 |= t0 & x;
t1 = (t1 << 1) - (t1 >> 31);
return (x ^ t0) | t1;
}
Above code does the same for 16 and 32 bit values.
If you don't need the feature that the functions add and saturate multiple values in parallel just mask out the bits you need. On ARM you also want to change the signmask constant because ARM can't load all possible 32 bit constants in a single cycle.
Edit: The parallel versions are most likely slower than the straight forward methods, but they are faster if you have to saturate more than one value at a time.
Solution 5
Zero branch solution:
uint32_t sadd32(uint32_t a, uint32_t b)
{
uint64_t s = (uint64_t)a+b;
return -(s>>32) | (uint32_t)s;
}
A good compiler will optimize this to avoid doing any actual 64-bit arithmetic (s>>32
will merely be the carry flag, and -(s>>32)
is the result of sbb %eax,%eax
).
In x86 asm (AT&T syntax, a
and b
in eax
and ebx
, result in eax
):
add %eax,%ebx
sbb %eax,%eax
or %ebx,%eax
8- and 16-bit versions should be obvious. Signed version might require a bit more work.
Frank Szczerba
I build and scale software systems to manage planet-scale computing infrastructure as well as the high-performing global teams who develop and support them, with a focus on reliability, efficiency, and low human support costs. Before joining Google SRE, I was an embedded systems, Linux kernel, and device driver consultant and iPhone app developer with an extensive background in C, C++, Perl, and bash on Intel, MIPS, ARM, and MSP430 (no, no Perl on the MSP430...).
Updated on December 19, 2021Comments
-
Frank Szczerba over 2 years
What is the best (cleanest, most efficient) way to write saturating addition in C?
The function or macro should add two unsigned inputs (need both 16- and 32-bit versions) and return all-bits-one (0xFFFF or 0xFFFFFFFF) if the sum overflows.
Target is x86 and ARM using gcc (4.1.2) and Visual Studio (for simulation only, so a fallback implementation is OK there).
-
Frank Szczerba over 15 yearsYour answer looks like what I thought we should be doing, but like you said I'm not really sure which is better, which is why I figured I'd open it up to voting here.
-
Frank Szczerba over 15 yearsI think the answer for ARM is similar (and even more efficient with conditional ops), but I'm hoping someone knows a pattern that will trick GCC into generating something close to this.
-
leo7r over 15 yearsThey both seem correct, therefore efficiency should decide. An extra comparison isn't obviously slower (or faster) than oversizing the addition. Do some efficiency tests for both solutions on both architectures and pick the faster one.
-
Nils Pipenbrinck over 15 years@Frank, which GCC versin are you using? (gcc --version). The newer versions do such tricks.
-
Ishbir over 15 yearsThe question asked for C, but anyway, nice code. Is
eax
returned by default as the function result? -
Steve Jessop over 15 yearsIf the question wanted portability, it shouldn't have specified x86 and ARM ;-)
-
Frank Szczerba over 15 yearsIs checking the sum against both inputs necessary? The limit case is (uint16_t)(0xffff + 1) which is both < 1 and < 0xffff, so it seems the second check can be avoided.
-
leo7r over 15 yearsYou are right, the lost overflow bit is worth MAXINT+1, so the result of overflown addition is equal to a+b-(MAXINT+1), which is both less than a and less than b.
-
MSalters over 15 yearsARM has "C-everything". Not just jump and move. But it doesn't have support for 32 bits constants. So you'd want a conditional mov 0, followed by a conditional sub 1
-
Arafangion almost 14 yearsThat function is still portable - once the elif and else cases are filled in. Portable code doesn't mean that you can't optimise for particular platforms.
-
Arafangion almost 14 yearslower-case function macros? Evil!
-
Exectron almost 14 yearsNice. A nitpick--if I saw the name
sadd16
in some code, my first assumption would be that thes
stands forsigned
. -
yingted almost 13 years@Craig McQueen Except for the fact that there would not be much reason to make it a function.
-
Marc Gravell over 12 yearsA proposed edit by YumeYao (which I've not pushed through, as it changes the nature of the answer): The 3 instructions (xor reg,reg; setne reg; dec reg;) can be replaced with one more efficient instruction (sbb reg,reg).
-
Joseph Garvin about 12 yearsIs using the SSE instructions still faster in cases where you're only ever operating on one variable at a time?
-
Joseph Garvin about 12 years@Anonymous: Why not? Signed overflow/underflow isn't defined.
-
yingted about 12 years@JosephGarvin Signed addition of unsigned ints?
-
Joseph Garvin about 12 years@Anonymous: Craig is speaking from the standpoint of reading code where there is a call to sad16/32. You won't see the signature unless you find and open the header.
-
yingted about 12 years@JosephGarvin
uint32_t x=5,y=6,z=sadd32(x,y);
? The types are visible. -
Joseph Garvin about 12 years@Anonymous: Often when you're reading a diff (e.g. for code review, or if you're sending patches to an open source mailing list) the call is far enough from the declaration of the parameters that you can't tell.
-
Cole Tobin almost 11 years@JosephGarvin except that most IDE's allow you to hover over a call (or something) to see the deceleration.
-
Cole Tobin almost 11 yearsWould it make sense to inline this?
attribute((always_inline))
for GCC;__forceinline
for MSVC -
Cole Tobin almost 11 yearsTwo things: the
__asm
keyword is compiler-dependent. The standard doesn't specify a keyword for inline assembly. So this is not portable in the sense that it is compiler-dependent. For example, the Intel C++ compiler is Windows only, so if you wrote portable code utilizing Itel C++ features, it wouldn't be portable. Another thing: inline assembly prevents compiler inlining. So this optimization doesn't really help if there is still the function call overhead... -
Cole Tobin almost 11 yearsWhy use
~((uint32_t)0)
? You're already including<limits.h>
to get theuint32_t
deceleration, so why not just useUINT32_MAX
? -
Dietrich Epp almost 11 years@ColeJohnson: Use
inline
instead. Don't use__attribute__((always_inline))
, which is usually a way of second-guessing the optimizer. Theinline
keyword does what you actually want and is portable. -
Cole Tobin almost 11 years@Dietrich Why not force inline it besides portability? Also with plain
inline
, you aren't portable. MSVC requires__inline
... -
Dietrich Epp almost 11 years@ColeJohnson: You can
#define inline __inline
easily enough, if you want to support compilers with poor support for C, and as long as you're aware of the semantic differences. There is really no reason to usealways_inline
here, and a good reason not to: it might interfere with debugging. For the same reason, C programmers have ditched declaringregister
variables long ago: there's no point, except in perhaps very specific circumstances. -
Cole Tobin almost 11 years@DietrichEpp Fair enough. I'm not gonna sit here and be given a lecture on something I already know. However, a smart compiler would not inline functions even if forced to when it's in debug mode. An example is MSVC. If you tell it to compiler for debug mode, it won't inline (even forced) functions.
-
Dietrich Epp almost 11 years@ColeJohnson: Sorry if my answer seemed like a lecture. But this isn't about how "smart" a compiler is, but just how it behaves, and GCC inlines an
always_inline
even when compiling with debugging symbols. -
Cole Tobin almost 11 years@Dietrich That's stupid. I guess I never noticed because I work in MSVC, then port to GCC when done.
-
Alexandros over 10 yearsJust a minor suggestion: The
0xFF..
constants should be changed to the equivalentUINTN_MAX
constants (or(uintN_t) -1
). That way, it will only take a single search & replace to write thesadd8
orsadd64
functions. (And it doesn't require you to count the number of Fs in0xFFFFFFFFFFFFFFFF
;) -
WiSaGaN about 9 years"...:...?..." would be a conditional move instead of "jump".
-
Alexandre Pereira Nunes almost 9 yearsThis produces nice code in gcc 5.1 when targetting armv4t, just 4 branchless instructions (two of them conditional).
-
Peter Cordes about 8 years
sbb $0, %ebx
subtracts 1 or not. This gives the wrong answer if the add overflowed more than 1. What does work (as suggested by others) is usingsbb same,same
to produce as 0 or -1 mask, and OR the add result with that. However, that has a longer critical-path latency thanadd %edi, %esi / mov $-1, %eax / cmovnc %esi, %edi
. (sbb and cmov have the same latency on all CPUs: 2 on Intel pre-Broadwell, and 1 otherwise.) -
Peter Cordes about 8 yearsYou can write
overflow&31
, and it will still compile without a wastedand ecx, 31
, because gcc and clang know how the shift instruction works (the ISA defines it to work that way, on every CPU since 286. See the Intel insn ref manual linked from the x86 tag wiki. On targets where the shift works differently way, they will emit the necessary instructions to make it work. Of course, this still relies on right-shift of a signed integer using an arithmetic shift, which the C standard doesn't guarantee. -
Peter Cordes about 8 yearsThis generates optimal code on x86 with clang (
mov eax,-1
/add
/cmovnc
), and about the same with gcc, unlike all the other answers. It's the only one that gets gcc to use the flags result from the add, instead of doing another test afterwards (except for DGentry's answer, but gcc doesn't realize both tests are the same). So one could say it's the only one where gcc "understands" what's going on. Even inline asm can't do better on x86: the compiler knows what's going on with yours, so it knows it's associative, and can choose which reg to destroy. -
Peter Cordes about 8 yearsYou'd hope a compiler would spot that, but they don't. clang/gcc/icc all do a crap job on everything except MSalter's answer. Yours compiles to
lea eax, [rdi+rsi]/ mov edx, edi / mov ecx, esi / add rdx, rcx / shr rdx, 32 / neg edx / or eax, edx
-
Peter Cordes about 8 yearsI didn't see an unsigned saturation instruction for 32bit integers, only for packed16
UQUADD16
and packed8. There's a 32bit add with signed-saturation, though. Also, unfortunately this C code compiles to horrible code for the 32bit case: all the overhead of doing it SWAR style, but for only one value. It unfortunately doesn't optimize away. See my comment on MSalters's answer: the godbolt link includes your version. -
Peter Cordes about 8 yearsThis kinda sucks: first because it's MSVC inline-asm, so inputs / outputs have to go through memory. (Or if this no-return-statement with a value in eax works, then the function itself can't inline. The inputs have to go through memory regardless). Second, because
cmov
is better: shorter critical path becausemov eax, -1
is off the critical path, unlikesbb
. -
Peter Cordes about 8 yearsThis produces significantly worse code for x86 and ARM than MSalter's answer. Have a look at it for asm output (including a godbolt link comparing this with it.)
-
rici about 6 years@PeterCordes: Care to comment on the behaviour of more recent clang/gcc versions? Since clang 3.9 and gcc 6.1, the 16-bit version gets quite a lot bulkier. I convinced clang to produce the same code as you show by disabling
likely
but gcc seems more insistent. The 32-bit versions work as expected (again, disabling likely for clang) but I need a 16-bit saturating add. -
Peter Cordes about 6 years@rici: For unsigned 16-bit, if the compiler already has values zero-extended in registers, it might be optimal to do a 32-bit addition and just check
sum & (1UL<<16)
for carry-out. Compilers don't do an optimal job with this (by any means), but clang6.0's branchy version is interesting if the normal case is no overflow. godbolt.org/g/qrpPze. (It should uselea
to copy-and-add, though.) If partial-register stalls for 16-bit regs don't exist (like on Haswell), clang's branchy version of this answer looks ok, too, but gcc's has a silly test (missed optimization should be reported). -
Peter Cordes about 6 yearsThese might end up different when inlining; branch layout would very likely be different when it's not just a stand-alone function.
-
rici about 6 years@peter: my actual use case is comparing
z < clamped_subtract(h, 4)
wherez
is a size_t andh
is auint16_t
. The existing code isz + 4 < h
, but that of course fails if the addition overflows (hugely unlikely, but it's a glitch and I'd like to fix it. It's not in a critical path so I'm not too concerned but I was lookng to see if there was something better than two comparisons. -
Peter Cordes almost 5 yearsThis also uses the undefined operation of
a+b
overflowing! Signed overflow is UB in C and C++. -
Peter Cordes almost 5 years
jno
checks for signed overflow.jnc
would check for unsigned wraparound like this Q wants, which would match withmov eax, -1
(or your short form with a false dependency;or eax, -1
). But if you're going to introduce a data dependency on the add, defeating the benefit for branch-prediction + speculative execution, you might usesbb edx,edx
/or eax, edx
to broadcast CF to all bits and OR that in. But CMOVC would be more efficient, only 1 or 2 uops on the critical path instead of 2 or 3. -
Peter Cordes almost 5 yearsARM can create small negative numbers with
mvn
(mov-NOT) with an immediate. Assemblers know how to use this for you, e.g.adds r0, r1
(add and set flags) / ``movCS r0, #-1` (mvn 0 = -1 if Carry Set). xD, MSalter's own answer posted later shows that compilers already do exactly that. And also emit this for x86, so you don't have to. And in a way that can inline and constant-propagate. -
Peter Cordes almost 5 years@JosephGarvin: yes, it can be, if you needed saturating 16-bit or 8-bit add or subtract. Or bit-reverse (with SSSE3 pshufb for a per-nibble parallel lookup table). Or with SSE4.1, min or max on 32-bit integers (or abs) with a single instruction. Or 64-bit integer math in 32-bit code. But there's overhead in getting numbers between XMM and integer registers, so use with care.
-
Alexis Wilke almost 2 yearsPretty good, although I think you further need to determine whether
T
is signed or unsigned because the test againstmax()
as it is won't work properly when signed integers are used.