How to do unsigned saturating addition in C?

37,807

Solution 1

You probably want portable C code here, which your compiler will turn into proper ARM assembly. ARM has conditional moves, and these can be conditional on overflow. The algorithm then becomes: add and conditionally set the destination to unsigned(-1), if overflow was detected.

uint16_t add16(uint16_t a, uint16_t b)
{
  uint16_t c = a + b;
  if (c < a)  /* Can only happen due to overflow */
    c = -1;
  return c;
}

Note that this differs from the other algorithms in that it corrects overflow, instead of relying on another calculation to detect overflow.

x86-64 clang 3.7 -O3 output for adds32: significantly better than any other answer:

add     edi, esi
mov     eax, -1
cmovae  eax, edi
ret

ARMv7: gcc 4.8 -O3 -mcpu=cortex-a15 -fverbose-asm output for adds32:

adds    r0, r0, r1      @ c, a, b
it      cs
movcs   r0, #-1         @ conditional-move
bx      lr

16bit: still doesn't use ARM's unsigned-saturating add instruction (UADD16)

add     r1, r1, r0        @ tmp114, a
movw    r3, #65535      @ tmp116,
uxth    r1, r1  @ c, tmp114
cmp     r0, r1    @ a, c
ite     ls        @
movls   r0, r1        @,, c
movhi   r0, r3        @,, tmp116
bx      lr  @

Solution 2

In plain C:

uint16_t sadd16(uint16_t a, uint16_t b) {
  return (a > 0xFFFF - b) ? 0xFFFF : a + b;
}
     
uint32_t sadd32(uint32_t a, uint32_t b) {
  return (a > 0xFFFFFFFF - b) ? 0xFFFFFFFF : a + b;
}

which is almost macro-ized and directly conveys the meaning.

Solution 3

In IA32 without conditional jumps:

uint32_t sadd32(uint32_t a, uint32_t b)
{
#if defined IA32
  __asm
  {
    mov eax,a
    xor edx,edx
    add eax,b
    setnc dl
    dec edx
    or eax,edx
  }
#elif defined ARM
  // ARM code
#else
  // non-IA32/ARM way, copy from above
#endif
}

Solution 4

In ARM you may already have saturated arithmetic built-in. The ARMv5 DSP-extensions can saturate registers to any bit-length. Also on ARM saturation is usually cheap because you can excute most instructions conditional.

ARMv6 even has saturated addition, subtraction and all the other stuff for 32 bits and packed numbers.

On the x86 you get saturated arithmetic either via MMX or SSE.

All this needs assembler, so it's not what you've asked for.

There are C-tricks to do saturated arithmetic as well. This little code does saturated addition on four bytes of a dword. It's based on the idea to calculate 32 half-adders in parallel, e.g. adding numbers without carry overflow.

This is done first. Then the carries are calculated, added and replaced with a mask if the addition would overflow.

uint32_t SatAddUnsigned8(uint32_t x, uint32_t y) 
{
  uint32_t signmask = 0x80808080;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 7);
  return (x ^ t0) | t1;
}

You can get the same for 16 bits (or any kind of bit-field) by changing the signmask constant and the shifts at the bottom like this:

uint32_t SatAddUnsigned16(uint32_t x, uint32_t y) 
{
  uint32_t signmask = 0x80008000;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 15);
  return (x ^ t0) | t1;
}

uint32_t SatAddUnsigned32 (uint32_t x, uint32_t y)
{
  uint32_t signmask = 0x80000000;
  uint32_t t0 = (y ^ x) & signmask;
  uint32_t t1 = (y & x) & signmask;
  x &= ~signmask;
  y &= ~signmask;
  x += y;
  t1 |= t0 & x;
  t1 = (t1 << 1) - (t1 >> 31);
  return (x ^ t0) | t1;
}

Above code does the same for 16 and 32 bit values.

If you don't need the feature that the functions add and saturate multiple values in parallel just mask out the bits you need. On ARM you also want to change the signmask constant because ARM can't load all possible 32 bit constants in a single cycle.

Edit: The parallel versions are most likely slower than the straight forward methods, but they are faster if you have to saturate more than one value at a time.

Solution 5

Zero branch solution:

uint32_t sadd32(uint32_t a, uint32_t b)
{
    uint64_t s = (uint64_t)a+b;
    return -(s>>32) | (uint32_t)s;
}

A good compiler will optimize this to avoid doing any actual 64-bit arithmetic (s>>32 will merely be the carry flag, and -(s>>32) is the result of sbb %eax,%eax).

In x86 asm (AT&T syntax, a and b in eax and ebx, result in eax):

add %eax,%ebx
sbb %eax,%eax
or %ebx,%eax

8- and 16-bit versions should be obvious. Signed version might require a bit more work.

Share:
37,807
Frank Szczerba
Author by

Frank Szczerba

I build and scale software systems to manage planet-scale computing infrastructure as well as the high-performing global teams who develop and support them, with a focus on reliability, efficiency, and low human support costs. Before joining Google SRE, I was an embedded systems, Linux kernel, and device driver consultant and iPhone app developer with an extensive background in C, C++, Perl, and bash on Intel, MIPS, ARM, and MSP430 (no, no Perl on the MSP430...).

Updated on December 19, 2021

Comments

  • Frank Szczerba
    Frank Szczerba over 2 years

    What is the best (cleanest, most efficient) way to write saturating addition in C?

    The function or macro should add two unsigned inputs (need both 16- and 32-bit versions) and return all-bits-one (0xFFFF or 0xFFFFFFFF) if the sum overflows.

    Target is x86 and ARM using gcc (4.1.2) and Visual Studio (for simulation only, so a fallback implementation is OK there).

  • Frank Szczerba
    Frank Szczerba over 15 years
    Your answer looks like what I thought we should be doing, but like you said I'm not really sure which is better, which is why I figured I'd open it up to voting here.
  • Frank Szczerba
    Frank Szczerba over 15 years
    I think the answer for ARM is similar (and even more efficient with conditional ops), but I'm hoping someone knows a pattern that will trick GCC into generating something close to this.
  • leo7r
    leo7r over 15 years
    They both seem correct, therefore efficiency should decide. An extra comparison isn't obviously slower (or faster) than oversizing the addition. Do some efficiency tests for both solutions on both architectures and pick the faster one.
  • Nils Pipenbrinck
    Nils Pipenbrinck over 15 years
    @Frank, which GCC versin are you using? (gcc --version). The newer versions do such tricks.
  • Ishbir
    Ishbir over 15 years
    The question asked for C, but anyway, nice code. Is eax returned by default as the function result?
  • Steve Jessop
    Steve Jessop over 15 years
    If the question wanted portability, it shouldn't have specified x86 and ARM ;-)
  • Frank Szczerba
    Frank Szczerba over 15 years
    Is checking the sum against both inputs necessary? The limit case is (uint16_t)(0xffff + 1) which is both < 1 and < 0xffff, so it seems the second check can be avoided.
  • leo7r
    leo7r over 15 years
    You are right, the lost overflow bit is worth MAXINT+1, so the result of overflown addition is equal to a+b-(MAXINT+1), which is both less than a and less than b.
  • MSalters
    MSalters over 15 years
    ARM has "C-everything". Not just jump and move. But it doesn't have support for 32 bits constants. So you'd want a conditional mov 0, followed by a conditional sub 1
  • Arafangion
    Arafangion almost 14 years
    That function is still portable - once the elif and else cases are filled in. Portable code doesn't mean that you can't optimise for particular platforms.
  • Arafangion
    Arafangion almost 14 years
    lower-case function macros? Evil!
  • Exectron
    Exectron almost 14 years
    Nice. A nitpick--if I saw the name sadd16 in some code, my first assumption would be that the s stands for signed.
  • yingted
    yingted almost 13 years
    @Craig McQueen Except for the fact that there would not be much reason to make it a function.
  • Marc Gravell
    Marc Gravell over 12 years
    A proposed edit by YumeYao (which I've not pushed through, as it changes the nature of the answer): The 3 instructions (xor reg,reg; setne reg; dec reg;) can be replaced with one more efficient instruction (sbb reg,reg).
  • Joseph Garvin
    Joseph Garvin about 12 years
    Is using the SSE instructions still faster in cases where you're only ever operating on one variable at a time?
  • Joseph Garvin
    Joseph Garvin about 12 years
    @Anonymous: Why not? Signed overflow/underflow isn't defined.
  • yingted
    yingted about 12 years
    @JosephGarvin Signed addition of unsigned ints?
  • Joseph Garvin
    Joseph Garvin about 12 years
    @Anonymous: Craig is speaking from the standpoint of reading code where there is a call to sad16/32. You won't see the signature unless you find and open the header.
  • yingted
    yingted about 12 years
    @JosephGarvin uint32_t x=5,y=6,z=sadd32(x,y);? The types are visible.
  • Joseph Garvin
    Joseph Garvin about 12 years
    @Anonymous: Often when you're reading a diff (e.g. for code review, or if you're sending patches to an open source mailing list) the call is far enough from the declaration of the parameters that you can't tell.
  • Cole Tobin
    Cole Tobin almost 11 years
    @JosephGarvin except that most IDE's allow you to hover over a call (or something) to see the deceleration.
  • Cole Tobin
    Cole Tobin almost 11 years
    Would it make sense to inline this? attribute((always_inline)) for GCC; __forceinline for MSVC
  • Cole Tobin
    Cole Tobin almost 11 years
    Two things: the __asm keyword is compiler-dependent. The standard doesn't specify a keyword for inline assembly. So this is not portable in the sense that it is compiler-dependent. For example, the Intel C++ compiler is Windows only, so if you wrote portable code utilizing Itel C++ features, it wouldn't be portable. Another thing: inline assembly prevents compiler inlining. So this optimization doesn't really help if there is still the function call overhead...
  • Cole Tobin
    Cole Tobin almost 11 years
    Why use ~((uint32_t)0)? You're already including <limits.h> to get the uint32_t deceleration, so why not just use UINT32_MAX?
  • Dietrich Epp
    Dietrich Epp almost 11 years
    @ColeJohnson: Use inline instead. Don't use __attribute__((always_inline)), which is usually a way of second-guessing the optimizer. The inline keyword does what you actually want and is portable.
  • Cole Tobin
    Cole Tobin almost 11 years
    @Dietrich Why not force inline it besides portability? Also with plain inline, you aren't portable. MSVC requires __inline...
  • Dietrich Epp
    Dietrich Epp almost 11 years
    @ColeJohnson: You can #define inline __inline easily enough, if you want to support compilers with poor support for C, and as long as you're aware of the semantic differences. There is really no reason to use always_inline here, and a good reason not to: it might interfere with debugging. For the same reason, C programmers have ditched declaring register variables long ago: there's no point, except in perhaps very specific circumstances.
  • Cole Tobin
    Cole Tobin almost 11 years
    @DietrichEpp Fair enough. I'm not gonna sit here and be given a lecture on something I already know. However, a smart compiler would not inline functions even if forced to when it's in debug mode. An example is MSVC. If you tell it to compiler for debug mode, it won't inline (even forced) functions.
  • Dietrich Epp
    Dietrich Epp almost 11 years
    @ColeJohnson: Sorry if my answer seemed like a lecture. But this isn't about how "smart" a compiler is, but just how it behaves, and GCC inlines an always_inline even when compiling with debugging symbols.
  • Cole Tobin
    Cole Tobin almost 11 years
    @Dietrich That's stupid. I guess I never noticed because I work in MSVC, then port to GCC when done.
  • Alexandros
    Alexandros over 10 years
    Just a minor suggestion: The 0xFF.. constants should be changed to the equivalent UINTN_MAX constants (or (uintN_t) -1). That way, it will only take a single search & replace to write the sadd8 or sadd64 functions. (And it doesn't require you to count the number of Fs in 0xFFFFFFFFFFFFFFFF ;)
  • WiSaGaN
    WiSaGaN about 9 years
    "...:...?..." would be a conditional move instead of "jump".
  • Alexandre Pereira Nunes
    Alexandre Pereira Nunes almost 9 years
    This produces nice code in gcc 5.1 when targetting armv4t, just 4 branchless instructions (two of them conditional).
  • Peter Cordes
    Peter Cordes about 8 years
    sbb $0, %ebx subtracts 1 or not. This gives the wrong answer if the add overflowed more than 1. What does work (as suggested by others) is using sbb same,same to produce as 0 or -1 mask, and OR the add result with that. However, that has a longer critical-path latency than add %edi, %esi / mov $-1, %eax / cmovnc %esi, %edi. (sbb and cmov have the same latency on all CPUs: 2 on Intel pre-Broadwell, and 1 otherwise.)
  • Peter Cordes
    Peter Cordes about 8 years
    You can write overflow&31, and it will still compile without a wasted and ecx, 31, because gcc and clang know how the shift instruction works (the ISA defines it to work that way, on every CPU since 286. See the Intel insn ref manual linked from the x86 tag wiki. On targets where the shift works differently way, they will emit the necessary instructions to make it work. Of course, this still relies on right-shift of a signed integer using an arithmetic shift, which the C standard doesn't guarantee.
  • Peter Cordes
    Peter Cordes about 8 years
    This generates optimal code on x86 with clang (mov eax,-1 / add / cmovnc), and about the same with gcc, unlike all the other answers. It's the only one that gets gcc to use the flags result from the add, instead of doing another test afterwards (except for DGentry's answer, but gcc doesn't realize both tests are the same). So one could say it's the only one where gcc "understands" what's going on. Even inline asm can't do better on x86: the compiler knows what's going on with yours, so it knows it's associative, and can choose which reg to destroy.
  • Peter Cordes
    Peter Cordes about 8 years
    You'd hope a compiler would spot that, but they don't. clang/gcc/icc all do a crap job on everything except MSalter's answer. Yours compiles to lea eax, [rdi+rsi]/ mov edx, edi / mov ecx, esi / add rdx, rcx / shr rdx, 32 / neg edx / or eax, edx
  • Peter Cordes
    Peter Cordes about 8 years
    I didn't see an unsigned saturation instruction for 32bit integers, only for packed16 UQUADD16 and packed8. There's a 32bit add with signed-saturation, though. Also, unfortunately this C code compiles to horrible code for the 32bit case: all the overhead of doing it SWAR style, but for only one value. It unfortunately doesn't optimize away. See my comment on MSalters's answer: the godbolt link includes your version.
  • Peter Cordes
    Peter Cordes about 8 years
    This kinda sucks: first because it's MSVC inline-asm, so inputs / outputs have to go through memory. (Or if this no-return-statement with a value in eax works, then the function itself can't inline. The inputs have to go through memory regardless). Second, because cmov is better: shorter critical path because mov eax, -1 is off the critical path, unlike sbb.
  • Peter Cordes
    Peter Cordes about 8 years
    This produces significantly worse code for x86 and ARM than MSalter's answer. Have a look at it for asm output (including a godbolt link comparing this with it.)
  • rici
    rici about 6 years
    @PeterCordes: Care to comment on the behaviour of more recent clang/gcc versions? Since clang 3.9 and gcc 6.1, the 16-bit version gets quite a lot bulkier. I convinced clang to produce the same code as you show by disabling likely but gcc seems more insistent. The 32-bit versions work as expected (again, disabling likely for clang) but I need a 16-bit saturating add.
  • Peter Cordes
    Peter Cordes about 6 years
    @rici: For unsigned 16-bit, if the compiler already has values zero-extended in registers, it might be optimal to do a 32-bit addition and just check sum & (1UL<<16) for carry-out. Compilers don't do an optimal job with this (by any means), but clang6.0's branchy version is interesting if the normal case is no overflow. godbolt.org/g/qrpPze. (It should use lea to copy-and-add, though.) If partial-register stalls for 16-bit regs don't exist (like on Haswell), clang's branchy version of this answer looks ok, too, but gcc's has a silly test (missed optimization should be reported).
  • Peter Cordes
    Peter Cordes about 6 years
    These might end up different when inlining; branch layout would very likely be different when it's not just a stand-alone function.
  • rici
    rici about 6 years
    @peter: my actual use case is comparing z < clamped_subtract(h, 4) wherez is a size_t and h is a uint16_t. The existing code is z + 4 < h, but that of course fails if the addition overflows (hugely unlikely, but it's a glitch and I'd like to fix it. It's not in a critical path so I'm not too concerned but I was lookng to see if there was something better than two comparisons.
  • Peter Cordes
    Peter Cordes almost 5 years
    This also uses the undefined operation of a+b overflowing! Signed overflow is UB in C and C++.
  • Peter Cordes
    Peter Cordes almost 5 years
    jno checks for signed overflow. jnc would check for unsigned wraparound like this Q wants, which would match with mov eax, -1 (or your short form with a false dependency; or eax, -1). But if you're going to introduce a data dependency on the add, defeating the benefit for branch-prediction + speculative execution, you might use sbb edx,edx / or eax, edx to broadcast CF to all bits and OR that in. But CMOVC would be more efficient, only 1 or 2 uops on the critical path instead of 2 or 3.
  • Peter Cordes
    Peter Cordes almost 5 years
    ARM can create small negative numbers with mvn (mov-NOT) with an immediate. Assemblers know how to use this for you, e.g. adds r0, r1 (add and set flags) / ``movCS r0, #-1` (mvn 0 = -1 if Carry Set). xD, MSalter's own answer posted later shows that compilers already do exactly that. And also emit this for x86, so you don't have to. And in a way that can inline and constant-propagate.
  • Peter Cordes
    Peter Cordes almost 5 years
    @JosephGarvin: yes, it can be, if you needed saturating 16-bit or 8-bit add or subtract. Or bit-reverse (with SSSE3 pshufb for a per-nibble parallel lookup table). Or with SSE4.1, min or max on 32-bit integers (or abs) with a single instruction. Or 64-bit integer math in 32-bit code. But there's overhead in getting numbers between XMM and integer registers, so use with care.
  • Alexis Wilke
    Alexis Wilke almost 2 years
    Pretty good, although I think you further need to determine whether T is signed or unsigned because the test against max() as it is won't work properly when signed integers are used.