Is it better to avoid using the mod operator when possible?

c performance optimization modulo

42,642

Solution 1

My general advice is as follows. Use whichever version you think is easier on the eye, and then profile your entire system. Only optimize those parts of the code that the profiler flags up as bottlenecks. I'll bet my bottom dollar that the modulo operator isn't going to be among them.

As far as the specific example goes, only benchmarking can tell which is faster on your specific architecture using your specific compiler. You are potentially replacing modulo with branching, and it's anything but obvious which would be faster.

Solution 2

Some simple measurement:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    int test = atoi(argv[1]);
    int divisor = atoi(argv[2]);
    int iterations = atoi(argv[3]);

    int a = 0;

    if (test == 0) {
        for (int i = 0; i < iterations; i++)
            a = (a + 1) % divisor;
    } else if (test == 1) {
        for (int i = 0; i < iterations; i++)
            a = a + 1 == divisor ? 0 : a + 1;
    }

    printf("%d\n", a);
}

Compiling with either gcc or clang with -O3, and running time ./a.out 0 42 1000000000 (modulo version) or time ./a.out 1 42 1000000000 (comparison version) results in

6.25 seconds user runtime for the modulo version,
1.03 seconds for the comparison version.

(using gcc 5.2.1 or clang 3.6.2; Intel Core i5-4690K @ 3.50GHz; 64-bit Linux)

This means that it is probably a good idea to use the comparison version.

Solution 3

Well, have a look at 2 ways to get the next value of a "modulo 3" cyclic counter.

int next1(int n) {
    return (n + 1) % 3;
}

int next2(int n) {
    return n == 2 ? 0 : n + 1;
}

I've compiled it with gcc -O3 option (for the common x64 architecture), and -s to get the assembly code.

The code for the first function does some unexplainable magic (*) to avoid a division, using a multiplication anyway:

addl    $1, %edi
movl    $1431655766, %edx
movl    %edi, %eax
imull   %edx
movl    %edi, %eax
sarl    $31, %eax
subl    %eax, %edx
leal    (%rdx,%rdx,2), %eax
subl    %eax, %edi
movl    %edi, %eax
ret

And is much longer (and I bet slower) than the second function:

leal    1(%rdi), %eax
cmpl    $2, %edi
movl    $0, %edx
cmove   %edx, %eax
ret

So it is not always true that "the (modern) compiler does a better job than you anyway".

Interestingly, the same experiment with 4 instead of 3 leads to a and-masking for the first function

addl    $1, %edi
movl    %edi, %edx
sarl    $31, %edx
shrl    $30, %edx
leal    (%rdi,%rdx), %eax
andl    $3, %eax
subl    %edx, %eax
ret

but it is still, and by large, inferior to the second version.

Being more explicit about proper ways to do the things

int next3(int n) {
    return (n + 1) & 3;;
}

yields much better results :

leal    1(%rdi), %eax
andl    $3, %eax
ret

(*) well, not that complicated. Multiplication by reciprocical. Compute the integer constant K = (2^N)/3, for some large enough value of N. Now, when you want the value of X/3, instead of a division by 3, compute X*K, and shift it N positions to the right.

Solution 4

Here is some additional benchmark. Note that I also added a branchless version:

#include <iostream>
#include <array>
#include <algorithm>
#include <random>
#include <chrono>
using namespace std::chrono;

constexpr size_t iter = 1e8;

int main() {
  std::minstd_rand rnd_engine{1234};
  std::uniform_int_distribution<int> dist {-1000, 1000};
  auto gen = [&]() { return dist(rnd_engine); };

  std::array<int, 10> a;
  std::generate( a.begin(), a.end(), gen);

  for (size_t size = 2; size < 10; size++) {
    std::cout << "Modulus size = " << size << '\n';
  
    {
      std::cout << "operator%  ";
      long sum = 0;
      size_t x = 0;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x];
        x = (x + 1) % size;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }
  
    {
      std::cout << "ternary    ";
      long sum = 0;
      size_t x = 0;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x];
        x = ((x + 1) == size) ? 0 : x + 1;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }
    
    {
      std::cout << "branchless ";
      long sum = 0;
      size_t x = 1;
      auto start = high_resolution_clock::now();
      for (size_t i = 0; i < iter; ++i) {
        sum += a[x-1];
        x = ( x != size ) * x + 1;
      }
      auto stop = high_resolution_clock::now();
      std::cout << duration_cast<microseconds>(stop - start).count()*0.001
                << "ms\t(sum = " << sum << ")\n";
    }

  }
  return 0;
}

And here is the output on my i7-4870HQ

$ g++ -Ofast test.cpp && ./a.out
Modulus size = 2
operator%  904.249ms    (sum = -4200000000)
ternary    137.04ms     (sum = -4200000000)
branchless 169.182ms    (sum = -4200000000)
Modulus size = 3
operator%  914.911ms    (sum = -31533333963)
ternary    113.384ms    (sum = -31533333963)
branchless 167.614ms    (sum = -31533333963)
Modulus size = 4
operator%  877.3ms      (sum = -36250000000)
ternary    97.265ms     (sum = -36250000000)
branchless 167.215ms    (sum = -36250000000)
Modulus size = 5
operator%  891.295ms    (sum = -30700000000)
ternary    88.562ms     (sum = -30700000000)
branchless 167.087ms    (sum = -30700000000)
Modulus size = 6
operator%  903.644ms    (sum = -39683333196)
ternary    83.433ms     (sum = -39683333196)
branchless 167.778ms    (sum = -39683333196)
Modulus size = 7
operator%  908.096ms    (sum = -34585713678)
ternary    79.703ms     (sum = -34585713678)
branchless 166.849ms    (sum = -34585713678)
Modulus size = 8
operator%  869ms        (sum = -39212500000)
ternary    76.972ms     (sum = -39212500000)
branchless 167.29ms     (sum = -39212500000)
Modulus size = 9
operator%  875.003ms    (sum = -36500000580)
ternary    75.011ms     (sum = -36500000580)
branchless 172.356ms    (sum = -36500000580)

In this particular case the ternary operator looks far superior, and it becomes even more like so when the branch predictor ramps up. Note however that this is a very particular case: if we were not incrementing the index by non-const value, using the more general operator% would be straightforward while the other two methods could become very intricated.

I would like to stress this very much underrated comment:

if len is a compile-time constant a recent GCC compiler (with -02) is usually doing clever things, often avoiding the modulus machine instruction of the target processor. – Basile Starynkevitch

For instance by removing the loop on the size variable and declaring it as const size_t size = 4; I get:

g++ -Ofast test.cpp && ./a.out
Modulus size = 4
operator%  62.103ms     (sum = -36250000000)
ternary    93.674ms     (sum = -36250000000)
branchless 166.774ms    (sum = -36250000000)

Conclusions

The execution time of the branchless version is pretty stable across the various scenarios. The ternary is consistently better than the branchless over the considered cases, especially when the branch predictor kicks in. Finally, the operator%, while being more general and significantly slower, has chances to get optimized to become the fastest as in the case of particular const values of the right hand side.

Of course this is completely platform dependent, who knows how this will be on an Arduino :)

View more solutions

42,642

Author by

limp_chimp

I started teaching myself programming and computer science a few years after college. I got my MA in computer science from the University of Chicago in 2013, and have been working as a software developer since.

Updated on July 08, 2022

Comments

limp_chimp almost 2 years
I assume that calculating the modulus of a number is a somewhat expensive operation, at least compared to simple arithmetic tests (such as seeing if a number exceeds the length of an array). If this is indeed the case, is it more efficient to replace, for example, the following code:
```
res = array[(i + 1) % len];
```
with the following? :
```
res = array[(i + 1 == len) ? 0 : i + 1];
```
The first one is easier on the eyes, but I wonder if the second might be more efficient. If so, might I expect an optimizing compiler to replace the first snippet with the second when a compiled language is used?

Of course, this "optimization" (if it is indeed an optimization) doesn't work in all cases (in this case, it only works if i+1 is never more than len).
Basile Starynkevitch about 11 years

On recent machines integer arithmetic is nearly free; what matter much more is cache misses..... which are much more costly. an L1 cache miss stalls the processor for hundreds of cycles, during which the processor could do dozens of divisions or modulus; so the eventual cost of the modulus is noise
NPE about 11 years

@BasileStarynkevitch: Well, cache behaviour is going to be identical between the two code snippets. What's going to matter is whether or not version #2 uses branching and, if it does, how good a job the branch predictor is going to do.
Chris Desjardins about 11 years

Just becuase there is a single instruction for an operation, doesn't mean it occurs in a single clock cycle.
Alex about 11 years

@ChrisDesjardins Agreed, but % if the second operator is power of 2 can be represented as a bit mask.
starblue about 11 years

@Basile Starynkevitch I've seen a factor of about 300 between modulo vs accessing a big table on a laptop. (Adding a test for divisibility by 17 squared to avoid the array access was still beneficial.)
autistic about 11 years

@NPE It might be worthwhile to add to this answer that the C language itself doesn't have speed; That's a quality of the implementation (eg. "your specific architecture"). In addition to being dependent upon hardware, "the speed of the modulo operator" is dependent upon the quality of the compiler building code for the hardware; A poor one might use the assembly equivalent of int foo = 54321; int bar = foo / 10000; foo -= bar * 10000; to obtain the modulo, while a good quality compiler might even optimise that code.
c.fogelklou about 11 years

Sorry had to downvote. I have worked with a lot of architectures (but not x86) and have yet to work with one that accomplishes mod/div in one instruction. And I have seen apps where mod is one of the top 10 CPU consuming function calls because of all of the circular buffering - each "sample" copy followed by a % buffersize. In my case I try to avoid mod if I can - usually by asserting that input buffer sizes are divisible by 2 so the compiler can optimize away the mod.
user1209304 over 7 years

On more realistic data (for example if number would be a random) then difference would not be that big
Bigminimus almost 7 years

The comparison version is only faster because the result of the if statement is the same every time, so the branch predictor gets it right every time. If you randomized the input, the comparison version might even be worse than mod
Baris Demiray over 6 years

@Bigminimus Hmm but the result of the if clause is the same for both tests all the time?
M.kazem Akhgary over 6 years

@c.fogelklou good point. branch prediction works well for ring buffers on iterations. one may think branching is more expensive than modulo and probably missed the opportunity to use it.
Ray over 5 years

The comparison in the second version might make it inferior to the first version; if it doesn't regularly predict the correct branch, that's going to screw up the pipeline. Still, +1 for demonstrating that modern compilers do not always magically find the best possible machine code.
Michel Billaud over 5 years

@Ray as far as I understand, conditional move have been added to the instruction set (Pentium Pro) so no branch prediction takes place at all "The CMOVcc instructions are useful for optimizing small IF constructions. They also help eliminate branching overhead for IF statements and the possibility of branch mispredictions by the processor. " Pentium-Pro Family Developers Manual, vol 2, p 6.14. phatcode.net/res/231/files/24269101.pdf
Ray over 5 years

Michel Billaud: Looks like you're right. I saw the cmpl and completely overlooked the lack of a jump.
Peter Cordes about 5 years

div is the slowest integer ALU operation by far. Like 35 to 90 cycle latency on Skylake for div r64, vs. 3 cycle latency for imul r64, r64. Related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why?/ shows just how slow div is, especially vs. a shift for a power of 2.
Doug Currie about 5 years

I wonder why the cond version's time is not the same in both cases.
J. Doe about 5 years

I think that, because res is not volatile, gcc can do many optimizations that are less effective as the size is increasing. When I add 'volatile' to res the times for the conditional are always around 2 sec. For modulo around 2 sec when power of 2 and not stable (above 4 sec, increasing with the size) otherwise.
J. Doe about 5 years

I also noticed that in the case of non-volatile res, for 1024 size the conditional runs faster, in 1 sec, so I guess it's about 'good' and 'bad' sizes for the optimizations (or branch predictors?).
Garret Gang over 4 years

He is referencing the (?) Operator, you're code depending on the size of the divisor is only guessing wrong 1 out of 100, 400, etc
pat about 4 years

The % 4 code is more complex because you are doing signed arithmetic. According to C99, the sign of the modulus must match the sign of the dividend, so it is not just a straight bitwise AND. Change the type to unsigned int, and you will get the same result as the & 3 code.
CaptainCodeman over 3 years

@c.fogelklou the compiler cannot optimize away a mod if the operand is divisible by 2. Did you mean a power of 2?
c.fogelklou over 3 years

@CaptainCodeman, yes, I meant power of 2. Good catch 7 years later :)
mtraceur over 3 years

-1 because the answer strongly suggests judging by code size, which is an okay heuristic but a mistake when it comes to optimizations like the one in this question. Not all instructions are equal. Even on a RISC architecture some operations might take longer than others, and on a pipelined CPU the time spent executing a mispredicted branch (which is longer than the branch itself, but continues after the branch until the result of the branching condition is found deeper in the pipeline) might be longer than the time spent by the unconditional code with more instructions.