Fast divisibility tests (by 2,3,4,5,.., 16)?
Solution 1
It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.
In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).
The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.
Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.
It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.
Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.
Solution 2
In every case (including divisible by 2):
if (number % n == 0) do();
Anding with a mask of low order bits is just obfuscation, and with a modern compiler will not be any faster than writing the code in a readable fashion.
If you have to test all of the cases, you might improve performance by putting some of the cases in the if
for another: there's no point it testing for divisibility by 4 if divisibility by 2 has already failed, for example.
Solution 3
As @James mentioned, let the compiler simplify it for you. If n
is a constant, any descent compiler is able to recognize the pattern and change it to a more efficient equivalent.
For example, the code
#include <stdio.h>
int main() {
size_t x;
scanf("%u\n", &x);
__asm__ volatile ("nop;nop;nop;nop;nop;");
const char* volatile foo = (x%3 == 0) ? "yes" : "no";
__asm__ volatile ("nop;nop;nop;nop;nop;");
printf("%s\n", foo);
return 0;
}
compiled with g++-4.5 -O3, the relevant part of x%3 == 0
will become
mov rcx,QWORD PTR [rbp-0x8] # rbp-0x8 = &x
mov rdx,0xaaaaaaaaaaaaaaab
mov rax,rcx
mul rdx
lea rax,"yes"
shr rdx,1
lea rdx,[rdx+rdx*2]
cmp rcx,rdx
lea rdx,"no"
cmovne rax,rdx
mov QWORD PTR [rbp-0x10],rax
which, translated back to C code, means
(hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
// equivalatent to: x % 3 == 0 ? "yes" : "no"
no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3
)
Edit:
- The magic constant 0xaaaaaaaaaaaaaaab can be computed in http://www.hackersdelight.org/magic.htm
- For divisors of the form 2n - 1, check http://graphics.stanford.edu/~seander/bithacks.html#ModulusDivision
Solution 4
A bit tongue in cheek, but assuming you get the rest of the answers:
Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
Solution 5
Assume number
is unsigned
(32-bits). Then the following are very fast ways to compute divisibility up to 16. (I haven't measured but the assembly code indicates so.)
bool divisible_by_2 = number % 2 == 0;
bool divisible_by_3 = number * 2863311531u <= 1431655765u;
bool divisible_by_4 = number % 4 == 0;
bool divisible_by_5 = number * 3435973837u <= 858993459u;
bool divisible_by_6 = divisible_by_2 && divisible_by_3;
bool divisible_by_7 = number * 3067833783u <= 613566756u;
bool divisible_by_8 = number % 8 == 0;
bool divisible_by_9 = number * 954437177u <= 477218588u;
bool divisible_by_10 = divisible_by_2 && divisible_by_5;
bool divisible_by_11 = number * 3123612579u <= 390451572u;
bool divisible_by_12 = divisible_by_3 && divisible_by_4;
bool divisible_by_13 = number * 3303820997u <= 330382099u;
bool divisible_by_14 = divisible_by_2 && divisible_by_7;
bool divisible_by_15 = number * 4008636143u <= 286331153u;
bool divisible_by_16 = number % 16 == 0;
Regarding divisibility by d
the following rules hold:
- When
d
is a power of 2:
As pointed out by James Kanze, you can use is_divisible_by_d = (number % d == 0)
. Compilers are clever enough to implement this as (number & (d - 1)) == 0
which is very efficient but obfuscated.
However, when d
is not a power of 2 it looks like the obfuscations shown above are more efficient than what current compilers do. (More on that later).
- When
d
is odd:
The technique takes the form is_divisible_by_d = number * a <= b
where a
and b
are cleverly obtained constants. Notice that all we need is 1 multiplication and 1 comparison:
- When
d
is even but not a power of 2:
Then, write d = p * q
where p
is a power of 2 and q
is odd and use the "tongue in cheek" suggested by unpythonic, that is, is_divisible_by_d = is_divisible_by_p && is_divisible_by_q
. Again, only 1 multiplication (in the calculation of is_divisible_by_q
) is performed.
Many compilers (I've tested clang 5.0.0, gcc 7.3, icc 18 and msvc 19 using godbolt) replace number % d == 0
by (number / d) * d == number
. They use a clever technique (see references in Olof Forshell's answer) to replace the division by a multiplication and a bit shift. They end up doing 2 multiplications. In contrast the techniques above perform only 1 multiplication.
Update 01-Oct-2018
Looks like the algorithm above is coming to GCC soon (already in trunk):
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853
The GCC's implementation seems even more efficient. Indeed, the implementation above has three parts: 1) divisibility by the divisor's even part; 2) divisibility by the divisor's odd part; 3) &&
to connect the results of the two previous steps. By using an assembler instruction which is not efficiently available in standard C++ (ror
), GCC wraps up the three parts into a single one which is very similar to that of divisibility by the odd part. Great stuff! Having this implementation available, it's better (for both clarity and performance) to fall back to %
all times.
Update 05-May-2020
My articles on the subject have been published:
Quick Modular Calculations (Part 1), Overload Journal 154, December 2019, pages 11-15.
Quick Modular Calculations (Part 2), Overload Journal 155, February 2020, pages 14-17.
Quick Modular Calculations (Part 3), Overload Journal 156, April 2020, pages 10-13.
Comments
-
psihodelia almost 2 years
What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?
WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).
Divisible_by_1 = do(); Divisible_by_2 = if (!(number & 1)) do(); Divisible_by_3 = ? Divisible_by_4 = ? Divisible_by_5 = ? Divisible_by_6 = ? Divisible_by_7 = ? Divisible_by_8 = ? Divisible_by_9 = ? Divisible_by_10 = ? Divisible_by_11 = ? Divisible_by_12 = ? Divisible_by_13 = ? Divisible_by_14 = ? Divisible_by_15 = ? Divisible_by_16 = if(!number & 0x0000000F) do();
and special cases:
Divisible_by_2k = if(number & (tk-1)) do(); //tk=2**k=(2*2*2*...) k times
-
psihodelia almost 13 yearsIt's not forbidden to change the base :). However, It must be fast!
-
psihodelia almost 13 years@Ali Veli: and it's very slow, because % requires division operation, which is very slow on many CPUs
-
psihodelia almost 13 yearsYour solution is very slow, because you implicitly use division operation !
-
Armen Tsirunyan almost 13 years@psihodelia: The thing is that if you do change the base, it's already gonna be slower than just doing
%
-
starblue almost 13 yearsThe compiler should optimize to a bit test for powers of two.
-
kennytm almost 13 years@psihodelia: Have you actually tried to check the compiler's generated assembly?
-
psihodelia almost 13 yearsI am not interested in sequential tests! Given code is EXAMPLE only! I am looking for methods of fast divisibility tests
-
James Kanze almost 13 years@psihodelia Then there's not much you can do to improve on
number % n == 0
. -
James Kanze almost 13 years@psihodelia My solution generates exactly the same machine code as yours, at least with g++ (and this is without optimization). From experience, trying to beat the compiler at this sort of thing is a loosing proposition: the compiler knows more about the subtleties of your machine than you do, and will do a better job at finding the optimal machine instructions. Formulating the expression for something other than what you really want will inhibit the compiler in this, and sometimes result in worse code.
-
psihodelia almost 13 years@James: please check what your compiler generates if you increment n in a loop
-
James Kanze almost 13 years@psihodelia The base, in this case, is determined by the underlying hardware.
-
James Kanze almost 13 years@psihodelia If n is a variable, it will generate a division. Obviously, since it can't know what value to optimize for. On the other hand, I just wrote a function
template<int n> bool isDivisibleBy( int number )
, and instantiated it for all values between 2 and 16, and the compiler didn't generated a single division. (VC++ optimizes out the division for powers of 2, but not for other values.) -
psihodelia almost 13 yearsI am interesting in this tricks compiler does. Not every compiler is the same.
-
kennytm almost 13 years@psihodelia: At least both gcc and clang does the same. See update.
-
MSalters almost 13 yearsYou only need the LCM of the odd primes: 15015. And there are only 5 primes, so the LUT doesn't need more than 5 bits. 75075 bits total.
-
Mooing Duck over 12 yearswwhen you check 15, this algorithm says its divisible by 2, 3, and 4, but not 5. This method wont work.
-
Mooing Duck over 12 yearsgood idea, and good that you mention that this code is probably slower than modulo.
-
FrankH. almost 12 yearsActually ... that specific chapter of Hacker's Delight is available online: hackersdelight.org/divcMore.pdf
-
starblue almost 12 years@FrankH. Good find, but from the text it seems to be an addition of more material on this topic.
-
Admin over 9 yearsAnding with a mask of low order bits is much faster, as I have demonstrated on a similar question RE: divisible by two. IIRC it runs in about 65% of the time of modulo. stackoverflow.com/a/16369720/1899861
-
Admin over 9 yearsIt's been awhile, but IIRC division on shorter integer types on x86 gets faster and faster. EG: an int_8 division is 9X faster than an int_32 division. Not even a little bit like proportional to the size, is it? Weird, but true.
-
James Kanze over 9 years@RocketRoy If this is the case, I'd suggest you change your compiler, or at least complain about it. Compilers have been changing multiplication, division and modulo into comparable shift, add and subtract operations for at least 20 years.
-
Peter Cordes about 6 years@RocketRoy: On recent x86 microarchitectures like Sandybridge or Haswell with powerful high-radix dividers, integer division is only slightly faster for int8_t than int32_t. But
int64_t
is 2x to 3x slower thanint32_t
: On Haswell, latencies foridiv r8
: 23-26. Foridiv r32
: 22-29 cycles, and foridiv r64
: 39-103. (Worst-case throughput is also better for smaller registers). Even going back to Pentium II, there was only a 2x latency / 3x throughput difference between 8-bit vs. 32-bit. AMD Ryzen has 13-16 cycleidiv r8
, and 14-30 cycleidiv r32
(same best case, 2x worst case) -
Peter Cordes about 6 yearsIf you have a number as a string, the first adding digits can be very fast. (e.g. a few instructions on x86 with SSE2
psadbw
to sum up to 16 digits). But doing it repeatedly down to a single digit requires modulo by 10 to break the binary integer into decimal digits, so you might as well just let the compiler use a magic-constant multiply to check for divisibility by 3 in the first place. But if your number is larger than a single register (e.g. int64_t on a 32-bit machine), and you already have a decimal string representation, this could be a win. -
Peter Cordes about 6 yearsgcc doesn't use the multiplicative-inverse trick for integers wider than a register, where it would take 4 multiplies and some
adc
to produce the high half of the full result. Instead it passes the constant to a libgcc division function that uses regulardiv
instructions. -
Peter Cordes about 6 yearsSee also Why does GCC use multiplication by a strange number in implementing integer division? for the details on how / why it works. Gcc will do it for you, except for
int64_t
on a 32-bit machine. (Or in general, with integers wider than a single register). -
Peter Cordes about 6 yearsTesting
n%8 == 7
isn't the same thing asn%7 == 0
. If it was, optimizing compilers would use a simple bitwise AND when compilingn%7 == 0
. -
Peter Cordes about 6 yearsDo this single-multiplication versions really work for all valid integer inputs? Put these in a loop for
number
from0..UINT_MAX
and check the boolean results of your method againstnumber % 13 == 0
. e.g. like this for a related division hack which fails for large numbers. (If this does work for the full range, though, it's cool and compilers should be using it! Only needing the low half of a multiply result is a nice bonus on top of only needing one multiply) -
Cassio Neri about 6 years@PeterCordes It does. The reference mathematically proves it (kudos to Chris Lomont). In addition, before posting I've done the test that you suggested. Compilers definitely should be using. Notice that the constants above are for 32 bits unsigned integers . The same reference gives the constants for 64 bits unsigned integers and it explains how the constants are obtained.
-
Peter Cordes about 6 yearsThat's awesome. Have you opened missed-optimization bug reports on gcc and clang?
-
Cassio Neri about 6 years@PeterCordes Good idea. Will do.
-
Cassio Neri about 6 yearsThere was already a bug report for gcc.
-
Cassio Neri about 6 yearsAnd one for clang as well.
-
Peter Cordes over 5 yearsIt is possible to write a rotate in ISO C++ in a way that will compile to a hardware rotate instruction with good compilers. Best practices for circular shift (rotate) operations in C++. Anyway, very cool trick, thanks for writing up this answer with links to compiler bug reports.
-
Cassio Neri over 5 years@PeterCordes Indeed. I confess I've assumed compilers would not be that clever! I've updated the answer to strikethrough my previous claim.
-
Gumby The Green almost 5 yearsThe GCC bug is "RESOLVED FIXED" but how will we know when the fix has been been deployed?
-
Gumby The Green almost 5 yearsFor me, the multiplication trick with odd #s only helps with 64-bit and 128-bit integers. With the latter, it's way faster than the mod operator. E.g.,
n * 226854911280625642308916404954512140971 <= 113427455640312821154458202477256070485
is 17x faster thann % 3 == 0
where all three integers are 128-bit (unsigned). -
Cassio Neri almost 5 years@GumbyTheGreen The implementation is in gcc 9.1. See here. Play with compiler versions and notice the difference in implementations (8.3 uses the "traditional" algorithm). Unfortunately, there are outstanding issues. (See my comment at the bottom of the bug report.)