Fastest way to get the integer part of sqrt(n)?
Solution 1
I would try the Fast Inverse Square Root trick.
It's a way to get a very good approximation of 1/sqrt(n)
without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).
Once you get it, you just need to inverse the result, and takes the integer part.
There might be faster tricks, of course, since this one is a bit of a round about.
EDIT: let's do it!
First a little helper:
// benchmark.h
#include <sys/time.h>
template <typename Func>
double benchmark(Func f, size_t iterations)
{
f();
timeval a, b;
gettimeofday(&a, 0);
for (; iterations --> 0;)
{
f();
}
gettimeofday(&b, 0);
return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
(a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}
Then the main body:
#include <iostream>
#include <cmath>
#include "benchmark.h"
class Sqrt
{
public:
Sqrt(int n): _number(n) {}
int operator()() const
{
double d = _number;
return static_cast<int>(std::sqrt(d) + 0.5);
}
private:
int _number;
};
// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
IntSqrt(int n): _number(n) {}
int operator()() const
{
int remainder = _number;
if (remainder < 0) { return 0; }
int place = 1 <<(sizeof(int)*8 -2);
while (place > remainder) { place /= 4; }
int root = 0;
while (place)
{
if (remainder >= root + place)
{
remainder -= root + place;
root += place*2;
}
root /= 2;
place /= 4;
}
return root;
}
private:
int _number;
};
// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
FastSqrt(int n): _number(n) {}
int operator()() const
{
float number = _number;
float x2 = number * 0.5F;
float y = number;
long i = *(long*)&y;
//i = (long)0x5fe6ec85e7de30da - (i >> 1);
i = 0x5f3759df - (i >> 1);
y = *(float*)&i;
y = y * (1.5F - (x2*y*y));
y = y * (1.5F - (x2*y*y)); // let's be precise
return static_cast<int>(1/y + 0.5f);
}
private:
int _number;
};
int main(int argc, char* argv[])
{
if (argc != 3) {
std::cerr << "Usage: %prog integer iterations\n";
return 1;
}
int n = atoi(argv[1]);
int it = atoi(argv[2]);
assert(Sqrt(n)() == IntSqrt(n)() &&
Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";
double time = benchmark(Sqrt(n), it);
double intTime = benchmark(IntSqrt(n), it);
double fastTime = benchmark(FastSqrt(n), it);
std::cout << "Number iterations: " << it << "\n"
"Sqrt computation : " << time << "\n"
"Int computation : " << intTime << "\n"
"Fast computation : " << fastTime << "\n";
return 0;
}
And the results:
sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation : 217
Fast computation : 119
// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation : 313
Fast computation : 119
Where as expected the Fast computation performs much better than the Int computation.
Oh, and by the way, sqrt
is faster :)
Solution 2
Edit: this answer is foolish - use (int) sqrt(i)
After profiling with proper settings (-march=native -m64 -O3
) the above was a lot faster.
Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.
It basicly comes down to this:
unsigned short isqrt(unsigned long a) {
unsigned long rem = 0;
int root = 0;
int i;
for (i = 0; i < 16; i++) {
root <<= 1;
rem <<= 2;
rem += a >> 30;
a <<= 2;
if (root < rem) {
root++;
rem -= root;
root++;
}
}
return (unsigned short) (root >> 1);
}
On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i)
took 2750 ms. Using (unsigned short) sqrt((float) i)
took 3600ms. This was done using g++ -O3
. Using the -ffast-math
compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.
The above code works for both C and C++ and with minor syntax changes also for Java.
What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:
#include <stdint.h>
const uint16_t squares[] = {
0, 1, 4, 9,
16, 25, 36, 49,
64, 81, 100, 121,
144, 169, 196, 225,
256, 289, 324, 361,
400, 441, 484, 529,
576, 625, 676, 729,
784, 841, 900, 961,
1024, 1089, 1156, 1225,
1296, 1369, 1444, 1521,
1600, 1681, 1764, 1849,
1936, 2025, 2116, 2209,
2304, 2401, 2500, 2601,
2704, 2809, 2916, 3025,
3136, 3249, 3364, 3481,
3600, 3721, 3844, 3969,
4096, 4225, 4356, 4489,
4624, 4761, 4900, 5041,
5184, 5329, 5476, 5625,
5776, 5929, 6084, 6241,
6400, 6561, 6724, 6889,
7056, 7225, 7396, 7569,
7744, 7921, 8100, 8281,
8464, 8649, 8836, 9025,
9216, 9409, 9604, 9801,
10000, 10201, 10404, 10609,
10816, 11025, 11236, 11449,
11664, 11881, 12100, 12321,
12544, 12769, 12996, 13225,
13456, 13689, 13924, 14161,
14400, 14641, 14884, 15129,
15376, 15625, 15876, 16129,
16384, 16641, 16900, 17161,
17424, 17689, 17956, 18225,
18496, 18769, 19044, 19321,
19600, 19881, 20164, 20449,
20736, 21025, 21316, 21609,
21904, 22201, 22500, 22801,
23104, 23409, 23716, 24025,
24336, 24649, 24964, 25281,
25600, 25921, 26244, 26569,
26896, 27225, 27556, 27889,
28224, 28561, 28900, 29241,
29584, 29929, 30276, 30625,
30976, 31329, 31684, 32041,
32400, 32761, 33124, 33489,
33856, 34225, 34596, 34969,
35344, 35721, 36100, 36481,
36864, 37249, 37636, 38025,
38416, 38809, 39204, 39601,
40000, 40401, 40804, 41209,
41616, 42025, 42436, 42849,
43264, 43681, 44100, 44521,
44944, 45369, 45796, 46225,
46656, 47089, 47524, 47961,
48400, 48841, 49284, 49729,
50176, 50625, 51076, 51529,
51984, 52441, 52900, 53361,
53824, 54289, 54756, 55225,
55696, 56169, 56644, 57121,
57600, 58081, 58564, 59049,
59536, 60025, 60516, 61009,
61504, 62001, 62500, 63001,
63504, 64009, 64516, 65025
};
inline int isqrt(uint16_t x) {
const uint16_t *p = squares;
if (p[128] <= x) p += 128;
if (p[ 64] <= x) p += 64;
if (p[ 32] <= x) p += 32;
if (p[ 16] <= x) p += 16;
if (p[ 8] <= x) p += 8;
if (p[ 4] <= x) p += 4;
if (p[ 2] <= x) p += 2;
if (p[ 1] <= x) p += 1;
return p - squares;
}
A 32 bit version can be downloaded here: https://gist.github.com/3481770
Solution 3
If you don't mind an approximation, how about this integer sqrt function I cobbled together.
int sqrti(int x)
{
union { float f; int x; } v;
// convert to float
v.f = (float)x;
// fast aprox sqrt
// assumes float is in IEEE 754 single precision format
// assumes int is 32 bits
// b = exponent bias
// m = number of mantissa bits
v.x -= 1 << 23; // subtract 2^m
v.x >>= 1; // divide by 2
v.x += 1 << 29; // add ((b + 1) / 2) * 2^m
// convert to int
return (int)v.f;
}
It uses the algorithm described in this Wikipedia article. On my machine it's almost twice as fast as sqrt :)
Solution 4
While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):
- Make a
static const
array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root. - Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).
Solution 5
This is so short that it 99% inlines:
static inline int sqrtn(int num) {
int i = 0;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t" // clean xmm0 for cvtsi2ss
"cvtsi2ss %1, %%xmm0\n\t" // convert num to float, put it to xmm0
"sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
"cvttss2si %%xmm0, %0" // float to int
:"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
return i;
}
Why clean xmm0
? Documentation of cvtsi2ss
The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.
GCC Intrinsic version (runs only on GCC):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v4sf xmm0 = {0, 0, 0, 0};
xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
xmm0 = __builtin_ia32_sqrtss(xmm0);
return __builtin_ia32_cvttss2si(xmm0);
}
Intel Intrinsic version (tested on GCC, Clang, ICC):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __m128 xmm0 = _mm_setzero_ps();
xmm0 = _mm_cvt_si2ss(xmm0, num);
xmm0 = _mm_sqrt_ss(xmm0);
return _mm_cvtt_ss2si(xmm0);
}
^^^^ All of them require SSE 1 (not even SSE 2).
Note: This is exactly how GCC calculates (int) sqrt((float) num)
with -Ofast
. If you want higher accuracy for larger i
, then we can calculate (int) sqrt((double) num)
(as noted by Gumby The Green in the comments):
static inline int sqrtn(int num) {
int i = 0;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t"
"cvtsi2sd %1, %%xmm0\n\t"
"sqrtsd %%xmm0, %%xmm0\n\t"
"cvttsd2si %%xmm0, %0"
:"=r"(i):"r"(num):"%xmm0");
return i;
}
or
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v2df xmm0 = {0, 0};
xmm0 = __builtin_ia32_cvtsi2sd(xmm0, num);
xmm0 = __builtin_ia32_sqrtsd(xmm0);
return __builtin_ia32_cvttsd2si(xmm0);
}
Nawaz
Following Rust and Haskell isocpp.org/wiki/faq Contact me on LinkedIn. Religion of C Correcting Grammar for Microsoft Products and Technology
Updated on July 05, 2022Comments
-
Nawaz almost 2 years
As we know if
n
is not a perfect square, thensqrt(n)
would not be an integer. Since I need only the integer part, I feel that callingsqrt(n)
wouldn't be that fast, as it takes time to calculate the fractional part also.So my question is,
Can we get only the integer part of sqrt(n) without calculating the actual value of
sqrt(n)
? The algorithm should be faster thansqrt(n)
(defined in<math.h>
or<cmath>
)?If possible, you can write the code in
asm
block also. -
Piva about 13 yearsI liked your (1) but I don't think you need to precalculate the perfect squares. Of course it depends on how many times he is going to be calling that, but precalculating it for a few calls would be a lot more expensive than the bs by itself.
-
R.. GitHub STOP HELPING ICE about 13 yearsIf you want to square the 'index' at every step of your binary search, be my guest. It will be slooooooow. That's why I suggested precalculating them. Note that I said
static const
. There is no cost to computing it because it happened before your program was compiled. And even if you support the full range of 32-bit integers, your table will only be 256kb. -
Elemental about 13 yearsI have used strategy 1 in a high performance context and it worked beautifully. I further enhanced the performance of the search using the knowledge that the next sqrt to be taken was likely to be close to the previous one (the context was graphical) ad it made a staggering performance difference.
-
Nawaz about 13 years@R.. : I don't think (1) will be faster than
sqrt
; binary searching on a list of 999999 integers would most likely to be slow than sqrt! -
Saeed Amiri about 13 yearsThis is for floating point but nawaz just needs integer values.
-
Tony Delroy about 13 years@Nawaz: given you apparently care enough to ask the question, how about benchmarking it before condemning it. Much will depend on your exact hardware....
-
Matthieu M. about 13 years@Saeed: an integer can be trivially converted to a float (back and forth) and I am curious about the applicability of this method. It certainly is the only branchless method without a pre-computed table that I could think of. After that... I guess we could benchmark :) ?
-
Saeed Amiri about 13 yearsYes it can, but I think methods (like article I referenced) are faster (because they are just care about integer parts) but yes should benchmark this ways.
-
MSalters about 13 years@R.. : squaring the index takes a single integer operation. There are plenty of integer pipelines. Accessing memory is way more expensive because that's a shared resource. Besides, you'd first have to calculate the address
square[i]
which likei*i
is a single integer operation. So even if accessingsquare[i]
would be free, it still wouldn't be faster. -
Matthieu M. about 13 years@Saeed: done, as expected the Fast Inverse Trick performs better, being branchless pays off I guess
-
Matthieu M. about 13 years@Saeed: normally you've got all the includes. I wonder what
-ffast-math
would give forsqrt
. -
Nawaz about 13 years@Matthieu M : good attempt. +1.. but as I had expected
sqrt
would still be faster. In fact,sqrt
is awesomely faster!! -
Matthieu M. about 13 years@Nawaz: actually, I had a good enough precision with a single iteration of Newton's method in the Fast case which gave only
86
, so I don't findsqrt
that fast, I would have hoped a hardware accelerated version would perform much better :/ -
Gabe about 13 yearsI'd love to see this benchmarked! I'd also love to see what a "branchless binary search" would look like.
-
R.. GitHub STOP HELPING ICE about 13 yearsMultiplication is much slower than addition, at least in latency. And in the binary search, latency is what will matter since you can't do the next step without the result of the previous (well, you could try to start squaring both possible next attempts, and then throw away one you don't need).
-
R.. GitHub STOP HELPING ICE about 13 yearsAs for what "branchless binary search" means, it means simply advancing the
left
index of the binary search by an integer expression which evaluates to 0 orlen/2
based on the difference of the value being tested and the value being sought, for example by making a mask out of the high bit. It can also be done usingcmov
-type instructions. -
R.. GitHub STOP HELPING ICE about 13 years@Nawaz: You don't search the whole list. You can come up with some obvious initial bounds.
-
orlp over 11 years@Nawaz and R.: I implemented solution 1 in C for uint16_t's, but it seems to get outperformed by my solution and even
(int) sqrt
on gcc -O3. You can look at it here: gist.github.com/3481295 . Maybe you can improve my implementation? -
R.. GitHub STOP HELPING ICE over 11 yearsEach of those
<
operators will compile to a conditional branch (or at least conditional move) because the compiler can't assume anything about the range of values. You can, so you could (for example) useunsigned
expressions and just use bit 31 of the difference to get a 0/1 result based on which is greater. -
orlp over 11 years@Nawaz and R: I actually implemented it much better now. It's not branchless but it blows everything else so far out of the water: gist.github.com/3481607
-
R.. GitHub STOP HELPING ICE over 11 yearsHere's my version:
for (s=squares, i=128; i; i=i>>1) s+=-((unsigned)(s[i]-x-1)>>31) & i; return s-squares;
Note that you can fix your table to remove the unwanted-1
from the code. There's no need to unroll the loop by hand;gcc -O3
will do it for you. -
R.. GitHub STOP HELPING ICE over 11 yearsTry my branchless variant and see if it's faster.
-
orlp over 11 years@R.: nope, it's slower by around a factor 3.
-
R.. GitHub STOP HELPING ICE over 11 yearsIs your compiler perhaps using
cmov
for your version? -
R.. GitHub STOP HELPING ICE over 11 yearsYou might also try this simplified version with (implementation-defined) signed right-shift:
for (s=squares, i=128; i; i=i>>1) s += s[i]-x-1>>31 & i; return s-squares;
-
orlp over 11 years@R.: nope it doesn't use
cmov
. Also, hand unrolling the loop actually is faster by around 20%. Here is the asm output for both versions (note that I made the 32 bit version): gist.github.com/3481749 The full 32 bit version can be downloaded here: gist.github.com/3481770 -
Ricbit over 11 yearsThe array is too large, a cache miss may be more expensive than the sqrt calculation.
-
Fredrik Johansson almost 11 yearsI tried that one on iPhone hardware, but it seemed to be slow because of the 'b = N / a' operation.
-
rubenvb over 10 yearsYou should move the "this answer is bullshit" edit to the top of your answer. I wasted time reading bullshit
;-)
-
Viktor Dahl about 9 yearsTechnically this breaks the strict aliasing rule. It doesn't seem to cause a problem under recent gcc (4.9), but the compliant way of doing it would be
union { float f; int32_t x } v; v.f = (float) x; v.x -= ... return (int)((float)v.x);
. -
Todd Lehman almost 9 years@AndrewTomazos — Unfortunately, your function fails to return the correct answer for N ∈ { 0, 3, 8, 48, 63, 120, 143, ... }.
-
Peter Cordes over 8 yearsThis is only useful if integer division is much more efficient than FP sqrt. Or maybe if you need to handle integers that are too large to have exact float representations. Modern x86 hardware (Intel Haswell) can convert to float and back, and do a double-precision FP sqrt, in about 25 clock cycles latency. A single 32bit integer division has 22 to 29 cycles latency. Integer Newton iterations don't do well from a throughput perspective, either.
-
rsaxvc almost 7 yearsTwo-stage tables might perform well here as well.
-
David Wohlferd almost 6 yearsDo you really need to zero xmm0 before you copy a value into it? Also, inlining can disable optimization for surrounding code. What about using builtins (ie
__builtin_ia32_cvtsi2ss
,__builtin_ia32_sqrtss
__builtin_ia32_cvtss2si
) from here? When in comes to using inline asm, less is more. -
Gumby The Green about 5 yearsThe Intel Intrinsic version doesn't round down - it rounds to the nearest int. To fix this, add a
t
to its last line:return _mm_cvtt_ss2si(xmm0);
. These are 5-6x faster thansqrt()
on my machine) but wrong answers start appearing whennum
>= 16,785,407 due to rounding errors on the float. To fix this in the GCC Intrinsic version, change the first line to__v2df xmm0 = {0, 0};
and replace eachss
withsd
(warning: cuts the speed in half). I don't see a_mm_cvt_si2sd()
in Intel's Intrinsics Guide for some reason. -
metablaster over 4 yearsNot bad it serves me to learn something, but how do you people measure speed? all the sqrt(int) functions I found on www are slower than std::sqrt. acording to my measuerments. your function is 27% slower and 3% less accurate btw. than std
-
Jan Schultke over 3 years
double
precision isn't good enough for 64-bit square roots andfloat
precision isn't good enough for 32-bit square roots. Try anything beyond 2^31 any you will see inconsistent rounding with floats. Or try anything beyond 2^63 with doubles. -
Jan Schultke over 3 years
double
precision isn't good enough for 64-bit square roots andfloat
precision isn't good enough for 32-bit square roots. Try anything beyond 2^31 any you will see inconsistent rounding with floats. Or try anything beyond 2^63 with doubles. -
Jan Schultke over 3 yearsAre you sure that
cvtsi2ss %1
is actually safe? Assuming%1
isedi
, then there could be some bits in upper part ofrdi
. Maybe you wouldn't notice it in your own tests, but it could happen in principle. -
MCCCS over 3 years@J.Schultke Good catch I'll initialize
i
-
Bram about 2 yearsThanks for sharing. This is very useful on μcontrollers that do not have a floating point unit.