Fastest way to get the integer part of sqrt(n)?

c++ c algorithm math performance

62,200

Solution 1

I would try the Fast Inverse Square Root trick.

It's a way to get a very good approximation of 1/sqrt(n) without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).

Once you get it, you just need to inverse the result, and takes the integer part.

There might be faster tricks, of course, since this one is a bit of a round about.

EDIT: let's do it!

First a little helper:

// benchmark.h
#include <sys/time.h>

template <typename Func>
double benchmark(Func f, size_t iterations)
{
  f();

  timeval a, b;
  gettimeofday(&a, 0);
  for (; iterations --> 0;)
  {
    f();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}

Then the main body:

#include <iostream>

#include <cmath>

#include "benchmark.h"

class Sqrt
{
public:
  Sqrt(int n): _number(n) {}

  int operator()() const
  {
    double d = _number;
    return static_cast<int>(std::sqrt(d) + 0.5);
  }

private:
  int _number;
};

// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
  IntSqrt(int n): _number(n) {}

  int operator()() const 
  {
    int remainder = _number;
    if (remainder < 0) { return 0; }

    int place = 1 <<(sizeof(int)*8 -2);

    while (place > remainder) { place /= 4; }

    int root = 0;
    while (place)
    {
      if (remainder >= root + place)
      {
        remainder -= root + place;
        root += place*2;
      }
      root /= 2;
      place /= 4;
    }
    return root;
  }

private:
  int _number;
};

// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
  FastSqrt(int n): _number(n) {}

  int operator()() const
  {
    float number = _number;

    float x2 = number * 0.5F;
    float y = number;
    long i = *(long*)&y;
    //i = (long)0x5fe6ec85e7de30da - (i >> 1);
    i = 0x5f3759df - (i >> 1);
    y = *(float*)&i;

    y = y * (1.5F - (x2*y*y));
    y = y * (1.5F - (x2*y*y)); // let's be precise

    return static_cast<int>(1/y + 0.5f);
  }

private:
  int _number;
};


int main(int argc, char* argv[])
{
  if (argc != 3) {
    std::cerr << "Usage: %prog integer iterations\n";
    return 1;
  }

  int n = atoi(argv[1]);
  int it = atoi(argv[2]);

  assert(Sqrt(n)() == IntSqrt(n)() &&
          Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
  std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";

  double time = benchmark(Sqrt(n), it);
  double intTime = benchmark(IntSqrt(n), it);
  double fastTime = benchmark(FastSqrt(n), it);

  std::cout << "Number iterations: " << it << "\n"
               "Sqrt computation : " << time << "\n"
               "Int computation  : " << intTime << "\n"
               "Fast computation : " << fastTime << "\n";

  return 0;
}

And the results:

sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation  : 217
Fast computation : 119

// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation  : 313
Fast computation : 119

Where as expected the Fast computation performs much better than the Int computation.

Oh, and by the way, sqrt is faster :)

Solution 2

Edit: this answer is foolish - use `(int) sqrt(i)`

After profiling with proper settings (-march=native -m64 -O3) the above was a lot faster.

Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.

It basicly comes down to this:

unsigned short isqrt(unsigned long a) {
    unsigned long rem = 0;
    int root = 0;
    int i;

    for (i = 0; i < 16; i++) {
        root <<= 1;
        rem <<= 2;
        rem += a >> 30;
        a <<= 2;

        if (root < rem) {
            root++;
            rem -= root;
            root++;
        }
    }

    return (unsigned short) (root >> 1);
}

On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i) took 2750 ms. Using (unsigned short) sqrt((float) i) took 3600ms. This was done using g++ -O3. Using the -ffast-math compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.

The above code works for both C and C++ and with minor syntax changes also for Java.

What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:

#include <stdint.h>

const uint16_t squares[] = {
    0, 1, 4, 9,
    16, 25, 36, 49,
    64, 81, 100, 121,
    144, 169, 196, 225,
    256, 289, 324, 361,
    400, 441, 484, 529,
    576, 625, 676, 729,
    784, 841, 900, 961,
    1024, 1089, 1156, 1225,
    1296, 1369, 1444, 1521,
    1600, 1681, 1764, 1849,
    1936, 2025, 2116, 2209,
    2304, 2401, 2500, 2601,
    2704, 2809, 2916, 3025,
    3136, 3249, 3364, 3481,
    3600, 3721, 3844, 3969,
    4096, 4225, 4356, 4489,
    4624, 4761, 4900, 5041,
    5184, 5329, 5476, 5625,
    5776, 5929, 6084, 6241,
    6400, 6561, 6724, 6889,
    7056, 7225, 7396, 7569,
    7744, 7921, 8100, 8281,
    8464, 8649, 8836, 9025,
    9216, 9409, 9604, 9801,
    10000, 10201, 10404, 10609,
    10816, 11025, 11236, 11449,
    11664, 11881, 12100, 12321,
    12544, 12769, 12996, 13225,
    13456, 13689, 13924, 14161,
    14400, 14641, 14884, 15129,
    15376, 15625, 15876, 16129,
    16384, 16641, 16900, 17161,
    17424, 17689, 17956, 18225,
    18496, 18769, 19044, 19321,
    19600, 19881, 20164, 20449,
    20736, 21025, 21316, 21609,
    21904, 22201, 22500, 22801,
    23104, 23409, 23716, 24025,
    24336, 24649, 24964, 25281,
    25600, 25921, 26244, 26569,
    26896, 27225, 27556, 27889,
    28224, 28561, 28900, 29241,
    29584, 29929, 30276, 30625,
    30976, 31329, 31684, 32041,
    32400, 32761, 33124, 33489,
    33856, 34225, 34596, 34969,
    35344, 35721, 36100, 36481,
    36864, 37249, 37636, 38025,
    38416, 38809, 39204, 39601,
    40000, 40401, 40804, 41209,
    41616, 42025, 42436, 42849,
    43264, 43681, 44100, 44521,
    44944, 45369, 45796, 46225,
    46656, 47089, 47524, 47961,
    48400, 48841, 49284, 49729,
    50176, 50625, 51076, 51529,
    51984, 52441, 52900, 53361,
    53824, 54289, 54756, 55225,
    55696, 56169, 56644, 57121,
    57600, 58081, 58564, 59049,
    59536, 60025, 60516, 61009,
    61504, 62001, 62500, 63001,
    63504, 64009, 64516, 65025
};

inline int isqrt(uint16_t x) {
    const uint16_t *p = squares;

    if (p[128] <= x) p += 128;
    if (p[ 64] <= x) p +=  64;
    if (p[ 32] <= x) p +=  32;
    if (p[ 16] <= x) p +=  16;
    if (p[  8] <= x) p +=   8;
    if (p[  4] <= x) p +=   4;
    if (p[  2] <= x) p +=   2;
    if (p[  1] <= x) p +=   1;

    return p - squares;
}

A 32 bit version can be downloaded here: https://gist.github.com/3481770

Solution 3

If you don't mind an approximation, how about this integer sqrt function I cobbled together.

int sqrti(int x)
{
    union { float f; int x; } v; 

    // convert to float
    v.f = (float)x;

    // fast aprox sqrt
    //  assumes float is in IEEE 754 single precision format 
    //  assumes int is 32 bits
    //  b = exponent bias
    //  m = number of mantissa bits
    v.x  -= 1 << 23; // subtract 2^m 
    v.x >>= 1;       // divide by 2
    v.x  += 1 << 29; // add ((b + 1) / 2) * 2^m

    // convert to int
    return (int)v.f;
}

It uses the algorithm described in this Wikipedia article. On my machine it's almost twice as fast as sqrt :)

Solution 4

While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):

Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root.
Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).

Solution 5

This is so short that it 99% inlines:

static inline int sqrtn(int num) {
    int i = 0;
    __asm__ (
        "pxor %%xmm0, %%xmm0\n\t"   // clean xmm0 for cvtsi2ss
        "cvtsi2ss %1, %%xmm0\n\t"   // convert num to float, put it to xmm0
        "sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
        "cvttss2si %%xmm0, %0"      // float to int
        :"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
    return i;
}

Why clean xmm0? Documentation of cvtsi2ss

The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.

GCC Intrinsic version (runs only on GCC):

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __v4sf xmm0 = {0, 0, 0, 0};
    xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
    xmm0 = __builtin_ia32_sqrtss(xmm0);
    return __builtin_ia32_cvttss2si(xmm0);
}

Intel Intrinsic version (tested on GCC, Clang, ICC):

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __m128 xmm0 = _mm_setzero_ps();
    xmm0 = _mm_cvt_si2ss(xmm0, num);
    xmm0 = _mm_sqrt_ss(xmm0);
    return _mm_cvtt_ss2si(xmm0);
}

^^^^ All of them require SSE 1 (not even SSE 2).

Note: This is exactly how GCC calculates (int) sqrt((float) num) with -Ofast. If you want higher accuracy for larger i, then we can calculate (int) sqrt((double) num) (as noted by Gumby The Green in the comments):

static inline int sqrtn(int num) {
    int i = 0;
    __asm__ (
        "pxor %%xmm0, %%xmm0\n\t"
        "cvtsi2sd %1, %%xmm0\n\t"
        "sqrtsd %%xmm0, %%xmm0\n\t"
        "cvttsd2si %%xmm0, %0"
        :"=r"(i):"r"(num):"%xmm0");
    return i;
}

#include <xmmintrin.h>
int sqrtn2(int num) {
    register __v2df xmm0 = {0, 0};
    xmm0 = __builtin_ia32_cvtsi2sd(xmm0, num);
    xmm0 = __builtin_ia32_sqrtsd(xmm0);
    return __builtin_ia32_cvttsd2si(xmm0);
}

View more solutions

62,200

Author by

Nawaz

Following Rust and Haskell isocpp.org/wiki/faq Contact me on LinkedIn. Religion of C Correcting Grammar for Microsoft Products and Technology

Updated on July 05, 2022

Comments

Nawaz almost 2 years

As we know if n is not a perfect square, then sqrt(n) would not be an integer. Since I need only the integer part, I feel that calling sqrt(n) wouldn't be that fast, as it takes time to calculate the fractional part also.

So my question is,

Can we get only the integer part of sqrt(n) without calculating the actual value of sqrt(n)? The algorithm should be faster than sqrt(n) (defined in <math.h> or <cmath>)?

If possible, you can write the code in asm block also.
Piva about 13 years

I liked your (1) but I don't think you need to precalculate the perfect squares. Of course it depends on how many times he is going to be calling that, but precalculating it for a few calls would be a lot more expensive than the bs by itself.
R.. GitHub STOP HELPING ICE about 13 years

If you want to square the 'index' at every step of your binary search, be my guest. It will be slooooooow. That's why I suggested precalculating them. Note that I said static const. There is no cost to computing it because it happened before your program was compiled. And even if you support the full range of 32-bit integers, your table will only be 256kb.
Elemental about 13 years

I have used strategy 1 in a high performance context and it worked beautifully. I further enhanced the performance of the search using the knowledge that the next sqrt to be taken was likely to be close to the previous one (the context was graphical) ad it made a staggering performance difference.
Nawaz about 13 years

@R.. : I don't think (1) will be faster than sqrt; binary searching on a list of 999999 integers would most likely to be slow than sqrt!
Saeed Amiri about 13 years

This is for floating point but nawaz just needs integer values.
Tony Delroy about 13 years

@Nawaz: given you apparently care enough to ask the question, how about benchmarking it before condemning it. Much will depend on your exact hardware....
Matthieu M. about 13 years

@Saeed: an integer can be trivially converted to a float (back and forth) and I am curious about the applicability of this method. It certainly is the only branchless method without a pre-computed table that I could think of. After that... I guess we could benchmark :) ?
Saeed Amiri about 13 years

Yes it can, but I think methods (like article I referenced) are faster (because they are just care about integer parts) but yes should benchmark this ways.
MSalters about 13 years

@R.. : squaring the index takes a single integer operation. There are plenty of integer pipelines. Accessing memory is way more expensive because that's a shared resource. Besides, you'd first have to calculate the address square[i] which like i*i is a single integer operation. So even if accessing square[i] would be free, it still wouldn't be faster.
Matthieu M. about 13 years

@Saeed: done, as expected the Fast Inverse Trick performs better, being branchless pays off I guess
Matthieu M. about 13 years

@Saeed: normally you've got all the includes. I wonder what -ffast-math would give for sqrt.
Nawaz about 13 years

@Matthieu M : good attempt. +1.. but as I had expected sqrt would still be faster. In fact, sqrt is awesomely faster!!
Matthieu M. about 13 years

@Nawaz: actually, I had a good enough precision with a single iteration of Newton's method in the Fast case which gave only 86, so I don't find sqrt that fast, I would have hoped a hardware accelerated version would perform much better :/
Gabe about 13 years

I'd love to see this benchmarked! I'd also love to see what a "branchless binary search" would look like.
R.. GitHub STOP HELPING ICE about 13 years

Multiplication is much slower than addition, at least in latency. And in the binary search, latency is what will matter since you can't do the next step without the result of the previous (well, you could try to start squaring both possible next attempts, and then throw away one you don't need).
R.. GitHub STOP HELPING ICE about 13 years

As for what "branchless binary search" means, it means simply advancing the left index of the binary search by an integer expression which evaluates to 0 or len/2 based on the difference of the value being tested and the value being sought, for example by making a mask out of the high bit. It can also be done using cmov-type instructions.
R.. GitHub STOP HELPING ICE about 13 years

@Nawaz: You don't search the whole list. You can come up with some obvious initial bounds.
orlp over 11 years

@Nawaz and R.: I implemented solution 1 in C for uint16_t's, but it seems to get outperformed by my solution and even (int) sqrt on gcc -O3. You can look at it here: gist.github.com/3481295 . Maybe you can improve my implementation?
R.. GitHub STOP HELPING ICE over 11 years

Each of those < operators will compile to a conditional branch (or at least conditional move) because the compiler can't assume anything about the range of values. You can, so you could (for example) use unsigned expressions and just use bit 31 of the difference to get a 0/1 result based on which is greater.
orlp over 11 years

@Nawaz and R: I actually implemented it much better now. It's not branchless but it blows everything else so far out of the water: gist.github.com/3481607
R.. GitHub STOP HELPING ICE over 11 years

Here's my version: for (s=squares, i=128; i; i=i>>1) s+=-((unsigned)(s[i]-x-1)>>31) & i; return s-squares; Note that you can fix your table to remove the unwanted -1 from the code. There's no need to unroll the loop by hand; gcc -O3 will do it for you.
R.. GitHub STOP HELPING ICE over 11 years

Try my branchless variant and see if it's faster.
orlp over 11 years

@R.: nope, it's slower by around a factor 3.
R.. GitHub STOP HELPING ICE over 11 years

Is your compiler perhaps using cmov for your version?
R.. GitHub STOP HELPING ICE over 11 years

You might also try this simplified version with (implementation-defined) signed right-shift: for (s=squares, i=128; i; i=i>>1) s += s[i]-x-1>>31 & i; return s-squares;
orlp over 11 years

@R.: nope it doesn't use cmov. Also, hand unrolling the loop actually is faster by around 20%. Here is the asm output for both versions (note that I made the 32 bit version): gist.github.com/3481749 The full 32 bit version can be downloaded here: gist.github.com/3481770
Ricbit over 11 years

The array is too large, a cache miss may be more expensive than the sqrt calculation.
Fredrik Johansson almost 11 years

I tried that one on iPhone hardware, but it seemed to be slow because of the 'b = N / a' operation.
rubenvb over 10 years

You should move the "this answer is bullshit" edit to the top of your answer. I wasted time reading bullshit ;-)
Viktor Dahl about 9 years

Technically this breaks the strict aliasing rule. It doesn't seem to cause a problem under recent gcc (4.9), but the compliant way of doing it would be union { float f; int32_t x } v; v.f = (float) x; v.x -= ... return (int)((float)v.x);.
Todd Lehman almost 9 years

@AndrewTomazos — Unfortunately, your function fails to return the correct answer for N ∈ { 0, 3, 8, 48, 63, 120, 143, ... }.
Peter Cordes over 8 years

This is only useful if integer division is much more efficient than FP sqrt. Or maybe if you need to handle integers that are too large to have exact float representations. Modern x86 hardware (Intel Haswell) can convert to float and back, and do a double-precision FP sqrt, in about 25 clock cycles latency. A single 32bit integer division has 22 to 29 cycles latency. Integer Newton iterations don't do well from a throughput perspective, either.
rsaxvc almost 7 years

Two-stage tables might perform well here as well.
David Wohlferd almost 6 years

Do you really need to zero xmm0 before you copy a value into it? Also, inlining can disable optimization for surrounding code. What about using builtins (ie __builtin_ia32_cvtsi2ss, __builtin_ia32_sqrtss __builtin_ia32_cvtss2si) from here? When in comes to using inline asm, less is more.
Gumby The Green about 5 years

The Intel Intrinsic version doesn't round down - it rounds to the nearest int. To fix this, add a t to its last line: return _mm_cvtt_ss2si(xmm0);. These are 5-6x faster than sqrt() on my machine) but wrong answers start appearing when num >= 16,785,407 due to rounding errors on the float. To fix this in the GCC Intrinsic version, change the first line to __v2df xmm0 = {0, 0}; and replace each ss with sd (warning: cuts the speed in half). I don't see a _mm_cvt_si2sd() in Intel's Intrinsics Guide for some reason.
metablaster over 4 years

Not bad it serves me to learn something, but how do you people measure speed? all the sqrt(int) functions I found on www are slower than std::sqrt. acording to my measuerments. your function is 27% slower and 3% less accurate btw. than std
Jan Schultke over 3 years

double precision isn't good enough for 64-bit square roots and float precision isn't good enough for 32-bit square roots. Try anything beyond 2^31 any you will see inconsistent rounding with floats. Or try anything beyond 2^63 with doubles.
Jan Schultke over 3 years

double precision isn't good enough for 64-bit square roots and float precision isn't good enough for 32-bit square roots. Try anything beyond 2^31 any you will see inconsistent rounding with floats. Or try anything beyond 2^63 with doubles.
Jan Schultke over 3 years

Are you sure that cvtsi2ss %1 is actually safe? Assuming %1 is edi, then there could be some bits in upper part of rdi. Maybe you wouldn't notice it in your own tests, but it could happen in principle.
MCCCS over 3 years

@J.Schultke Good catch I'll initialize i
Bram about 2 years

Thanks for sharing. This is very useful on μcontrollers that do not have a floating point unit.