How to speed up floating-point to integer number conversion?

61

Solution 1

Most of the other answers here just try to eliminate loop overhead.

Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.

Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU

Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.

Solution 2

inline int float2int( double d )
{
   union Cast
   {
      double d;
      long l;
    };
   volatile Cast c;
   c.d = d + 6755399441055744.0;
   return c.l;
}

// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
   d += 6755399441055744.0;
   return reinterpret_cast<int&>(d);
}

for(int i = 0; i < HUGE_NUMBER; i++)
     int_array[i] = float2int(float_array[i]);

The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).

Solution 3

I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.

Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.

Solution 4

There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.

Solution 5

The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.

If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):

for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
   __m128 a = sse_load4(float_array+i+0);
   __m128 b = sse_load4(float_array+i+4);
   __m128 c = sse_load4(float_array+i+8);
   __m128 d = sse_load4(float_array+i+12);
   a = sse_convert4(a);
   b = sse_convert4(b);
   c = sse_convert4(c);
   d = sse_convert4(d);
   sse_write4(int_array+i+0, a);
   sse_write4(int_array+i+4, b);
   sse_write4(int_array+i+8, c);
   sse_write4(int_array+i+12, d);
}

The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)

Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.

I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.

Share:
61
Amir Hossain
Author by

Amir Hossain

Updated on July 09, 2022

Comments

  • Amir Hossain
    Amir Hossain almost 2 years

    How can I capture a user define target in first page and track the target in second page in Vuforia?

  • Nils Pipenbrinck
    Nils Pipenbrinck over 15 years
    SSE (or if you're cross-platform Altivec or Neon) will give you roughly the same speed as a memcopy. If bulk-conversion is aproblem the two-liner in assembly or intrinsic-based C is well worth the work.
  • Max Lybbert
    Max Lybbert over 15 years
    I don't think this does what you're expecting. The value in l will be the same bit pattern as in d, but it won't be anything similar to the same number: 6.054 != -9620726 (my machine, 32 bit little-endian).
  • P Daddy
    P Daddy over 15 years
    @Max: The expectation is a 32-bit "long" type (and, of course, an IEEE-754 double). Given these, this works, although I doubt it could be any faster than "movsd xmm0, mmword ptr [d]; cvttsd2si eax, xmm0; mov dword ptr [i], eax" (which is what my compiler generates for the straight cast).
  • deft_code
    deft_code over 15 years
    I learned this trick from the Lua source code. There are some places where it doesn't work, but I've never found one. It works fine on my core2duo and my old pentium.
  • Drew Dormann
    Drew Dormann over 15 years
    That Visual Studio setting exists because reordering floating point math can produce slightly different results, even if it shouldn't mathematically, such as "a * (b + c)" vs "ab + ac".
  • Jay Conrod
    Jay Conrod over 15 years
    This seems like it would have a very low chance of working if the machine isn't exactly what you expect. Also I can't believe it would be better than other methods mentioned here.
  • Martin York
    Martin York over 15 years
    That's Scary. I assume this is only vaid for a particular type of floating point number (IEEE-754???). I think you should make this explicit in your answer (unless it is true everywhere) an d also note that C++ does not specify a particular floating point standard so you should verify before use.
  • akauppi
    akauppi about 15 years
    I think FISTTP will automatically enhance the speed of the crippled casts '(int)float_val' via recompilation if: - SSE3 support is enabled ('-msse3' for gcc) - The CPU is SSE3 capable While the 'fix' is connected to SSE3 featureset, it is actually a X87 side feature.
  • akauppi
    akauppi about 15 years
    It's not a simple instruction. See: software.intel.com/en-us/articles/… (hi, Norman! Wouldn't have though of you.... ;)
  • akauppi
    akauppi about 15 years
    See software.intel.com/en-us/articles/… for how many instructions it takes to convert float->int (on X87).
  • akauppi
    akauppi about 15 years
  • Serge
    Serge over 12 years
    Thanks for the info, but I've just run a quick test on mac with gcc-4.2 (with -O3) but it seems that lrint yields the same time as plain cast does.
  • luispedro
    luispedro over 12 years
    I think gcc 4.2 might be too old. I know, from experience, that 4.1 did not yet do the inline (it used a function call).
  • Kuba hasn't forgotten Monica
    Kuba hasn't forgotten Monica about 11 years
    On things that are not big iron and are not GPUs, IEEE-754 is where things are :)
  • Ben Voigt
    Ben Voigt over 8 years
    Note that using a union for type-punning like this is legal in C11 but undefined behavior in all versions of C++
  • nspo
    nspo about 7 years
    Truncating by using 6755399441055743.5 does not seem to work in this case: cpp.sh/2dw45
  • chux - Reinstate Monica
    chux - Reinstate Monica over 3 years
    "if you want truncation instead use 6755399441055743.5" fails on 2 accounts: 6755399441055743.5 not exactly representable in IEEE - same as 6755399441055744.0. Conceptually does not truncate in the right direction for negative numbers.
  • nyanpasu64
    nyanpasu64 over 3 years
    Long is 64-bit on Linux x86-64, but 32-bit on Win64 and x86-32. Is it meant to be 32-bit or 64-bit or does it not matter? Could this answer be rewritten using int32_t or int64_t?