Casting float to int (bitwise) in C

c floating-point bit-manipulation bitwise-operators

43,814

Solution 1

C has the "union" to handle this type of view of data:

typedef union {
  int i;
  float f;
 } u;
 u u1;
 u1.f = 45.6789;
 /* now u1.i refers to the int version of the float */
 printf("%d",u1.i);

Solution 2

&x gives the address of x so has float* type.

(int*)&x cast that pointer to a pointer to int ie to a int* thing.

*(int*)&x dereference that pointer into an int value. It won't do what you believe on machines where int and float have different sizes.

And there could be endianness issues.

This solution was used in the fast inverse square root algorithm.

Solution 3

(Somebody should double-check this answer, especially border cases and the rounding of negative values. Also, I wrote it for round-to-nearest. To reproduce C’s conversion, this should be changed to round-toward-zero.)

Essentially, the process is:

Separate the 32 bits into one sign bit (s), eight exponent bits (e), and 23 significand bits (f). We will treat these as twos-complement integers.

If e is 255, the floating-point object is either infinity (if f is zero) or a NaN (otherwise). In this case, the conversion cannot be performed, and an error should be reported.

Otherwise, if e is not zero, add 2²⁴ to f. (If e is not zero, the significand implicitly has a 1 bit at its front. Adding 2²⁴ makes that bit explicit in f.)

Subtract 127 from e. (This converts the exponent from its biased/encoded form to the actual exponent. If we were doing a general conversion to any value, we would have to handle the special case when e is zero: Subtract 126 instead of 127. But, since we are only converting to an integer result, we can neglect this case, as long as the integer result is zero for these tiny input numbers.)

If s is 0 (the sign is positive) and e is 31 or more, then the value overflows a signed 32-bit integer (it is 2³¹ or larger). The conversion cannot be performed, and an error should be reported.

If s is 1 (the sign is negative) and e is more than 31, then the value overflows a signed 32-bit integer (it is less than or equal to -2³²). If s is one, e is 32, and f is greater than 2²⁴ (any of the original significand bits were set), then the value overflows a signed 32-bit integer (it is less than -2³¹; if the original f were zero, it would be exactly -2³¹, which does not overflow). In any of these cases, the conversion cannot be performed, and an error should be reported.

Now we have an s, an e, and an f for a value which does not overflow, so we can prepare the final value.

If s is 1, set f to -f.

The exponent value is for a significand between 1 (inclusive) and 2 (exclusive), but our significand starts with a bit at 2²⁴. So we have to adjust for that. If e is 24, our significand is correct, and we are done, so return f as the result. If e is greater than 24 or less than 24, we have to shift the significand appropriately. Also, if we are going to shift f right, we may have to round it, to get a result rounded to the nearest integer.

If e is greater than 24, shift f left e-24 bits. Return f as the result.

If e is less than -1, the floating-point number is between -½ and ½, exclusive. Return 0 as the result.

Otherwise, we will shift f right 24-e bits. However, we will first save the bits we need for rounding. Set r to the result of casting f to an unsigned 32-bit integer and shifting it left by 32-(24-e) bits (equivalently, left by 8+e bits). This takes the bits that will be shifted out of f (below) and “left adjusts” them in the 32 bits, so we have a fixed position where they start.

Shift f right 24-e bits.

If r is less than 2³¹, do nothing (this is rounding down; the shift truncated bits). If r is greater than 2³¹, add one to f (this is rounding up). If r equals 2³¹, add the low bit of f to f. (If f is odd, add one to f. Of the two equally near values, this rounds to the even value.) Return f.

Solution 4

// With the proviso that your compiler implementation uses
// the same number of bytes for an int as for a float:
// example float
float f = 1.234f;
// get address of float, cast as pointer to int, reference
int i = *((int *)&f);
// get address of int, cast as pointer to float, reference
float g = *((float *)&i);
printf("%f %f %08x\n",f,g,i);

Solution 5

float x = 43.133;
int y;

assert (sizeof x == sizeof y);
memcpy (&y, &x, sizeof x);
...

View more solutions

43,814

Anonymous

Updated on July 09, 2022

Comments

Anonymous almost 2 years
Given the 32 bits that represent an IEEE 754 floating-point number, how can the number be converted to an integer, using integer or bit operations on the representation (rather than using a machine instruction or compiler operation to convert)?

I have the following function but it fails in some cases:

Input: int x (contains 32 bit single precision number in IEEE 754 format)
```
  if(x == 0) return x;

  unsigned int signBit = 0;
  unsigned int absX = (unsigned int)x;
  if (x < 0)
  {
      signBit = 0x80000000u;
      absX = (unsigned int)-x;
  }

  unsigned int exponent = 158;
  while ((absX & 0x80000000) == 0)
  {
      exponent--;
      absX <<= 1;
  }

  unsigned int mantissa = absX >> 8;

  unsigned int result = signBit | (exponent << 23) | (mantissa & 0x7fffff);
  printf("\nfor x: %x, result: %x",x,result);
  return result;
```
- Basile Starynkevitch over 11 years
  
  This don't cast a float into an int. It just copy bitwise their machine representation, without e.g. converting 2.03e1 to 20 [by rounding] as the (int)2.03e1 cast will.
- Ry- over 11 years
  
  You want do do it bitwise? Well, that's how you do it bitwise - it just reinterprets the bytes. No steps, really.
- Anonymous over 11 years
  
  But 0x7eff8965 = 1325268755 (after casting). If you use the HEX in IEEE 754 Calc, you get 1.6983327e+38 and HEX to decimal gives: 2130676069 - none of them give the correct result of 1325268755.
- Paul Hankin over 11 years
  
  This code has undefined behavior in C. See section 6.5 in the standard.
- Eric Postpischil over 11 years
  
  Is your question this: Given the 32 bits that represent a float x, how can the conversion (int) x be implemented, using integer/bit operations on the representation (rather than using a machine instruction to convert floating-point to integer)?
- Anonymous over 11 years
  
  @EricPostpischil - Yes! exactly.
- Jonathan Leffler over 11 years
  
  There's another related question by Anon: Negate Floating Number in C, also about bitwise manipulation of IEEE 754 values. There was a second related question in the last 24 hours or so: How to manually (bitwise) perform (float)x.
- Admin over 11 years
  
  Indeed, stackoverflow.com/questions/12336314/… Having the same problem, it doesn't want to round correctly... very frustrating
- Anonymous over 11 years
  
  @Silver - yes! I still have to work on float_times_four. That is time consuming too!
- Admin over 11 years
  
  For float_times_four, you want to separate it into a bunch of cases (is NaN, is zero, is infinity, is normal, is denormalized (that last one was the part that took me a while))
- Anonymous over 11 years
  
  I have already done that. If less than 0x0071FFFF then just return uf*4, else just add to mantissa. But I am not sure what to do when both exponent and mantissa have to be changed. Also, are you converting the number and doing multiplication, or just manipulating bits in the IEEE form?
- Eric Postpischil over 11 years
  
  I asked whether you were trying to convert a float (given its representation) to an int, and you answered yes. But your code looks like you are trying to convert an int to a float. Which is it? (The latter is addressed here.)
- Gabe over 11 years
  
  BTW, the code you posted converts a 32-bit signed int to its 32-bit IEEE 754 single-precision with rounding toward zero. I know because I wrote it yesterday.
- phuclv almost 5 years
  
  duplicates: How to manually (bitwise) perform (float)x?, Converting Int to Float or Float to Int using Bitwise operations (software floating point), How to convert an unsigned int to a float?
Anonymous over 11 years

So you are saying that the code just gets the location of x and prints it out? In that case, the value would change on each run.
Basile Starynkevitch over 11 years

No it gives the integer contained at the location of the float, so, when sizeof(int) == sizeof[float] it gives the int of the same machine bit representation as your x ; nothing is printed unless you call a printing routine like printf (which is not in your question)
Anonymous over 11 years

Ok, so it gives the value stored at the location in memory and casts it to an int type. How can I do this without casting?
Anonymous over 11 years

memcpy did not work. int x (contains 32 bit float) is the input, then int result; memcpy(&result, &x, 4) does not work. (4 is ok as it will only run on 32bit machines)
wildplasser over 11 years

Maybe your assert (or your sizeof) is broke? BTW:Oops, I should have used x instead of f. BRB.
Anonymous over 11 years

Thanks for the explaination. I wrote the function but it fails in some cases.
chux - Reinstate Monica over 10 years

"Subtract 127 from e." happens when e > 0. Else "Subtract 126 from 0."
Eric Postpischil over 10 years

@chux: Yes, one would need to adjust when converting a floating-point encoding to a number in general. This question asks about the special case of converting a floating-point encoding to an integer. In that case, we can neglect proper handling of tiny values, since they will produce zero in the end.
Gauthier about 9 years

Meaningful use: receiving two int16_ts on a bus, that actually represent a float32. Reinterpret the two int16_t as a float.
Koopakiller over 8 years

Please add some informtion about how your code works
TLW almost 8 years

This is undefined behavior in every C standard I know of.
user694733 about 7 years

@TLW Type punning through union is not UB since C99. This is explictly mentioned in, for example, N1256 6.5.2.3 footnote 82.
nitronoid over 6 years

Your second example will work if you use a rvalue reference, replace (int&) with (int&&). This is required as the expression returns an rvalue reference which lvalue references cannot bind to. I assume you could also use (const int &) to bind to both.
Björn Lindqvist over 6 years

@BasileStarynkevitch: what would the problem with endianness be? If you are just looking to pick out the bits of a float, I don't think it would matter if ints are stored big- or little-endian.
Matthieu Brucher over 5 years

This doesn't add anything to the previous answers.
Dino Dini over 5 years

I do not agree. It's a nice self contained example, Mr. Brucher with a reputation of 6,923
Mark Walsh over 5 years

Endianness would be a problem if you were converting a float to an unsigned int, where you are using the bits as flags and the sending function/program/device can only send floats.