How to normalize a mantissa

c floating-point double normalization

55,507

Solution 1

A floating point number is normalized when we force the integer part of its mantissa to be exactly 1 and allow its fraction part to be whatever we like.

For example, if we were to take the number 13.25, which is 1101.01 in binary, 1101 would be the integer part and 01 would be the fraction part.

I could represent 13.25 as 1101.01*(2^0), but this isn't normalized because the integer part is not 1. However, we are allowed to shift the mantissa to the right one digit if we increase the exponent by 1:

  1101.01*(2^0)
= 110.101*(2^1)
= 11.0101*(2^2)
= 1.10101*(2^3)

This representation 1.10101*(2^3) is the normalized form of 13.25.

That said, we know that normalized floating point numbers will always come in the form 1.fffffff * (2^exp)

For efficiency's sake, we don't bother storing the 1 integer part in the binary representation itself, we just pretend it's there. So if we were to give your custom-made float type 5 bits for the mantissa, we would know the bits 10100 would actually stand for 1.10100.

Here is an example with the standard 23-bit mantissa:

enter image description here

As for the exponent bias, let's take a look at the standard 32-bit float format, which is broken into 3 parts: 1 sign bit, 8 exponent bits, and 23 mantissa bits:

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm

The exponents 00000000 and 11111111 have special purposes (like representing Inf and NaN), so with 8 exponent bits, we could represent 254 different exponents, say 2^1 to 2^254, for example. But what if we want to represent 2^-3? How do we get negative exponents?

The format fixes this problem by automatically subtracting 127 from the exponent. Therefore:

0000 0001 would be 1 -127 = -126
0010 1101 would be 45 -127 = -82
0111 1111 would be 127-127 = 0
1001 0010 would be 136-127 = 9

This changes the exponent range from 2^1 ... 2^254 to 2^-126 ... 2^+127 so we can represent negative exponents.

Solution 2

Tommy -- chux and eigenchris, along with the others have provided excellent answers, but if I am looking at your comments correctly, you still seem to be struggling with the nuts-and-bolts of "how would I take this info and then use this in creating a custom float representation where the user specifies the amount of bits for the exponent?" Don't feel bad, it is a clear as mud the first dozen times you go through it. I think I can take a stab at clearing it up.

You are familiar with the IEEE754-Single-Precision-Floating-Point representation of:

IEEE-754 Single Precision Floating Point Representation of (13.25)

  0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
 |s|      exp      |                  mantissa                   |

That the 1-bit sign-bit, 8-bit biased exponent (in 8-bit excess-127 notation), and the remaining 23-bit mantissa.

When you allow the user to choose the number of bits in the exponent, you are going to have to rework the exponent notation to work with the new user-chosen limit.

What will that change?

Will it change the sign-bit handling -- No.
Will it change the mantissa handling -- No (you will still convert the mantissa/significand to "hidden bit" format).

So the only thing you need to focus on is exponent handling.

How would you approach this? Recall, the current 8-bit exponent is in what is called excess-127 notation (where 127 represents the largest value for 7 bits allowing any bias to be contained and expressed within the current 8-bit limit. If your user chooses 6 bits as the exponent size, then what? You will have to provide a similar method to insure you have a fixed number to represent your new excess-## notation that will work within the user limit.

Take a 6-bit user limit, then a choice for the unbiased exponent value could be tried as 31 (the largest values that can be represented in 5-bits). To that you could apply the same logic (taking the 13.25 example above). Your binary representation for the number is 1101.01 to which you move the decimal 3 positions to the left to get 1.10101 which gives you an exponent bias of 3.

In your 6-bit exponent case you would add 3 + 31 to obtain your excess-31 notation for the exponent: 100010, then put the mantissa in "hidden bit" format (i.e. drop the leading 1 from 1.10101 resulting in your new custom Tommy Precision Representation:

IEEE-754 Tommy Precision Floating Point Representation of (13.25)

  0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 |- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -|
 |s|    exp    |                    mantissa                     |

With 1-bit sign-bit, 6-bit biased exponent (in 6-bit excess-31 notation), and the remaining 25-bit mantissa.

The same rules would apply to reversing the process to get your floating point number back from the above notation. (just using 31 instead of 127 to back the bias out of the exponent)

Hopefully this helps in some way. I don't see much else you can do if you are truly going to allow for a user-selected exponent size. Remember, the IEEE-754 standard wasn't something that was guessed at and a lot of good reasoning and trade-offs went into arriving at the 1-8-23 sign-exponent-mantissa layout. However, I think your exercise does a great job at requiring you to firmly understand the standard.

Now totally lost and not addressed in this discussion is what effects this would have on the range of numbers that could be represented in this Custom Precision Floating Point Representation. I haven't looked at it, but the primary limitation would seem to be a reduction in the MAX/MIN that could be represented.

Solution 3

"Normalization process" converts the inputs into a select range.

binary32 expects the significand (not mantissa) to be in the range 1.0 <= s < 2.0 unless the number has a minimum exponent.

Example:
value = 12, exp = 4 is the same as
value = 12/(2*2*2), exp = 4 + 3
value = 1.5, exp = 7

Since the significand always has a leading digit of 1 (unless the number has a minimum exponent), there is no need to store it. And rather than storing the exponent as 7, a bias of 127 is added to it.

value = 1.5 decimal --> 1.1000...000 binary --> 0.1000...000 stored binary (23 bits in all)
exp = 7 --> bias exp 7 + 127 --> 134 decimal --> 10000110 binary

The binary pattern stored is the concatenation of the "sign", "significand with a leading 1 bit implied" and a "bias exponent"

0 10000110 1000...000 (1 + 8 + 23 = 32 bits)

When the biased exponent is 0 - the minimum value, the implied bit is 0 and so small numbers like 0.0 can be stored.

When the biased exponent is 255 - the maximum value, data stored no longer represents finite numbers but "infinity" and "Not-a-numbers".

Check the referenced link for more details.

55,507

Author by

Tommy K

Updated on July 16, 2022

Comments

Tommy K almost 2 years

I'm trying to convert an int into a custom float, in which the user specifies the amount of bits reserved for the exp and mantissa, but I don't understand how the conversion works. My function takes in an int value and and int exp to represent the number (value * 2^exp) i.e value = 12, exp = 4, returns 192. but I don't understand the process I need to do to change these. I've been looking at this for days and playing with IEEE converter web apps but I just don't understand what the normalization process is. Like I see that its "move the binary point and adjust the exponent" but I have no idea what this means, can anyone give me an example to go off of? Also I don't understand what the exponent bias is. The only info I have is that you just add a number to your exponent but I don't understand why. I've been searching Google for an example I can understand but this just isn't making any sense to me
Tommy K about 9 years

can you give a more concrete example on how this is done in code? Like I understand 3.1416 in binary would be 11.00100100001111... so I need to normalize it to1.100100100001111... x 2^1 I get the abstract part but I dont understand how to actually implement this
kashyap about 3 years

@eipenchris, can you please tell me who to identify the integer part
eigenchris about 3 years

@kashyap the integer part is the part to the left of the decimal point (also called the radix point).