Rounding Floating Point Numbers after addition (guard, sticky, and round bits)

floating-point ieee-754

27,003

Single precision means the mantissa holds 23 bits (assuming 32 bit architecture), plus a hidden one. Therefore the first one disappears from the mantissa.

Next is to determine the G and R bits or Guard and Round bit.

The Guard bit is the first of two bits past the 0 bit of the mantissa that will be cutoff.

The round bit is the second bit after the o bit of the mantissa. The guard bit here is 1 and the round bit is zero since no other bit is present.

The sticky bit is also zero because there are no ones to the right of the round bit. Therefore we have GRS or 100.

Depending on the book or processor being used this normally means round to the nearest even number. In this case since the LSB (least significant bit) is 1 the number will be rounded up to 1100,0000,0000,0000,0000,010 for the mantissa.

27,003

Author by

audiFanatic

Updated on July 09, 2022

Comments

audiFanatic almost 2 years

I haven't been able to find a good explanation of this anywhere on the web yet, so I'm hoping somebody here can explain it for me.

I want to add two binary numbers by hand:

1.001₂ * 2²
1.010,0000,0000,0000,0000,0011₂ * 2¹

I can add them no problem, I get the following result after de-normalizing the first number, adding the two, and re-normalizing them.

1.1100,0000,0000,0000,0000,0011₂ * 2²

The issue is, that number will not fit into single-precision IEEE 754 format without truncating or rounding one bit. My assignment asks that we put this number into single-precision IEEE 754 format (which again, is normally no problem, I can do that easy). It asks us to do so first with guard, round, and sticky bits and then repeat without these bits. However, I'm not exactly sure how these bits help with rounding. I would assume that I would just truncate the last LSB if I were to do this without guard, round, and sticky bits, however.