Rutgers UniversitySantosh Nagarakatte4Floating pointIntegers typically written in ordinary decimal formnE.g., 1, 10, 100, 1000, 10000, 12456897, etc.But, can also be written in scientific notationnE.g., 1x104, 1.2456897x107What about binary numbers?nWorks the same wayn0b100 = 0b1x22Scientific notation gives a natural way for thinking about floating point numbersn0.25 = 2.5x10-1= 0b1x2-2How to represent in computers? Why not use fixed point numbers?
Rutgers UniversitySantosh Nagarakatte6Numerical ValuesThree different cases:nNormalized valueslexponent field ≠0 and exponent field ≠2k-1 (all 1’s)lexponent = binary value – Bias»Bias = 2k-1-1 (e.g., 127 for float)lValue of the number = 1.(mantissa field)lEx: (sign: 0, exp: 1, mantissa: 1) would give 0b1.1x2-126nDenormalized valueslexponent field = 0lexponent = 1 – Bias (e.g., -126 for float)lValue of the number = mantissa field (no leading 1)lEx: (sign: 0, exp: 0, mantissa: 10) would give 0b10x2-126nSpecial values: represent +∞, -∞, and NaN
Rutgers UniversitySantosh Nagarakatte7Decimal to IEEE Floating Point5.625In binary 101.101 à1.01101 x 22Exponent field has value 2nadd 127 to get 129Exponent is 10000001Mantissa is 01101Sign bit is 00 10000001 0110100000000000000000
Rutgers UniversitySantosh Nagarakatte8Floating point in C32 bits single precision (type float)n1 bit for sign, 8 bits for exponent, 23 bits for mantissalSign bit: 1 = negative numbers, 0 = positive numberslExponent is power of 2nHave 2 zero’snRange is approximately -1038to 103864 bits double precision (type double)n1 bit for sign, 11 bits for exponent, 52 bits for mantissanMajority of new bits for mantissa èhigher precisionnRange is -10308to +10308
Rutgers UniversitySantosh Nagarakatte10Floating Point Operations•No exact representation for a floating point•Mantissa is only 23 bits in 32 bit representation•Least significant bits may be dropped•Floating point operations are not associative•(3.14 + 1e10) – 1e10 != 3.14 + (1e10 – 1e10). Why?
Rutgers UniversitySantosh Nagarakatte11iClicker Pop Quiz 1