1
© Mark Redekopp, All rights reserved
EE 357 Unit 3
IEEE 754 Floating Point
Representation
Floating Point Arithmetic
© Mark Redekopp, All rights reserved
Floating Point
•
Used to represent very small numbers
(fractions) and very large numbers
– Avogadro’s Number: +6.0247 * 10
23
– Planck’s Constant: +6.6254 * 10
27
– Note: 32 or 64bit integers can’t represent this
range
•
Floating Point representation is used in
HLL’s like C by declaring variables as
float
or
double
© Mark Redekopp, All rights reserved
Fixed Point
•
Unsigned and 2’s complement fall under a
category of representations called “Fixed Point”
•
The radix point is assumed to be in a fixed
location for all numbers
– Integers:
10011101.
(binary point to right of LSB)
•
For 32bits, unsigned range is 0 to ~4 billion
– Fractions:
.10011101
(binary point to left of MSB)
•
Range [0 to 1)
•
Main point: By fixing the radix point, we limit the
range of numbers that can be represented
– Floating point allows the radix point to be in a different
location for each value
© Mark Redekopp, All rights reserved
Floating Point Representation
•
Similar to scientific notation used with
decimal numbers
– ±D.DDD * 10
±exp
•
Floating Point representation uses the
following form
– ±b.bbbb * 2
±exp
– 3 Fields: sign, exponent, fraction (also called
mantissa or significand)
S
Exp.
fraction
Overall Sign of #
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
2
© Mark Redekopp, All rights reserved
Normalized FP Numbers
•
Decimal Example
– +0.754*10
15
is not correct scientific notation
–
Must have exactly one significant digit before decimal point:
+7.54*10
14
•
In binary the only significant digit is ‘1’
•
Thus normalized FP format is:
±1.bbbbbb * 2
±exp
•
FP numbers will always be normalized before being stored
in memory or a reg.
– The
1.
is actually not stored but assumed since we
always will store normalized numbers
– If HW calculates a result of 0.001101*2
5
it must
normalize to 1.101000*2
2
before storing
© Mark Redekopp, All rights reserved
IEEE Floating Point Formats
•
Single Precision
(32bit format)
– 1 Sign bit (0=p/1=n)
– 8 Exponent bits
(Excess127
representation)
– 23 fraction (significand
or mantissa) bits
– Equiv. Decimal Range:
7 digits x 10
±38
•
Double Precision
(64bit format)
– 1 Sign bit (0=p/1=n)
– 11 Exponent bits
(Excess1023
representation)
– 52 fraction (significand
or mantissa) bits
– Equiv. Decimal Range:
16 digits x 10
±308
S
Fraction
Exp.
1
8
23
S
Fraction
Exp.
1
11
52
© Mark Redekopp, All rights reserved
Exponent Representation
•
Exponent includes its own sign (+/)
•
Rather than using 2’s comp. system,
SinglePrecision uses Excess127
while DoublePrecision uses
Excess1023
–
This representation allows FP numbers to
be easily compared
•
Let E’ = stored exponent code and
E = true exponent value
•
For singleprecision: E’ = E + 127
– 2
1
=> E = 1, E’ = 128
10
= 10000000
2
•
For doubleprecision: E’ = E + 1023
– 2
2
=> E = 2, E’ = 1021
10
= 01111111101
2
2’s
comp.
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '08
 MAYEDA
 IEEE 7542008, Mark Redekopp

Click to edit the document details