Unformatted text preview: Chapter 6: FloatingPoint
Point
Arithmetic
Arithmetic
EEC 70 Fall 2010
Professor Wilken 1 About Floating Point Arithmetic
Arithmetic basic operations on floating
point numbers are:
point • Add, Subtract, Multiply, Divide
• Transcendental operations (sine, cosine, log,
Transcendental
exp, etc.) are synthesized from these
exp, 2 Floating Point Addition
Like addition using base 10 scientific representation:
Like
• Align decimal points
• Add
• Normalize the result
Example:
9.998 x 102
+ 5.0
x 101
Align (make smaller # same exp. as larger, why?):
9.998 x 102
+ .005 x 102
 3 Floating Point Addition (cont.)
Add:
9.998 x 102
+ .005 x 102
10.003 x 102
Normalize (integer part must be => 1, <= 9 )
10.003 x 102 = 1.0003 x 103
1.0003
Observation: By hand, the precision is
unlimited. For computer hardware, precision is
limited to a fixed number of bits.
4 Binary Floating Point Add
Similar to add in base 10 scientific, but must operate on
standard floating point representation, use basic
computer operations
Example:
Example:
0.25 = 0 01111101
00000000000000000000000
(1.)
100 = 0 10000101
10010000000000000000000
(1.) Align: compare exponents by subtracting
• Sign of result tells which is larger
• Magnitude of results how many places smaller must be moved
ExpA: 01111101
ExpB: 10000101
00001000 # ExpB is larger by 1000two (8ten)
5 Binary Floating Point Add (cont.)
Shift smaller number to the right, by magnitude of
exponent subtraction, include hidden bit in shifting.
Set smaller exponent equal to larger exponent
Set
0 10000101
00000001000000000000000
10000101
(0.)
(0.)
0 10000101
10010000000000000000000
10000101
(1.)
(1.) Add mantissas (including hidden bits)
Add
0 10000101
00000001000000000000000
(0.)
+ 0 10000101
10010000000000000000000
(1.)
0 10000101
10010001000000000000000
(1.)
6 Binary Floating Point Add (cont.)
Normalize the result (get “hidden bit” to be 1) • This example is already normalized 7 Floating Point Subtract
Alignment binary point, like addition
Alignment
Then algorithm for sign magnitude
numbers takes over
numbers • Negate the second operand, then add
Must set sign bit of result to be consistent
with outcome
with 8 Floating Point Multiply
Similar to multiply in base 10 scientific:
Similar • Multiply mantissas
• Add exponents
• Normalize
Example:
3.0 x 102
+ 5.0 x 101
15.0 x 101 > 1.5 x 102 9 Binary Floating Point Multiply
Similar to base 10, but must deal with standard floating
point representation
point
Example (using only 4 mantissa bits):
x 0 10000100 0100
1 00111100 1100
 • Multiply the mantissas, don’t forget the hidden 1s:
1.0100
x 1.1100
00000
00000
10100
10100
10100
1000110000
1000110000 > 10.00110000 10 Binary Floating Point Multiply (cont.)
• Add exponents:
10000100
00111100
11000000 • Exponent now has double bias (one for each term),
Exponent
so subtract 127:
so
11000000
01111111
01000001 • Compute the result sign bit
XOR of operand sign bits
XOR • Reconstruct the result
0 01000001 10.00110000
11 Binary Floating Point Multiply (cont.)
• Normalize the result
Sift binary point if integer part is >1, increment exponent:
Sift 0 01000010 1.001100000 • “Trim” excess bits low order bits and the
excess
hidden bit
hidden 0 01000010 0011 12 Floating Point Division
Dual of multiplication
Dual • Divide mantissas (include hidden bits)
• Subtract exponents. Must add back in bias (127)
Subtract
because it has been eliminated
because • Normalize result
hidden bit must be 1
hidden
may require left shifts, exponent decrement
may • Trim excess bits 13 MIPS Floating Point Hardware
Floating point arithmetic can be done in
software using integer instructions, but MIPS
architecture also provides for floating hardware
architecture
Floating point hardware is 100x faster, or more,
than floating point software
than
MIPS architecture defines two distinct
coprocessors (units)
coprocessors
• coprocessor 0 executes integer instructions
• coprocessor 1 executes floating point
Processors with no floating point unit can be
built for special applications
built 14 MIPS Floating Point Hardware
Instruction addresses and data addresses are
sent to memory by the integer unit
sent
Floating point unit “listens” to the instruction
to
sequence, partially decodes each instruction
sequence,
• executes FP instructions, ignores integer instructions
• IInteger unit ignores FP instructions
nteger FP
Unit Integer
Unit Instructions/data
addresses memory
15 FP Registers
Integer registers and FP registers are separated
Integer
• iinteger unit can only access integer registers, FP unit
nteger
only the FP registers
only • separate instructions are used for loading/storing
separate
to/from integer and FP registers
to/from
lw $x, 0($y); vs. l.s $fx, 0($y)
lw FP Unit Integer Unit
int
ALU FP
ALU int
regs. FP
regs. memory
16 FP Registers (cont.)
Separate instructions are used to move
data between FP and integer registers
data • mfc1 $x, $fy mtc1 $x, $fy Floating point unit executes instructions
that convert integer FP format, FP
integer format
integer • Source and destination registers are both
Source
floating point registers
floating
cvt.s.w $fx, $fy
cvt.s.w cvt.w.s $fx, $fy • Data can then be moved to/from integer
Data
registers before/after conversion
registers 17 FP Registers (cont.)
FP registers are 32 bits
FP
Double precision values are stored in
adjacent register pairs, starting with an
even number FP reg
even • thus can store only 16 double precision in
thus
regs, vs. 32 single precision
regs, Loading/storing a double precision value
to/from memory takes two instructions,
one for lower word one for upper
one
FP unit includes an instruction that
converts from single precision FP to
double precision
18 Arithmetic Instructions
MIPS processor has separate instructions for
integer and floating point arithmetic
integer
Floating point operations are slower, want to
use integer if possible:
use • Integer Add
• Fl. Pt. Add/ Int Mult
• Fl. Pt. Multiply
• Fl. Pt. Divide 1 time unit
2 time units
3 time units
20 time units Can often eliminate FP divide from loops
Can 19 Floating Point Range: Maximum
Floating point can represent very large numbers.
The largest number is:
The
0 11111110 (1.)11111111111111111111111 • Maximum exponent is: 254  127 = 127 => weight of 2127.
Maximum
23
• Mantissa is: 1 + 1/2 + 1/4 + ... + 1/223 = 2  1/223 ≅ 2
• Thus the largest number ≅ 2128 ≅ 1038
Thus
10
• Do we need larger numbers? Probably not:
Radius of the universe = age of the universe x speed of light
Radius
= (1010 years) x (102.5 days/year) x (105 seconds/day) x
seconds/day)
(108.5 meters/sec) = 1026 meters
(10
Bill Gates’ net worth is < 1011 dollars
Bill
20 Floating Point Range: Minimum
Floating point can represent numbers with very
small magnitude. Normally, the smallest
magnitude number is:
magnitude
0 00000001 (1.)00000000000000000000000 • Minimum exponent is: 1  127 = 126 => weight of 2126.
Minimum
127
• Mantissa is: 1
• Thus the smallest number ≅ 2126 ≅ 1038
Thus
10
• Do we need smaller magnitude numbers? Probably not:
27
Mass of an electron: 1027 grams
Mass
grams But somebody was small minded: 21 Denormalized Numbers
There is no hidden 1 when the exponent is
00000000, i.e. the number is no longer normalized
00000000,
00000000 exponent has same meaning as
00000001, i.e., 2126
00000001,
Denormalized number can have a leading 1 in any
bit position, allowing even smaller numbers
bit
• However smaller numbers have fewer bits of precision
Smallest denormalized number is:
Smallest
0 00000000 (0.)00000000000000000000001 which represents 2149 ≅ 1045
10
• only has 1 bit of precision
Denormalization complicates floating point
hardware design, questionable usefulness
hardware
22 Overflow and Underflow
When a result is larger than
When
0 11111110 (1.)11111111111111111111111 overflow occurs, and the result becomes
overflow
+infinity, which is represented as:
+infinity,
0 11111111 00000000000000000000000 • infinity is the same, but with a 1 in the sign bit
Operations with infinity will produce the
expected result: # + infinity = infinity
expected
Underflow occurs when the result is smaller
than the smallest denormalized number
than
• result becomes 0 23 Precision
The set of all real numbers is infinite.
The set of floating point numbers is
32
finite, 2
finite,
We must map a range of real numbers
onto a single floating point number
onto • The mapping cannot be precise, some
The
precision is lost. How much?
precision 24 Integer Precision
First, consider precision of mapping real numbers
to binary integers:
to ...00 ...01 ...10 ...11 ...00 The real numbers between two integers must be
mapped to one of those integers.
• Simplest method is truncation, discard the fractional part:
...00 ...01 ...10 ...11 ...00 25 Truncation Precision
With truncation, assuming real numbers
are randomly distributed between
integers, what is the expected loss in
precision for a given number?
precision
...00 ...01 ...10 ...11 ...00 1
Loss
0 26 Cumulative Truncation Error
Adding up a large list of numbers can
quickly result in significant cumulative
error.
error.
What is the expected roundoff error for
adding a large list?
adding
What is the maximum roundoff error?
What
The fastest computers can add more than
9
10 numbers/second, ⇒ cumulative error
10 numbers/second,
9
can reach 32bits (4 x 10 ) in a few seconds
can
27 Rounding Precision
With rounding to nearest integer, assuming
real numbers are randomly distributed between
integers, what is the magnitude of the expected
loss in precision for a given number?
loss
...00 ...01 ...10 ...11 ...00 1
Loss
0
1 28 Cumulative Roundoff Error
What is the expected cumulative roundoff
error when adding up a large list of numbers? What is the maximum roundoff error?
What 29 Floating Point Precision
Just as the precision of an integer is relative
to the weight of the LSB (1), the precision of a
floating point number is relative to the weight
of the LSB
of • LSB weight is determined by the exponent
23
• LSB weight is 223 times 2exp
times
FP precision ranges from 0.5 to 0.5 LSB for
0.5
rounding, 0 to 1 LSB for truncation
rounding, 30 Double Precision Floating Point
If computers can accumulate errors so
quickly, what to do?
quickly,
Larger representation, 2 words = 64 bits
IEEE double precision format:
IEEE
1 11 sign exponent 52
mantissa exponent is biased by 1023, mantissa has a
hidden 1.
hidden 31 Cumulative Double Precision Error
Even the fastest computers will take a while to
accumulate enough error to run out of precision using
double precision FP
double
(109 operations/sec) x (105 seconds/day) x
(101.5 days/month) = 1015.5 operations/month
operations/month
= 252 operations/month
⇒ a couple months before the max error magnitude
couple
approaches magnitude of mantissa
approaches
Some supercomputers have used quad precision
quad
floating point (128bit) to avoid error accumulation for
bit)
huge computations.
32 Double Precision Range
Double precision allows much larger/smaller
numbers due to larger exponent
numbers
21023 ≅ 10308
10 • For comparison, there are an estimated 1070 atoms
atoms
in the universe Increased precision is much more important
than increased range.
than 33 Half Precision FP
Some graphics allocations use a 16bit
Some
bit
half precision floating point
half
floating
representation
representation • Where need for precision and range is limited
5 bits for exponent, 10 bits for fraction
bits • Exponent bias 15
Exponent 34 Rounding Methods
There are various methods for rounding, all of
which are useful for certain situations
which
• Truncation (rounding toward zero)
• Rounding to nearest
• Rounding toward +infinity:
1.22 > 1.3
1.22
1.28 > 1.3
1.28
2.81 > 2.8
2.81
2.89 > 2.8
2.89
2.8 • Rounding toward infinity:
Rounding
1.22 > 1.2
1.22
1.28 > 1.2
1.28
2.81 > 2.9
2.81
2.89 > 2.9
2.89
2.9
35 ...
View
Full Document
 Fall '05
 Wilken
 Precision, double precision

Click to edit the document details