Chapter6

Chapter6 - Chapter 6: Floating-Point Point Arithmetic...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 6: Floating-Point Point Arithmetic Arithmetic EEC 70 Fall 2010 Professor Wilken 1 About Floating Point Arithmetic Arithmetic basic operations on floating point numbers are: point • Add, Subtract, Multiply, Divide • Transcendental operations (sine, cosine, log, Transcendental exp, etc.) are synthesized from these exp, 2 Floating Point Addition Like addition using base 10 scientific representation: Like • Align decimal points • Add • Normalize the result Example: 9.998 x 102 + 5.0 x 10-1 ----------------------Align (make smaller # same exp. as larger, why?): 9.998 x 102 + .005 x 102 ------------ 3 Floating Point Addition (cont.) Add: 9.998 x 102 + .005 x 102 -----------10.003 x 102 Normalize (integer part must be => 1, <= 9 ) 10.003 x 102 = 1.0003 x 103 1.0003 Observation: By hand, the precision is unlimited. For computer hardware, precision is limited to a fixed number of bits. 4 Binary Floating Point Add Similar to add in base 10 scientific, but must operate on standard floating point representation, use basic computer operations Example: Example: 0.25 = 0 01111101 00000000000000000000000 (1.) 100 = 0 10000101 10010000000000000000000 (1.) Align: compare exponents by subtracting • Sign of result tells which is larger • Magnitude of results how many places smaller must be moved ExpA: 01111101 ExpB: -10000101 ------------00001000 # ExpB is larger by 1000two (8ten) 5 Binary Floating Point Add (cont.) Shift smaller number to the right, by magnitude of exponent subtraction, include hidden bit in shifting. Set smaller exponent equal to larger exponent Set 0 10000101 00000001000000000000000 10000101 (0.) (0.) 0 10000101 10010000000000000000000 10000101 (1.) (1.) Add mantissas (including hidden bits) Add 0 10000101 00000001000000000000000 (0.) + 0 10000101 10010000000000000000000 (1.) ----------------------------------0 10000101 10010001000000000000000 (1.) 6 Binary Floating Point Add (cont.) Normalize the result (get “hidden bit” to be 1) • This example is already normalized 7 Floating Point Subtract Alignment binary point, like addition Alignment Then algorithm for sign magnitude numbers takes over numbers • Negate the second operand, then add Must set sign bit of result to be consistent with outcome with 8 Floating Point Multiply Similar to multiply in base 10 scientific: Similar • Multiply mantissas • Add exponents • Normalize Example: 3.0 x 102 + 5.0 x 10-1 ----------------15.0 x 101 -> 1.5 x 102 9 Binary Floating Point Multiply Similar to base 10, but must deal with standard floating point representation point Example (using only 4 mantissa bits): x 0 10000100 0100 1 00111100 1100 ---------------- • Multiply the mantissas, don’t forget the hidden 1s: 1.0100 x 1.1100 --------00000 00000 10100 10100 10100 --------1000110000 1000110000 -> 10.00110000 10 Binary Floating Point Multiply (cont.) • Add exponents: 10000100 00111100 -------11000000 • Exponent now has double bias (one for each term), Exponent so subtract 127: so 11000000 -01111111 --------01000001 • Compute the result sign bit XOR of operand sign bits XOR • Reconstruct the result 0 01000001 10.00110000 11 Binary Floating Point Multiply (cont.) • Normalize the result Sift binary point if integer part is >1, increment exponent: Sift 0 01000010 1.001100000 • “Trim” excess bits low order bits and the excess hidden bit hidden 0 01000010 0011 12 Floating Point Division Dual of multiplication Dual • Divide mantissas (include hidden bits) • Subtract exponents. Must add back in bias (127) Subtract because it has been eliminated because • Normalize result hidden bit must be 1 hidden may require left shifts, exponent decrement may • Trim excess bits 13 MIPS Floating Point Hardware Floating point arithmetic can be done in software using integer instructions, but MIPS architecture also provides for floating hardware architecture Floating point hardware is 100x faster, or more, than floating point software than MIPS architecture defines two distinct coprocessors (units) coprocessors • coprocessor 0 executes integer instructions • coprocessor 1 executes floating point Processors with no floating point unit can be built for special applications built 14 MIPS Floating Point Hardware Instruction addresses and data addresses are sent to memory by the integer unit sent Floating point unit “listens” to the instruction to sequence, partially decodes each instruction sequence, • executes FP instructions, ignores integer instructions • IInteger unit ignores FP instructions nteger FP Unit Integer Unit Instructions/data addresses memory 15 FP Registers Integer registers and FP registers are separated Integer • iinteger unit can only access integer registers, FP unit nteger only the FP registers only • separate instructions are used for loading/storing separate to/from integer and FP registers to/from lw $x, 0($y); vs. l.s $fx, 0($y) lw FP Unit Integer Unit int ALU FP ALU int regs. FP regs. memory 16 FP Registers (cont.) Separate instructions are used to move data between FP and integer registers data • mfc1 $x, $fy mtc1 $x, $fy Floating point unit executes instructions that convert integer FP format, FP integer format integer • Source and destination registers are both Source floating point registers floating cvt.s.w $fx, $fy cvt.s.w cvt.w.s $fx, $fy • Data can then be moved to/from integer Data registers before/after conversion registers 17 FP Registers (cont.) FP registers are 32 bits FP Double precision values are stored in adjacent register pairs, starting with an even number FP reg even • thus can store only 16 double precision in thus regs, vs. 32 single precision regs, Loading/storing a double precision value to/from memory takes two instructions, one for lower word one for upper one FP unit includes an instruction that converts from single precision FP to double precision 18 Arithmetic Instructions MIPS processor has separate instructions for integer and floating point arithmetic integer Floating point operations are slower, want to use integer if possible: use • Integer Add • Fl. Pt. Add/ Int Mult • Fl. Pt. Multiply • Fl. Pt. Divide 1 time unit 2 time units 3 time units 20 time units Can often eliminate FP divide from loops Can 19 Floating Point Range: Maximum Floating point can represent very large numbers. The largest number is: The 0 11111110 (1.)11111111111111111111111 • Maximum exponent is: 254 - 127 = 127 => weight of 2127. Maximum 23 • Mantissa is: 1 + 1/2 + 1/4 + ... + 1/223 = 2 - 1/223 ≅ 2 • Thus the largest number ≅ 2128 ≅ 1038 Thus 10 • Do we need larger numbers? Probably not: Radius of the universe = age of the universe x speed of light Radius = (1010 years) x (102.5 days/year) x (105 seconds/day) x seconds/day) (108.5 meters/sec) = 1026 meters (10 Bill Gates’ net worth is < 1011 dollars Bill 20 Floating Point Range: Minimum Floating point can represent numbers with very small magnitude. Normally, the smallest magnitude number is: magnitude 0 00000001 (1.)00000000000000000000000 • Minimum exponent is: 1 - 127 = -126 => weight of 2-126. Minimum 127 • Mantissa is: 1 • Thus the smallest number ≅ 2-126 ≅ 10-38 Thus 10 • Do we need smaller magnitude numbers? Probably not: 27 Mass of an electron: 10-27 grams Mass grams But somebody was small minded: 21 Denormalized Numbers There is no hidden 1 when the exponent is 00000000, i.e. the number is no longer normalized 00000000, 00000000 exponent has same meaning as 00000001, i.e., 2-126 00000001, Denormalized number can have a leading 1 in any bit position, allowing even smaller numbers bit • However smaller numbers have fewer bits of precision Smallest denormalized number is: Smallest 0 00000000 (0.)00000000000000000000001 which represents 2-149 ≅ 10-45 10 • only has 1 bit of precision Denormalization complicates floating point hardware design, questionable usefulness hardware 22 Overflow and Underflow When a result is larger than When 0 11111110 (1.)11111111111111111111111 overflow occurs, and the result becomes overflow +infinity, which is represented as: +infinity, 0 11111111 00000000000000000000000 • -infinity is the same, but with a 1 in the sign bit Operations with infinity will produce the expected result: # + infinity = infinity expected Underflow occurs when the result is smaller than the smallest denormalized number than • result becomes 0 23 Precision The set of all real numbers is infinite. The set of floating point numbers is 32 finite, 2 finite, We must map a range of real numbers onto a single floating point number onto • The mapping cannot be precise, some The precision is lost. How much? precision 24 Integer Precision First, consider precision of mapping real numbers to binary integers: to ...00 ...01 ...10 ...11 ...00 The real numbers between two integers must be mapped to one of those integers. • Simplest method is truncation, discard the fractional part: ...00 ...01 ...10 ...11 ...00 25 Truncation Precision With truncation, assuming real numbers are randomly distributed between integers, what is the expected loss in precision for a given number? precision ...00 ...01 ...10 ...11 ...00 1 Loss 0 26 Cumulative Truncation Error Adding up a large list of numbers can quickly result in significant cumulative error. error. What is the expected roundoff error for adding a large list? adding What is the maximum roundoff error? What The fastest computers can add more than 9 10 numbers/second, ⇒ cumulative error 10 numbers/second, 9 can reach 32-bits (4 x 10 ) in a few seconds can 27 Rounding Precision With rounding to nearest integer, assuming real numbers are randomly distributed between integers, what is the magnitude of the expected loss in precision for a given number? loss ...00 ...01 ...10 ...11 ...00 1 Loss 0 -1 28 Cumulative Roundoff Error What is the expected cumulative roundoff error when adding up a large list of numbers? What is the maximum roundoff error? What 29 Floating Point Precision Just as the precision of an integer is relative to the weight of the LSB (1), the precision of a floating point number is relative to the weight of the LSB of • LSB weight is determined by the exponent 23 • LSB weight is 2-23 times 2exp times FP precision ranges from -0.5 to 0.5 LSB for 0.5 rounding, 0 to 1 LSB for truncation rounding, 30 Double Precision Floating Point If computers can accumulate errors so quickly, what to do? quickly, Larger representation, 2 words = 64 bits IEEE double precision format: IEEE 1 11 sign exponent 52 mantissa exponent is biased by 1023, mantissa has a hidden 1. hidden 31 Cumulative Double Precision Error Even the fastest computers will take a while to accumulate enough error to run out of precision using double precision FP double (109 operations/sec) x (105 seconds/day) x (101.5 days/month) = 1015.5 operations/month operations/month = 252 operations/month ⇒ a couple months before the max error magnitude couple approaches magnitude of mantissa approaches Some supercomputers have used quad precision quad floating point (128-bit) to avoid error accumulation for bit) huge computations. 32 Double Precision Range Double precision allows much larger/smaller numbers due to larger exponent numbers 21023 ≅ 10308 10 • For comparison, there are an estimated 1070 atoms atoms in the universe Increased precision is much more important than increased range. than 33 Half Precision FP Some graphics allocations use a 16-bit Some bit half precision floating point half floating representation representation • Where need for precision and range is limited 5 bits for exponent, 10 bits for fraction bits • Exponent bias -15 Exponent 34 Rounding Methods There are various methods for rounding, all of which are useful for certain situations which • Truncation (rounding toward zero) • Rounding to nearest • Rounding toward +infinity: 1.22 -> 1.3 1.22 1.28 -> 1.3 1.28 -2.81 -> -2.8 2.81 -2.89 -> -2.8 2.89 2.8 • Rounding toward -infinity: Rounding 1.22 -> 1.2 1.22 1.28 -> 1.2 1.28 -2.81 -> -2.9 2.81 -2.89 -> -2.9 2.89 2.9 35 ...
View Full Document

Ask a homework question - tutors are online