This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Solutions for Homework 5 Foundations of Computational Math 1 Fall 2010 Problem 5.1 Consider the following numbers: • 122.9572 • 457932 • 0.0014973 5.1.a . Express the numbers as floating point numbers with β = 10 and t = 4 using rounding to even and using chopping. 5.1.b . Express the numbers as floating point numbers with in single precision IEEE format using rounding to even. It is strongly recommended that you implement a program to do this rather than computing the representation manually. 5.1.c . Calculate the relative error for each number and verify it satisfies the bounds implied by the floating point system used. Solution: The conversion to the decimal floating point is straightforward and is easily done by inspection. 122 . 9572 = . 1229572 × 10 3 ≈ . 1229 × 10 3 chopped ≈ . 1230 × 10 3 rounded 457932 = . 457932 × 10 6 ≈ . 4579 × 10 6 . 0014973 = . 14973 × 10 2 ≈ . 1497 × 10 2 For the IEEE single precision, the number of bits and the possibility of a nonterminating fraction complicates trying to do this by hand. However, the procedure you use is easily coded on an IEEE machine using double precision arithmetic to compute the single preceision representation. We illustrate one point of view on 122 . 9572 in detail. Consider first the nonfractional portion 122. We start with the largest power of 2 smaller than or equal to 122 1 f = 122 f ≥ 2 6 = 64 → b 6 = 1 and f ← f 2 6 = 58 f ≥ 2 5 = 32 → b 5 = 1 and f ← f 2 5 = 26 f ≥ 2 4 = 16 → b 4 = 1 and f ← f 2 4 = 10 f ≥ 2 3 = 8 → b 3 = 1 and f ← f 2 3 = 2 f < 2 2 = 4 → b 2 = 0 and f ← f f ≥ 2 1 = 2 → b 1 = 1 and f ← f 2 1 = 0 f < 2 = 1 → b = 0 and f ← f 2 1 = 0 122 = b 6 2 6 + b 5 2 5 + b 4 2 4 + b 3 2 3 + b 2 2 2 + b 1 2 1 + b 2 = 2 6 + 2 5 + 2 4 + 2 3 + b 1 2 1 = 64 + 32 + 16 + 8 + 2 122 = (1111010) 2 The fractional part can also be converted to an expansion in terms of 2 i by repeated comparison and subtraction. A code does this easily. When computed in double precision, all of the relevant negative powers of 2 needed for a single precision representation can be computed with enough extra to apply chopping or a rounding of your choice. The code below is essentially matlab code Generates n fractional binary bits into b (1 : n ) f=double(0.9572); s=double(1.0); n=40; b(1:n)=0; for k=1:n s=(s)/double(2.0); if f ≥ s b ( k ) = 1; f = f s ; end end Using this code on the fractional part 0 . 9572 to produce many more bits (the vertical line indicates the last mantissa bit) than is needed for single precision yields: f = 122 . 9572 = 1 . 11101011110101000010110  0001111001001111 × 2 6 σ μ = 10000101 11101011110101000010110 2 Repeating the exercise on the other two numbers yields: f = 457932 = 1 . 10111111001100110000000 × 2 18 σ μ = 10010001 10111111001100110000000 f = 0 . 0014973 = 1 . 10001000100000100001101000  100001 × 2 10 σ μ = 01110101 10001000100000100001101001 Problem 5.2 Consider the function f ( x ) = 1 . 01 +...
View
Full
Document
This note was uploaded on 07/25/2011 for the course MAD 5403 taught by Professor Gallivan during the Spring '11 term at University of Florida.
 Spring '11
 Gallivan

Click to edit the document details