This preview shows page 1. Sign up to view the full content.
Unformatted text preview: es reading the value ·½, from dest, multiplying this by val to get ·½ and then storing this back at dest. Evidently, some part of this computation requires much longer than the normal ﬁve clock cycles required by ﬂoating-point multiplication. In fact, running measurements on this operation we ﬁnd it takes between 110 and 120 cycles to multiply a number by inﬁnity. Most likely, the hardware detected this as a special case and issued a trap that caused a software routine to perform the actual computation. The CPU designers felt such an occurrence would be sufﬁciently rare that they did not need to deal with it as part of the hardware design. Similar behavior could happen with underﬂow. When we run the benchmarks on data for which every vector element equals ½ ¼, combine3 achieves a CPE of 10.00 cycles for both double and single precision. This is much more in line with the times measured for the other data types and operations, and comparable to the time for combine4. This example illu...
View Full Document
- Spring '10
- The American