This preview shows page 1. Sign up to view the full content.
Unformatted text preview: eas we achieve maximum performance for the other operations by introducing some, but not too much, parallelism. The overall performance gain of 27.6X and better from our original code is quite impressive. 5.11.1 Floating-Point Performance Anomaly
One of the most striking features of Figure 5.27 is the dramatic drop in the cycle time for ﬂoating-point multiplication when we go from combine3, where the product is accumulated in memory, to combine4 where the product is accumulated in a ﬂoating-point register. By making this small change, the code suddenly runs 23.4 times faster. When an unexpected result such as this one arises, it is important to hypothesize what could cause this behavior and then devise a series of tests to evaluate this hypothesis. Examining the table, it appears that something strange is happening for the case of ﬂoating-point multiplication when we accumulate the results in memory. The performance is far worse than for ﬂoating-point addition or integer multipli...
View Full Document