lecture9

# lecture9 - Code Optimization II: Machine Dependent...

This preview shows pages 1–8. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Code Optimization II: Machine Dependent Optimizations Code Optimization II: Machine Dependent Optimizations Topics Topics Machine-Dependent Optimizations z Pointer code z Unrolling z Enabling instruction level parallelism Understanding Processor Operation z Translation of instructions into operations z Out-of-order execution of operations 2 Previous Best Combining Code Previous Best Combining Code Task Task Compute sum of all elements in vector Vector represented by C-style abstract data type Achieved CPE of 2.00 z Cycles per element void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } 3 General Forms of Combining General Forms of Combining Data Types Data Types Use different declarations for data_t int float double void abstract_combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t t = IDENT; for (i = 0; i < length; i++) t = t OP data[i]; *dest = t; } Operations Operations Use different definitions of OP and IDENT + / * / 1 4 Machine Independent Opt. Results Machine Independent Opt. Results Optimizations Optimizations Reduce function calls and memory references within loop Performance Anomaly Performance Anomaly Computing FP product of all elements exceptionally slow. Very large speedup when accumulate in temporary Caused by quirk of IA32 floating point z Memory uses 64-bit format, register use 80 z Benchmark data caused overflow of 64 bits, but not 80 Integer Floating Point Method + * + * Abstract -g 42.06 41.86 41.44 160.00 Abstract -O2 31.25 33.25 31.25 143.00 Move vec_length 20.66 21.25 21.15 135.00 data access 6.00 9.00 8.00 117.00 Accum. in temp 2.00 4.00 3.00 5.00 5 Pointer Code Pointer Code Optimization Optimization Use pointers rather than array references CPE: 3.00 (Compiled -O2) z Oops! Were not making progress here! Warning : Some compilers do better job optimizing array code void combine4p(vec_ptr v, int *dest) { int length = vec_length(v); int *data = get_vec_start(v); int *dend = data+length; int sum = 0; while (data < dend) { sum += *data; data++; } *dest = sum; } 6 Pointer vs. Array Code Inner Loops Pointer vs. Array Code Inner Loops Array Code Array Code Pointer Code Pointer Code Performance Performance Array Code: 4 instructions in 2 clock cycles Pointer Code: Almost same 4 instructions in 3 clock cycles .L24: # Loop: addl (%eax,%edx,4),%ecx # sum += data[i] incl %edx # i++ cmpl %esi,%edx # i:length jl .L24 # if < goto Loop .L30: # Loop: addl (%eax),%ecx # sum += *data addl \$4,%eax # data ++ cmpl %edx,%eax # data:dend jb .L30 # if < goto Loop 7 Modern CPU Design Modern CPU Design Execution Execution Functional Units Instruction Control Instruction Control Integer/ Branch FP Add FP Mult/Div Load Store Instruction Cache Data Cache Fetch Control...
View Full Document

## This note was uploaded on 04/13/2008 for the course CS 211 taught by Professor Chakraborty during the Spring '08 term at Rutgers.

### Page1 / 32

lecture9 - Code Optimization II: Machine Dependent...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online