Lecture 7 - DLP

Lecture 7 - DLP - ECE565 Computer Architecture Instructor...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon
ECE565: Computer Architecture Instructor: Vijay S. Pai Fall 2011 Course administration: via Blackboard 1 ECE 565, Fal 2011 ECE 565, Fal 2011 2 This Unit: Data/Thread Level Parallelism • Data-level parallelism • Vector processors • Message-passing multiprocessors • Thread-level parallelism • Shared-memory multiprocessors • Flynn Taxonomy Application OS Firmware Compiler I/O Memory Digital Circuits CPU ECE 565, Fal 2011 3 Latency, Bandwidth, and Parallelism Latency • Time to perform a single task – Hard to make smaller Bandwidth • Number of tasks that can be performed in a given amount of time + Easier to make larger: overlap tasks, execute tasks in parallel • One form of parallelism: insn-level parallelism (ILP) • Parallel execution of insns from a single sequential program Pipelining : overlap processing stages of different insns Superscalar : multiple insns in one stage at a time • Have seen
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
ECE 565, Fal 2011 4 Exposing and Exploiting ILP • ILP is out there… • Integer programs (e.g., gcc, gzip): ~10–20 • Floating-point programs (e.g., face-rec, weather-sim): ~50–250 + It does make sense to build at least 4-way superscalar • …but compiler/processor work hard to exploit it • Independent insns separated by branches, stores, function calls • Overcome with dynamic scheduling and speculation – Modern processors extract ILP of 1–3 ECE 565, Fal 2011 5 Fundamental Problem with ILP • Clock rate and IPC are at odds with each other • Pipelining + Fast clock – Increased hazards lower IPC • Wide issue + Higher IPC – N 2 bypassing slows down clock • Can we get both fast clock and wide issue? • Yes, but with a parallelism model less general than ILP Data-level parallelism (DLP) • Single operation repeated on multiple data elements • Less general than ILP: parallel insns are same operation ECE 565, Fal 2011 6 Data-Level Parallelism (DLP) for (I = 0; I < 100; I++) Z[I] = A*X[I] + Y[I]; 0: ldf X(r1),f1 // I is in r1 mulf f0,f1,f2 // A is in f0 ldf Y(r1),f3 addf f2,f3,f4 stf f4,Z(r1) addi r1,4,r1 blti r1,400,0 • One example of DLP: inner loop-level parallelism • Iterations can be performed in parallel
Background image of page 2
ECE 565, Fal 2011 7 Exploiting DLP With Vectors • One way to exploit DLP: vectors • Extend processor with vector “data type” • Vector: array of MVL 32-bit FP numbers Maximum vector length (MVL) : typically 8 –64 Vector register file : 8–16 vector registers ( v0–v15 ) regfile I$ B P D$ V-regfile ECE 565, Fal 2011 8 Vector ISA Extensions • Vector operations • Versions of scalar operations: op.v • Each performs an implicit loop over MVL elements for (I=0;I<MVL;I++) op[ I ]; • Examples ldf.v X(r1),v1 : load vector for (I=0;I<MVL;I++) ldf X +I (r1),v1[ I ]; stf.v v1,X(r1) : store vector for (I=0;I<MVL;I++) stf v1[ I ],X +I (r1); addf.vv v1,v2,v3 : add two vectors for (I=0;I<MVL;I++) addf v1[ I ],v2[ I ],v3[ I ]; addf.vs v1,f2,v3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 4
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 01/10/2012 for the course ECE 565 taught by Professor Pai during the Fall '11 term at Purdue.

Page1 / 10

Lecture 7 - DLP - ECE565 Computer Architecture Instructor...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online