lec33 - SUB R20, R4, Rx LD F0, a LV V1, Rx MULTSV V2, F0,...

Info iconThis preview shows pages 1–9. Sign up to view the full content.

View Full Document Right Arrow Icon
LECTURE - 33
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Lecture Outline ± Vector Processors ± ± Scribe for today?
Background image of page 2
Why Vector Processing ± Deep pipeline ==> more parallelism ² But more dependences ² Need to fetch and issue many instructions (Flynn bottleneck) ± Same issues with multiple-issue processor ± Operations on vectors: ² No data dependences ² No control hazards ² Single instn. ==> instn. bandwidth reduced ² Well defined memory access pattern
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Basic Architecture ± Vector-register processors vs. memory- memory vector processor ± DLXV: vector extn. of DLX (vector-register) ± Components: ² Vector registers (V0. .V7), 64-element ² Vector functional units: ± ADD/SUB, MUL, DIV, Integer, Logical ± Each is pipelined, can start a new opn. every cycle ² Vector load/store unit: also pipelined ² Scalar registers and scalar unit (like in DLX)
Background image of page 4
Some Vector Instructions ± ADDV V1, V2, V3 ± ADDSV V1, F0, V2 ± SUBV V1, V2, V3 ± SUBVS V1, V2, F0 ± SUBSV V1, F0, V2 ± Similar for MUL and DIV ± LV V1, R1 ± SV R1, V1
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
SAXPY/DAXPY Loop ± Y = aX + Y (caps ==> vector) LD F0, a ADDI R4, Rx, 512 Loop: LD F2, 0(Rx) MULTD F2, F0, F2 LD F4, 0(Ry) ADDD F4, F2, F4 SD 0(Ry), F4 ADDI Rx, Rx, 8 ADDI Ry, Ry, 8
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 8
Background image of page 9
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: SUB R20, R4, Rx LD F0, a LV V1, Rx MULTSV V2, F0, V1 LV V3, Ry ADDV V4, V2, V3 SV Ry, V4 Reduction in instn. bandwidth Lesser pipeline interlocks Estimating Execution Time ± Convoy: set of vector instructions which can begin execution in same cycle ² Check for structural, data hazards ± For simplicity: convoy must complete before initiating next convoy ± Chime: time taken to execute one vector opn. ± Approximations: ² Only one instn. can be initiated per cycle ² Pipeline setup latency Adding Flexibility ± Vector-length register (VLR), Maximum vector length (MVL) ² MOVI2S VLR, R1 ² MOVS2I R1, VLR ± Vector longer than MVL ==> use strip-mining ± Vector stride: ² LVWS V1, (R1, R2) ² SVWS (R1, R2), V1 ± Memory-bank conflicts? Enhancing Vector Performance ± Chaining: data-forwarding ± Conditional execution: ² Vector Mask Register ² Some related instructions ± SNEV V1, V2 ± SGTSV F0, V1 ± CVM ± Sparse matrices: scatter-gather ² LVI V1, (R1+V2) ² SVI (R1+V2), V1...
View Full Document

This note was uploaded on 07/14/2011 for the course CS 422 taught by Professor Hogakoi during the Spring '10 term at IIT Kanpur.

Page1 / 9

lec33 - SUB R20, R4, Rx LD F0, a LV V1, Rx MULTSV V2, F0,...

This preview shows document pages 1 - 9. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online