11_vectors

# 11_vectors - How to Compute This Fast Performing the same...

This preview shows pages 1–3. Sign up to view the full content.

CIS 501 (Martin): Vectors 1 CIS 501 Computer Architecture Unit 11: Vectors Slides originally developed by Amir Roth with contributions by Milo Martin at University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. How to Compute This Fast? • Performing the same operations on many data items • Example: SAXPY • Instruction-level parallelism (ILP) - fine grained • Loop unrolling with static scheduling –or– dynamic scheduling • Wide-issue superscalar (non-)scaling limits benefits • Thread-level parallelism (TLP) - coarse grained • Multicore • Can we do some “medium grained” parallelism? L1: ldf [X+r1]->f1 // I is in r1 mulf f0,f1->f2 // A is in f0 ldf [Y+r1]->f3 addf f2,f3->f4 stf f4->[Z+r1} addi r1,4->r1 blti r1,4096,L1 for (I = 0; I < 1024; I++) { Z[I] = A*X[I] + Y[I]; } 2 CIS 501 (Martin): Vectors Data-Level Parallelism Data-level parallelism (DLP) • Single operation repeated on multiple data elements • SIMD ( S ingle- I nstruction, M ultiple- D ata) • Less general than ILP: parallel insns are all same operation • Exploit with vectors • Old idea: Cray-1 supercomputer from late 1970s • Eight 64-entry x 64-bit floating point “Vector registers” • 4096 bits (0.5KB) in each register! 4KB for vector register file • Special vector instructions to perform vector operations • Load vector, store vector (wide memory operation) • Vector+Vector addition, subtraction, multiply, etc. • Vector+Constant addition, subtraction, multiply, etc. • In Cray-1, each instruction specifies 64 operations! • ALUs were expensive, did not perform 64 operations in parallel! CIS 501 (Martin): Vectors 3 Today’s Vectors / SIMD CIS 501 (Martin): Vectors 4

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
CIS 501 (Martin): Vectors 5 Example Vector ISA Extensions (SIMD) • Extend ISA with floating point (FP) vector storage … Vector register : fixed-size array of 32- or 64- bit FP elements Vector length : For example: 4, 8, 16, 64, … • … and example operations for vector length of 4 • Load vector: ldf.v [X+r1]->v1 ldf [X+r1+0]->v1 0 ldf [X+r1+1]->v1 1 ldf [X+r1+2]->v1 2 ldf [X+r1+3]->v1 3 • Add two vectors: addf.vv v1,v2->v3 addf v1 i ,v2 i ->v3 i (where i is 0,1,2,3) • Add vector to scalar: addf.vs v1,f2,v3 addf v1 i ,f2->v3 i (where i is 0,1,2,3) • Today’s vectors: short (128 bits), but fully parallel CIS 501 (Martin): Vectors 6 Example Use of Vectors – 4-wide • Operations • Load vector: ldf .v [X+r1]->v1 • Multiply vector to scalar: mulf .vs v1,f2->v3 • Add two vectors: addf .vv v1,v2->v3 • Store vector: stf .v v1->[X+r1] • Performance? • Best case: 4x speedup
This is the end of the preview. Sign up to access the rest of the document.

## This note was uploaded on 10/19/2011 for the course CS 501 taught by Professor Matin during the Fall '10 term at UPenn.

### Page1 / 6

11_vectors - How to Compute This Fast Performing the same...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online