l13-handout

l13-handout - Lecture 13 Loop Transformations for...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture 13 Loop Transformations for Parallelism and Locality 1.  Examples 2.  Affine Partitioning: Do-all 3.  Affine Partitioning: Pipelining Readings: Chapter 11–11.3, 11.6–11.7.4, 11.9-11.9.6 Carnegie Mellon M. Lam CS243: Loop Transformations 1 Shared Memory Machines Performance on Shared Address Space Multiprocessors: Parallelism & Locality Carnegie Mellon M. Lam CS243: Loop Transformations 2 1 Parallelism and Locality •  Parallelism DOES NOT imply speed up! •  Parallel performance: Improve locality with loop transformations –  Minimize communication –  Operations using the same data are executed on the same processor •  Sequential performance: Improve locality with loop transformations –  Minimize cache misses –  Operations using the same data are executed close in time. Carnegie Mellon M. Lam CS243: Loop Transformations 3 Loop Permutation (Loop Interchange) for J’ = 1 to 3 for I’ = 1 to 4 Z[I’,J’] = Z[I’-1,J’] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I ⎡ྎ j '⎤ྏ ⎡ྎ0 1 ⎤ྏ ⎡ྎ i ⎤ྏ ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ ⎢ྎ ⎥ྏ ⎣ྏ i' ⎦ྏ ⎣ྏ1 0 ⎦ྏ ⎣ྏ j ⎦ྏ J’ I’ J € Carnegie Mellon M. Lam CS243: Loop Transformations 4 2 Loop Fusion for J = 1 to 4 T[J]= A[J]+B[J] (s1) C[J]= T[J] x T[J] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) I [ j ] = [1] [i] s2: [ j ] = [1] [i' ] s1: I’ J € M. Lam € Carnegie Mellon CS243: Loop Transformations 5 Affine Partitioning: An Contrived but Illustrative Example FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j]+B[i-1,j]; B[i,j] = A[i,j-1]*B[i,j]; (S1) (S2) S1 S2 j" i Carnegie Mellon M. Lam CS243: Loop Transformations 6 3 Best Parallelization Scheme Algorithm finds affine partition mappings for each instruction: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1. SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if [1 <= p) then B[p,1] = A[p,0] * B[p,1]; for i1 = max[1,1+p) to min[n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S2) (S1) (S2) (S1) Carnegie Mellon M. Lam CS243: Loop Transformations 7 2. Iteration Space FOR i = 0 to 5 FOR j = i to 7 … •  •  •  •  n-deep loop nests: n-dimensional polytope Iterations: coordinates in the iteration space Assume: iteration index is incremented in the loop Sequential execution order: lexicographic order –  [0,0], [0,1], …, [0,6], [0,7], [1,1], …, [1,6], [1,7], … Carnegie Mellon M. Lam CS243: Loop Transformations 8 4 Maximum Parallelism & No Communication F1i1+f1 Array Loops F2i2+f2 C2i2+c2 C1i1+c1 Processor ID For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Find C1, c1, C2, c2: ∀ i1, i2 F1 i1+ f1 = F2 i2+f2 → C1i1+c1 = C2i2+c2 with the objective of maximizing the rank of C1, C2 Carnegie Mellon M. Lam CS243: Loop Transformations 9 Rank of Partitioning = Degree of Parallelism Affine Mapping Rank 0 1 2 Mapped to same processor Carnegie Mellon M. Lam CS243: Loop Transformations 10 5 Example 1: Loop Transform Find affine partitioning: c1, c2, c0 such that for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I ⎡ྎ i ⎤ྏ p = [c1 c 2 ] ⎢ྎ ⎥ྏ + c 0 ⎣ྏ j⎦ྏ € Suppose itera4on i,j & i’, j’ refer to same loca4on i = i’ - 1 j = j’ No communica4on means: c1 i + c2 j + c0 = c1 i’ + c2 j’ + c0 c1(i’- 1) +c2 j’ + c0 = c1 i’ + c2 j’ + c0 J c1 = 0 p = c2 j + c0 Pick simplest c2, c0: c2 = 1, c0= 0 p = j Carnegie Mellon M. Lam CS243: Loop Transformations 11 Code Generation •  Naive –  Each processor visits all the iterations –  Executes only if it owns that iteration •  Optimization –  Removes unnecessary looping and condition evaluation Carnegie Mellon M. Lam CS243: Loop Transformations 12 6 Code Generation for P = 1 to 3 for I = 1 to 4 for J = 1 to 3 if (j == P) Z[I,J] = Z[I-1,J] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I p = j for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] J SPMD (single program multiple data) code: for I = 1 to 4 Z[I,P] = Z[I-1,P] Carnegie Mellon M. Lam CS243: Loop Transformations 13 Loop Permutation (Loop Interchange) for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I ⎡ྎ p'⎤ྏ ⎡ྎ0 1⎤ྏ ⎡ྎ i ⎤ྏ P ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ ⎢ྎ ⎥ྏ ⎣ྏ i' ⎦ྏ ⎣ྏ1 0⎦ྏ ⎣ྏ j⎦ྏ I J € Carnegie Mellon M. Lam CS243: Loop Transformations 14 7 Example 2: Loop Fusion Find affine partitioning: c1,1, c1,0, c2.1, c1,0, such that for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) € I € I’ [ p] = [c1,1] [i] + c1,0 s2: [ p] = [c ] [i' ] + c 2,1 2, 0 s1: Suppose itera4on i & i’ refer to the same loca4on i = i’ No communica4on means: c1,1 i + c1,0 = c2,1 i’ + c2,0 c1,1 = c2,1 c1,0= c2,0 Pick simplest values: c1,1 = c2,1 = 1, c1,0= c2,0= 0 p = i; p=i’ Carnegie Mellon M. Lam CS243: Loop Transformations 15 Loop Fusion for P = 1 to 4 for I = 1 to 4 if (I == P) T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 if (I’ == P) C[I’]= T[I’] x T[I’] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) I s1: s2: I’ [ p] = [1] [i] [ p] = [1] [i' ] € M. Lam € for P = 1 to 4 T[P]= A[P]+B[P] (s1) C[P]= T[P] x T[P] (s2) J Carnegie Mellon CS243: Loop Transformations 16 8 Example 3: 2 Nested, Parallel Loops Find affine partitioning: c1, c2, c0 such that ⎡ྎ i ⎤ྏ p = [c1 c 2 ] ⎢ྎ ⎥ྏ + c 0 ⎣ྏ j⎦ྏ for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I,J]+1 Suppose itera4on i,j & i’, j’ refer to same loca4on i = i’ j = j’ No communica4on means: c1 i + c2 j + c0 = c1 i’ + c2 j’ + c0 € I c1i’ +c2 j’ + c0 = c1 i’ + c2 j’ + c0 No constraints Two basis vectors: [c1 c2]=[1 0], or [c1 c2] = [0 1] Two answers for p: two degrees of parallelism J ⎡ྎ p1 ⎤ྏ ⎡ྎ1 0 ⎤ྏ⎡ྎ i ⎤ྏ ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ⎢ྎ ⎥ྏ ⎣ྏ p2 ⎦ྏ ⎣ྏ0 1 ⎦ྏ⎣ྏ j ⎦ྏ Carnegie Mellon M. Lam CS243: Loop Transformations 17 € Example 3: 2 Nested, Parallel Loops for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I,J]+1 for p1 = 1 to 4 for p2 = 1 to 3 for I = 1 to 4 for J = 1 to 3 if (I==p1 & J == p2) Z[I,J] = Z[I,J]+1 I ⎡ྎ p1 ⎤ྏ ⎡ྎ1 0 ⎤ྏ⎡ྎ i ⎤ྏ ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ⎢ྎ ⎥ྏ ⎣ྏ p2 ⎦ྏ ⎣ྏ0 1 ⎦ྏ⎣ྏ j ⎦ྏ J for p1 = 1 to 4 for p2 = 1 to 3 Z[p1,p2] = Z[p1,p2]+1 € Carnegie Mellon M. Lam CS243: Loop Transformations 18 9 Optimizing Arbitrary Loop Nesting Using Affine Partitions (chotst, NAS) 2 4 5 1 8 7 9 6 A" L" B" L" DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) L" EPSS" Carnegie Mellon M. Lam CS243: Loop Transformations 19 Chotst: Results with Affine Partitioning + Blocking (Unimodular: a subset of affine partitioning for perfect loop nests) Unimodular + Blocking Affine Partitioning + Blocking 8 7 6 Speedup 3 DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) 5 4 3 2 1 0 1 2 3 4 5 6 Number of Processors 7 8 Carnegie Mellon M. Lam CS243: Loop Transformations 20 10 Summary of Affine Partitioning Communication-Free F1i1+f1 Array Loops F2i2+f2 C2i2+c2 C1i1+c1 Processor ID Carnegie Mellon M. Lam CS243: Loop Transformations 21 Advanced topic: Pipelining SOR (Successive Over-Relaxation): An Example for i = 0 TO m for j = 0 to n X[j+1] = c * (X[j] + X[j+1]) i j Carnegie Mellon M. Lam CS243: Loop Transformations 22 11 Finding the Maximum Degree of Pipelining F1i1+f1 i1 ≤ i2 Loops C2i2+c2 F2i2+f2 C1i1+c1 Array Time Stage For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Let B1i1+b1 ≥ 0, B2i2+b2 ≥ 0 be the corresponding loop bound constraints, Find C1, c1, C2, c2: ∀ i1, i2 B1i1 + b1 ≥ 0, B2i2 + b2 ≥ 0 (i1 ≤ i2 ) ∧ (F1 i1+ f1 = F2 i2+f2) → C1i1+c1 ≤ C2i2+c2 with the objective of maximizing the rank of C1, C2 Carnegie Mellon M. Lam CS243: Loop Transformations 23 Key Insight •  Choice in time mapping => (pipelined) parallelism •  Rank(C) – 1 degree of parallelism with 1 degree of synchronization •  Can create blocks with Rank(C) dimensions •  Find time partitions is not as straightforward as space partitions –  Need to deal with linear inequalities –  Solved using Farkas Lemma – no simple intuitive proof Carnegie Mellon M. Lam CS243: Loop Transformations 24 12 Summary of Affine Partitioning Communication-Free F1i1+f1 Array Loops F2i2+f2 C2i2+c2 C1i1+c1 Processor ID Pipelining F1i1+f1 i1 ≤ i2 Loops C2i2+c2 F2i2+f2 C1i1+c1 Array Time Stage Carnegie Mellon M. Lam CS243: Loop Transformations 25 13 ...
View Full Document

Ask a homework question - tutors are online