# l13 - Lecture 13 Loop Transformations for Parallelism and...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture 13 Loop Transformations for Parallelism and Locality 1.  Examples 2.  Affine Partitioning: Do-all 3.  Affine Partitioning: Pipelining Readings: Chapter 11–11.3, 11.6–11.7.4, 11.9-11.9.6 Carnegie Mellon M. Lam CS243: Loop Transformations 1 Shared Memory Machines Performance on Shared Address Space Multiprocessors: Parallelism & Locality Carnegie Mellon M. Lam CS243: Loop Transformations 2 Parallelism and Locality •  Parallelism DOES NOT imply speed up! •  Parallel performance: Improve locality with loop transformations –  Minimize communication –  Operations using the same data are executed on the same processor •  Sequential performance: Improve locality with loop transformations –  Minimize cache misses –  Operations using the same data are executed close in time. Carnegie Mellon M. Lam CS243: Loop Transformations 3 Loop Permutation (Loop Interchange) for J’ = 1 to 3 for I’ = 1 to 4 Z[I’,J’] = Z[I’-1,J’] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I ⎡ྎ j '⎤ྏ ⎡ྎ0 1 ⎤ྏ ⎡ྎ i ⎤ྏ ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ ⎢ྎ ⎥ྏ i' ⎦ྏ ⎣ྏ1 0 ⎦ྏ ⎣ྏ j ⎦ྏ ⎣ྏ J’ I’ J € Carnegie Mellon M. Lam CS243: Loop Transformations 4 Loop Fusion for J = 1 to 4 T[J]= A[J]+B[J] (s1) C[J]= T[J] x T[J] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) I [ j ] = [1] [i] s2: [ j ] = [1] [i' ] s1: I’ € M. Lam € J Carnegie Mellon CS243: Loop Transformations 5 Affine Partitioning: An Contrived but Illustrative Example FOR j = 1 TO n FOR i = 1 TO n A[i,j] = A[i,j]+B[i-1,j]; B[i,j] = A[i,j-1]*B[i,j]; (S1) (S2) S1 S2 j" i Carnegie Mellon M. Lam CS243: Loop Transformations 6 Best Parallelization Scheme Algorithm finds affine partition mappings for each instruction: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1. SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if [1 <= p) then B[p,1] = A[p,0] * B[p,1]; for i1 = max[1,1+p) to min[n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S2) (S1) (S2) (S1) Carnegie Mellon M. Lam CS243: Loop Transformations 7 2. Iteration Space FOR i = 0 to 5 FOR j = i to 7 … •  •  •  •  n-deep loop nests: n-dimensional polytope Iterations: coordinates in the iteration space Assume: iteration index is incremented in the loop Sequential execution order: lexicographic order –  [0,0], [0,1], …, [0,6], [0,7], [1,1], …, [1,6], [1,7], … Carnegie Mellon M. Lam CS243: Loop Transformations 8 Maximum Parallelism & No Communication F1i1+f1 Array Loops F2i2+f2 C2i2+c2 C1i1+c1 Processor ID For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Find C1, c1, C2, c2: ∀ i1, i2 F1 i1+ f1 = F2 i2+f2 → C1i1+c1 = C2i2+c2 with the objective of maximizing the rank of C1, C2 Carnegie Mellon M. Lam CS243: Loop Transformations 9 Rank of Partitioning = Degree of Parallelism Affine Mapping Rank 0 1 2 Mapped to same processor Carnegie Mellon M. Lam CS243: Loop Transformations 10 Example 1: Loop Transform Find affine partitioning: c1, c2, c0 such that for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I p = [c1 € ⎡ྎ i ⎤ྏ c 2 ] ⎢ྎ ⎥ྏ + c 0 ⎣ྏ j⎦ྏ Suppose itera0on i,j & i’, j’ refer to same loca0on i = i’ - 1 j = j’ No communica0on means: c1 i + c2 j + c0 = c1 i’ + c2 j’ + c0 c1(i’- 1) +c2 j’ + c0 = c1 i’ + c2 j’ + c0 J c1 = 0 p = c2 j + c0 Pick simplest c2, c0: c2 = 1, c0= 0 p = j Carnegie Mellon M. Lam CS243: Loop Transformations 11 Code Generation •  Naive –  Each processor visits all the iterations –  Executes only if it owns that iteration •  Optimization –  Removes unnecessary looping and condition evaluation Carnegie Mellon M. Lam CS243: Loop Transformations 12 Code Generation for P = 1 to 3 for I = 1 to 4 for J = 1 to 3 if (j == P) Z[I,J] = Z[I-1,J] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I p = j for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] J SPMD (single program multiple data) code: for I = 1 to 4 Z[I,P] = Z[I-1,P] Carnegie Mellon M. Lam CS243: Loop Transformations 13 Loop Permutation (Loop Interchange) for P = 1 to 3 for I = 1 to 4 Z[I,P] = Z[I-1,P] for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I-1,J] I ⎡ྎ p'⎤ྏ ⎡ྎ0 1⎤ྏ ⎡ྎ i ⎤ྏ P ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ ⎢ྎ ⎥ྏ ⎣ྏ i' ⎦ྏ ⎣ྏ1 0⎦ྏ ⎣ྏ j⎦ྏ I J € Carnegie Mellon M. Lam CS243: Loop Transformations 14 Example 2: Loop Fusion Find affine partitioning: c1,1, c1,0, c2.1, c1,0, such that for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) I € € I’ [ p] = [c1,1] [i] + c1,0 s2: [ p] = [c ] [i' ] + c 2,1 2, 0 s1: Suppose itera0on i & i’ refer to the same loca0on i = i’ No communica0on means: c1,1 i + c1,0 = c2,1 i’ + c2,0 c1,1 = c2,1 c1,0= c2,0 Pick simplest values: c1,1 = c2,1 = 1, c1,0= c2,0= 0 p = i; p=i’ Carnegie Mellon M. Lam CS243: Loop Transformations 15 Loop Fusion for P = 1 to 4 for I = 1 to 4 if (I == P) T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 if (I’ == P) C[I’]= T[I’] x T[I’] (s2) for I = 1 to 4 T[I]= A[I]+B[I] (s1) for I’ = 1 to 4 C[I’]= T[I’] x T[I’] (s2) I s1: s2: I’ [ p] = [1] [i] [ p] = [1] [i' ] € M. Lam € for P = 1 to 4 T[P]= A[P]+B[P] (s1) C[P]= T[P] x T[P] (s2) J Carnegie Mellon CS243: Loop Transformations 16 Example 3: 2 Nested, Parallel Loops Find affine partitioning: c1, c2, c0 such that for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I,J]+1 p = [c1 Suppose itera0on i,j & i’, j’ refer to same loca0on i = i’ j = j’ No communica0on means: c1 i + c2 j + c0 = c1 i’ + c2 j’ + c0 € I ⎡ྎ i ⎤ྏ c 2 ] ⎢ྎ ⎥ྏ + c 0 ⎣ྏ j⎦ྏ c1i’ +c2 j’ + c0 = c1 i’ + c2 j’ + c0 No constraints Two basis vectors: [c1 c2]=[1 0], or [c1 c2] = [0 1] Two answers for p: two degrees of parallelism J ⎡ྎ p1 ⎤ྏ ⎡ྎ1 0 ⎤ྏ⎡ྎ i ⎤ྏ ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ⎢ྎ ⎥ྏ ⎣ྏ p2 ⎦ྏ ⎣ྏ0 1 ⎦ྏ⎣ྏ j ⎦ྏ Carnegie Mellon M. Lam CS243: Loop Transformations € 17 Example 3: 2 Nested, Parallel Loops for I = 1 to 4 for J = 1 to 3 Z[I,J] = Z[I,J]+1 for p1 = 1 to 4 for p2 = 1 to 3 for I = 1 to 4 for J = 1 to 3 if (I==p1 & J == p2) Z[I,J] = Z[I,J]+1 I ⎡ྎ p1 ⎤ྏ ⎡ྎ1 0 ⎤ྏ⎡ྎ i ⎤ྏ ⎢ྎ ⎥ྏ = ⎢ྎ ⎥ྏ⎢ྎ ⎥ྏ p2 ⎦ྏ ⎣ྏ0 1 ⎦ྏ⎣ྏ j ⎦ྏ ⎣ྏ J for p1 = 1 to 4 for p2 = 1 to 3 Z[p1,p2] = Z[p1,p2]+1 € Carnegie Mellon M. Lam CS243: Loop Transformations 18 Optimizing Arbitrary Loop Nesting Using Affine Partitions (chotst, NAS) 3 2 4 5 1 8 7 9 6 DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) A" L" B" L" DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) L" EPSS" Carnegie Mellon M. Lam CS243: Loop Transformations 19 Chotst: Results with Affine Partitioning + Blocking (Unimodular: a subset of affine partitioning for perfect loop nests) Unimodular + Blocking Affine Partitioning + Blocking 8 7 Speedup 6 5 4 3 2 1 0 1 2 3 4 5 6 Number of Processors 7 8 Carnegie Mellon M. Lam CS243: Loop Transformations 20 Summary of Affine Partitioning Communication-Free F1i1+f1 Array Loops F2i2+f2 C2i2+c2 C1i1+c1 Processor ID Carnegie Mellon M. Lam CS243: Loop Transformations 21 Advanced topic: Pipelining SOR (Successive Over-Relaxation): An Example for i = 0 TO m for j = 0 to n X[j+1] = c * (X[j] + X[j+1]) i j Carnegie Mellon M. Lam CS243: Loop Transformations 22 Finding the Maximum Degree of Pipelining F1i1+f1 i1 ≤ i2 Loops C2i2+c2 F2i2+f2 C1i1+c1 Array Time Stage For every pair of data dependent accesses F1i1+f1 and F2i2+f2 Let B1i1+b1 ≥ 0, B2i2+b2 ≥ 0 be the corresponding loop bound constraints, Find C1, c1, C2, c2: ∀ i1, i2 B1i1 + b1 ≥ 0, B2i2 + b2 ≥ 0 (i1 ≤ i2 ) ∧ (F1 i1+ f1 = F2 i2+f2) → C1i1+c1 ≤ C2i2+c2 with the objective of maximizing the rank of C1, C2 Carnegie Mellon M. Lam CS243: Loop Transformations 23 Key Insight •  Choice in time mapping => (pipelined) parallelism •  Rank(C) – 1 degree of parallelism with 1 degree of synchronization •  Can create blocks with Rank(C) dimensions •  Find time partitions is not as straightforward as space partitions –  Need to deal with linear inequalities –  Solved using Farkas Lemma – no simple intuitive proof Carnegie Mellon M. Lam CS243: Loop Transformations 24 Summary of Affine Partitioning Communication-Free F1i1+f1 Array Loops F2i2+f2 C2i2+c2 C1i1+c1 Processor ID Pipelining F1i1+f1 i1 ≤ i2 Loops C2i2+c2 F2i2+f2 C1i1+c1 Array Time Stage Carnegie Mellon M. Lam CS243: Loop Transformations 25 ...
View Full Document

## This document was uploaded on 03/12/2012.

Ask a homework question - tutors are online