l12 - Lecture 12 Parallelization I.  Basic...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Lecture 12 Parallelization I.  Basic Parallelization II.  Data dependence analysis III. Interprocedural parallelization Chapter 11.1-11.1.4 Carnegie Mellon M. Lam CS243: Parallelization 1 Parallelization of Numerical Applications •  DoAll loop parallelism –  Find loops whose iterations are independent –  Number of iterations typically scales with the problem –  Usually much larger than the number of processors in a machine –  Divide up iterations across machines Carnegie Mellon CS243: Parallelization 2 M. Lam Basic Parallelism Examples: FOR i = 1 to 100 A[i] = B[i] + C[i] FOR i = 11 TO 20 a[i] = a[i-1] + 3 FOR i = 11 TO 20 a[i] = a[i-10] + 3 •  Does there exist a data dependence edge between two different iterations? •  A data dependence edge is loop-carried if it crosses iteration boundaries •  DoAll loops: loops without loop-carried dependences Carnegie Mellon CS243: Parallelization 3 M. Lam Recall: Data Dependences •  True dependence: a= =a •  Anti-dependence: =a a= •  Output dependence a= a= Carnegie Mellon CS243: Parallelization 4 M. Lam Affine Array Accesses •  Common patterns of data accesses: (i, j, k are loop indexes) A[i], A[j], A[i-1], A[0], A[i+j], A[2*i], A[2*i+1] A[i,j], A[i-1, j+1] •  Array indexes are affine expressions of surrounding loop indexes –  Loop indexes: in, in-1, ... , i1 –  Integer constants: cn, cn-1, ... , c0 –  Array index: cnin + cn-1in-1+ ... + c1i1+ c0 –  Affine expression: linear expression + a constant term (c0) Carnegie Mellon CS243: Parallelization 5 M. Lam II. Formulating Data Dependence Analysis FOR i := 2 to 5 do A[i-2] = A[i]+1; •  Between read access A[i] and write access A[i-2]there is a dependence if: –  there exist two iterations ir and iw within the loop bounds, s.t. –  iterations ir & iw read & write the same array element, respectively ∃integers i w, i r 2 ≤ i w, i r ≤ 5 ir = iw - 2 •  Between write access A[i-2] and write access A[i-2] there is a dependence if: ∃integers i w, i v 2 ≤ i w, i v ≤ 5 iw – 2 = iv - 2 –  To rule out the case when the same instance depends on itself: •  add constraint iw ≠ iv Carnegie Mellon CS243: Parallelization 6 M. Lam Memory Disambiguation is Undecidable at Compile Time read(n) For i = a[i] = a[n] Carnegie Mellon CS243: Parallelization 7 M. Lam Domain of Data Dependence Analysis •  Only use loop bounds and array indexes that are affine functions of loop variables for i = 1 to n for j = 2i to 100 a[i+2j+3][4i+2j][i*i] = … … = a[1][2i+1][j] •  Assume a data dependence between the read & write operation if there exists: –  a read instance with indexes ir, jr and –  a write instance with indexes iw, jw ∃integers i r , j r , i w, j w iw + 2jw + 3 = 1 1 ≤ i w, i r ≤ n 2iw ≤ jw ≤ 100 2ir ≤ jr ≤ 100 4iw + 2jw = 2ir + 1 •  Equate each dimension of array access; ignore non-affine ones –  No solution No data dependence –  Solution there may be a dependence Carnegie Mellon CS243: Parallelization 8 M. Lam Complexity of Data Dependence Analysis For every pair of accesses not necessarily distinct (F1 , f1 ) and (F2, f2) one must be a write operation Let B1i1+b1 ≥ 0, B2i2+b2 ≥ 0 be the corresponding loop bound constraints, ∃ integers i1, i2 B1i1 + b1 ≥ 0, B2i2 + b2 ≥ 0 F1 i1+ f1 = F2 i2+f2 If the accesses are not distinct, then add the constraint i1 ≠ i2 •  Equivalent to integer linear programming ∃ integer i A 1i ≤ b 1 A 2i = b 2 •  Integer linear programming is NP-complete –  O(size of the coefficients) or O(nn) Carnegie Mellon CS243: Parallelization 9 M. Lam Data Dependence Analysis Algorithm •  Typically solving many tiny, repeated problems –  Integer linear programming packages optimize for large problems –  Use memoization to remember the results of simple tests •  Apply a series of relatively simple tests –  GCD: 2*i, 2*i+1; GCD for simultaneous equations –  Test if the ranges overlap •  Backed up by a more expensive algorithm –  Use Fourier-Motzkin Elimination to test if there is a real solution •  Keep eliminating variables to see if a solution remains •  If there is no solution, then there is no integer solution Carnegie Mellon Fourier-Motzkin Elimination •  To eliminate a variable from a set of linear inequalities. •  To eliminate a variable x1 –  Rewrite all expressions in terms of lower or upper bounds of x1 –  Create a transitive constraint for each pair of lower and upper bounds. •  Example: Let L, U be lower bounds and upper bounds resp –  To eliminate x1: L1(x2, …, xn) ≤ x1 ≤ U1 (x2, …, xn) L2(x2, …, xn) ≤ x1 ≤ U2 (x2, …, xn) L1(x2, …, xn) ≤ U1 (x2, …, xn) L1(x2, …, xn) ≤ U2 (x2, …, xn) L2(x2, …, xn) ≤ U1 (x2, …, xn) L2(x2, …, xn) ≤ U2 (x2, …, xn) Carnegie Mellon Example FOR i = 1 to 5 FOR j = i+1 to 5 A[i,j] = f(A[i,i], A[i-1,j]) 1 ≤ i i ≤ 5 i + 1 ≤ j j ≤ 5 2: Data dep between A[i,j] and A[i- 1,j] i = i’ – 1 => i+1 = i’ j = j’ SubsDtuDng 1 ≤ i + 1, i + 1 ≤ 5 i + 2 ≤ j j ≤ 5 Combining 1 ≤ i; i ≤ 4 i ≤ j - 2; j ≤ 5 EliminaDng i: 1 ≤ 4; 1 ≤ j - 2; j ≤ 5 3 ≤ j; j ≤ 5 EliminaDng j: 3 ≤ 5 1 ≤ i’ i’ ≤ 5 i’ + 1 ≤ j’ j’ ≤ 5 1: Data dep between A[i,j], A[i,i] i = i’ j = i’ i’+1 ≤ i’ Carnegie Mellon CS243: Parallelization 12 M. Lam Data Dependence Analysis Algorithm •  Typically solving many tiny, repeated problems –  Integer linear programming packages optimize for large problems –  Use memoization to remember the results of simple tests •  Apply a series of relatively simple tests –  GCD: 2*i, 2*i+1; GCD for simultaneous equations –  Test if the ranges overlap •  Backed up by a more expensive algorithm –  Use Fourier-Motzkin Elimination to test if there is a real solution •  Keep eliminating variables to see if a solution remains •  Add heuristics to encourage finding an integer solution. –  Create 2 subproblems if a real, but not integer, solution is found. •  For example, if x = .5 is a solution, create two problems, by adding x ≤ 0 and x ≥ 1 respectively to original constraint. Carnegie Mellon Relaxing Dependences Privatization: •  Scalar for i = 1 to n t = (A[i] + B[i]) / 2; C[i] = t * t; •  Array for i = 1 to for j = 1 t[j] = for j = 1 C[i,j] n to n (A[i,j] + B[i,j]) / 2; to n = t[j] * t[j]; Reduction: for i = 1 to n sum = sum + A[i]; Carnegie Mellon CS243: Parallelization 14 M. Lam Carnegie Mellon CS243: Parallelization 15 M. Lam Interprocedural Parallelization •  Why? Amdahl’s Law •  Interprocedural symbolic analysis –  Find interprocedural array indexes which are affine expressions of outer loop indices •  Interprocedural parallelization analysis –  Data dependence based on summaries of array regions accessed •  If the regions do not intersect, there is no parallelism –  Find privatizable scalar variables and arrays –  Find scalar and array reductions Carnegie Mellon CS243: Parallelization 16 M. Lam Conclusions •  Basic parallelization –  Doall loop: loops with no loop-carried data dependences –  Data dependence for affine loop indexes = integer linear programming •  Coarse-grain parallelism because of Amdahl’s Law –  Interprocedural analysis is useful for affine indices –  Ask users for help on unresolved dependences Carnegie Mellon CS243: Parallelization 17 M. Lam Extra Material: Blocking Example: Matrix Multiplication Carnegie Mellon M. Lam CS243: Loop Transformations 18 Experimental Results With Blocking Without Blocking Carnegie Mellon M. Lam CS243: Loop Transformations 19 Code Transform •  Before for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { for (k = 0; k < n; k++) { Z[i,j] = Z[i,j] + X[i,k]*Y[k,j]; }}} •  After for (ii = 0; ii < n; ii = ii+B) { for (jj = 0; jj < n; jj = jj+B) { for (kk = 0; kk < n; kk = kk+B) { for (i = ii; i < min(n,kk+B); i++) for (j = jj; j < min(n,kk+B); j++) for (k = kk; k < min(n,kk+B); Z[i,j] = Z[i,j] + X[i,k] }}}}}} { { k++) { * Y[k,j]; Carnegie Mellon M. Lam CS243: Loop Transformations 20 ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online