11f6643lec21 - CS 6643 F'11 Synchronization Synchronization...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 6643 F'11 Synchronization Synchronization of several processes using barrier synchronization use use mutexes, semaphores, locks, ... to control access to shared variables Communication Communication among processes through reads reads and writes of shared variables declare declare shared and private variables spawn spawn and combine processes Primitives Primitives to Shared Memory Programming Shared-Memory Programming Processor CS 6643 F'11 Extends Extends threads model Creates Creates and synchronizes threads automatically Directive Directive based programming All All data are global except the thread stack and locally declared declared variables Managed Managed by threads or light-weight process library in user lightspace Low Low overhead Use lightUse light-weight processes or threads All All data private to processes, unless otherwise specified High High overhead Use UnixUse Unix-like processes Shared Memory Programming Models CS 6643 F'11 Processor Memory Processor Processors interact and synchronize with each other through shared variables. Processor Shared-memory Model The Kernel Threads Library CS 6643 F'11 User space Owing Owing to low overhead, many threads can be spawned with small amount of work per thread Easy Easy to balance load on processes in unstructured, dynamic applications Scheduling Scheduling and load balancing Server Server can respond to new requests while servicing the existing ones by spawning threads as needed Responsiveness Responsiveness Increased Increased throughput in I/O bound apps. Latency Latency hiding Parallel Parallel processing is easy: same binary for uni- and unimultiprocessors POSIX POSIX threads are commonly used Software Software portability Advantages of Threads CS 6643 F'11 Process Data Thread Structures CPU CPU registers, program counter, stack pointer, stack, ... Each Each process has one or more threads Each Each thread contains minimal program state: A thread is like a user-level process with lot less overhead thread user- A thread is a stream of instrs. that can be executed thread independently Threads CS 6643 F'11 void pthread_exit(void *value_ptr); POSIX Thread Functions CS 6643 F'11 create_thread(dot_product(get_row(a,row),get_col(b,col))); Each Each of n2 iterations can be executed independently using using a thread per iteration as follows Consider Consider the following code fragment Threads Example CS 6643 F'11 Initially Initially only master thread is active Master Master thread executes sequential code Fork: Fork: Master thread creates or awakens additional additional threads to execute execute parallel code Join: Join: At end of parallel code created threads die or are suspended For For loops are good sources of parallelism Fork/Join Parallelism CS 6643 F'11 Other threads Master Thread join fork join fork OpenMP OpenMP works in conjunction with Fortran, C, or C++ Compiler Compiler directives Library Library of support functions OpenMP: OpenMP: An application programming interface (API) for parallel programming on multiprocessors OpenMP Time Time CS 6643 F'11 Number Number active threads 1 at start and finish of program, changes dynamically during execution Parallelization Parallelization can be done done incrementally Execute Execute and profile sequential program Parallelize Parallelize selectively Stop Stop when further effort not warranted SharedShared-memory model Sequential Sequential-to-parallel transformation transformation requires major effort Transformation Transformation done in one giant step rather than many tiny steps All All processes active throughout execution of program MessageMessage-passing model Shared-memory vs. Message-passing CS 6643 F'11 C + OpenMP sufficient to program OpenMP multiprocessors C + MPI + OpenMP a good way to program MPI multicomputers built from multicore, multimulticore, multiCPU servers What’s OpenMP Good For? CS 6643 F'11 index + + + +index index − − < <= − −index for (index = start ; index ≥ end ; index + = inc ) >= index − = inc > index = index +inc index = inc +index index = index − inc Canonical Shape of for Loop Control Clause CS 6643 F'11 Compiler Compiler is free to ignore pragmas pragmas Syntax: Syntax: #pragma omp <rest of pragma> pragma> Stands Stands for “pragmatic information” A way for the programmer to communicate with way the compiler Pragma: Pragma: a compiler directive in C or C++ Pragmas CS 6643 F'11 static static variables dynamically dynamically allocated data structures in the heap variables variables on the run-time stack additional runadditional run-time stack for functions invoked by the thread Every Every thread has its own execution context Execution Execution context: address space containing all of the variables a thread may access Contents Contents of execution context: Execution Context CS 6643 F'11 Compiler Compiler must be able to verify the run-time system runwill have information it needs to schedule loop iterations #pragma omp parallel for for (i = 0; i < N; i++) (i a[i] = b[i] + c[i]; a[i b[i c[i Format: Format: OpenMP OpenMP makes it easy to parallelize them Compiler Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads C programs often express data-parallel operations as programs datafor for loops Parallel for Pragma i b Master Thread (Thread 0) Stack Heap CS 6643 F'11 i Thread 1 i cptr #pragma omp parallel for private(j) for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); Either Either loop could be executed in parallel We We prefer to make outer loop parallel, to reduce number of forks/joins Must Must give each thread its own private copy of variable variable j Use Use an optional clause: private(x,y,…) private(x,y,…) for (i = 0; i < n; i++) (i for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); a[i MIN(a[i][j],a[i][k]+tmp); Declaring Private Variables CS 6643 F'11 cptr = malloc(1); #pragma omp parallel for for (i = 0; i < 3; i++) b[i] = i; int main (int argc , char *argv) { int b[3]; char *cptr; int i; Shared Shared variable: has the same address in execution context of every thread Private Private variable: has different address in execution context of every thread A thread cannot access the private variables of another thread thread Shared and Private Variables Returns number of processors available Use CS 6643 F'11 Returns the number active threads (1 for sequential execution) Used Used to create private variables having initial values identical to the variable controlled by the master thread as the loop is entered Variables Variables are initialized once per thread, not once per loop iteration If If a thread modifies a variable’s value in an iteration, iteration, subsequent iterations will get the modified value firstprivate Clause CS 6643 F'11 int omp_get_num_threads (void) void omp_set_num_threads (int t) Sets the number of threads used in parallel execution; execution; can be called several several times int omp_get_thread_num (void) Returns the calling thread’s ID (master thread ID is 0) int omp_get_num_procs (void) Function Thread Management Functions CS 6643 F'11 double area, pi, x; int i, n; ... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n; Parallelization Parallelization ignoring concurrent accesses to data Critical Section (cont.) CS 6643 F'11 Sequentially Sequentially last iteration: iteration that occurs last when the loop is executed sequentially lastprivate lastprivate clause: used to copy back to the the master thread’s copy of a variable the private private copy of the variable from the thread that executed the sequentially last iteration lastprivate Clause CS 6643 F'11 15.432 11.667 15.230 15.432 11.667 Thread B 15.230 11.667 Answer should be 18.995 area += 4.0/(1.0 + x*x) Thread A area Causes Causes a race condition in which one process may “race ahead” of another and not see its change to shared variable area area Race Condition CS 6643 F'11 double area, pi, x; int i, n; ... area = 0.0; for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n; Consider Consider this C program segment to compute π using the rectangle rule: Critical Sections + 3.765 Thread A CS 6643 F'11 + 3.563 Thread B Only Only one thread at a time may execute the statement; i.e., it is sequential code Time Time to execute statement significant part of of loop By By Amdahl’s Law we know speedup will be severely constrained Update Update to area inside a critical section area Source of Inefficiency CS 6643 F'11 15.230 15.432 11.667 11.667 Value of area Race Condition Time Line CS 6643 F'11 Specify Specify reduction operation and reduction variable OpenMP OpenMP takes care of storing partial results in private private variables and combining partial results after the the loop Reductions Reductions are so common that OpenMP provides support for them May May add reduction clause to parallel for pragma parallel Reductions CS 6643 F'11 double area, pi, x; int i, n; ... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n; Avoid Avoid race conditions by using #pragma omp critical before of a block of critical section code critical Pragma Product Bitwise and Bitwise Bitwise or Bitwise exclusive or Logical and Logical or CS 6643 F'11 Parallelism Parallelism is in inner loop After After inversion, the outer loop can be made parallel parallel Inversion Inversion does not significantly lower cache hit rate Too Too many fork/joins can lower performance Inverting Inverting loops may help performance if Performance Improvement #1 CS 6643 F'11 + Sum * & | ^ && && || || Operators Operators The The reduction clause has this syntax: reduction (<op> :<variable>) reduction Clause CS 6643 F'11 #pragma omp parallel for if(n > 5000) If If loop has too few iterations, fork/join overhead is greater than time savings from parallel execution The if The if clause instructs compiler to insert code code that determines at run-time whether loop loop should be executed in parallel; e.g., Performance Improvement #2 CS 6643 F'11 double area, pi, x; int i, n; ... area = 0.0; #pragma omp parallel for \ private(x) reduction(+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n; π-finding Code with Reduction Clause CS 6643 F'11 A chunk is a contiguous range of iterations chunk Increasing Increasing chunk size reduces overhead and may increase cache hit rate Decreasing Decreasing chunk size allows finer balancing of of workloads Chunks CS 6643 F'11 We We can use schedule clause to specify how clause iterations of a loop should be allocated to threads Static Static schedule: all iterations allocated to threads before any iterations executed Dynamic Dynamic schedule: only some iterations allocated to threads threads at beginning of loop’s execution. Remaining iterations iterations allocated to threads that complete their assigned iterations. Performance Improvement #3 CS 6643 F'11 static: static: static allocation dynamic: dynamic: dynamic allocation guided: guided: guided self-scheduling runtime: runtime: type chosen at run-time based on value of runenvironment variable OMP_SCHEDULE Syntax Syntax of schedule clause schedule (<type>[,<chunk> ]) Schedule Schedule type required, chunk size optional Allowable Allowable schedule types schedule Clause CS 6643 F'11 Higher Higher overhead Can Can reduce workload imbalance Dynamic Dynamic scheduling Low Low overhead May May exhibit high workload imbalance Static Static scheduling Static vs. Dynamic Scheduling CS 6643 F'11 processing processing items on a “to do” list for for loop + additional code outside of loop Other Other opportunities for data parallelism Our Our focus has been on the parallelization of for for loops More General Data Parallelism CS 6643 F'11 schedule(static): schedule(static): block allocation of about n/t contiguous iterations to each thread schedule(static,C): schedule(static,C): interleaved allocation of chunks of size C to threads schedule(dynamic): schedule(dynamic): dynamic one-at-a-time one-atallocation allocation of iterations to threads schedule(dynamic,C): schedule(dynamic,C): dynamic allocation of C iterations at a time to threads Scheduling Options CS 6643 F'11 task_ptr Thread 1 task_ptr Master Thread job_ptr Processing a “To Do” List CS 6643 F'11 Shared Variables Heap schedule(guided, schedule(guided, C): dynamic allocation of chunks to tasks using guided self-scheduling heuristic. selfInitial chunks are bigger, later chunks are smaller, minimum chunk size is C. schedule(guided): schedule(guided): guided self-scheduling with selfminimum chunk size 1 schedule(runtime): schedule(runtime): schedule chosen at run-time based based on value of OMP_SCHEDULE; Unix example: setenv OMP_SCHEDULE “static,1” Scheduling Options (cont.) CS 6643 F'11 Every Every thread should repeatedly take next task from list and complete it, until there are no more tasks We We must ensure no two threads take same task task from the list; i.e., must declare a critical critical section Parallelization Strategy CS 6643 F'11 } ... task_ptr = get_next_task (&job_ptr); while (task_ptr != NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); } ... int main (int argc, char *argv) { struct job_struct *job_ptr; struct task_struct *task_ptr; Sequential Code (1/2) CS 6643 F'11 The parallel The parallel pragma precedes a block of code that should be executed by all of the all threads Note: Note: execution is replicated among all threads threads parallel Pragma CS 6643 F'11 } if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; } return answer; char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer; Sequential Code (2/2) CS 6643 F'11 The The parallel pragma allows us to write SPMDSPMDstyle programs In In these programs we often need to know number of threads and thread ID number OpenMP OpenMP provides functions to retrieve this information information Functions for SPMD-style Programming CS 6643 F'11 #pragma omp parallel private(task_ptr) { task_ptr = get_next_task (&job_ptr); while (task_ptr != NULL) { complete_task (task_ptr); task_ptr = get_next_task (&job_ptr); } } Use of parallel Pragma CS 6643 F'11 #pragma omp for The parallel pragma The parallel pragma instructs every thread to execute all of the code inside the block If If we encounter a for loop that we want to for divide divide among threads, we use the for for pragma for Pragma CS 6643 F'11 char *get_next_task(struct job_struct **job_ptr) { struct task_struct *answer; #pragma omp critical { if (*job_ptr == NULL) answer = NULL; else { answer = (*job_ptr)->task; *job_ptr = (*job_ptr)->next; } } return answer; } Critical Section for get_next_task CS 6643 F'11 #pragma omp parallel private(i,j) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting (%d)\n", i); break; } #pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; } Use of single Pragma CS 6643 F'11 #pragma omp parallel private(i,j) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { printf ("Exiting (%d)\n", i); break; } #pragma omp for for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; } Example Use of for Pragma CS 6643 F'11 Compiler Compiler puts a barrier synchronization at end of every parallel for statement In In our example, this is necessary: if a thread leaves loop and changes low or high, it may low high affect behavior of another thread If If we make these private variables, then it would would be okay to let threads move ahead, which could reduce execution time nowait Clause CS 6643 F'11 #pragma omp single Suppose Suppose we only want to see the output once The single pragma The single pragma directs compiler that only a single thread should execute the block of code the pragma precedes Syntax: Syntax: single Pragma CS 6643 F'11 May execute alpha, beta, and delta in parallel alpha gamma epsilon beta v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf ("%6.2f\n", epsilon(x,y)); Functional Parallelism Example CS 6643 F'11 delta #pragma omp parallel private(i,j,low,high) for (i = 0; i < m; i++) { low = a[i]; high = b[i]; if (low > high) { #pragma omp single printf ("Exiting (%d)\n", i); break; } #pragma omp for nowait for (j = low; j < high; j++) c[j] = (c[j] - a[i])/b[i]; } Use of nowait Clause CS 6643 F'11 #pragma omp parallel sections Precedes Precedes a block of k blocks of code that may be executed concurrently by k threads Syntax: Syntax: parallel sections Pragma CS 6643 F'11 To To this point all of our focus has been on exploiting data parallelism OpenMP OpenMP allows us to assign different threads to different portions of code (functional parallelism) parallelism) Functional Parallelism CS 6643 F'11 alpha gamma epsilon beta Another Approach CS 6643 F'11 delta #pragma omp section Execute alpha and beta in parallel. Execute gamma and delta in parallel. Precedes Precedes each block of code within the encompassing block preceded by the parallel sections pragma May May be omitted for first parallel section after the parallel sections pragma Syntax: Syntax: section Pragma CS 6643 F'11 Appears Appears inside a parallel block of code Has Has same meaning as the parallel parallel sections pragma If If multiple sections pragmas inside one sections parallel parallel block, may reduce fork/join costs sections Pragma CS 6643 F'11 #pragma omp parallel sections { #pragma omp section /* Optional */ v = alpha(); #pragma omp section w = beta(); #pragma omp section y = delta(); } x = gamma(v, w); printf ("%6.2f\n", epsilon(x,y)); Example of parallel sections CS 6643 F'11 Functional Functional parallelism (parallel sections pragma) parallel parallel for pragma reduction reduction clause OpenMP OpenMP an API for sharedshared-memory parallel programming SharedShared-memory model based on fork/join parallelism parallelism Data Data parallelism Summary CS 6643 F'11 Inverting Inverting loops Conditionally Conditionally parallelizing loops Changing Changing loop scheduling SPMDSPMD-style programming (parallel pragma) Critical Critical sections (critical pragma) Enhancing Enhancing performance of of parallel for loops #pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta(); } #pragma omp sections { x = gamma(v, w); #pragma omp section y = delta(); } } printf ("%6.2f\n", epsilon(x,y)); Use of sections Pragma Software layer over threads Compiler (user input optional) Not obvious Implicit, Compiler generated Static and regular task interactions Environment Management and creation of threads Data Sharing Synchronization OpenMP Yes Yes Supports incremental parallelization parallelization CS 6643 F'11 Explicit control of memory hierarchy No Yes No Suitable for multicomputers Minimal extra code Yes Suitable for multiprocessors Yes No No Yes Yes MPI Explicit, coded by user Dynamic and complex task interactions More explicit User OS Threads Characteristic Summary (2) CS 6643 F'11 Suitability OpenMP Characteristic Explicit Threads vs. OpenMP ...
View Full Document

Ask a homework question - tutors are online