98 Pages

OpenMP-Tutorial-HPCC2007-HO

Course: TLC 2, Fall 2009
School: U. Houston
Rating:
 
 
 
 
 

Word Count: 12033

Document Preview

'07 HPCC OpenMP Tutorial 2 Outline Introduction into Parallelization Multicore Processor Architectures An Overview of OpenMP Data Races Guest Speakers (slides not included here) OpenMP Under The Hood (Lei Huang, UH) Cluster OpenMP (Larry Meadows, Intel) OpenMP and Performance RvdP/V1 HPCC '07 OpenMP Tutorial 1 HPCC '07 OpenMP Tutorial 3 Shameless Plug - "Using OpenMP" "Using...

Register Now

Unformatted Document Excerpt

Coursehero >> Texas >> U. Houston >> TLC 2

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
'07 HPCC OpenMP Tutorial 2 Outline Introduction into Parallelization Multicore Processor Architectures An Overview of OpenMP Data Races Guest Speakers (slides not included here) OpenMP Under The Hood (Lei Huang, UH) Cluster OpenMP (Larry Meadows, Intel) OpenMP and Performance RvdP/V1 HPCC '07 OpenMP Tutorial 1 HPCC '07 OpenMP Tutorial 3 Shameless Plug - "Using OpenMP" "Using OpenMP" by Chapman, Jost, van der Pas Published by MIT Press ISBN-10: 0-262-53302-2 ISBN-13: 978-0-262-53302-7 RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 4 Introduction Into Parallelization RvdP/V1 HPCC '07 OpenMP Tutorial 2 HPCC '07 OpenMP Tutorial 5 What is Parallelization ? Parallelization is simply another optimization technique to get your results sooner To this end, more than one processor is used to solve the problem The "elapsed time" (also called wallclock time) should come down, but the total CPU time probably goes up The latter is a difference with serial optimization, where one makes better use of existing resources i.e. the cost comes down RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 6 What Is Parallelization ? An attempt to give you a sort of definition: "Something" is parallel if there is a certain level of independence in the order of operations "Something" can be: A collection of program statements An algorithm A part of your program The problem you're trying to solve granularity RvdP/V1 HPCC '07 OpenMP Tutorial 3 HPCC '07 OpenMP Tutorial 7 What is a thread ? Loosely said, a thread consists of a series of instructions with it's own program counter ("PC") and state A parallel program executes threads in parallel These threads are then scheduled onto processors Thread 0 Thread 1 Thread 2 Thread 3 PC ..... ..... ..... ..... PC ..... ..... ..... ..... PC ..... ..... ..... ..... PC ..... ..... ..... ..... P RvdP/V1 P HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 8 Parallel overhead The total CPU time may exceed the serial CPU time: The newly introduced parallel portions in your program need to be executed Threads need time sending data to each other and synchronizing ("communication") Often the key contributor, spoiling all the fun Typically, things also get worse when increasing the number of threads Efficient parallelization is about minimizing the communication overhead RvdP/V1 HPCC '07 OpenMP Tutorial 4 HPCC '07 OpenMP Tutorial 9 Communication Serial Execution Parallel - Without communication Parallel - With communication Wallclock time Embarrassingly parallel: 4x faster Wallclock time is of serial wallclock time Additional communication Less than 4x faster Consumes additional resources Wallclock time is more than of serial wallclock time Total CPU time increases Communication RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 10 Load balancing Perfect Load Balancing Load Imbalance Time 1 idle 2 idle 3 idle All threads finish in the same amount of time No threads is idle Different threads need a different amount of time to finish their task Total wall clock time increases Program does not scale well Thread is idle RvdP/V1 HPCC '07 OpenMP Tutorial 5 HPCC '07 OpenMP Tutorial 11 Different levels of parallelism Parallelization at the highest ( Low communication cost Limited to 5 processors only Potential load balancing issue ) level: Parallelization at the lowest ( ) level: Higher communication cost Not limited to a certain number of processors Load balancing probably less of an issue HPCC '07 OpenMP Tutorial RvdP/V1 HPCC '07 OpenMP Tutorial 12 About scalability Define the speed-up S(P) as S(P) := T(1)/T(P) Speed-up S(P) Ideal The efficiency E(P) is defined as E(P) := S(P)/P In the ideal case, S(P)=P and E(P)=P/P=1=100% Unless the application is embarrassingly parallel, S(P) eventually starts to deviate from the ideal curve Past this point Popt, the application sees less and less benefit from adding processors Note that both metrics give no information on the actual run-time As such, they can be dangerous to use Popt In some cases, S(P) exceeds P P This is called "superlinear" behaviour Don't count on this to happen though RvdP/V1 HPCC '07 OpenMP Tutorial 6 HPCC '07 OpenMP Tutorial 13 Amdahl's Law Assume our program has a parallel fraction "f" This implies the execution time T(1) := f*T(1) + (1-f)*T(1) On P processors: T(P) = (f/P)*T(1) + (1-f)*T(1) Amdahl's law: S(P) = T(1)/T(P) = 1 / (f/P + 1-f) Comments: This "law' describes the effect the non-parallelizable part of a program has on scalability Note that the additional overhead caused by parallelization and speed-up because of cache effects are not taken into account RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 14 Amdahl's Law 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Value for f: 1.00 0.99 0.75 0.50 0.25 0.00 It is easy to scale on a small number of processors Scalable performance however requires a high degree of parallelization i.e. f is very close to 1 This implies that you need to parallelize that part of the code where the majority of the time is spent Use the performance analyzer to find these parts Speed-up Processors RvdP/V1 HPCC '07 OpenMP Tutorial 7 HPCC '07 OpenMP Tutorial 15 Amdahl's Law in practice We can estimate the parallel fraction "f" Recall: T(P) = (f/P)*T(1) + (1-f)*T(1) It is trivial to solve this equation for "f": f = (1 - T(P)/T(1))/(1 - (1/P)) Example: T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70 f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84% Estimated performance on 8 processors is then: T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5 S(8) = T/T(8) = 3.78 RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 16 Cache Coherence RvdP/V1 HPCC '07 OpenMP Tutorial 8 HPCC '07 OpenMP Tutorial 17 Typical cache based system fastest faster L2 cache slow CPU ? L1 cache Memory RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 18 Cache line modifications page cache line caches cache line inconsistency ! Main Memory RvdP/V1 HPCC '07 OpenMP Tutorial 9 HPCC '07 OpenMP Tutorial 19 Parallel Architectures RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 20 The Debate There is an on-going debate about labelling systems: It is hard to classify architectures in the first place Most systems share some characteristics, but not all For example, when do we call a system cc-NUMA ? Even a cache based workstation might qualify ... In the overview we're going to present, we classify systems based on Main Memory: Shared or Distributed ? Can all processors access all of memory, or a subset only ? Memory access time(s) Uniform, or non-uniform? RvdP/V1 HPCC '07 OpenMP Tutorial 10 HPCC '07 OpenMP Tutorial 21 Uniform Memory Access (UMA) Memory Interconnect cache cache cache I/O I/O Also called "SMP" (Symmetric Multi Processor) Memory Access time is Uniform for all CPUs I/O Interconnect is "cc": Bus Crossbar No fragmentation Memory and I/O are shared resources CPU CPU CPU Pro Easy to use and to administer Efficient use of resources Con Said to be expensive Said to be non-scalable RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 22 NUMA Interconnect M I/O M I/O M I/O Also called "Distributed Memory" or NORMA (No Remote Memory Access) Memory Access time is Non-Uniform Hence the name "NUMA" Interconnect is not "cc": Ethernet, Infiniband, etc, ...... Runs 'N' copies of the OS Memory and I/O are distributed resources cache cache cache CPU CPU CPU Pro Said to be cheap Said to be scalable Con Difficult to use and administer In-efficient use of resources RvdP/V1 HPCC '07 OpenMP Tutorial 11 HPCC '07 OpenMP Tutorial 23 Cluster of SMP nodes Second-level Interconnect SMP node SMP node Second-level interconnect is not cache coherent Ethernet, Infiniband, etc, .... Hybrid Architecture with all Pros and Cons: UMA within one SMP node NUMA across nodes RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 24 cc-NUMA 2-nd level Interconnect Two-level interconnect: UMA/SMP within one system NUMA between the systems Both interconnects support cache coherence i.e. the system is fully cache coherent Has all the advantages ('look and feel') of an SMP Downside is the Non-Uniform Memory Access time RvdP/V1 HPCC '07 OpenMP Tutorial 12 HPCC '07 OpenMP Tutorial 25 Parallel Programming Models RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 26 How To Program A Parallel Computer? There are numerous parallel programming models The ones most well-known are: y r An ste r MP u /o S Cl d an gle n si Distributed Memory PVM - Parallel Virtual Machine (obsolete) MPI - Message Passing Interface (de-facto std) Shared Memory P SM ly on Posix Threads (standardized, low level) OpenMP (de-facto standard) Automatic Parallelization (compiler does it for you) RvdP/V1 HPCC '07 OpenMP Tutorial 13 HPCC '07 OpenMP Tutorial 27 Parallel Programming Models Distributed Memory - MPI RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 28 Message Passing Interface/MPI M M 0 1 y r An ste r MP u /o S Cl d an gle n si 3 M 5 4 M 6 M M 2 M An MPI (or PVM) application consists of a set of processes All communication and data transfer is under the control of the programmer RvdP/V1 HPCC '07 OpenMP Tutorial 14 HPCC '07 OpenMP Tutorial 29 Distributed Memory model T private T private Programming Model All threads have access to their own, private, memory only Data transfer and most synchronization has to be programmed explicitly By default, data is private Transfer Mechanism T private private T private T Data is shared explicitly by exchanging buffers This programming model makes it very hard to develop auto-parallelizing compilers HPCC '07 OpenMP Tutorial RvdP/V1 HPCC '07 OpenMP Tutorial 30 Example MPI Program Send 10 integers from one thread to the other include 'mpif.h' Status of receive operation integer ier, count, me, you integer data(10), status(MPI_STATUS_SIZE) Get TID within rank ..... Initialize MPI environment you = 1 him = 0 call MPI_Init(ier) Node 0 sends call MPI_Comm_Rank(MPI_COMM_WORLD, me, ier) If ( me .eq. 0 ) Then call MPI_Send(data, 10, MPI_INTEGER, you, 1957, MPI_COMM_WORLD, ier) Else If (me .eq. 1 ) Then Node 1 receives call MPI_Recv(data, 10, MPI_INTEGER, him, 1957, MPI_COMM_WORLD, status, ier) End If If ( ier .ne. 0 ) stop 'Beagle 2, we got a problem here' ..... Leave MPI environment call MPI_Finalize(ier) .... RvdP/V1 HPCC '07 OpenMP Tutorial 15 HPCC '07 OpenMP Tutorial 31 Run-time behavior TID = 0 you = 1 him = 0 me = 0 MPI_Send 10 integers destination = you = 1 label = 1957 TID = 1 you = 1 him = 0 me = 1 Yes ! Connection established MPI_Recv 10 integers sender = him =0 label = 1957 RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 32 Parallel Programming Models Shared Memory - OpenMP RvdP/V1 HPCC '07 OpenMP Tutorial 16 HPCC '07 OpenMP Tutorial 33 Memory 0 1 P http://www.openmp.org RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 34 Shared Memory model T private T private Programming Model All threads have access to the same, globally shared, memory Data can be shared or private T private Shared Memory private Shared data is accessible by all threads Private data can only be accessed by the thread that owns it Data transfer is transparent to the programmer Synchronization takes place, but it is mostly implicit T private T RvdP/V1 HPCC '07 OpenMP Tutorial 17 HPCC '07 OpenMP Tutorial 35 About data In a shared memory parallel program variables have a "label" attached to them: Labelled "Private" Visible to one thread only Change made in local data, is not seen by others Example - Local variables in a function that is executed in parallel Labelled "Shared" Visible to all threads Change made in global data, is seen by all others Example - Global data RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 36 Example - Matrix times vector #pragma omp parallel for default(none) \ private(i,j,sum) shared(m,n,a,b,c) for (i=0; i<m; i++) j { sum = 0.0; for (j=0; j<n; j++) = * sum += b[i][j]*c[j]; a[i] = sum; i } TID = 0 for (i=0,1,2,3,4) i = 0 sum = b[i=0][j]*c[j] a[0] = sum i = 1 sum = b[i=1][j]*c[j] a[1] = sum TID = 1 for (i=5,6,7,8,9) i = 5 sum = b[i=5][j]*c[j] a[5] = sum i = 6 sum = b[i=6][j]*c[j] a[6] = sum ... etc ... RvdP/V1 HPCC '07 OpenMP Tutorial 18 HPCC '07 OpenMP Tutorial 37 OpenMP performance 2400 Performance (Mflop/s) 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 0 1 10 100 1000 10000 4 threads 2 threads Matrix too small * scales 1 thread 100000 1000000 Memory Footprint (KByte) SunFire 6800 UltraSPARC III Cu @ 900 MHz 8 MB L2-cache RvdP/V1 *) With the IF-clause in OpenMP this performance degradation can be avoided HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 38 A Black and White comparison MPI De-facto standard Endorsed by all key players Runs on any number of (cheap) systems "Grid Ready" High and steep learning curve You're on your own All or nothing model No data scoping (shared, private, ..) More widely used (but ....) Sequential version is not preserved Requires a library only Requires a run-time environment Easier to understand performance OpenMP De-facto standard Endorsed by all key players Limited to one (SMP) system Not (yet?) "Grid Ready" Easier to get started (but, ...) Assistance from compiler Mix and match model Requires data scoping Increasingly popular (CMT !) Preserves sequential code Need a compiler No special environment Performance issues implicit RvdP/V1 HPCC '07 OpenMP Tutorial 19 HPCC '07 OpenMP Tutorial 39 Multicore Processor Architectures RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 40 What is a Multicore Architecture ? In a Multicore processor, there is more than "one" core A "core" is not well defined - A (very) simplified view is to see it as a CPU Different implementations possible and available Could be two levels of parallelism for example Like Sun's UltraSPARCTM T1 or T2 processor Multiple cores and multiple threads within one processor Often, there is also a cache hierarchy of private and shared caches RvdP/V1 HPCC '07 OpenMP Tutorial 20 HPCC '07 OpenMP Tutorial 41 The impact of Multicore Multicore has not only arrived, it is mainstream now and will be with us for a long time to come To illustrate the differences and similarities, we now briefly present and discuss several multicore processors that are available today For the developers, it mostly matters there is hardware parallelism at the chip level they can take advantage of RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 42 A Generic Multicore Architecture thread thread cache(s) core thread thread thread System Interconnect (Memory, I/O, etc) Shared Cache(s) cache(s) core thread thread thread cache(s) core thread RvdP/V1 HPCC '07 OpenMP Tutorial 21 HPCC '07 OpenMP Tutorial 43 The UltraSPARC IV+ Processor L2/L3 Cntl L2 Data (Left) L2 Tags SIU L2 Data (Right) L3 Tags RvdP/V1 HPCC '07 OpenMP Tutorial SIU MCU HPCC '07 OpenMP Tutorial 44 UltraSPARC IV+ - Block diagram 14 stage pipeline 4-way superscalar One Thread L1 D-cache System Interconnect L1 I-cache L3 cache Memory, I/O, etc L2 cache L1 D-cache 14 stage pipeline 4-way superscalar One Thread L1 I-cache UltraSPARC IV+ RvdP/V1 HPCC '07 OpenMP Tutorial 22 HPCC '07 OpenMP Tutorial 45 The UltraSPARC T1 Processor RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 46 Throughput Computing Key Concept Other threads continue while one or more threads wait for a resource T0: T1: Execute Wait Wait Execute Execute Wait Wait Execute Processor utilization improves: Execute RvdP/V1 Execute Wait HPCC '07 OpenMP Tutorial Execute Execute Wait 23 HPCC '07 OpenMP Tutorial 47 UltraSPARC T1 - Block diagram Four Four Four Four Four Four Four Four Threads Threads Threads Threads Threads Threads Threads Threads L2 cache L1 D-cache L1 I-cache 6 stage pipeline (1 way) 6 stage 6 stage pipeline pipeline (1 way) 6 stage pipeline (1 way) 6 stage pipeline (1 way) 6 stage pipeline (1 way) 6 stage pipeline (1 way) 6 stage pipeline (1 way) 6 stage pipeline (1 way) bank 0 L1 D-cache L1 I-cache L1 D-cache Memory Crossbar FPU bank 1 bank 2 L1 I-cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache I/O L1 D-cache bank 3 L1 I-cache L1 D-cache L1 I-cache UltraSPARC T1 RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 48 The UltraSPARC T2 Processor RvdP/V1 HPCC '07 OpenMP Tutorial 24 HPCC '07 OpenMP Tutorial 49 UltraSPARC T2 - Block diagram Eight Eight Eight Eight Eight Eight Eight Eight Threads Threads Threads Threads Threads Threads Threads Threads FPU L2 cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache L1 D-cache FPU bank 0 bank 1 two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) two 8 stage pipelines (1 way each) : : : : : : : : Memory Crossbar bank 2 bank 3 bank 4 bank 5 L1 I-cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache L1 D-cache L1 I-cache FPU has a 12 stage pipeline RvdP/V1 HPCC '07 OpenMP Tutorial FPU bank 7 FPU I/O bank 6 FPU FPU FPU FPU UltraSPARC T2 HPCC '07 OpenMP Tutorial 50 AMD Opteron RvdP/V1 HPCC '07 OpenMP Tutorial 25 HPCC '07 OpenMP Tutorial 51 AMD Opteron - Dual core HyperTransport Links Cross bar System Request Queue L2 cache L1 D-cache 12 stage pipeline 3-way superscalar (x86 insts) One Thread L1 I-cache AMD Opteron Dual Core L1 D-cache L2 cache 12 stage pipeline 3-way superscalar (x86 insts) One Thread L1 I-cache Memory RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 52 AMD Opteron - Quad core HyperTransport Links L2 cache System Request Queue L1 I-cache L1 D-cache 12 stage pipeline 3-way superscalar (x86 insts) 12 stage pipeline 3-way superscalar (x86 insts) 12 stage pipeline 3-way superscalar (x86 insts) 12 stage pipeline 3-way superscalar (x86 insts) One Thread One Thread L2 cache L1 I-cache L1 D-cache Cross bar L3 cache L2 cache One Thread L1 I-cache L1 D-cache One Thread Memory L2 cache L1 I-cache L1 D-cache RvdP/V1 HPCC '07 OpenMP Tutorial 26 HPCC '07 OpenMP Tutorial 53 HyperTransport Interconnect AMD 8131 AMD 8111 Each CPU has dedicated memory bandwidth I/O is independent of memory access Adding CPU's adds memory and I/O bandwidth RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 54 Intel Xeon 5300 ('Clovertown') RvdP/V1 HPCC '07 OpenMP Tutorial 27 HPCC '07 OpenMP Tutorial 55 Intel Xeon 5300 L1 I-cache L1 D-cache 14 stage pipeline 4-way superscalar (x86 insts) 14 stage pipeline 4-way superscalar (x86 insts) One Thread L2 cache One Thread Northbridge L1 I-cache L1 D-cache Memory L1 I-cache L1 D-cache 14 stage pipeline 4-way superscalar (x86 insts) 14 stage pipeline 4-way superscalar (x86 insts) One Thread L2 cache L1 I-cache L1 D-cache One Thread RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 56 Summary Multicore has arrived and is here to stay Substantial differences between architectures Number of cores Number of threads per core Cache organization Caches private to one core Typical at the L1 level (instruction, data, TLB) Shared caches Could be more than one How many cores share one cache To the developer this means that about every processor is, or soon will be, a (small) parallel computer RvdP/V1 HPCC '07 OpenMP Tutorial 28 HPCC '07 OpenMP Tutorial 57 Wrap-Up RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 58 Parallelism is everywhere! Multiple levels of parallelism: Instruction Level (ILP) Chip Level (Multicore) System Level (SMP) Grid Level (LAN/WAN) granularity RvdP/V1 HPCC '07 OpenMP Tutorial 29 HPCC '07 OpenMP Tutorial 59 Generic Parallel Architecture * Registers Cache Board P P $ $ Board P P $ $ Cache Local Memory Interconnect Level 1 Interconnect Level 1 Memory Memory Latency increases Bandwidth decreases Interconnect Level 2 Local Remote Memory *) and simplified .... RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 60 Programming Models revisited Architecture Shared Mem ory Efficient ? UMA/SMP NUMA yes not available Distributed Memory Efficient ? yes (very !) maybe * maybe * yes Cluster of SMPs yes (within one node) cc-NUMA depends One can map any programming model onto any architecture Making it efficient is the key problem to solve *) Depends on interconnect RvdP/V1 HPCC '07 OpenMP Tutorial 30 HPCC '07 OpenMP Tutorial 61 Parallelizing an Application The question whether an application is parallel, or not, has nothing to do with the programming model Two possibilities (for the time consuming part): If parallel, decide on the programming model: Message Passing Do It Yourself Shared Memory Use the compiler May need directives to assist the compiler If not parallel: Try to rewrite or change the algorithm and go back to step RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 62 An Overview of OpenMP RvdP/V1 HPCC '07 OpenMP Tutorial 31 HPCC '07 OpenMP Tutorial 63 Outline OpenMP Guided Tour OpenMP Overview Directives Environment variables Run-time environment Global Data RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 64 OpenMP Guided Tour RvdP/V1 HPCC '07 OpenMP Tutorial 32 HPCC '07 OpenMP Tutorial 65 http://www.openmp.org http://www.compunity.org RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 66 What is OpenMP? De-facto standard API for writing shared memory parallel applications in C, C++, and Fortran Consists of: Compiler directives Run time routines Environment variables Specification maintained by the OpenMP Architecture Review Board (http://www.openmp.org) Latest Specification: Version 2.5 Supported by the Sun Studio 12 compilers Version 3.0 has been in the works since September 2007, final specification expected late 2007/early 2008 RvdP/V1 HPCC '07 OpenMP Tutorial 33 HPCC '07 OpenMP Tutorial 67 When to consider OpenMP? The compiler may not be able to do the parallelization in the way you like to see it: A loop is not parallelized The data dependence analysis is not able to determine whether it is safe to parallelize or not The granularity is not high enough The compiler lacks information to parallelize at the highest possible level This is when explicit parallelization through OpenMP directives and functions comes into the picture RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 68 Advantages of OpenMP Good performance and scalability If you do it right .... De-facto standard An OpenMP program is portable Supported by a large number of compilers Requires little programming effort Allows the program to be parallelized incrementally Maps naturally onto a multicore architecture: Lightweight Each OpenMP thread in the program can be executed by a hardware thread RvdP/V1 HPCC '07 OpenMP Tutorial 34 HPCC '07 OpenMP Tutorial 69 A first OpenMP example For-loop with independent iterations for (i = 0; i < n; i++) c[i] = a[i] + b[i]; For-loop parallelized using an OpenMP pragma #pragma omp parallel for \ shared(n, a, b, c)\ private(i) for (i = 0; i < n; i++) c[i] = a[i] + b[i]; % cc -xopenmp source.c % setenv OMP_NUM_THREADS 4 % a.out RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 70 The OpenMP execution model Fork and Join Model Master Thread Parallel region Worker Threads Synchronization Parallel region Worker Threads Synchronization RvdP/V1 HPCC '07 OpenMP Tutorial 35 HPCC '07 OpenMP Tutorial 71 Example parallel execution Thread 0 Iteration: 1-250 Thread 1 251-500 Thread 2 501-750 Thread 3 751-1000 a + b = c C RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 72 A loop parallelized with OpenMP #pragma omp parallel default(none) \ shared(n,x,y) private(i) { #pragma omp for for (i=0; i<n; i++) x[i] += y[i]; } /*-- End of parallel region --*/ !$omp parallel default(none) !$omp shared(n,x,y) private(i) !$omp do do i = 1, n x(i) = x(i) + y(i) end do !$omp end do !$omp end parallel & clauses RvdP/V1 HPCC '07 OpenMP Tutorial 36 HPCC '07 OpenMP Tutorial 73 Components of OpenMP Directives Parallel regions Work sharing Synchronization Data-sharing attributes private firstprivate lastprivate shared reduction Environment variables Number of threads Scheduling type Dynamic thread adjustment Nested parallelism Runtime environment Number of threads Thread ID Dynamic thread adjustment Nested parallelism Timers API for locking Orphaning RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 74 Directive format C: directives are case sensitive Syntax: #pragma omp directive [clause [clause] ...] Continuation: use \ in pragma Conditional compilation: _OPENMP macro is set Fortran: directives are case insensitive Syntax: sentinel directive [clause [[,] clause]...] The sentinel is one of the following: !$OMP or C$OMP or *$OMP !$OMP (fixed format) (free format) Continuation: follows the language syntax Conditional compilation: !$ or C$ -> 2 spaces RvdP/V1 HPCC '07 OpenMP Tutorial 37 HPCC '07 OpenMP Tutorial 75 A more elaborate example #pragma omp parallel if (n>limit) default(none) \ shared(n,a,b,c,x,y,z) private(f,i,scale) { Statement is executed f = 1.0; by all threads #pragma omp for nowait for (i=0; i<n; i++) z[i] = x[i] + y[i]; parallel loop (work is distributed) parallel region #pragma omp for nowait for (i=0; i<n; i++) a[i] = b[i] + c[i]; #pragma omp barrier parallel loop (work is distributed) synchronization Statement is executed by all threads .... scale = sum(a,0,n) + sum(z,0,n) + f; .... } /*-- End of parallel region --*/ RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 76 Another OpenMP example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 void mxv_row(int m,int n,double *a,double *b,double *c) { int i, j; double sum; #pragma omp parallel for default(none) \ private(i,j,sum) shared(m,n,a,b,c) for (i=0; i<m; i++) { sum = 0.0; for (j=0; j<n; j++) sum += b[i*n+j]*c[j]; a[i] = sum; } /*-- End of parallel for --*/ } % cc -c -fast -xrestrict -xopenmp -xloopinfo mxv_row.c "mxv_row.c", line 8: PARALLELIZED, user pragma used "mxv_row.c", line 11: not parallelized RvdP/V1 HPCC '07 OpenMP Tutorial 38 HPCC '07 OpenMP Tutorial 77 OpenMP performance 2400 Performance (Mflop/s) 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 0 OpenMP - 1 CPU OpenMP - 2 CPUs OpenMP - 4 CPUs Matrix too small * scales 1 10 100 1000 10000 100000 1000000 Memory Footprint (KByte) SunFire 6800 UltraSPARC III Cu @ 900 MHz 8 MB L2-cache RvdP/V1 *) With the IF-clause in OpenMP this performance degradation can be avoided HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 78 OpenMP Directives RvdP/V1 HPCC '07 OpenMP Tutorial 39 HPCC '07 OpenMP Tutorial 79 Terminology and behavior OpenMP Team := Master + Workers A Parallel Region is a block of code executed by all threads simultaneously The master thread always has thread ID 0 Thread adjustment (if enabled) is only done before entering a parallel region Parallel regions can be nested, but support for this is implementation dependent An "if" clause can be used to guard the parallel region; in case the condition evaluates to "false", the code is executed serially A work-sharing construct divides the execution of the enclosed code region among the members of the team; in other words: they split the work RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 80 About OpenMP clauses Many OpenMP directives support clauses These clauses are used to specify additional information with the directive For example, private(a) is a clause to the for directive: #pragma omp for private(a) Before we present an overview of all the directives, we discuss several of the OpenMP clauses first The specific clause(s) that can be used, depends on the directive RvdP/V1 HPCC '07 OpenMP Tutorial 40 HPCC '07 OpenMP Tutorial 81 The if/private/shared clauses if (scalar expression) Only execute in parallel if expression evaluates to true Otherwise, execute serially #pragma omp parallel if (n > threshold) \ shared(n,x,y) private(i) { #pragma omp for for (i=0; i<n; i++) x[i] += y[i]; } /*-- End of parallel region --*/ private (list) No storage association with original object All references are to the local object Values are undefined on entry and exit shared (list) Data is accessible by all threads in the team All threads access the same address space RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 82 About storage association Private variables are undefined on entry and exit of the parallel region The value of the original variable (before the parallel region) is undefined after the parallel region ! A private variable within a parallel region has no storage association with the same variable outside of the region Use the first/last private clause to override this behavior We illustrate these concepts with an example RvdP/V1 HPCC '07 OpenMP Tutorial 41 HPCC '07 OpenMP Tutorial 83 Example private variables main() { A = 10; #pragma omp parallel { #pragma omp for private(i) firstprivate(A) lastprivate(B)... #pragma omp for private(i,B) firstprivate(A) ... private(i,A,B) ... for (i=0; i<n; i++) { .... /*-- A undefined, unless declared B = A + i; firstprivate --*/ .... } C = B; /*-- B undefined, unless declared lastprivate --*/ } /*-- End of OpenMP parallel region --*/ } Disclaimer: This code fragment is not very meaningful and only serves to demonstrate the clauses RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 84 The first/last private clauses firstprivate (list) All variables in the list are initialized with the value the original object had before entering the parallel construct lastprivate (list) The thread that executes the sequentially last iteration or section updates the value of the objects in the list RvdP/V1 HPCC '07 OpenMP Tutorial 42 HPCC '07 OpenMP Tutorial 85 The default clause default ( none | shared | private ) default ( none | shared ) none No implicit defaults Have to scope all variables explicitly Fortran C/C++ Note: default(private) is not supported in C/C++ shared All variables are shared The default in absence of an explicit "default" clause private All variables are private to the thread Includes common block data, unless THREADPRIVATE RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 86 The reduction clause - Example sum = 0.0 !$omp parallel default(none) & !$omp shared(n,x) private(i) !$omp do reduction (+:sum) do i = 1, n sum = sum + x(i) end do !$omp end do !$omp end parallel print *,sum Variable SUM is a shared variable Care needs to be taken when updating shared variable SUM With the reduction clause, the OpenMP compiler generates code such that a race condition is avoided RvdP/V1 HPCC '07 OpenMP Tutorial 43 HPCC '07 OpenMP Tutorial 87 The reduction clause reduction ( [operator | intrinsic] ) : list ) reduction ( operator : list ) Fortran C/C++ Reduction variable(s) must be shared variables A reduction is defined as: Fortran x x x x = = = = x operator expr expr operator x intrinsic (x, expr_list) intrinsic (expr_list, x) C/C++ Check the docs for details x = x operator expr x = expr operator x x++, ++x, x--, --x x <binop> = expr Note that the value of a reduction variable is undefined from the moment the first thread reaches the clause till the operation has completed The reduction can be hidden in a function call RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 88 Barrier/1 Suppose we run each of these two loops in parallel over i: for (i=0; i < N; i++) a[i] = b[i] + c[i]; for (i=0; i < N; i++) d[i] = a[i] + b[i]; This may give us a wrong answer (one day) Why ? RvdP/V1 HPCC '07 OpenMP Tutorial 44 HPCC '07 OpenMP Tutorial 89 Barrier/2 We need to have updated all of a[ ] first, before using a[ ] * for (i=0; i < N; i++) a[i] = b[i] + c[i]; wait ! barrier for (i=0; i < N; i++) d[i] = a[i] + b[i]; All threads wait at the barrier point and only continue when all threads have reached the barrier point *) If there is the guarantee that the mapping of iterations onto threads is identical for both loops, there will not be a data race in this case RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 90 Barrier/3 Barrier Region idle idle idle time Barrier syntax in OpenMP: #pragma omp barrier !$omp barrier RvdP/V1 HPCC '07 OpenMP Tutorial 45 HPCC '07 OpenMP Tutorial 91 When to use barriers ? When data is updated asynchronously and the data integrity is at risk Examples: Between parts in the code that read and write the same section of memory After one timestep/iteration in a solver Unfortunately, barriers tend to be expensive and also may not scale to a large number of processors Therefore, use them with care RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 92 The nowait clause To minimize synchronization, some OpenMP directives/pragmas support the optional nowait clause If present, threads do not synchronize/wait at the end of that particular construct In Fortran the nowait clause is appended at the closing part of the construct In C, it is one of the clauses on the pragma #pragma omp for nowait { : } !$omp do : : !$omp end do nowait RvdP/V1 HPCC '07 OpenMP Tutorial 46 HPCC '07 OpenMP Tutorial 93 The Parallel Region A parallel region is a block of code executed by multiple threads simultaneously !$omp parallel [clause[[,] clause] ...] "this is executed in parallel" !$omp end parallel (implied barrier) #pragma omp parallel [clause[[,] clause] ...] { "this is executed in parallel" } (implied barrier) RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 94 The Parallel Region - Clauses A parallel region supports the following clauses: if private shared default default reduction copyin firstprivate num_threads (scalar expression) (list) (list) (none|shared) (C/C++) (none|shared|private) (Fortran) (operator: list) (list) (list) (scalar_int_expr) RvdP/V1 HPCC '07 OpenMP Tutorial 47 HPCC '07 OpenMP Tutorial 95 Work-sharing constructs The OpenMP work-sharing constructs #pragma omp for { .... } !$OMP DO .... !$OMP END DO #pragma omp sections { .... } !$OMP SECTIONS .... !$OMP END SECTIONS #pragma omp single { .... } !$OMP SINGLE .... !$OMP END SINGLE The work is distributed over the threads Must be enclosed in a parallel region Must be encountered by all threads in the team, or none at all No implied barrier on entry; implied barrier on exit (unless nowait is specified) A work-sharing construct does not launch any new threads RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 96 The workshare construct Fortran has a fourth worksharing construct: !$OMP WORKSHARE <array syntax> !$OMP END WORKSHARE [NOWAIT] Example: !$OMP WORKSHARE A(1:M) = A(1:M) + B(1:M) !$OMP END WORKSHARE NOWAIT RvdP/V1 HPCC '07 OpenMP Tutorial 48 HPCC '07 OpenMP Tutorial 97 The omp for/do directive The iterations of the loop are distributed over the threads #pragma omp for [clause[[,] clause] ...] <original for-loop> !$omp do [clause[[,] clause] ...] <original do-loop> !$omp end do [nowait] Clauses supported: private firstprivate lastprivate reduction ordered* schedule nowait covered later *) Required if ordered sections are in the dynamic extent of this construct RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 98 The omp for directive - Example #pragma omp parallel default(none)\ shared(n,a,b,c,d) private(i) { #pragma omp for nowait for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; #pragma omp for nowait for (i=0; i<n; i++) d[i] = 1.0/c[i]; } /*-- End of parallel region --*/ (implied barrier) RvdP/V1 HPCC '07 OpenMP Tutorial 49 HPCC '07 OpenMP Tutorial 99 The sections directive The individual code blocks are distributed over the threads #pragma omp sections [clause(s)] { #pragma omp section <code block1> #pragma omp section <code block2> #pragma omp section : } !$omp sections [clause(s)] !$omp section <code block1> !$omp section <code block2> !$omp section : !$omp end sections [nowait] Clauses supported: private firstprivate lastprivate reduction nowait Note: The SECTION directive must be within the lexical extent of the SECTIONS/END SECTIONS pair RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 100 The sections directive - Example #pragma omp parallel default(none)\ shared(n,a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i<n-1; i++) b[i] = (a[i] + a[i+1])/2; #pragma omp section for (i=0; i<n; i++) d[i] = 1.0/c[i]; } /*-- End of sections --*/ } /*-- End of parallel region --*/ RvdP/V1 HPCC '07 OpenMP Tutorial 50 HPCC '07 OpenMP Tutorial 101 Combined work-sharing constructs #pragma omp parallel #pragma omp parallel for #pragma omp for for (....) for (...) Single PARALLEL loop !$omp parallel !$omp parallel do !$omp do ... ... !$omp end parallel do !$omp end do !$omp end parallel !$omp parallel Single WORKSHARE loop !$omp parallel workshare !$omp workshare ... ... !$omp end parallel workshare !$omp end workshare !$omp end parallel #pragma omp parallel #pragma omp parallel sections #pragma omp sections { ... } { ...} Single PARALLEL sections !$omp parallel !$omp parallel sections !$omp sections ... ... !$omp end parallel sections !$omp end sections !$omp end parallel RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 102 Orphaning : !$omp parallel : call dowork() : !$omp end parallel : subroutine dowork() : !$omp do do i = 1, n : end do !$omp end do : orphaned work-sharing directive The OpenMP standard does not restrict worksharing and synchronization directives (omp for, omp single, critical, barrier, etc.) to be within the lexical extent of a parallel region. These directives can be orphaned That is, they can appear outside the lexical extent of a parallel region RvdP/V1 HPCC '07 OpenMP Tutorial 51 HPCC '07 OpenMP Tutorial 103 More on orphaning (void) dowork(); !- Sequential FOR #pragma omp parallel { (void) dowork(); !- Parallel FOR } void dowork() { #pragma omp for for (i=0;....) { : } } When an orphaned worksharing or synchronization directive is encountered in the sequential part of the program (outside the dynamic extent of any parallel region), it is executed by the master thread only. In effect, the directive will be ignored RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 104 Parallelizing bulky loops for i<n; (i=0; i++) /* Parallel loop */ { a = ... b = ... a .. c[i] = .... ...... for (j=0; j<m; j++) { <a lot more code in this loop> } ...... } RvdP/V1 HPCC '07 OpenMP Tutorial 52 HPCC '07 OpenMP Tutorial 105 Step 1: "Outlining" for (i=0; i<n; i++) /* Parallel loop */ { (void) FuncPar(i,m,c,...) } Still a sequential program Should behave identically Easy to test for correctness But, parallel by design void FuncPar(i,m,c,....) { float a, b; /* Private data */ int j; a = ... b = ... a .. c[i] = .... ...... for (j=0; j<m; j++) { <a lot more code in this loop> } ...... } HPCC '07 OpenMP Tutorial RvdP/V1 HPCC '07 OpenMP Tutorial 106 Step 2: Parallelize #pragma omp parallel for private(i) shared(m,c,..) for (i=0; i<n; i++) /* Parallel loop */ { (void) FuncPar(i,m,c,...) } /*-- End of parallel for --*/ void FuncPar(i,m,c,....) { float a, b; /* Private data */ int j; a = ... b = ... a .. c[i] = .... ...... for (j=0; j<m; j++) { <a lot more code in this loop> } ...... } HPCC '07 OpenMP Tutorial Minimal scoping required Less error prone RvdP/V1 53 HPCC '07 OpenMP Tutorial 107 Single processor region/1 This construct is ideally suited for I/O or initializations ..... "read a[0..N-1]"; ..... Original Code "declare A to be be shared" #pragma omp parallel { ..... one volunteer requested "read a[0..N-1]"; ..... May have to insert a barrier here thanks, we're done } Parallel Version RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 108 Single processor region/2 Usually, there is a barrier at the end of the region Might therefore be a scalability bottleneck (Amdahl's law) single processor region time Threads wait in the barrier RvdP/V1 HPCC '07 OpenMP Tutorial 54 HPCC '07 OpenMP Tutorial 109 SINGLE and MASTER construct Only one thread in the team executes the code enclosed #pragma omp single [clause[[,] clause] ...] { <code-block> } !$omp single [clause[[,] clause] ...] <code-block> !$omp end single [nowait] Only the master thread executes the code block; #pragma omp master {<code-block>} !$omp master <code-block> !$omp end master RvdP/V1 HPCC '07 OpenMP Tutorial There is no implied barrier on entry or exit ! HPCC '07 OpenMP Tutorial 110 Critical region/1 If sum is a shared variable, this loop can not run in parallel for (i=0; i < N; i++){ ..... sum += a[i]; ..... } We can use a critical region for this: for (i=0; i < N; i++){ ..... one at a time can proceed sum += a[i]; ..... next in line, please } RvdP/V1 HPCC '07 OpenMP Tutorial 55 HPCC '07 OpenMP Tutorial 111 Critical region/2 Useful to avoid a race condition, or to perform I/O (but which still has random order) Be aware that your parallel computation may be serialized and so this could introduce a scalability bottleneck (Amdahl's law) critical region time RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 112 Critical and Atomic constructs Critical: All threads execute the code, but only one at a time: #pragma omp critical [(name)] {<code-block>} !$omp critical [(name)] <code-block> !$omp end critical [(name)] There is no implied barrier on entry or exit ! Atomic: only the loads and store are atomic .... #pragma omp atomic <statement> This is a lightweight, special form of a critical section RvdP/V1 HPCC '07 OpenMP Tutorial !$omp atomic <statement> #pragma omp atomic a[indx[i]] += b[i]; 56 HPCC '07 OpenMP Tutorial 113 More synchronization constructs The enclosed block of code is executed in the order in which iterations would be executed sequentially: #pragma omp ordered {<code-block>} !$omp ordered <code-block> !$omp end ordered May introduce serialization (could be expensive) Ensure that all threads in a team have a consistent view of certain objects in memory: #pragma omp flush [(list)] !$omp flush [(list)] In the absence of a list, all visible variables are flushed; this could be expensive RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 114 Load Balancing Load balancing is an important aspect of performance For regular operations (e.g. a vector addition), load balancing is not an issue For less regular workloads, care needs to be taken in distributing the work over the threads Examples: Transposing a matrix Multiplication of triangular matrices Parallel searches in a linked list For these irregular situations, the schedule clause supports various iteration scheduling algorithms RvdP/V1 HPCC '07 OpenMP Tutorial 57 HPCC '07 OpenMP Tutorial 115 The schedule clause/1 schedule ( static | dynamic | guided [, chunk] ) schedule (runtime) static [, chunk] Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads Example: Loop of length 16, 4 threads: TID no chunk 0 1-4 1-2 9-10 1 5-8 3-4 11-12 HPCC '07 OpenMP Tutorial 2 9-12 5-6 13-14 3 13-16 7-8 15-16 chunk = 2 RvdP/V1 HPCC '07 OpenMP Tutorial 116 The schedule clause/2 dynamic [, chunk] Fixed portions of work; size is controlled by the value of chunk When a thread finishes, it starts on the next portion of work guided [, chunk] Same dynamic behavior as "dynamic", but size of the portion of work decreases exponentially runtime Iteration scheduling scheme is set at runtime through environment variable OMP_SCHEDULE RvdP/V1 HPCC '07 OpenMP Tutorial 58 HPCC '07 OpenMP Tutorial 117 The experiment 500 iterations on 4 threads 3 2 1 0 guided, 5 Thread ID 3 2 1 0 3 2 1 0 0 50 100 150 200 250 dynamic, 5 static 300 350 400 450 500 Iteration Number RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 118 OpenMP Environment Variables RvdP/V1 HPCC '07 OpenMP Tutorial 59 HPCC '07 OpenMP Tutorial 119 OpenMP Environment Variables OpenMP environment variable OMP_NUM_THREADS n OMP_SCHEDULE "schedule,[chunk]" OMP_DYNAMIC { TRUE | FALSE } OMP_NESTED { TRUE | FALSE } Default for Sun OpenMP 1 static, "N/P" (1) TRUE (2) FALSE (3) (1) The chunk size approximately equals the number of iterations (N) divided by the number of threads (P) (2) The number of threads is limited to the number of on-line processors in the system. This can be changed by setting OMP_DYNAMIC to FALSE. (3) Multi-threaded execution of inner parallel regions in nested parallel regions is supported as of Sun Studio 10 Note: The names are in uppercase, the values are case insensitive RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 120 OpenMP Run-time Environment RvdP/V1 HPCC '07 OpenMP Tutorial 60 HPCC '07 OpenMP Tutorial 121 OpenMP run-time environment OpenMP provides several user-callable functions To control and query the parallel environment General purpose semaphore/lock routines OpenMP 2.0: supports nested locks Nested locks are not covered in detail here The run-time functions take precedence over the corresponding environment variables Recommended to use under control of an #ifdef for _OPENMP (C/C++) or conditional compilation (Fortran) C/C++ programs need to include <omp.h> Fortran: may want to use "USE omp_lib" RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 122 OpenMP run-time library OpenMP Fortran library routines are external functions Their names start with OMP_ but usually have an integer or logical return type Therefore these functions must be declared explicitly On Sun systems the following features are available: USE omp_lib INCLUDE 'omp_lib.h' #include "omp_lib.h" (preprocessor directive) Compilation with -Xlist also reports any type mismatches The f95 -XlistMP option for more extensive checking can be used as well RvdP/V1 HPCC '07 OpenMP Tutorial 61 HPCC '07 OpenMP Tutorial 123 Run-time library overview Name omp_set_num_threads omp_get_num_threads omp_get_max_threads omp_get_thread_num omp_get_num_procs omp_in_parallel omp_set_dynamic omp_get_dynamic omp_set_nested omp_get_nested omp_get_wtime omp_get_wtick Functionality Set number of threads Return number of threads in team Return maximum number of threads Get thread ID Return maximum number of processors Check whether in parallel region Activate dynamic thread adjustment (but implementation is free to ignore this) Check for dynamic thread adjustment Activate nested parallelism (but implementation is free to ignore this) Check for nested parallelism Returns wall clock time Number of seconds between clock ticks RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 124 Example #pragma omp parallel single(...) NumP = omp_get_num_threads(); allocate WorkSpace[NumP][N]; #pragma omp parallel for (...) for (i=0; i < N; i++) { TID = omp_get_thread_num(); ..... WorkSpace[TID][i] = .... ; ..... ... = WorkSpace[TID][i]; ..... } N NumP RvdP/V1 HPCC '07 OpenMP Tutorial 62 HPCC '07 OpenMP Tutorial 125 OpenMP locking routines Locks provide greater flexibility over critical sections and atomic updates: Possible to implement asynchronous behavior Not block structured The so-called lock variable, is a special variable: Fortran: type INTEGER and of a KIND large enough to hold an address C/C++: type omp_lock_t and omp_nest_lock_t for nested locks Lock variables should be manipulated through the API only It is illegal, and behavior is undefined, in case a lock variable is used without the appropriate initialization HPCC '07 OpenMP Tutorial RvdP/V1 HPCC '07 OpenMP Tutorial 126 Nested locking Simple locks: may not be locked if already in a locked state Nestable locks: may be locked multiple times by the same thread before being unlocked In the remainder, we discuss simple locks only The interface for functions dealing with nested locks is similar (but using nestable lock variables): Simple locks omp_init_lock omp_destroy_lock omp_set_lock omp_unset_lock omp_test_lock Nestable locks omp_init_nest_lock omp_destroy_nest_lock omp_set_nest_lock omp_unset_nest_lock omp_test_nest_lock RvdP/V1 HPCC '07 OpenMP Tutorial 63 HPCC '07 OpenMP Tutorial 127 OpenMP locking example parallel region - begin TID = 0 acquire lock TID = 1 The protected region contains the update of a shared variable One thread acquires the lock and performs the update Meanwhile, the other thread performs some other work When the lock is released again, the other thread performs the update Protected Region release lock Other Work Other Work acquire lock Protected Region release lock parallel region - end RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 128 Locking example - The code Program Locks .... Call omp_init_lock (LCK) Initialize lock variable !$omp parallel shared(SUM,LCK) private(TID) TID = omp_get_thread_num() Check availability of lock (also sets the lock) Do While ( omp_test_lock (LCK) .EQV. .FALSE. ) Call Do_Something_Else(TID) End Do Call Do_Work(SUM,TID) Call omp_unset_lock (LCK) !$omp end parallel Call omp_destroy_lock (LCK) Stop End RvdP/V1 Release lock again Remove lock association HPCC '07 OpenMP Tutorial 64 HPCC '07 OpenMP Tutorial 129 Example output for 2 threads TID: 1 at 09:07:27 => entered parallel region TID: 1 at 09:07:27 => done with WAIT loop and has the lock TID: 1 at 09:07:27 => ready to do the parallel work TID: 1 at 09:07:27 => this will take about 18 seconds TID: 0 at 09:07:27 => entered parallel region TID: 0 at 09:07:27 => WAIT for lock - will do something else TID: 0 at 09:07:32 => WAIT for lock - will do something else TID: 0 at 09:07:37 => WAIT for lock - will do something else TID: 0 at 09:07:42 => WAIT for lock - will do something else TID: 1 at 09:07:45 => done with my work TID: 1 at 09:07:45 => done with work loop - released the lock TID: 1 at 09:07:45 => ready to leave the parallel region TID: 0 at 09:07:47 => done with WAIT loop and has the lock TID: 0 at 09:07:47 => ready to do the parallel work TID: 0 at 09:07:47 => this will take about 18 seconds TID: 0 at 09:08:05 => done with my work TID: 0 at 09:08:05 => done with work loop - released the lock TID: 0 at 09:08:05 => ready to leave the parallel region Done at 09:08:05 - value of SUM is 1100 for for for for 5 5 5 5 seconds seconds seconds seconds Used to check the answer Note: program has been instrumented to get this information RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 130 Global Data RvdP/V1 HPCC '07 OpenMP Tutorial 65 HPCC '07 OpenMP Tutorial 131 Global data - An example program global_data .... include "global.h" .... !$omp parallel do private(j) do j = 1, n call suba(j) end do !$omp end parallel do ...... file global.h common /work/a(m,n),b(m) subroutine suba(j) ..... include "global.h" ..... do i = 1, m b(i) = j end do Data Race ! do i = 1, m a(i,j) = func_call(b(i)) end do return end RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 132 Global data - A Data Race! Thread 1 call suba(1) subroutine suba(j=1) Thread 2 call suba(2) subroutine suba(j=2) do i = 1, m b(i) = 2 end do .... do i = 1, m a(i,2)=func_call(b(i)) end do Shared do i = 1, m b(i) = 1 end do .... do i = 1, m a(i,1)=func_call(b(i)) end do RvdP/V1 HPCC '07 OpenMP Tutorial 66 HPCC '07 OpenMP Tutorial 133 Example - Solution program global_data .... include "global_ok.h" .... !$omp parallel do private(j) do j = 1, n call suba(j) end do !$omp end parallel do ...... file global_ok.h integer, parameter:: nthreads=4 common /work/a(m,n) common /tprivate/b(m,nthreads) subroutine suba(j) ..... include "global_ok.h" ..... TID = omp_get_thread_num()+1 do i = 1, m b(i,TID) = j end do do i = 1, m a(i,j)=func_call(b(i,TID)) end do return end By expanding array B, we can give each thread unique access to it's storage area Note that this can also be done using dynamic memory (allocatable, malloc, ....) RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 134 About global data Global data is shared and requires special care A problem may arise in case multiple threads access the same memory section simultaneously: Read-only data is no problem Updates have to be checked for race conditions It is your responsibility to deal with this situation In general one can do the following: Split the global data into a part that is accessed in serial parts only and a part that is accessed in parallel Manually create thread private copies of the latter Use the thread ID to access these private copies Alternative: Use OpenMP's threadprivate directive RvdP/V1 HPCC '07 OpenMP Tutorial 67 HPCC '07 OpenMP Tutorial 135 The threadprivate directive OpenMP's threadprivate directive !$omp threadprivate (/cb/ [,/cb/] ...) #pragma omp threadprivate (list) Thread private copies of the designated global variables and common blocks are created Several restrictions and rules apply when doing this: The number of threads has to remain the same for all the parallel regions (i.e. no dynamic threads) Sun implementation supports changing the number of threads Initial data is undefined, unless copyin is used ...... Check the documentation when using threadprivate ! RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 136 Example - Solution 2 program global_data .... include "global_ok2.h" .... !$omp parallel do private(j) do j = 1, n call suba(j) end do !$omp end parallel do ...... stop end file global_ok2.h common /work/a(m,n) common /tprivate/b(m) !$omp threadprivate(/tprivate/) subroutine suba(j) ..... include "global_ok2.h" ..... do i = 1, m b(i) = j end do do i = 1, m a(i,j) = func_call(b(i)) end do return end The compiler creates thread private copies of array B, to give each thread unique access to it's storage area Note that the number of copies is automatically adjusted to the number of threads RvdP/V1 HPCC '07 OpenMP Tutorial 68 HPCC '07 OpenMP Tutorial 137 The copyin clause copyin (list) Applies to THREADPRIVATE common blocks only At the start of the parallel region, data of the master thread is copied to the thread private copies Example: common /cblock/velocity common /fields/xfield, yfield, zfield ! create thread private common blocks !$omp threadprivate (/cblock/, /fields/) !$omp parallel & !$omp default (private) & !$omp copyin ( /cblock/, zfield ) RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 138 OpenMP - A Summary A very powerful, yet simple, programming model Portable Easy to learn and to use Preserves sequential version of the program Very good fit with multicore architectures RvdP/V1 HPCC '07 OpenMP Tutorial 69 HPCC '07 OpenMP Tutorial 139 Data Races RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 140 Data Races Shared Memory programming is usually fairly straightforward There are some areas that require special care though One of these is called "data race" If a data race occurs, silent data corruption could result To complicate matters further, this behavior may not be (easily) reproducible It is difficult for a conventional debugger to detect these and a special tool is a "must have" RvdP/V1 HPCC '07 OpenMP Tutorial 70 HPCC '07 OpenMP Tutorial 141 About Parallelism Parallelism Independence No Fixed Ordering "Something" that does not obey this rule, is not parallel (at that level ...) RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 142 Shared Memory Programming T private T private X X Y Shared Memory private T private X T private T Threads communicate via shared memory RvdP/V1 HPCC '07 OpenMP Tutorial 71 HPCC '07 OpenMP Tutorial 143 What is a Data Race? Two different threads in a multi-threaded shared memory program Access the same (=shared) memory location Concurrently Without holding any common exclusive locks At least one of the accesses is a write/store and and RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 144 Example of a data race #pragma omp parallel shared(n) {n = omp_get_thread_num();} T private W W T private n Shared Memory RvdP/V1 HPCC '07 OpenMP Tutorial 72 HPCC '07 OpenMP Tutorial 145 Another example #pragma omp parallel shared(x) {x = x + 1;} T private R/W R/W T private x Shared Memory RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 146 About Data Races Loosely described, a data race means that the update of a shared variable is not well protected A data race tends to show up in a nasty way: Numerical results are (somewhat) different from run to run Especially with Floating-Point data difficult to distinguish from a numerical side-effect Changing the number of threads can cause the problem to seemingly (dis)appear May also depend on the load on the system May only show up using many threads RvdP/V1 HPCC '07 OpenMP Tutorial 73 HPCC '07 OpenMP Tutorial 147 A parallel loop for (i=0; i<8; i++) a[i] = a[i] + b[i]; Thread 1 a[0]=a[0]+b[0] a[1]=a[1]+b[1] a[2]=a[2]+b[2] a[3]=a[3]+b[3] Every iteration in this loop is independent of the other iterations Thread 2 a[4]=a[4]+b[4] a[5]=a[5]+b[5] a[6]=a[6]+b[6] a[7]=a[7]+b[7] Time RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 148 Not a parallel loop for (i=0; i<8; i++) a[i] = a[i+1] + b[i]; Thread 1 a[0]=a[1]+b[0] a[1]=a[2]+b[1] a[2]=a[3]+b[2] a[3]=a[4]+b[3] The result is not deterministic when run in parallel ! Thread 2 a[4]=a[5]+b[4] a[5]=a[6]+b[5] a[6]=a[7]+b[6] a[7]=a[8]+b[7] Time RvdP/V1 HPCC '07 OpenMP Tutorial 74 HPCC '07 OpenMP Tutorial 149 About the experiment We manually parallelized the previous loop The compiler detects the data dependence and does not parallelize the loop Vectors a and b are of type integer We use the checksum of a as a measure for correctness: checksum += a[i] for i = 0, 1, 2, ...., n-2 The correct, sequential, checksum result is computed as a reference We ran the program using 1, 2, 4, 32 and 48 threads Each of these experiments was repeated 4 times RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 150 Numerical results threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: threads: 1 1 1 1 2 2 2 2 4 4 4 4 32 32 32 32 48 48 48 48 checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum checksum 1953 1953 1953 1953 1953 1953 1953 1953 1905 1905 1953 1937 1525 1473 1489 1513 936 1007 887 822 correct correct correct correct correct correct correct correct correct correct correct correct correct correct correct correct correct correct correct correct value value value value value value value value value value value value value value value value value value value value 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 1953 Data Race In Action ! RvdP/V1 HPCC '07 OpenMP Tutorial 75 HPCC '07 OpenMP Tutorial 151 Another example of a data race/1 #pragma omp parallel default(none) private(i,k,s) \ shared(n,m,a,b,c,d,dr) { #pragma omp for for (i=0; i<m; i++) { int max_val = 0; s = 0 ; for (k=0; k<i; k++) s += a[k]*b[k]; c[i] = s; dr = c[i]; c[i] = 3*s - c[i]; if (max_val < c[i]) max_val = c[i]; d[i] = c[i] - dr; } } /*-- End of parallel region --*/ Where is the data race ? RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 152 Another example of a data race/2 #pragma omp parallel default(none) private(i,k,s) \ shared(n,m,a,b,c,d,dr) { #pragma omp for for (i=0; i<m; i++) { % cc int max_val = 0; -xopenmp -fast -xvpara -xloopinfo -c data-race.c "data-race.c", line 9: Warning: inappropriate scoping s =variable 'dr' may be scoped inappropriately 0 ; foras 'shared' k++) (k=0; k<i; s += a[k]*b[k];24 and write at line 21 may . read at line c[i] = s; data race cause dr = c[i]; c[i] = 3*s - c[i]; if (max_val < c[i]) max_val = c[i]; d[i] = c[i] - dr; } } /*-- End of parallel region --*/ Here is the data race ! RvdP/V1 HPCC '07 OpenMP Tutorial 76 HPCC '07 OpenMP Tutorial 153 Bottom line about Data Races Data Races Are Easy To Put In But Very Hard To Find That is why a special tool to find data races is a "must have" RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 154 The Sun Thread Analyzer RvdP/V1 HPCC '07 OpenMP Tutorial 77 HPCC '07 OpenMP Tutorial 155 Sun Studio Thread Analyzer New tool from Sun - Available in Sun Studio 12 Detects threading errors in a multi-threaded program Now: Data race detection Deadlock detection Future: Other run time errors using POSIX threads and/or OpenMP RvdP/V1 HPCC '07 OpenMP Tutorial HPCC '07 OpenMP Tutorial 156 Sun Studio Thread Analyzer Parallel Programming Models supported*: OpenMP POSIX Threads Solaris Threads Platforms: Solaris on SPARC, Solaris/Linux on x86/x64 Languages: C, C++, Fortran API provided to inform Thread Analyzer of user-defined synchronizations Reduce the number of false positive data races reported *) Legacy Sun and Cray parallel directives are supported too RvdP/V1 HPCC '07 OpenMP Tutorial 78 HPCC '07 OpenMP Tutorial 157 About Sun Studio Thread Analyzer Getting Started: http://developers.sun.com/sunstudio/downloads/ ssx/tha/tha_getting_started.html Provide feedback and ask questions on the Sun Studio Tools Forum http://developers.sun.com/sunstudio/community/ forums/index...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

U. Houston - TLC - 07
A pr ofile based appr oach for t opology awar e M PI r ank placementDavid Solt , Ph.D. H P-M PI www.hp.com/go/mpi 2007 H ewlet t -Packar d Devel opment Company, L .P. The i nfor mat i on cont ai ned her ei n i s subject t o change wit hout not ice
U. Houston - TLC - 2
A pr ofile based appr oach for t opology awar e M PI r ank placementDavid Solt , Ph.D. H P-M PI www.hp.com/go/mpi 2007 H ewlet t -Packar d Devel opment Company, L .P. The i nfor mat i on cont ai ned her ei n i s subject t o change wit hout not ice
U. Houston - TLC - 07
THIRD INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONSCALL FOR PAPERSHPCC07Houston, TexasSeptember 26-28, 2007 www.tlc2.uh.edu/hpcc07Greater Houston Convention and Visitors Bureau (photographer: Jim Olive)Organizati
U. Houston - TLC - 2
THIRD INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONSCALL FOR PAPERSHPCC07Houston, TexasSeptember 26-28, 2007 www.tlc2.uh.edu/hpcc07Greater Houston Convention and Visitors Bureau (photographer: Jim Olive)Organizati
U. Houston - TLC - 2
TLC2 Web WorkshopMarch 23rd 2005 Application Solutions GroupApplication Solutions Group Web portal development Database applications Consultations Application Project ManagementServices Research Data Integration Education assessments Sch
U. Houston - TLC - 2
Testbed Division Project SummaryGreater Harris County e911 (e911) and Texas Medical Center (TMC) Grant Request(Confidential Draft: For Approved Eyes On)Flooding and other weather-related conditions negatively impact regions throughout Texas and
U. Houston - TLC - 2
Computer 101Technology Assistance Division SWTC has joined with the Sheriff's Association of Texas to go on-site at locations around Texas to train law enforcement personnel in basic computer skills and the elements of the Microsoft Office Suite in
U. Houston - TLC - 2
May/June 2008THE PORTOF HOUSTONENSURiNg a SaFE &amp; SEcURE PORTA bi-monthly publication.Contents26 Secure ToursTrips aboard Sam Houston safe, secure, funMay/June 2008COVER STORYFEATURES10 Beyond the Fenceline: Increased Safety and Secu
U. Houston - TLC - 2
Selected Provisions in Homeland Security Act of 2002 (P.L. 107-296, enacted 11/25/02) of Particular Interest to UniversitiesThe Act authorizes a transfer of activities from 22 agencies to the new Department of Homeland Security, including the Immig
U. Houston - TLC - 2
Testbed Division Project SummaryRegional Intelligent Transportation System (ITS) Study for the Houston TranStar ConsortiumAt the request of the Leadership Team of the Houston TranStar Consortium (consisting of Texas Department of Transportation,
U. Houston - TLC - 2
Technology Assistance Division Project SummaryMiddle Rio Grand Development Council ProjectThe vast Rio Grande region of Texas presents a number of public safety challenges. Dispersed resources and diverse constituencies make effective and timely co
U. Houston - TLC - 2
Technology Assistance Division Project SummaryMiddle Rio Grand Development Council ProjectThe vast Rio Grande region of Texas presents a number of public safety challenges. Dispersed resources and diverse constituencies make effective and timely c
U. Houston - TLC - 2
RDT&amp;E Division Project SummaryRFID-based Property and Evidence Management SystemRFID (Radio Frequency Identification) technology has been applied to many fields with the main purpose of locating and tracking objects or people through Ultrahigh Fr
U. Houston - TLC - 2
Testbed Division Project SummaryAutomated Contraflow Traffic Management for Urban and Coastal Area EvacuationThe Advanced Concepts Business Unit of SAIC contacted SWTC after Hurricane Rita because it has received internal funding to further devel
U. Houston - TLC - 2
RDT&amp;E Division Project SummaryAutomated Face Recognition System for Monitoring Ingress/EgressMonitoring ingress and egress is vital to maintaining a secure accesscontrol environment. Biometrics-based automated systems are not foolproof, but add a
U. Houston - TLC - 2
Automated Face Recognition System for Monitoring Ingress/EgressR&amp;D Division Project Summary Monitoring ingress and egress is vital to maintaining a secure accesscontrol environment. Biometrics-based automated systems are not foolproof, but add a lay
U. Houston - TLC - 2
Uptown (Galleria) Area Wireless ProjectTest &amp; Evaluation Project SummaryAt the request of the Uptown Area (Galleria), the City of Houston and Houston TranStar, SWTC assisted Uptown relative to understanding the intricacies and potential costs asso
U. Houston - TLC - 2
Regional Intelligent Transportation System (ITS) Study for the Houston TranStar ConsortiumTest and Evaluation Project SummaryAt the request of the Leadership Team of the Houston TranStar Consortium (consisting of Texas Department of Transportation,
U. Houston - TLC - 2
RDT&amp;E Division Project SummaryEvaluation of Contraband Cell Phone DetectorUsing the resources of the Center and the University of Houston, the Southwest Public Safety Technology Center is establishing a broad-based facility for the development, t
U. Houston - TLC - 2
Automated Contraflow Traffic Management for Urban and Coastal Area EvacuationTest and Evaluation Project Summary The Advanced Concepts Business Unit of SAIC contacted SWTC after Hurricane Rita because it has received internal funding to further deve
U. Houston - TLC - 2
Test &amp; Evaluation ProjectEvaluation of Contraband Cell Phone DetectorUsing the resources of the Center and the University of Houston, the Southwest Public Safety Technology Center is establishing a broad-based facility for the development, testin
U. Houston - TLC - 2
Texas Congressional Members Deliver Funding Priorities for Security at Port of Houston, U.S. Ports. NPA 03-18 100508306 NDN- 214-0496-1986-8 HOUSTON, April 23 /PR Newswire/ - Democratic members of the House Committee on Homeland Security held a press
U. Houston - TLC - 2
Texas Internet Grid for Research and EducationDocument Name Current Version Date last updated TIGRE Membership Policy Document 1.9 December 12, 2007Abstract: This document describes the constitution of TIGRE Steering Committee and policies for joi
U. Houston - TLC - 2
Minimum Requirements for Participation in the Test Bed for the Texas Internet Grid for Research and EducationRevision 0.2 February 28, 2006PurposeThis document establishes the minimum requirements for participation the Texas Internet Grid for Re
U. Houston - TLC - 2
Use of Grid Computing in Ensemble Kalman Filter Based Data Assimilation for Hydrocarbon ReservoirsAjitabh Kumar, Ravi Vadapalli, Taesung Kim Advisor: Dr Akhil Datta-Gupta Texas A&amp;M UniversityOutlineObjective Approach Implementation Results Conclu
U. Houston - TLC - 2
Texas Internet Grid for Research and EducationDocument Name Current Version Date last updated TIGRE User Agreement and Responsibility Form 1.5 December 6, 2007Abstract: This document describes the acceptable use policies, user agreements and respo
U. Houston - TLC - 2
TB, KR, PMB/238453, 16/04/2007IOP PUBLISHING Phys. Med. Biol. 52 (2007) 119 PHYSICS IN MEDICINE AND BIOLOGY UNCORRECTED PROOFClinical CT-based calculations of dose and positron emitter distributions in proton therapy using the FLUKA Monte Carlo co
U. Houston - TLC - 2
Texas Internet Grid for Research and EducationDocument Name Current Version Date last updated TIGRE Site Operational Policies Version 1.3 December 12, 2007Abstract: This document describes the service agreement for all providers of TIGRE services.
U. Houston - TLC - 2
Appendix AProject PlanTexas Internet Grid for Research and Education (TIGRE)Rice University, Texas A &amp; M University, Texas Tech University, University of Houston, and The University of Texas at AustinRevision 1.2 March 3, 200621. Introduc
U. Houston - TLC - 2
PET/CT imaging for treatment verification after proton therapy: A study with plastic phantoms and metallic implantsKatia Parodi,a Harald Paganetti, Ethan Cascio, and Jacob B. FlanzMassachusetts General Hospital, Department of Radiation Oncology, 30
U. Houston - TLC - 2
Texas Internet Grid for Research and EducationDocument Name Current Version Date last updated TIGRE Site Operational Policies Version 1.3 December 12, 2007Abstract: This document describes the service agreement for all providers of TIGRE services
U. Houston - TLC - 2
Texas Internet Grid for Research and EducationDocument Name Current Version Date last updated TIGRE Membership Policy Document 1.9 December 12, 2007Abstract: This document describes the constitution of TIGRE Steering Committee and policies for jo
U. Houston - TLC - 2
Minimum Requirements for Participation in the Test Bed for the Texas Internet Grid for Research and EducationRevision 0.2 February 28, 2006PurposeThis document establishes the minimum requirements for participation the Texas Internet Grid for Re
U. Houston - TLC - 2
University of Houston HiPCAT Institutional UpdateFebruary, 2005 Resources &amp; Services TLC2 successfully completed the letter of credit and contract to procure two dark fiber rings for several institutions in Houston and LEARN and most likely also NL
U. Houston - TLC - 2
Winners year 2008 1st Place: Kevin Shen, Bellaire High School (Houston, TX). 2nd Place: Qianning Zhang, Bellaire High School (Houston, TX) 3rd Place: Sachin Subramanian, Bellaire High School (Houston, TX) Winners year 2007 1st Place: Sailesh Prabhu f
U. Houston - TLC - 2
25.942778,97.518889,026.035,97.786111,126.050278,97.710833,126.064167,97.760833,126.081667,97.836111,126.091111,97.955833,026.093056,97.617778,026.095278,98.201389,126.125833,97.938889,026.148889,97.910278,026.155833,97.961667,026.164167,9
U. Houston - TLC - 2
TLC2 OverviewLennart Johnsson Director Cullen Prof of Computer Science, Mathematics, and Electrical and Computer EngineeringTLC2 Missionto foster and support collaborative multidisciplinary research, education and training in Computational Scienc
U. Houston - TLC - 2
TIGRE StatusSeptember, 2006TIGRE Overview Integrate resources of Texas institutions to enhance research and educational capabilities Foster academic, private, and government partnerships State of Texas funded Texas Internet Grid for Research a
U. Houston - TLC - 2
Lighting the NextGeneration Network Across TexasA Briefing forHigh Performance Computing Across Texas (HiPCAT)Sep 22, 2006 Akbar KaraLEARN Origins2002 &quot;Texas&quot; invited to join NLR No unified network/organization: 6 separate state university sy
U. Houston - TLC - 2
Forming a LEARN Research Advisory CouncilRichard Ewing and Guy Almes 7 September 2006OutlineLEARN, Cyberinfrastructure, and Texas Objectives The Research Advisory Council Initial ActivitiesLEARN, Cyberinfrastructure, and Texas Cyberinfrastructu
U. Houston - TLC - 2
UH &amp; TLC2 @ CERN/LHC &amp; NASA (The Tragedy of the Anti-Commons)L. Pinsky Physics Department University of Houston HIPCAT September 22, 2006 Houston, TexasALICE-USA CollaborationRoadmap. Acknowledgments and Disclaimers. A little bit about LHC
U. Houston - COMM - 15
COMM 3302 eHealth &amp; TelemedicineShawn McCombs UH School of Communicationhttp:/soc.class.uh.edu/~smccombs 2006 UH School of Communication - Health Communication Lecture Series - All Rights ReservedThe Future Potential of Digital HealthCOMM 3302
U. Houston - COMM - 3302
COMM 3302 eHealth &amp; TelemedicineShawn McCombs UH School of Communicationhttp:/soc.class.uh.edu/~smccombs 2006 UH School of Communication - Health Communication Lecture Series - All Rights ReservedThe Future Potential of Digital HealthCOMM 3302
U. Houston - COMM - 3302
COMM 3302 eHealth &amp; TelemedicineShawn McCombs UH School of Communicationhttp:/soc.class.uh.edu/~smccombs 2006 UH School of Communication - Health Communication Lecture Series - All Rights ReservedThe Need for Digital Health:News Ways to Meet N
U. Houston - COMM - 3302
COMM 3302 eHealth &amp; TelemedicineShawn McCombs UH School of Communicationhttp:/soc.class.uh.edu/~smccombs 2006 UH School of Communication - Health Communication Lecture Series - All Rights ReservedA Diversity of ApplicationsCOMM 3302 eHealth &amp;
U. Houston - COSC - 6365
COSC 6365 Lecture 1, 2008-01-15Introduction to HPCLennart Johnsson Dept of Computer Science Director TLC2COSC 6365 Lecture 1, 2008-01-15Technology Background From small to large scale parallelism From architectural convergence to divergence
U. Houston - COSC - 6365
COSC 6365 Lecture 5, 2008-01-29Introduction to HPC Lecture 5Lennart Johnsson, Mark Huang Dept of Computer Science Director TLC2COSC 6365 Lecture 5, 2008-01-29Outline CPU trends Processor architectures IA64 IA32 Intel Xeon processors AMD
U. Houston - COSC - 4377
COSC4377 Lecture 08 2007-09-13Chapter 2 Application LayerLecture 8Computer Networking: A Top Down Approach Featuring the Internet,3rd edition. Jim Kurose, Keith Ross Addison-Wesley, July 2004.2: Application LayerMost slides courtesy the book
U. Houston - COSC - 6365
COSC 6365 Lecture 9 2008-02-12Introduction to HPC Lecture 9Lennart Johnsson Dept of Computer Science Director TLC2COSC 6365 Lecture 9 2008-02-12CrossbarCrosspoint switch complexity increases quadratically with the number of crossbar input/out
U. Houston - COSC - 6365
COSC 6365 Lecture 3, 2008-01-22Introduction to HPC Lecture 3Lennart Johnsson Dept of Computer Science Director TLC2COSC 6365 Lecture 3, 2008-01-22Evolution of Computer SystemsCPU1000CPU100~200 cycles105 20 GB/s1 DRAMTimeMemor
U. Houston - CS - 4368
Classification of Search Problemshttp:/www.cis.temple.edu/~ingargio/cis587/readings/constraints.htmlState Space SearchConstraint Satisfaction ProblemsOptimization ProblemsSearchUninformed SearchHeuristic SearchCh. Eick: Introduction to
U. Houston - CS - 6367
Evolution strategiesChapter 4A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing Evolution StrategiesES quick overview Developed: Germany in the 1970's Early names: I. Rechenberg, H.-P. Schwefel Typically applied to:numeri
U. Houston - JOHNSON - 2
SYLLABUS COSC 6364 Section 17606 1:00 - 2:30 Tuesday - Thursday, Room 634 S&amp;R I Olin Johnson 596 PGH (713)743-3343 FAX: (713)743-3335 johnson@cs.uh.edu http:/www2.cs.uh.edu/~johnson2 Office Hours: 2:30 3:30 TTh TEXTBOOKS: Numerical Analysis, Kincaid
U. Houston - JWANG - 3
Complex 9: Beta = 1.01 Eta = 1 Purity 1 Quality 0.9741 Time 5416.1 No of Clusters 17The left corner part contains 7 clusters, the outside convex shape outside belong to class 1 and two spots inside belong to class 0. By calculate the average distan
U. Houston - JOHNSON - 2
CHAPTER ONE Introduction The introduction should start by some general ideas explaining the importance of your work. The idea is to convince your reader that you are addressing an interesting topic and that he should keep reading. You should give her
U. Houston - JOHNSON - 2
SYLLABUS Numerical Methods II COSC 3362 Section 14080 Room 350 PGH Olin Johnson, Professor 596 PGH johnson@cs.uh.edu Office Hours: 4:00 - 5:00 T Th &amp; by appointment Lecture Topic _ _ 1 Course Overview 2 Floating Point Arithmetic 3 Floating Point Arit
U. Houston - TAI - 95
* | ICTAI '95 REGISTRATION FORM | * &gt; Register Today! &lt;Please, complete and return this form and fee to: J. Vassilopoulos ICTAI'95 Registration Chair Tulane Universi
U. Houston - COSC - 1304
Supplement to Assignment #3, COSC 1304, Fall 1999/*PROGRAMMER: Robert AndersonFILENAME: payroll.cDATE: March 8, 1994DESCRIPTION: This program inputs the hours worked and rate of pay for a series of employees and computes and outputs each empl
U. Houston - COSC - 1304
Practice Problems for COSC 1304 (exercises enclosed in parentheses are for reference) Chapter 1: 1.3 1.6 1.7 Chapter 2: 2.9 2.10 2.12 2.15 2.17 2.18 2.22 2.24 Chapter 3: 3.11 3.13 3.14 3.
U. Houston - COSC - 1304
COSC 1304 Labs - Fall 1999 100 3.75% 3.75% 3.75% 4.50% 3.00% 3.75% 7.50% 20% 20% 30% of 100Code Lab#1 Lab#2 Lab#3 Lab#4 Lab#5
U. Houston - COSC - 1304
practice Problems for chapter 8Do the self-review exercises which have answers in the book.Do exercises: 8.6 8.8 8.9 (8.16 8.21 for reference)Answers to the Selected Exercises=8.6 #include &lt;stdio.h&gt; #include &lt;ctype.h&gt; /*for protot