39 Pages

Lec02 Fundamentals of parallelism-Address spce organization

Course: CSE 260, Fall 2006
School: UCSD
Rating:
 
 
 
 
 

Word Count: 1955

Document Preview

2 Fundamentals Lecture of parallelism: Address space organization Control mechanism Performance The PRAM Announcements Homework #1 has been posted; due next Thursday in class Valkyrie is available - try it out! (See latest web posting) Web board 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 2 The Anita Borg Institute for Women and Technology and the Association for Computing Machinery Present San Diego,...

Register Now

Unformatted Document Excerpt

Coursehero >> California >> UCSD >> CSE 260

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
2 Fundamentals Lecture of parallelism: Address space organization Control mechanism Performance The PRAM Announcements Homework #1 has been posted; due next Thursday in class Valkyrie is available - try it out! (See latest web posting) Web board 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 2 The Anita Borg Institute for Women and Technology and the Association for Computing Machinery Present San Diego, October 4-7, 2006 Over 1,000 women in computing Events for undergraduates considering careers & grad school Events for graduate students Shirley Tilghman, Parties, company representatives, and more! Volunteers Needed! http://www.cs.ucsd.edu/~bsimon President, Princeton University Sally Ride, UCSD professor and former astronaut Helen Greiner, President, iRobot 3 Free registration! 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 Address Space Organization Recall that we classify the address space organization of a parallel computer according to whether or not it provides global memory When there is global memory: a "shared memory" architecture, also known as a multiprocessor Where there is no global memory: "shared nothing" architecture, also known as a multicomputer 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 4 Shared memory organization The address space is global to all processors Hardware automatically performs the global to local mapping using address translation mechanisms We classify shared memory architectures still further according to the uniformity of memory access times 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 5 UMA UMA = Uniform Memory Access time All processors observe the same access time to memory in the absence of contention Also called Symmetric Multiprocessors Usually bus based: not a scalable solution M C P 9/26/06 C P C P C P 6 Scott B. Baden /CSE 260/ Fall 2006 NUMA: Non-Uniform Memory Access time Processors see distance-dependent access times to memory Implies physically distributed memory Distributed shared memory architectures SGI Origin 3000, up to 512 processors Elaborate interconnect monitors memory sharing 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 7 But why do we use shared memory? A natural extension for existing single processor execution model We don't have to distinguish local from remote data More natural for the programmer and compiler writers 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 8 Disadvantages An efficient program may end up having to mimic a message passing program! For a given level of aggregate memory bandwidth, shared memory appears more costly than shared nothing architectures 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 9 Architectures without shared memory A collection of P processor-memory pairs Processors communicate by sending messages over an interconnect 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 10 Hybrid organizations Today's multicomputers generally have a hybrid design Hierarchically organized: each node is a multiprocessor Nodes communicate by passing messages, processors within a node communicate via shared memory Also called a multi-tier computer Interconnection Network M C C C C P P P P M C C C C P P P P M C C C C P P P P M C C C C P P P P 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 11 Locality A parallel computer extends the notion of a traditional memory hierarchy The notions of locality which apply to virtual memory, cache memory, and registers, also apply to parallel memory hierarchies Vital to preserve locality inherent to the application, and to amortize fixed costs 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 12 Control mechanisms In addition to address space organization, architectures are also classified according to their control mechanism How do the processors issue instructions? Today, most parallel computers execute their instruction streams independently Some special purpose machines execute a global instruction stream in lock-step 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 13 Flynn's classification (1966) PE + CU PE PE Control Unit PE MIMD: Multiple Instruction, Multiple Data Interconnect PE + CU Interconnect PE + CU PE PE + CU PE SIMD: Single Instruction, Multiple Data 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 PE + CU 14 SIMD Two classic SIMD designs ILIAC IV (1960s) Connection Machine Model 1 and 2 (1980s) Efficiently compute on regular arrays of data Irregular or data dependent computations lead to poor performance forall ( i=0:n-1) if ( x[i] < 0) then y[i] = x[i] else y[i] = x[i] end if 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 15 MIMD SIMD machines have a niche market We'll focus on MIMD shared nothing architectures 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 16 Measures of Performance Why do we measure performance? Measure of performance Completion time Processor time product Completion time # processors Throughput: amount of work that can be accomplished in a given amount of time Relative performance: given a reference architecture or implementation AKA Speedup 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 17 Parallel Speedup and Efficiency How much of an improvement did our parallel algorithm obtain over the serial algorithm? Define the parallel speedup, SP Running time of the best serial program on 1 processor SP = Running time of the parallel program on P processors T1 is defined as the running time of the "best serial algorithm" In general: not the running time of the parallel algorithm on 1 processor Definition: Parallel efficiency EP = SP/P 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 18 What can go wrong with speedup? Speedup is not always an accurate way to compare different algorithms or machines We might be able to obtain a better speedup at the cost of a longer running time We have a super-linear speedup when SP > P EP > 1 Super-linear speedups are often an artifact of inappropriate measurement technique Where there is a super-linear speedup, a better serial algorithm may be lurking 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 19 Scalability Sometimes communication is the bottleneck that limits performance as we increase P Other factors may limit performance, too We say that a computation is scalable if performance increases as a "nice function" with the number of processors: linear or even p lg p 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 20 Limits to scalability In practice scalability can be hard to achieve "Non-productive" work associated with exploiting parallelism, e.g. communication Serial sections: portions of the code that run on only one processor. e.g. initialization Load imbalance: work assigned to unevenly processors Some algorithms present intrinsic barriers to realizing scalability and in these cases we seek alternatives forall i=0:n-1 sum = sum + x[i] 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 21 Amdahl's law (1967) A serial section limits scalability Let f = fraction of T1 that runs serially Amdahl's Law (1967) : As P , SP 1/f 0.1 0.2 0.3 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 22 Performance limited by a serial section 0.1 0.2 0.3 0.4 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 23 Scaled Speedup Amdahl's law led many to take a pessimistic outlook on the benefits of parallelism Observation: Amdahl's law assumes that the workload (W) remains fixed But parallel computers are used to tackle more ambitious workloads W increases with P f often decreases with W 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 24 Computing scaled speedup Instead of asking what the speedup is, let's ask how long a parallel program would run on a single processor [J. Gustafson 1992] http://www.scl.ameslab.gov/Publications/Gus/FixedTime/FixedTime.pdf Let TP = 1 f = fraction of serial time spent on the parallel program T1 = f + (1- f ) P = S P = scaled speedup Scaled speedup is linear in P 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 25 Isoefficiency Consequence of Gustafson's observation is that we increase N with P Kumar: We can maintain constant efficiency so long as we increase N appropriately The isoefficiency function specifies the growth of N in terms of P If N is linear in P, we have a scalable computation More on this later on 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 26 A theoretical basis: the PRAM Parallel Random Access Machine Idealized parallel computer Unbounded number of processors Shared memory of unbounded size Constant access time PE Memory PE PE Access time is comparable to that of a machine instruction All processors execute in lock step 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 PE 27 Why is the PRAM interesting? Inspires real world system and algorithm designs Formal basis for fundamental limitations If a PRAM algorithm is inefficient, then so is any parallel algorithm If a PRAM algorithm is efficient, does it follow that any parallel algorithm is efficient? 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 28 How do we handle concurrent accesses? Our options are to prohibit or permit concurrency in reads and writes There are therefore 4 flavors We'll focus on CRCW = Concurent Read Concurent Write All processors may read or write 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 29 CRCW PRAM What happens when more than one processor attempts to write to the same location? We need a rule for combining multiple writes Common: All processors must write the same value Arbitrary: Only allow 1 arbitrarily chosen processor to write Priority: Assign priorities to the processors, and allow the highest-priority processor's write Combine the written values in some meaningful way, e.g. sum, max, using an associative operator. 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 30 A natural programming model for a PRAM: the data parallel model Apply an operation uniformly over all processors in a single step Assign each array element to a virtual processor Implicit barrier synchronization between each step 2 8 18 12 = 1 -2 7 10 + 1 10 11 2 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 31 Forall forall var0 = <range>, var1 = <range>, ... <assignment> Evaluate entire RHS of <assignment> for all index values (in any order) and assign to a temporary Perform all assignments (in any order) using the temporary No more than one value for each element on the left hand side forall i = 0:n-1 forall i = 0:n-1, j = 0:m-1 x[i] = (i*2.0/n)-1.0 H[i,j] = 1.0/(i+j) 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 32 Sorting on a PRAM A 2 step algorithm called rank sort Compute the rank (position in sorted order) for each element in parallel Compare all possible pairings of input values in parallel, n2-fold parallelism CRCW model with update on write using summation Move each value to its correctly sorted position according to the rank: n-fold parallelism O(1) running time 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 33 Rank sort on a PRAM Compute the rank for all possible pairings of inputs in parallel, n2-fold parallelism Move each value in position according to the rank: n-fold parallelism forall i=0:n-1, j=0:n-1 if ( x[i] > x[j] ) then rank[i] = 1 end if forall i=0:n-1 y[rank[i]] = x[i] 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 34 Compute Ranks O(N2) parallelism forall i=0:n-1, j=0:n-1 if ( x[i] > x[j] ) then rank[i] = 1 end if 1 7 3 -1 5 6 Update on write: summation 1 7 -1 5 6 9/26/06 1 1 1 1 1 1 1 Scott B. Baden /CSE 260/ Fall 2006 rank i 3 1 1 1 1 1 1 1 1 1 5 2 0 3 4 35 Route the data using the ranks forall i=0:n-1 y[rank[i]] = x[i] 0 1 2 3 4 5 1 5 2 0 3 4 9/26/06 rank x 1 7 3 -1 5 6 -1 1 3 5 6 7 Scott B. Baden /CSE 260/ Fall 2006 36 Parallel speedup and efficiency Recall the parallel speedup on P processors Running time of the best serial program on 1 processor SP = Running time of the parallel program on P processors The speedup is (n lg n) / O(1) = O(n lg n) No matter how many processors we have, the speedup for this workload is limited by the amount of available work 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 37 Enter real world constraints The PRAM provides a necessary condition for an efficient algorithm on physical hardware But the condition is not sufficient; e.g. rank sort forall ( i=0:n-1, j=0:n-1 ) if ( x[i] > x[j]) then rank[i] = 1 end if forall ( i=0:n-1 ) y[rank[i]] = x[i] Not all computations can execute efficiently in lockstep Real world computers have finite resources: memory and communication network capacity Fast networks are expensive 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 38 SPMD execution model Most parallel programming is implemented under the Same Program Multiple Data programming model = SPMD Other names for this model are "loosely synchronous" or "bulk synchronous" Programs execute as a set of P processes We specify P when we run the program Each process is usually assigned to a different physical processor Each process is initialized with the same code has an associated rank, a unique integer in the range 0:P-1 executes instructions at its own rate Processes communicate via messages or shared memory, but we'll assume message passing The sequence of instructions each process executes depends on the process' rank and the outcome of communication 9/26/06 Scott B. Baden /CSE 260/ Fall 2006 39
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

UCSD - CSE - 260
Lecture 3Programming with Message PassingAnnouncements Office hours change Weds 1pm to 2pm Thurs 5pm to 6pm HW #1 was posted : due on Thursday Try and run the &quot;hello&quot; program described in Assignment #2 Read over the Getting Started with Val
UCSD - CSE - 260
Lecture 4Writing parallel programs with MPI Measuring performanceAnnouncements Wednesday's office hour moved to 1.30 A new version of Ring (Ring_new) that handles linear sequences of message lengths Project proposals due on Tuesday (10/10)10/
UCSD - CSE - 260
Lecture 5Higher dimensional iterative methods Performance modelingAnnouncements A2 due today - submit hard copy and electronic copy A3 is posted Project proposals due Tuesday Connected component labeling10/5/06Scott B. Baden /CSE 260/ Fall
UCSD - CSE - 260
Lecture 6Parallel Print function Matrix multiplication Memory hierarchy optimization MPI CommunicatorsAnnouncements Project proposals due today10/10/06Scott B. Baden / CSE 260 / Fall 20062Parallel print function Parallel print facility d
UCSD - CSE - 260
Lecture 7Matrix multiplication - continued Scalability Revisiting communication performance and correctnessAnnouncements Dr. Bob Lucas will give a special lecture on sparse matrix linear algebra on Weds 11/8 Meeting time? There will be only on
UCSD - CSE - 260
Lecture 11Shared memory: Architecture and programmingAnnouncements Datastar accounts UPC lecture next week10/26/06Scott B. Baden/CSE 260/Fall 20062Shared memory architecture Every processor has direct access to all of memory The addres
UCSD - CSE - 260
Lecture 12Shared memory programmingAnnouncements UPC lecture on Thursday10/31/06Scott B. Baden/CSE 260/Fall 20062Shared memory programming model A collection of concurrent instruction streams, called threads, that share memory Each thr
UCSD - CSE - 260
Lecture 14Solving systems of linear equationsLinear systems of equations A common task in scientific computation is to solve a system of linear equations Often result from of discretizing a differential equation Example: linear system of 2 equ
UCSD - CSE - 260
Parallel Multifrontal MethodsInformation Sciences Institute, Computational Sciences Division8 November 2006 Bob Lucas rflucas@isi.eduCollaborators Over the Years StanfordBob Dutton (advisor) &amp; Tom Blank (now Microsoft) Jerry Tiemann (GE CRD) &amp;
North Texas - ECE - 315
Final Exam ECE 315 Spring 2006 Open Book and two pages of notes 1. 10pts. Below is a semiconductor which has been patterned into the shape illustrated. The thickness of the material is 1um with a mobility of 1000 cm2/Vsec. Assume the fields transform
UT Chattanooga - ECE - 315
Final Exam ECE 315 Spring 2006 Open Book and two pages of notes 1. 10pts. Below is a semiconductor which has been patterned into the shape illustrated. The thickness of the material is 1um with a mobility of 1000 cm2/Vsec. Assume the fields transform
UCSD - POLI SCI - 232
Jacques Jones November 27, 2006 Political Science 232Take Home Exam Questions T-TH 9:30-1045Take Home Exam Questions1. Is Symbolic speech, such as flag burning practiced by the constitution? *Symbolic speech such as t-shirts with special messag
North Texas - BIOL - 1
Principles of Biology I Exam II, FALL `05Name:_Remember to read all of the possible answers for each question before choosing the BEST ANSWER. 1. Random motion causes a net movement of substances dissolved in water from regions where their concen
North Texas - BIOL - 1
Principles of Biology I Exam III, Fall `05Name:_Read all of the possible answers for each question before choosing the BEST ANSWER. 1. Mendel chose two true breeding cultivars of the pea for an experiment. One had purple flowers and the other whi
MIT - ECE - 6.002
Readings: Chapter 13, Chapter 14, Sections 15.1-15.3 Exercise 10-1: Chapter 14, Exercise 14.7, parts b, c, and d only; (page 825). Note the terminology &quot;natural frequency&quot; in part (d). The definition of natural frequency is the value of s0, in a wave
MIT - ECE - 6.002
MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.002 - Electronic Circuits Fall 2000 Homework #7 Handout F00-036 Issued 10/19/2000 - Due 10/27/2000Reading: Exercise 7.1: 8.2, Chapter 9, 10.1, 10.2, 10
MIT - ECE - 6.002
MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.002 - Electronic Circuits Fall 2000 Homework #8 Handout F00-041 Issued 10/26/2000 - Due 11/3/2000Readings: 10.2-10.7Exer
UCSD - CSE - 260
Lecture 1Fundamentals of parallelism models, performance, message passingIntroduction Your instructor is Scott B. Baden, baden@cs.ucsd.edu Office: room 3244 in EBU3B Office hours: Weds 1pm to 3pm or by appointment The class home page ishttp:/
MIT - BIOL - 7.88
Lecture Notes - 2 7.24/7.88J/5.48J The Protein Folding Problem Handouts: An Anfinsen paper Reading ListAnfinsen ExperimentsThe Problem of the title refers to how the amino acid sequence of a polypeptide chain determines the folded three-dimensiona
MIT - BIOL - 7.88
Lecture Notes - 1 7.24/7.88J/5.48J The Protein Folding Problem Student Review: Side chains of the L amino acids and their pK's L/D difference Planarity of the peptide Bond Lecture Overview: Introduction to the protein folding problem This course
MIT - BIOL - 7.88
7.88 Lecture Notes - 11 7.24/7.88J/5.48J The Protein Folding Problem Protein Folding Intermediates and the Failure of Protein Folding Mutants affecting intermediates Protein Misfolding and Inclusion Body FormationA. Protein Folding Intermediat
MIT - BIOL - 7.88
Coiled Coils7.88J Protein FoldingProf. David Gossard September 24, 2003PDB AcknowledgementsThe Protein Data Bank (PDB - http:/www.pdb.org/) is the single worldwide repository for the processing and distribution of 3-D biological macromolecular
MIT - BIOL - 7.88
7.88 Lecture Notes - 7 7.24/7.88J/5.48J The Protein Folding ProblemTropomyosin and S-peptide Sequence determinants of Coiled Coil Structure Tropomyosin Circular Dichroism Tropomyosin thermal denaturation/renaturation Chain Recognition, Associa
MIT - BIOL - 7.88
Lecture Notes - 3 7.24/7.88J/5.48J The Protein Folding Problem Reprise RNase refolding General Features of Globular proteins Interior PackingRnase RefoldingWe can cartoon the experiments: Assume that reduced form is unfolded and populates an e
MIT - BIOL - 7.88
7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding Problem Professor Gossard &quot;Retrieving, Viewing Protein Structures from the Protein Data Base&quot; Helix helix packingPacking of Secondary StructuresOne of the determinants of the shapes o
UCLA - CLASSIC - 152
MIT - BIOL - 7.88
7.88 Lecture Notes - 14 7.24/7.88J/5.48J The Protein Folding Problem Chaperonins Chaperonins and Protein Misfolding Segue into GroELMisfolding &gt; Kinetically trapped aggregates A. Chaperonins Just three experiments: 1) original RUBISCO set a, b, c
MIT - BIOL - 7.88
Collagen7.88J Protein Folding Prof. David Gossard October 20, 2003PDB AcknowledgementsThe Protein Data Bank (PDB - http:/www.pdb.org/) is the single worldwide repository for the processing and distribution of 3-D biological macromolecular structu
MIT - BIOL - 7.88
7.88 Lecture Notes - 8 7.24/7.88J/5.48J The Protein Folding Problem S-peptide Helices Continued S-peptide Reprise Fos/Jun Helix Dipole (10 min) S peptide of ribonuclease S. (20 min) Helicity of S- peptide variants (30 min) Presta and Rose mod
MIT - BIOL - 7.88
7.88 Lecture Notes - 9 7.24/7.88J/5.48J The Protein Folding Problem Fluorescence spectroscopy Denaturation and Denaturing agents Denatured State as a random coil (First Approx.) Renaturation/Refolding Protocols Detection of partially folded In
MIT - BIOL - 7.88
7.88 Lecture Notes - 10 7.24/7.88J/5.48J The Protein Folding ProblemAmong non S-S proteins, one of those studied most intensively has been cytochrome c. Cytochrome c structure Equilibrium unfolding/refolding and kinetic refolding studies Hydro
MIT - BIOL - 7.88
7.88 Lecture Notes - 5 7.24/7.88J/5.48J The Protein Folding ProblemPacking of Secondary Structures Packing of Helices against sheets Packing of sheets against sheets Parallel OrthogonalTable: &quot;Amino Acid Composition of the Ten Proteins and of
BC - CHEM - 231
Old CH 231 ExamsThe questions on these old exams are typical of the kinds of questions you can expect on the exams in CH 231 this semester. For the midterm exam, you should know all of the material in chapters 1-5, with one exception: you will not b
UCLA - CHEM - 30A
Last NameFirst NameMIStudent ID Number: Circle the name of your TA: Discussion Section Day: Cari / Phil / Adam / Heather Time:Total Score/ 30Chem 30A Fall 2004 QUIZ #3 (PINK)(15 Min)Weds Dec 8thINTERPRETATION OF THE QUESTIONS IS PART
UCLA - CHEM - 30A
Last NameFirst NameMIStudent ID Number: Circle the name of your TA: Cari / Phil / Adam / Heather Discussion Section Day: Time:Total Score/ 30Chem 30A Fall 2004 QUIZ #2 (BLUE)(15 Min)Weds Nov 10thINTERPRETATION OF THE QUESTIONS IS PAR
UCLA - CHEM - 30A
Last NameANSWERFirst NameKEYMIStudent ID Number: Circle the name of your TA: Discussion Section Day: Cari / Phil / Adam / Heather Time:Total Score35/ 30Chem 30A Fall 2004 QUIZ #2 (BLUE)(15 Min)Weds Nov 10thINTERPRETATION OF THE
UCLA - CHEM - 30A
Last NameFirst NameMIStudent ID Number: Circle the name of your TA: Discussion Section Day:PHIL ADAM CARI HEATHERTotal ScoreTime:/ 30Chem 30A Fall 2004 QUIZ #1A(15 Min)Weds October 13thINTERPRETATION OF THE QUESTIONS IS PART OF TH
Cornell - ECON - 1120
National Output (Income) (GDP, GNP, etc.) Chapter 6 Definition Expenditure Approach: Y = C + I + G + (X M) o Closed Economy without a Government: Y=C+I o Closed Economy with a Government: Y=C+I+G Income Approach (National Income) Equivalence of