{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

09_multicore-4up

09_multicore-4up - This Unit Shared Memory Multiprocessors...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
CIS 501 (Martin): Multicore 1 CIS 501 Computer Architecture Unit 9: Multicore (Shared Memory Multiprocessors) Slides originally developed by Amir Roth with contributions by Milo Martin at University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. Mem CIS 501 (Martin): Multicore 2 This Unit: Shared Memory Multiprocessors Thread-level parallelism (TLP) Shared memory model Multiplexed uniprocessor Hardware multihreading • Multiprocessing • Synchronization Lock implementation Locking gotchas Cache coherence Bus-based protocols Directory protocols Memory consistency models I/O System software App App App CPU Readings Textbook (MA:FSPTCM) Sections 7.0, 7.1.3, 7.2-7.4 Section 8.2 CIS 501 (Martin): Multicore 3 Beyond Implicit Parallelism Consider “daxpy”: daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) Z[i] = a*x[i] + y[i]; Lots of instruction-level parallelism (ILP) • Great! But how much can we really exploit? 4 wide? 8 wide? Limits to (efficient) super-scalar execution But, if SIZE is 10,000, the loop has 10,000-way parallelism! How do we exploit it? CIS 501 (Martin): Multicore 4
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Explicit Parallelism Consider “daxpy”: daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) Z[i] = a*x[i] + y[i]; Break it up into N “chunks” on N cores! Done by the programmer (or maybe a really smart compiler) daxpy( int chunk_id , double *x, double *y, *z, double a): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] • Assumes Local variables are “private” and x, y, and z are “shared” Assumes SIZE is a multiple of N (that is, SIZE % N == 0) CIS 501 (Martin): Multicore 5 Chunk ID Start End 0 0 99 1 100 199 2 200 299 3 300 399 SIZE = 400, N=4 CIS 501 (Martin): Multicore 6 Multiplying Performance A single processor can only be so fast Limited clock frequency Limited instruction-level parallelism Limited cache hierarchy What if we need even more computing power? Use multiple processors! But how? High-end example: Sun Ultra Enterprise 25k 72 UltraSPARC IV+ processors, 1.5Ghz 1024 GBs of memory Niche: large database servers • $$$ CIS 501 (Martin): Multicore 7 Multicore: Mainstream Multiprocessors Multicore chips IBM Power5 Two 2+GHz PowerPC cores Shared 1.5 MB L2, L3 tags AMD Quad Phenom Four 2+ GHz cores Per-core 512KB L2 cache Shared 2MB L3 cache Intel Core i7 Quad Four cores, private L2s Shared 6 MB L3 Sun Niagara 8 cores, each 4-way threaded Shared 2MB L2, shared FP For servers, not desktop 1.5MB L2 L3 tags Core 1 Core 2 Why multicore? What else would you do with 1 billion transistors?
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}