{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

09_multicore - CIS 501 Computer Architecture Unit 9...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
CIS 501 (Martin): Multicore 1 CIS 501 Computer Architecture Unit 9: Multicore (Shared Memory Multiprocessors) Slides developed by Milo Martin & Amir Roth at the University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Mem CIS 501 (Martin): Multicore 2 This Unit: Shared Memory Multiprocessors Thread-level parallelism (TLP) Shared memory model Multiplexed uniprocessor Hardware multihreading • Multiprocessing • Synchronization Lock implementation Locking gotchas Cache coherence Bus-based protocols Directory protocols Memory consistency CPU I/O System software App App App CPU CPU CPU CPU CPU
Background image of page 2
Readings Textbook (MA:FSPTCM) Sections 7.0, 7.1.3, 7.2-7.4 Section 8.2 CIS 501 (Martin): Multicore 3
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Beyond Implicit Parallelism Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(): for (i = 0; i < SIZE; i++) z[i] = a*x[i] + y[i]; Lots of instruction-level parallelism (ILP) • Great! But how much can we really exploit? 4 wide? 8 wide? Limits to (efficient) super-scalar execution But, if SIZE is 10,000, the loop has 10,000-way parallelism! How do we exploit it? CIS 501 (Martin): Multicore 4
Background image of page 4
Explicit Parallelism Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy(): for (i = 0; i < SIZE; i++) z[i] = a*x[i] + y[i]; Break it up into N “chunks” on N cores! Done by the programmer (or maybe a really smart compiler) void daxpy( int chunk_id ): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] • Assumes Local variables are “private” and x, y, and z are “shared” Assumes SIZE is a multiple of N (that is, SIZE % N == 0) CIS 501 (Martin): Multicore 5 Chunk ID Start End 0 0 99 1 100 199 2 200 299 3 300 399 SIZE = 400, N=4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Explicit Parallelism Consider “daxpy”: double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy( int chunk_id ): chuck_size = SIZE / N my_start = chuck_id * chuck_size my_end = my_start + chuck_size for (i = my_start; i < my_end; i++) z[i] = a*x[i] + y[i] Main code then looks like: parallel_daxpy(): for (tid = 0; tid < CORES; tid++) { spawn_task(daxpy, tid); } wait_for_tasks(CORES); CIS 501 (Martin): Multicore 6
Background image of page 6
Explicit (Loop-Level) Parallelism Another way: “OpenMP” annotations to inform the compiler double a, x[SIZE], y[SIZE], z[SIZE]; void daxpy() { #pragma omp parallel for for (i = 0; i < SIZE; i++) { z[i] = a*x[i] + y[i]; } Look familiar? Hint: homework #1 But only works if loop is actually parallel If not parallel, incorrect behavior may result in unpredictable ways CIS 501 (Martin): Multicore 7
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Multicore & Multiprocessor Hardware CIS 501 (Martin): Multicore 8
Background image of page 8
CIS 501 (Martin): Multicore 9 Multiplying Performance A single processor can only be so fast Limited clock frequency Limited instruction-level parallelism Limited cache hierarchy What if we need even more computing power?
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}