CIS 501 (Martin): Multicore
1
CIS 501
Computer Architecture
Unit 9: Multicore
(Shared Memory Multiprocessors)
Slides originally developed by Amir Roth with contributions by Milo Martin
at University of Pennsylvania with sources that included University of
Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood.
Mem
CIS 501 (Martin): Multicore
2
This Unit: Shared Memory Multiprocessors
•
Thread-level parallelism (TLP)
•
Shared memory model
•
Multiplexed uniprocessor
•
Hardware multihreading
• Multiprocessing
• Synchronization
•
Lock implementation
•
Locking gotchas
•
Cache coherence
•
Bus-based protocols
•
Directory protocols
•
Memory consistency models
I/O
System software
App
App
App
CPU
Readings
•
Textbook (MA:FSPTCM)
•
Sections 7.0, 7.1.3, 7.2-7.4
•
Section 8.2
CIS 501 (Martin): Multicore
3
Beyond Implicit Parallelism
•
Consider “daxpy”:
daxpy(double *x, double *y, double *z, double a):
for (i = 0; i < SIZE; i++)
Z[i] = a*x[i] + y[i];
•
Lots of instruction-level parallelism (ILP)
• Great!
•
But how much can we really exploit?
4 wide?
8 wide?
•
Limits to (efficient) super-scalar execution
•
But, if SIZE is 10,000, the loop has 10,000-way parallelism!
•
How do we exploit it?
CIS 501 (Martin): Multicore
4
This
preview
has intentionally blurred sections.
Sign up to view the full version.
Explicit Parallelism
•
Consider “daxpy”:
daxpy(double *x, double *y, double *z, double a):
for (i = 0; i < SIZE; i++)
Z[i] = a*x[i] + y[i];
•
Break it up into N “chunks” on N cores!
•
Done by the programmer (or maybe a really smart compiler)
daxpy(
int chunk_id
, double *x, double *y, *z, double a):
chuck_size = SIZE / N
my_start = chuck_id * chuck_size
my_end = my_start + chuck_size
for (i = my_start; i < my_end; i++)
z[i] = a*x[i] + y[i]
• Assumes
•
Local variables are “private” and x, y, and z are “shared”
•
Assumes SIZE is a multiple of N (that is, SIZE % N == 0)
CIS 501 (Martin): Multicore
5
Chunk ID
Start End
0
0
99
1
100
199
2
200
299
3
300
399
SIZE = 400, N=4
CIS 501 (Martin): Multicore
6
Multiplying Performance
•
A single processor can only be so fast
•
Limited clock frequency
•
Limited instruction-level parallelism
•
Limited cache hierarchy
•
What if we need even more computing power?
•
Use multiple processors!
•
But how?
•
High-end example: Sun Ultra Enterprise 25k
•
72 UltraSPARC IV+ processors, 1.5Ghz
•
1024 GBs of memory
•
Niche: large database servers
• $$$
CIS 501 (Martin): Multicore
7
Multicore: Mainstream Multiprocessors
•
Multicore chips
•
IBM Power5
•
Two 2+GHz PowerPC cores
•
Shared 1.5 MB L2, L3 tags
•
AMD Quad Phenom
•
Four 2+ GHz cores
•
Per-core 512KB L2 cache
•
Shared 2MB L3 cache
•
Intel Core i7 Quad
•
Four cores, private L2s
•
Shared 6 MB L3
•
Sun Niagara
•
8 cores, each 4-way threaded
•
Shared 2MB L2, shared FP
•
For servers, not desktop
1.5MB L2
L3 tags
Core 1
Core 2
Why multicore?
What else would
you do with 1 billion transistors?

This is the end of the preview.
Sign up
to
access the rest of the document.
- Fall '10
- matin
- Computer Architecture, Central processing unit, CPU cache, Cache coherence
-
Click to edit the document details