cs503-lecture2

cs503-lecture2 - Single Node Performance Locality and...

Info iconThis preview shows pages 1–7. Sign up to view the full content.

View Full Document Right Arrow Icon
Single Node Performance Locality and Memory Hierarchies CS503 Spring 2011 Jacqueline Chame and Robert Lucas
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Announcements I have requested accounts on HPCC for all students currently registered & on Blackboard. Please make sure your name, USC email address and USC 10-digit ID are correct on Blackboard. There is a waiting list for registering in this course. If you are going to drop CS503 please do it soon so that other students can register. T h a n k y o u !
Background image of page 2
Outline Single processor performance Memory hierarchies Case study: Matrix Multiplication This lecture is based on Jim Demmel’s CE267 class notes (UCBerkeley) a n d o n m y p r e v i o u s lectures on Parallel Programming and Advanced C o m p i l e r T e c h n o l o g y f o r HPC.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Motivation Most applications run at < 10% of the “peak” performance of a system Peak is the maximum the hardware can physically execute Much of this performance is lost on a single processor, i.e., the code running on one processor often runs at only 10- 20% of the processor peak Most of the single processor performance loss is in the memory system Moving data takes much longer than arithmetic and logic To understand this, we need to look under the hood of modern processors Fo r t o d ay, w e w i l l l o o k a t o n ly a s i n g l e “ c o r e ” p r o c e s s o r These issues will exist on processors within any parallel computer
Background image of page 4
Single processor: idealized model Processor names bytes, words, etc. in its address space These represent integers, floats, pointers, arrays, etc. Operations include Read and write into very fast memory called registers Arithmetic and other logical operations on registers Order specified by program Read returns the most recently written data Compiler and architecture translate high level expressions into “obvious” lower level instructions Hardware executes instructions in order specified by compiler Idealized Cost Each operation has roughly the same cost (read, write, add, multiply, etc.) A = B + C Re a d a dd re s s ( B) to R 1 Re a d a dd re s s ( C ) to R 2 R3 = R1 + R2 W r i t e R 3 t o A d d r e s s ( A )
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Single processor: in practice Real processors have registers and caches small amounts of fast memory store values of recently used or nearby data different memory ops can have very different costs parallelism multiple “functional units” that can run in parallel different orders, instruction mixes have different costs pipelining a f o r m o f p a r a l l e l i s m , l i k e a n a s s e m b ly l i n e i n a f a c t o r y Why is this important? I n t h e o r y, c o m p i l e r s u n d e r s t a n d a l l o f t h i s a n d c a n o p t i m i z e yo u r p r o g ra m ; in practice they don’t.
Background image of page 6
Image of page 7
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 81

cs503-lecture2 - Single Node Performance Locality and...

This preview shows document pages 1 - 7. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online