lec10-performance_considerations

lec10-performance_considerations - Lecture 10 Performance...

Info iconThis preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon
ecture 10: Performance Considerations Lecture 10: Performance Considerations
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
But First! lways measure where your time is Always measure where your time is going! ven if you think you know where it is going Even if you think you know where it is going Start coarse, go fine-grained as need be eep in mind Amdahl’s Law when Keep in mind Amdahl s Law when optimizing any part of your code on’t continue to optimize once a part is only a Don t continue to optimize once a part is only a small fraction of overall execution time
Background image of page 2
Performance Considerations emory Coalescing Memory Coalescing Shared Memory Bank Conflicts ontrol low Divergence Control-Flow Divergence Occupancy ernel Launch Overheads Kernel Launch Overheads
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
EMORY COALESCING MEMORY COALESCING
Background image of page 4
Memory Coalescing ff hip memory is accessed in Off-chip memory is accessed in chunks ven if you read only a single word Even if you read only a single word If you don’t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 32/64/128 bytes Unaligned accesses will cost more
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 128B segment
Background image of page 6
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 ( reduce to 64B ) t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 64B segment
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 ( reduce to 32B ) t1 t3 ... t0 t15 t2 96 192 128 160 224 288 256 03 2 6 4 32B transaction
Background image of page 8
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 128B segment
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 ( reduce to 64B ) t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 64B transaction
Background image of page 10
Mem Access Examples
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Coalescing Algorithm Find the memory segment that contains the address requested by the lowest numbered active thread: 32B segment for 8-bit data 64B segment for 16-bit data 128B segment for 32, 64 and 128-bit data. Find all other active threads whose requested ddress lies in the same segment address lies in the same segment Reduce the transaction size, if possible: size == 128B and only the lower or upper half is used If size == 128B and only the lower or upper half is used, reduce transaction to 62B If size == 64B and only the lower or upper half is used, duce transaction to 32B reduce transaction to 32B Carry out the transaction, mark threads as inactive Repeat until all threads in the half-warp are serviced
Background image of page 12
Comparing Compute Capabilities ompute capability < 1.2 Compute capability < 1.2 Requires threads in a half-warp to: Access a single aligned 64B, 128B, or 256B segment Threads must issue addresses in sequence
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 14
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/31/2011 for the course EE 101 taught by Professor Gibbons during the Spring '09 term at Michigan State University.

Page1 / 51

lec10-performance_considerations - Lecture 10 Performance...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online