lec10-performance_considerations

lec10-performance_considerations - Lecture 10 Performance...

Info icon This preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
Lecture 10: Performance Considerations
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
But First! Always measure where your time is Always measure where your time is going! Even if you think you know where it is going Start coarse, go fine-grained as need be Keep in mind Amdahl’s Law when Keep in mind Amdahl s Law when optimizing any part of your code Don’t continue to optimize once a part is only a Don t continue to optimize once a part is only a small fraction of overall execution time
Image of page 2
Performance Considerations Memory Coalescing Shared Memory Bank Conflicts Control-Flow Divergence Occupancy Kernel Launch Overheads
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
MEMORY COALESCING
Image of page 4
Memory Coalescing Off-chip memory is accessed in Off-chip memory is accessed in chunks Even if you read only a single word If you don’t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 32/64/128 bytes Unaligned accesses will cost more
Image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Threads 0-15 access 4-byte words at addresses 116-176 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 0 32 64 128B segment
Image of page 6
Threads 0-15 access 4-byte words at addresses 116-176 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 ( reduce to 64B ) t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 0 32 64 64B segment
Image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Threads 0-15 access 4-byte words at addresses 116-176 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 ( reduce to 32B ) t1 t3 ... t0 t15 t2 96 192 128 160 224 288 256 0 32 64 32B transaction
Image of page 8
Threads 0-15 access 4-byte words at addresses 116-176 116 176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 0 32 64 128B segment
Image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Threads 0-15 access 4-byte words at addresses 116-176 116 176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 ( reduce to 64B ) t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 0 32 64 64B transaction
Image of page 10
Mem Access Examples
Image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Coalescing Algorithm Find the memory segment that contains the address t d b th l t b d ti th d requested by the lowest numbered active thread: 32B segment for 8-bit data 64B segment for 16-bit data 128B segment for 32, 64 and 128-bit data. Find all other active threads whose requested address lies in the same segment Reduce the transaction size, if possible: If size == 128B and only the lower or upper half is used If size == 128B and only the lower or upper half is used, reduce transaction to 62B If size == 64B and only the lower or upper half is used, reduce transaction to 32B Carry out the transaction, mark threads as inactive Repeat until all threads in the half-warp are serviced
Image of page 12
Comparing Compute Capabilities Compute capability < 1.2 Requires threads in a half-warp to: Access a single aligned 64B, 128B, or 256B segment Threads must issue addresses in sequence If i t t ti fi d If requirements are not satisfied: Separate 32B transaction for each thread Compute capability 1 2 Compute capability 1.2 Does not require sequential addressing by threads Perf degrades gracefully when a half-warp addresses multiple segments
Image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Consider the stride of your accesses __global__ void foo( int * input, float3 * input2) { int i = blockDim.x
Image of page 14
Image of page 15
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern