lec09-performance_considerations

lec09-performance_considerations - Lecture 09: Performance...

Info iconThis preview shows pages 1–15. Sign up to view the full content.

View Full Document Right Arrow Icon
ecture 9: erformance Considerations Lecture 09: Performance Considerations
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
But First! lways measure where your time is Always measure where your time is going! ven if you think you know where it is going Even if you think you know where it is going Start coarse, go fine-grained as need be eep in mind Amdahl’s Law when Keep in mind Amdahl s Law when optimizing any part of your code on’t continue to optimize once a part is only a Don t continue to optimize once a part is only a small fraction of overall execution time
Background image of page 2
Performance Considerations emory Coalescing Memory Coalescing Shared Memory Bank Conflicts ontrol low Divergence Control-Flow Divergence Occupancy ernel Launch Overheads Kernel Launch Overheads
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
EMORY COALESCING MEMORY COALESCING
Background image of page 4
Memory Coalescing ff hip memory is accessed in Off-chip memory is accessed in chunks ven if you read only a single word Even if you read only a single word If you don’t use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 32/64/128 bytes Unaligned accesses will cost more
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 128B segment
Background image of page 6
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 ( reduce to 64B ) t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 64B segment
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 0 is lowest active, accesses address 116 128-byte segment: 0-127 ( reduce to 32B ) t1 t3 ... t0 t15 t2 96 192 128 160 224 288 256 03 2 6 4 32B transaction
Background image of page 8
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 128B segment
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Threads 0-15 access 4-byte words at addresses 16- 76 116 176 Thread 3 is lowest active, accesses address 128 128-byte segment: 128-255 ( reduce to 64B ) t1 t2 ... t0 t15 t3 96 192 128 160 224 288 256 03 2 6 4 64B transaction
Background image of page 10
Consider the stride of your accesses __global__ void foo( int * input, float3 * input2) { int i = blockDim.x * blockIdx.x + threadIdx.x ; // Stride 1 int a = input[i]; // Stride 2, half the bandwidth is wasted nt = input[2*i]; int b = input[2*i]; // Stride 3, 2/3 of the bandwidth wasted float c = input2[i].x; }
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example: Array of Structures (AoS) struct record { int key; int value; int flag; }; record *d records; cudaMalloc (( void **)&d_records, . ..);
Background image of page 12
Example: Structure of Arrays (SoA) struct SoA { int * keys; int * values; int * flags; }; SoA d SoA data; cudaMalloc (( void **)&d_SoA_data.keys, . ..); cudaMalloc (( void **)&d_SoA_data.values, . ..); cudaMalloc (( void **)&d_SoA_data.flags, . ..);
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Example: SoA vs. AoS __global__ void bar(record *AoS_data,
Background image of page 14
Image of page 15
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/31/2011 for the course EE 101 taught by Professor Gibbons during the Spring '09 term at Michigan State University.

Page1 / 48

lec09-performance_considerations - Lecture 09: Performance...

This preview shows document pages 1 - 15. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online