# mt - Name EE 7700-1 Take-Home Pre-Final Examination...

This preview shows pages 1–4. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Name EE 7700-1 Take-Home Pre-Final Examination Wednesday, 29 April 2009 to Early Monday Morning, 4 April 2009 Alias Problem 1 (20 pts) Problem 2 (20 pts) Problem 3 (20 pts) Problem 4 (20 pts) Problem 5 (20 pts) Exam Total (100 pts) Good Luck! Problem 1: [20 pts]The CUDA code below performs some calculation on each element of an array, global data . It operates on a GeForce 8000-like GPU with: Eight multiprocessors. Core clock frequency of 1.5 GHz. Four-hundred cycle global access latency. The cuda normal thread routine consists of 9 instructions including one global load. __host__ void launch(int element_amt_lg) { const int element_amt = 1 &lt;&lt; element_amt_lg; dim3 block_dim(256,1,1); dim3 grid_dim( 1 &lt;&lt; element_amt_lg - 8, 1, 1 ); cuda_normal_thread&lt;&lt;&lt;grid_dim,block_dim&gt;&gt;&gt;(element_amt); } __global__ void cuda_normal_thread(int element_amt) { const int idx = blockIdx.x * blockDim.x + threadIdx.x; const float coord = global_data[idx]; output_data[idx] = some_func(transform_matrix,coord); } ( a ) Determine either the time needed for 2 20 elements ( work amt lg is 20), or else determine the computation rate in elements per cycle. (Two ways of expressing the same thing.) ( b ) The code below may be the true assembly language for cuda normal thread . (See http://www.cs.rug.nl/ wladimier/de Based on this code, what would be the minimum number of threads per multiprocessor needed to achieve peak performance? Explain. 0: mov.b16 \$r0.hi, %ntid.y 1: cvt.u32.u16 \$r1, \$r0.lo 2: mad24.lo.u32.u16.u16.u32 \$r0, s[0x000c], \$r0.hi, \$r1 3: shl.u32 \$r1, \$r0, 0x00000002 4: add.u32 \$r0, \$r1, c0[0x0040] 5: mov.u32 \$r0, g[\$r0] // &lt;- Global load of data into r0. 6: add.u32 \$r1, \$r1, c0[0x0044] 7: add.rn.f32 \$r0, \$r0, 0x3f800000 8: mov.end.u32 g[\$r1], \$r0 2 Problem 2: [20 pts]The CUDA code below does exactly the same thing as the code from the previous problem, but it does so in an un-cuda-like fashion: looping over elements in a thread rather than spawning one thread per element. The code runs on the system described in the previous problem.one thread per element....
View Full Document

## This note was uploaded on 01/03/2012 for the course EE 7700 taught by Professor Staff during the Spring '08 term at LSU.

### Page1 / 8

mt - Name EE 7700-1 Take-Home Pre-Final Examination...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online