16.1-autotuning

16.1-autotuning - An introduction to autotuning...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: An introduction to autotuning Richard (Rich) Vuduc Assistant Professor, CSE Guest Lecture for CS/CSE 6230: HPC Tools & Apps October 20, 2010 Wednesday, October 20, 2010 What problem are we solving? Wednesday, October 20, 2010 3 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) On Hogwarts cluster: 2-socket x 4-core Intel “Nehalem” 2.27 GHz Wednesday, October 20, 2010 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) In cache, 1.5–2 Gfop/s Out oF cache, 0.2 Gfop/s 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) In cache, 1.5–2 Gfop/s Out oF cache, 0.2 Gfop/s Wednesday, October 20, 2010 5 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Parallelized – 8 threads Wednesday, October 20, 2010 6 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Parallelized – 8 threads ~ 5.4x in-cache ~ 7.5x out of cache Wednesday, October 20, 2010 Architecture 101: Memory hierarchy To deal with slow memory, augment system with smaller but faster memory. 7 Fast Slow CPU = cache or RAM or “local” = RAM or disk or “remote” 10–10 4 x Wednesday, October 20, 2010 HPC 101: Blocking is I/O-optimal. Hong & Kung (1981) 8 Computation : C ← A ⋅ B C A B Naïve Blocked Thm [Hong & Kung (‘81)]: A blocked algorithm minimizes slow memory word transfers. Tuning parameter: Block size Wednesday, October 20, 2010 9 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Cilk++, recursive 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Cilk++, recursive Parallelized using block-recursive algorithm Wednesday, October 20, 2010 10 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Cilk++, recursive 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Cilk++, recursive Parallelized using block-recursive algorithm ~ 6.6x in-cache ~ 46x out (!) Wednesday, October 20, 2010 11 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Cilk++, recursive Vendor tuned (1-core) 1 2 3 4 5 6 7 8 9 10 200 400 600 800 1000 1200 1400 1600 1800 2000 Gfop/s n (matrix dimension) Untuned (1-core) Cilk++ Cilk++, recursive Vendor tuned (1-core) Intel’s code on 1 thread (not even multithreaded!) Wednesday, October 20, 2010 12...
View Full Document

This note was uploaded on 11/04/2010 for the course CSE 6530 taught by Professor Jeffreyvetter during the Fall '10 term at Georgia Tech.

Page1 / 57

16.1-autotuning - An introduction to autotuning...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online