03 - Announcements • Registration is full and almost...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Announcements • Registration is full and almost closed • Class schedule – Class on Aug 30, Sept 1 – NO class on Sept 3, 6 (Holiday), 8 – Class on Sept 10 • TA – Jonathan Brownsworth – j.brownsworth@gatech.edu • Action Items – Brainstorming term project ideas and collaborators – Get familiar with HPC resources • Homework #1 – Survey at http://www.surveymonkey.com/s/ SRTJ7GL – Submit the survey at this website – Due Mon, Aug 30 @ 8am eastern • Homework #2 coming next week • T-square mailing lists – Testing this week – Did you receive my email? • Slides/syllabus/project list available at – T-Square Computational Science and Engineering Division 1 CS6230: HPC Tools and Applications Term Project Discussion Jeffrey Vetter Pre-proposals • • • • • • • • Title Participants – – – – – – list all participants that will collaborate on this project. what do you hope to learn, prove, implement, publish, etc. (in no more than 200 words)? Expected outcome What is your initial plan to accomplish this goal? how will you show that you have met your goal? Time to solution? Accuracy? Scaling? Others? short list of the required resources: software, hardware, source code, data sets, and/or knowledge (e.g., domain specific, algorithm). short list of what might go wrong with the project to keep you from reaching your goal. Are you working on related work for this project in any other class? Goal Approach Metrics Required resources Risks – – Related work Computational Science and Engineering Division 2 Nice Example from CDC/Chris Lynberg Computational Science and Engineering Division 3 Term Project (1) • The term project will focus on a research or application related HPC project that will depend greatly on the student’s skills and status. – Students will identify term projects in the first month, and work on them throughout the semester. – The final project is something chosen by the student in conjunction with the instructor which will probe some area of high-performance computing of interest to the student. – Each team will present a pre-proposal by Oct 1 – Each team will present project summary during the last weeks of the class. – Required components of the term project are the • • • • written project proposal presentation on project proposal final project presentation Final project written report – Possible implementation languages range from FORTRAN to C/C++ and potentially some new or exploratory languages (depending on student’s previous experiences). Computational Science and Engineering Division 4 Term Project (2) • Class projects may be performed by students individually or by teams of two, three – Consequently, for Computer Science or Computer Engineering students, the effective outcome of the course is a deeper understanding of parallel/distributed computing and their communication mechanisms, operating systems, middleware, and programming support – Other students learn to understand the topics listed above while focusing on the use of high performance machines for realistic applications. – Typical application domains include scientific and engineering computation, real-time systems, large scale data analysis and visualization, and parallel optimization methods. – Students will be asked to evaluate their team partners for their final grade. Computational Science and Engineering Division 5 Term Project Ideas (examples) Will post project idea listing online later this week • • Parallelize an application on a contemporary HPC system Implement application kernels on emerging architectures (e.g., DGEMM, ZGEMM, FFT, MD) on – – – – GPUs High scale FPGAs Cray massively multithreaded systems: MTA or XMT • • • • Investigate new programming models – – – – OpenCL CUDA CAF UPC • • • Develop novel methods for scalable performance analysis and optimization using data mining and machine learning techniques. • Compare/evaluate algorithms used in consumer gaming to those used in high performance computing in order to understand the potential leverage of commodity gaming on HPC. Optimize the HPC Challenge benchmarks for SC07 Design (paper, simulator) extensions to existing memory technologies (such as FBDIMM) that allow remote memory operations. Build an Eclipse plugin that uses empirical measurements to identify regions of application source code as good candidates for acceleration and optimization. Construct tools to assist users in the development of symbolic performance models from our Modeling Assertions instrumentation. Design a symbolic simulator for simulating large-scale interconnects which takes its input from our Modeling Assertions tool. 6 Computational Science and Engineering Division Term Project Ideas (more examples from last class) • Parallel File Systems on Object Based Storage Devices • Community Detection using Parallel Extremal Optimization • Evaluating the performance of mpi-openmp hierarchical approach for lbe les • Pthreads + MPI + NVIDIACUDA RayTracer • Optimized Parallel Asynchronous Variational Integrators – OPAVI • Parallel Dual Tree Nearest Neighbor Search • Building a Fast CFD Solver For Surgical Planning Computational Science and Engineering Division • Parallelization of RNA Secondary Structure Prediction • Implementation and Performance Comparison of Parallel Matrix Multiplication Algorithms on NVIDIA G80* • Heuristic Method for Graph Partitioning on CELL BE • Accelerating Distribution Ray Tracing Through Parallel Computing • Parallel Segmentation of Images • Parallelized Logic Circuit Simulation 7 Other ideas • Integrate a benchmark from WRF (climate code). There are 5 test problems mentioned on this site. Some have existing CUDA code – http://www.mmm.ucar.edu/wrf/WG2/GPU/ • Machine learning benchmark • Bioinformatics (smith waterman, BLAST, etc.) – http://www.nvidia.com/object/tesla_bio_workbench.html • Graph Algorithm Benchmark: BFS, APSP, and several others • Multi-dimensional FFT – we have 1D, a student could try 2D or 3D • Create a truly parallel GPU benchmark with MPI + CUDA/OpenCL • More exotic GPU programming models like Concurrent Collections Computational Science and Engineering Division 8 CS6230 – HPC Tools and Applications Parallelism Jeffrey S. Vetter Computational Science and Engineering College of Computing Georgia Institute of Technology The Hierarchy of Parallelism Parallelism Exists at Every Level of the Architecture It’s your job to exploit it… Core – Instruction-level parallelism Socket – Multi-core parallelism Node – Multi-socket parallelism Systems – Multi-node parallelism Georgia Tech / Computational Science and Engineering / Vetter 3 Parallelism >> Core Each core is able to ‘execute’ multiple instructions simultaneously – Superscalar • Intel Woodcrest can issue 4 different instructions per clock – Vector • Cray X1E SSP can issue 1 vector instruction over 64 data items per clock – Multithreading • Cray XMT/Tera MTA2 can manage 128 threads Georgia Tech / Computational Science and Engineering / Vetter 4 Exploiting Parallelism in the Core Contemporary cores are very complex Performance is very unstable and sensitive to small changes in code To get high performance across the system, you must start with the core AMD family 10h Courtesy of AMD Georgia Tech / Computational Science and Engineering / Vetter 5 Expected Performance from a Core Peak IPC – Capable of issuing multiple instructions per cycle • Peak IPC of 4-8 • Realized IPC of 0.3-2.5 Floating point peak – Commonly used metric in HPC, Computational Science – TOP500 – Clk speed * number of fp ops per cycle • 2.6 GHz * 4 flops/cyc 10.4 GFLOPS – This can be complex to calculate! • Must use specific instructions and features of core • Must start on every cycle • Must be pipelined Georgia Tech / Computational Science and Engineering / Vetter 6 Sustained Performance (v. Peak) What does peak ignore? Non-flop instructions Resource contention Startup latencies Overheads Memory access patterns Communication for IO, Interconnects Algorithm Selection Correctness Time to solution OS Services So why do people use it? – Easy to calculate • No simulations • No workloads or benchmarking • No other architectural features involved – Until 5-10 years ago, it was a reasonable first-order performance estimate Georgia Tech / Computational Science and Engineering / Vetter 7 Today’s Cores Have Very Specific Requirements for Approaching Peak FP Performance Generally – Must use SIMD extensions: SSE, Altivec – Must have aligned, packed memory Thankfully, the compiler assists in optimizing at this level – Some hand coding remains • STI Cell is still evolving Libraries are often hand optimized by experts – Intel MKL – AMD ACML – ATLAS And there is some hardware support for measuring performance – Hardware performance counters • Flops, memory activity, tlb activity, etc – PAPI Georgia Tech / Computational Science and Engineering / Vetter 8 Get to Know your SI Prefixes 1000 1000 1000 8 7 6 5 4 3 2 1 2/3 1/3 0 −1/3 −2/3 −1 −2 −3 −4 −5 −6 −7 −8 n 10 10 10 n Prefix yottazettaY Z Symbol Short scale Septillion Sextillion Quadrillion Long scale Decimal equivalent in SI writing style 1 000 000 000 000 000 000 000 000 1 000 000 000 000 000 000 000 24 21 18 15 12 9 6 3 2 1 0 −1 −2 −3 −6 −9 −12 −15 −18 −21 −24 Trilliard (thousand trillion) 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 exapetateragigamegakilohectodeca(none) decicentimillimicronanopicofemtoattozeptoyocto- E P T G M k h da (none) d c m µ n p f a z y Quintillion Quadrillion Trillion Billion Trillion Billiard (thousand billion) Billion Milliard (thousand million) Million Thousand Hundred Ten One Tenth Hundredth Thousandth Millionth 1 000 000 000 000 000 000 1 000 000 000 000 000 1 000 000 000 000 1 000 000 000 1 000 000 1 000 100 10 1 0.1 0.01 0.001 0.000 001 0.000 000 001 0.000 000 000 001 0.000 000 000 000 001 0.000 000 000 000 000 001 0.000 000 000 000 000 000 001 0.000 000 000 000 000 000 000 001 Billionth Trillionth Quadrillionth Quintillionth Sextillionth Septillionth Milliardth Billionth Billiardth Trillionth Trilliardth Quadrillionth Source: Wikipedia Georgia Tech / Computational Science and Engineering / Vetter 9 Parallelism >> Socket Sockets, or Multi-Chip Modules, or Chip-multiprocessors – Multiple cores sharing a common set of basic resources • Memory • I/O • Power – Cooling (adaptive voltage and clocks) Georgia Tech / Computational Science and Engineering / Vetter 10 Multiple Cores per Module AMD’s quad core Similar cores: AMD 10h Dedicated – L1 cache – L2 cache Shared – L3 cache – Hypertransport links Independent clock frequencies Courtesy of AMD Georgia Tech / Computational Science and Engineering / Vetter 11 Memory Hierarchy Main memory is shared across all cores – Latency and Bandwidth across cores is reasonably similar Temporal and spatial locality – More important because of 3 levels of cache – Top 2 levels are dedicated – no interference, contention Georgia Tech / Computational Science and Engineering / Vetter 12 Parallelism >> Node (AMD) Courtesy of AMD Georgia Tech / Computational Science and Engineering / Vetter 13 Parallelism >> Node (AMD 4 Socket) Georgia Tech / Computational Science and Engineering / Vetter 14 Parallelism >> Node (Intel) Georgia Tech / Computational Science and Engineering / Vetter 15 Parallelism >> Node Observations Until a few years ago, this was the typical architecture of a multiprocessor SMP – One core per socket Now, with multicore, we have much higher levels of intranode parallelism Programming *can* remain the same – Shared memory model across all cores – Wide range of bandwidths and latencies Georgia Tech / Computational Science and Engineering / Vetter 16 Parallelism >> System (Cray XT4) Courtesy of Cray Georgia Tech / Computational Science and Engineering / Vetter 17 Node Connect via IO Channel Georgia Tech / Computational Science and Engineering / Vetter 18 Parallelism >> System Observations Interconnect is shared by all cores on the node – Resource contention • • • • Own application message traffic Other application message traffic IO traffic Other Node memory systems are NOT shared among nodes – In some systems this isn’t the case Locality matters! Application is responsible for orchestrating data movement at all levels (for performance) – Overlapping communication and computation Georgia Tech / Computational Science and Engineering / Vetter 19 PARALLELISM IS EVERYWHERE! 28 August 2010 Georgia Tech / Computational Science and Engineering / Vetter 20 2-core CPU 2-core CPU + 32-core GPU (↑perf+power) 2-core CPU + 32-core GPU (↑perf+power) + 16-core GPU (↓perf+power) PARALLELISM CAN BE COMPLEX 28 August 2010 Georgia Tech / Computational Science and Engineering / Vetter 25 Challenge: Machine complexity “Roadrunner” system at Los Alamos National Laboratory, early petaflop system. [Source: Paul Henning at LANL] 26 2 x 2-core x86 Source: Paul Henning (LANL) 27 27 4 Cell processors (e.g., Sony PS3) = 4 x (1 PowerPC + 8 SIMD cores) 2 x 2-core x86 Source: Paul Henning (LANL) 28 28 4 Cell processors (e.g., Sony PS3) = 4 x (1 PowerPC + 8 SIMD cores) 192 nodes / cluster x 18 clusters 2 x 2-core x86 Source: Paul Henning (LANL) 29 29 4 Cell processors (e.g., Sony PS3) = 4 x (1 PowerPC + 8 SIMD cores) 3 compilers 2 byte-orderings 192 nodes / cluster x 18 clusters 2 x 2-core x86 Source: Paul Henning (LANL) 30 30 PROGRAMMING TO A HIERARCHY OF PARALLELISM 28 August 2010 Georgia Tech / Computational Science and Engineering / Vetter 31 Programming Core • Assembler • Compiler • Libraries, Frameworks Socket: Multicore Node System 28 August 2010 • Threads – Pthreads, OpenMP • Distributed memory model like MPI, or GAS Languages • Libraries, Frameworks • Threads – Pthreads, OpenMP • Distributed memory model like MPI, or GAS Languages • Memory-Thread affinity becomes much more important • Libraries, Frameworks • Distributed memory model like MPI, GAS Languages • Libraries, Frameworks Georgia Tech / Computational Science and Engineering / Vetter 32 Memory Hierarchy Main memory is shared across all cores – Latency can differ significantly across cores in SMP – Thread affinity – Memory allocation policies • First touch Temporal and spatial locality – – – – Top 2 levels are dedicated – no interference, contention Multiple shared L3 caches Multiple TLBs (local) main memory if possible Georgia Tech / Computational Science and Engineering / Vetter 33 Observations Currently, applications must use multiple programming models to exploit parallelism at this scale: SIMD MPI Shared memory is not provided by the hardware at the highest levels Distributed memory models require more explicit management of communication – Cell, Polaris, other architectures have limited/no caches • Explicit communication is coming soon Georgia Tech / Computational Science and Engineering / Vetter 34 Peak Floating Point Metric Core FLOPS – Clock Rate * number of flops per cycle • 2.6 GHz * 2 5.2 GFLOPS Again, why? – Easy to calculate • No simulations • No benchmarking Socket FLOPS – Core FLOPS * Cores per Socket • 5.2 GFLOPS * 2 10.4 Node FLOPS – Socket FLOPS * Sockets per Node • 10.4 GFLOPS * 1 10.4 – Provides performance upper bound System FLOPS – Node FLOPS * Nodes per System • 10.4 GFLOPS * 11,508 ~119 TFLOPS • Jaguar @ ORNL Sustained performance on real workloads is much better – But it does take work! – And speculation Georgia Tech / Computational Science and Engineering / Vetter 35 CS6230: HPC Tools and Applications Benchmarks Jeffrey Vetter Benchmarks • Benchmarks provide a consistent, realistic workload (and methodology perhaps) to measure different systems • Representative and concrete target • Benchmark caveats – ‘Benchmarkology’ – marketing, spin – Fooling the masses – David Bailey • http://crd.lbl.gov/~dhbailey/dhbtalks/dhb-12ways.pdf Computational Science and Engineering Division 2 Benchmarking Spectrum DARPA HPCS Computational Science and Engineering Division 3 Examples of Different Levels of Benchmarks • Microbenchmarks – Stream, DGEMM, 1D FFT, RandomAccess • Kernels – Solvers, Signal processing, IO • Compact Applications – NAS Parallel Benchmarks, some other benchmark suites, HPCS SSCA benchmarks • Complete Applications (which have been prepared for benchmarking) – POP, HYCOM, WRF, CAM, GYRO, AMBER Computational Science and Engineering Division 4 HPC Challenge Benchmarks (and Awards) • HPCC designed to emphasize – different aspects of computer performance (in contrast to TOP500) – benchmark guidelines • HPCC Awards initiated in 2005, submissions are open for 2007 – Class 1: Best Performance (4 awards) • Best performance on a run submitted to the website – – – – HPL RandomAccess STREAM FFT • The prize will be $500 plus a certificate for each benchmark – Class 2: Most Productivity • • • • Most "elegant" implementation of at least two benchmarks 50% on performance 50% on code elegance, clarity, and size The prize will be $1500 plus a certificate for this award Computational Science and Engineering Division 5 ...
View Full Document

This note was uploaded on 11/04/2010 for the course CSE 6530 taught by Professor Jeffreyvetter during the Fall '10 term at Georgia Institute of Technology.

Ask a homework question - tutors are online