52 Pages

DavidPatterson

Course: CS 596, Fall 2008
School: USC
Rating:
 
 
 
 
 

Word Count: 3424

Document Preview

Landscape The of Parallel Computing Research: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel, John Kubiatowicz, Kurt Keutzer, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, and Kathy Yelick June, 2007 Parallel Revolution, Ready or Not Power Wall + Memory Wall + ILP Wall = Brick Wall End of uniprocessors New Moores Law is 2X cores per chip every technology...

Register Now

Unformatted Document Excerpt

Coursehero >> California >> USC >> CS 596

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Landscape The of Parallel Computing Research: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel, John Kubiatowicz, Kurt Keutzer, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, and Kathy Yelick June, 2007 Parallel Revolution, Ready or Not Power Wall + Memory Wall + ILP Wall = Brick Wall End of uniprocessors New Moores Law is 2X cores per chip every technology generation, same clock not a breakthrough but a retreat from even greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures Sea change for HW & SW industries since changing programmer model, responsibilities HW/SW industries bet their future that parallelism will succeed despite track record of failures Dead Computer Society: Illiac IV, Encore, Sequent, Transputer, Thinking Machines, MasPar, NCUBE, Kendall Square Research, Convex, (SGI) 2 Need a Fresh Approach Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Ras Bodik, Jim Demmel, John Kubiatowicz, Kurt Keutzer, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis Tried to learn from successes in parallel embedded (BWRC) and high performance computing (LBNL) Berkeley View Tech. Report: see view.eecs.berkeley.edu This talk next step in ideas: Berkeley View 2.0 A. The computing platforms of the future B. Novel approach to design of future systems C. 7 questions to frame parallel research 3 Bells Evolution Of Computer Classes Technology enables two paths: 1. increasing performance, same cost (and form factor) onstant price, increasing performance Mainframes Log price WS PC Time 2010s Bells Evolution Of Computer Classes Mainframe Technology enables two paths: 2. constant performance, decreasing cost (and form factor) Mini Log price WS PC Laptop Handset Ubiquitous Time A: Re-inventing Client/Server The Datacenter is the Computer Building sized computers: Google, MS The Laptop/Handheld is the Computer Dell no. laptops > desktops? 06 Profit? 1B Cell phones/yr, increasing in function Apple iPhone Otellini demoed "Universal Communicator 08: Combination cell phone, PC and video device Laptop/Handheld as future client, Datacenter as future server 6 Laptop/Handheld Reality Use with large displays at home or office What % time disconnected? 10%? 30% 50%? Disconnectedness due to Economics Cell towers and support system expensive to maintain charge for use to recover costs costs to communicate Policy varies, but most countries allow wireless investment where make most money Cities well covered Rural areas will never be covered Disconnectedness due to Technology No coverage deep inside buildings Satellite communication uses up batteries Need computation & storage in L/H 7 B:Try Application Driven Research? Conventional Wisdom in CS Research Users dont know what they want Computer Scientists solve individual parallel problems with clever language feature (e.g., dataflow), new compiler pass, or novel hardware widget (e.g., SIMD) Approach: Push (foist?) CS nuggets/solutions on users Problem: Stupid users dont learn/use proper solution Another Approach Work with domain experts developing compelling apps Provide HW/SW infrastructure necessary to build, compose, and understand parallel software written in multiple languages Research guided by commonly recurring patterns actually observed in compelling application codes 8 C: 7 Questions for Parallelism 1. 2. 3. 4. 5. 6. 7. Applications: What are the apps? What are kernels of apps? Hardware: What are the HW building blocks? How to connect them? Programming Model & Systems Software: How to describe apps and kernels? (Inspired by a view of the How to program the HW? Golden Gate Bridge from Berkeley) Evaluation: How to measure success? 9 Apps and Kernels Tower: What are the problems? Who needs 100 cores to run M/S Word? Failure of imagination? (CS education?) Need compelling apps that use 100s of cores What about parallel benchmarks? Few examples (e.g., SPLASH, NAS) Optimized to old models, languages, architectures How invent parallel systems of future when tied to old code, programming models of the past? 10 Coronary Heart Disease Before After Modeling to help patient compliance? 400k deaths/year, 16M w. symptom, 72MBP Massively parallel, Real-time variations Blood pressure, activity, habitus, cholesterol 11 CFD FE solid (non-linear), fluid (Newtonian), pulsatile Content Based Image Retrieval Query by example Relevance Feedback Image Database Similarity Metric Candidate Results Final Result Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non labeled images Many pictures of few people Complex pictures including people, events, 1000s of images and objects places, 12 Compelling Laptop/Handheld Apps Health Coach Since laptop/handheld always with you, Record images of all meals, weigh plate before and after, analyze calories consumed so far What if I order a pizza for my next meal? A salad? Since laptop/handheld always with you, record amount of exercise so far, show how body would look if maintain this exercise and diet pattern next 3 months What would I look like if I regularly 2 miles? 4 miles? Face Recognizer/Name Whisperer Laptop/handheld scans faces, matches image database, whispers name in ear (relies on Content Based Image Retreival) 13 Compelling Laptop/Handheld Apps Meeting Diarist Laptops/ Handhelds at meeting coordinate to create speaker identified, partially translated text diary of meeting Teleconference speaker identifier, speech helper used for teleconference, identifies who is speaking, closed caption hint of what being said L/Hs 14 Compelling Laptop/Handheld Apps Music Enhancer Enhanced sound delivery systems for home sound systems using large microphone and speaker arrays Laptop/Handheld recreate 3D sound over ear buds Laptop/Handheld as accelerator for hearing aide New composition and performance systems beyond keyboards New forms of bodily engagement when performing musical material Input device for Laptop/Handheld Hearing Augmenter Novel Instrument User Interface Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array in a single location with a radiation pattern controlled by onboard FPGA. 10inch-diameter icosahedron incorporating 120 1.25inch drivers developed by Meyer Sound, each driven from a separate audio channel, plus circuit boards for all of the control and class-D amplification functions. Laptop controls array via Gigabit Ethernet. 15 Apps and Kernels Tower: What are the problems? Who needs 100 cores to run M/S Word? Failure of imagination (CS education?) Need compelling apps that can use 100 cores What about parallel benchmarks? Few examples (e.g., SPLASH, NAS) Optimized to old models, languages, architectures How invent parallel systems of future when tied to old code, programming models of the past? 16 Can find patterns widely used? 1. 2. 3. 4. 5. 6. Look for common patterns of communication and computation Embedded Computing (EEMBC benchmark) Desktop/Server Computing (SPEC2006) Data Base / Text Mining Software Advice from Jim Gray of Microsoft and Joe Hellerstein of UC Games/Graphics/Vision Machine Learning Advice from Mike Jordan and Dan Klein of UC Berkeley High Performance Computing (Original 7 Dwarfs) Result: 13 Dwarfs 17 Dwarf Popularity (Red Hot Blue Cool) Embed SPEC 1 2 3 4 5 6 7 8 9 10 11 12 13 Finite State Mach. Combinational Graph Trav ersal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid DB Games ML HPC 18 13 Dwarfs (so far) 1. Finite State Machine 8. 2. Combinational Logic 9. 3. Graph Traversal 10. 4. Structured Grids 11. 5. Dense Linear Algebra 6. Sparse Linear Algebra 12. 7. Spectral Methods (FFT) 13. Dynamic Programming N-Body Methods MapReduce Back-track/ Branch & Bound Graphical Model Inference Unstructured Grids Claim: parallel arch., lang., compiler must do at least these well to do future parallel apps well Note: MapReduce is embarrassingly parallel; perhaps FSM is embarrassingly sequential? 19 Dwarf Popularity (Red Hot Blue Cool) How do compelling apps relate to 13 dwarfs? Games Embed SPEC QuickTime and a TIFF (LZW) decompressor are needed to see this picture. HPC DB ML Health Image Speech Music Finite State Mach. Combinational Graph Trav ersal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid 20 Dwarf Popularity (Red Hot Blue Cool) Stanford Transactional Apps for Multi Processing? Kmeans Clustering (http://stamp.stanford.edu) Speech Games Embed Health Image Music SPEC HPC DB ML Finite State Mach. Combinational Graph Trav ersal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid Vacation Reservation System 21 Gene Sequencing Delaunay mesh generatiom 4 Valuable Roles of Dwarfs 1. 2. 3. 4. Anti-benchmarks not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware Universal, understandable vocabulary to talk across disciplinary boundaries Define building blocks for creating libraries & frameworks that cut across app domains They decouple research, allowing analysis of HW & SW engineering and programming without waiting years for full apps 22 Berkeley View 2.0: Apps Tower Application-Driven Research vs. CS Solution-Driven Research Drill down on 4 app areas to guide research agenda Dwarfs to represent broader set of apps to guide research agenda Dwarfs help break through traditional interfaces Benchmarking, multidisciplinary conversations, target for libraries, and parallelizing parallel research 23 7 Questions for Parallelism 1. 2. 3. 4. 5. 6. 7. Applications: What are the apps? What are kernels of apps? Hardware: What are the HW building blocks? How to connect them? Programming Model & Systems Software: How to describe apps and kernels? How to program the HW? Evaluation: How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley) 24 How do we describe apps and kernels? Observation: Use Dwarfs. Dwarfs are of 2 types Algorithms in the dwarfs can either be implemented as: Compact parallel computations within a traditional library Compute/communicate pattern implemented as framework Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid Computations may be viewed a multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalize reduce) 25 Composing dwarfs to build apps Any parallel application of arbitrary complexity may be built by composing parallel and serial components code invoking parallel libraries, e.g., FFT, matrix ops., Parallel Serial patterns with serial plug-ins e.g., MapReduce Composition is hierarchical 26 Programming the Hardware 2 types of programmers 2 layers Productivity Layer (90% of todays programmers) Domain experts / Nave programmers productively build parallel apps using frameworks & libraries Frameworks & libraries composed to form app frameworks Efficiency Layer (10% of todays programmers) Expert programmers build: Frameworks: software that supports general structural patterns of computation and communication: e. g. MapReduce Libraries: software that supports compact computational expressions: e.g. Sketch for Combinational or Grid computation Bare metal efficiency possible at Efficiency Layer Effective composition techniques allows the efficiency programmers to be highly leveraged 27 Content Based Image Retrieval Extract Features 250 200 150 100 50 0 RGB Histogram Summarize image meaningfully and distinctively and Global Segment RGB and HSV Histograms Average gray value Color moments (mean, variance, kurtosis) DCT coefficients Edge direction histogram RGB and HSV Correlagrams Wavelet transforms(tree structured and pyramid) Scale Invariant Feature Transform histogram Color Coherence Vector Geometry from extracted faces 28 Coordination & Composition in CBIR Application Parallelism in CBIR is hierarchical Mostly independent tasks/data with combining DCT extractor feature extraction Face Recog ? DCT Output: stream of feature vectors data parallel map DCT over tiles combine reduction on histograms from each tile combine concatenate feature vectors output one histogram (feature vector) 29 Input: stream of images stream parallel over images DWT ? task parallel over extraction algorithms Coordination & Composition Language Coordination & Composition language for productivity 2 key challenges Correctness: ensuring independence using decomposition operators, copying and requirements specifications on frameworks 2. Efficiency: resource management during composition; domain-specific 1. OS/runtime support Language control features hide core resources, e.g., Map DCT over tiles in language becomes set of DCTs/tiles per core Hierarchical parallelism managed using OS mechanisms Data structure hide memory structures Partitioners on arrays, graphs, trees produce independent data Framework interfaces give independence requirements: e.g., map-reduce function must be independent, either by copying or application to partitioned data object (set of tiles from partitioner) 30 How do we program the HW? What are the problems? For parallelism to succeed, must provide productivity, efficiency, and correctness simultaneously for scalable hardware Correctness Productivity Cant make SW productivity even worse! Why do in parallel if efficiency doesnt matter? Most programmers not ready to produce correct parallel programs IBM usually considered orthogonal problem slows if code incorrect or inefficient SP customer escalations: concurrency bugs worst, can take months to fix 31 Ensuring Correctness Productivity Layer: Enforce independence of tasks using decomposition (partitioning) and copying operators Goal: Remove concurrency errors (nondeterminism from execution order, not just low level data races) E.g., the race-free program atomic delete + atomic insert does not compose to an atomic replace; need higher level properties, not solved with transactions (or locks) Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, and so on) Mixture of verification and automated directed testing Error detection on framework and libraries; some techniques applicable to third-party software 32 Support Software: What are the problems? Compilers and Operating Systems are large, complex, resistant to innovation (Worst possible time to be resistant to innovation!) Takes a decade for compiler innovations to show up in production compilers? Time for idea in SOSP to appear in production OS? Traditional OSes brittle, insecure, memory hogs Traditional monolithic OS image uses lots of precious memory * 100s - 1000s times (e.g., AIX uses GBs of DRAM / CPU) 33 Deconstructing Operating Systems Resurgence of interest in virtual machines Hypervisor: thin SW layer btw guest OS and HW Future OS: libraries where only functions needed are linked into app, on top of thin hypervisor providing protection and sharing of resources Opportunity for OS innovation Leverage HW partitioning support for very thin hypervisors, and to allow software full access to hardware within partition 34 21st Century Code Generation Problem: generating optimal code is like searching for a needle in a haystack New approach: Auto-tuners 1st run variations of program on computer to heuristically search for best combinations of optimizations (blocking, padding, ) and data structures, then produce C code to be compiled for that computer Search space for matmul block sizes: Axes are block dim Temp is speed E.g., PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W Example: Sparse Matrix (SPMv) for 3 multicores Fastest SPMv: 2X OSKI/PETSc Intel Clovertown, 4X AMD Opteron Optimization space: register blocking, cache blocking, TLB blocking, prefetching/DMA options, NUMA, BCOO v. BCSR data structures, 16b v. 32b indices, 35 Example: Sparse Matrix * Vector Name Chips*Cores Architecture Clovertown Opteron Cell Clock Rate Peak MemBW Peak GFLOPS Nave SPMv (median of many matrices) 2*4 = 8 2*2 = 4 1*8 = 8 4-/3-issue, 2-/1-SSE3, 2-VLIW, SIMD, OOO, caches, prefetch local store, DMA 2.3 GHz 2.2 GHz 3.2 GHz 21.3 GB/s 21.3 25.6 GB/s 74.6 GF 17.6 GF 14.6 (DP Fl. Pt.) 1.0 GF 0.6 GF -1% 1.5 GF 1.5X 3% 1.9 GF 3.2X -3.4 GF Efficiency % Autotune SPMv Auto Speedup 36 Software Span Summary Greater productivity and efficiency for SPMv? Parallelizing compiler + multicore + multilevel caches + SW and HW prefetching Autotuner + multicore + local store + DMA SPMv: Easier to autotune single local store + DMA than multilevel caches + HW and SW prefetching Deconstructed OS relying on thin hypervisors Productivity Layer & Efficiency Layer C&C Language to compose Libraries/Frameworks 37 7 Questions for Parallelism 1. 2. 3. 4. 5. 6. 7. Applications: What are the apps? What are kernels of apps? Hardware: What are the HW building blocks? How to connect them? Programming Model & Systems Software: How to describe apps and kernels? (Inspired by a view of the How to program the HW? Golden Gate Bridge from Berkeley) Evaluation: How to measure success? 38 Hardware Tower: What are the problems? Multicore (2, 4) v. Manycore (64, 128)? How can novel architectural support improve productivity, efficiency, and correctness for scalable hardware? Efficiency Also, power, design and verification costs, low yield, higher error rates instead of performance to capture energy as well as performance 39 HW Solution: Small is Beautiful Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs Small cores not much slower than large cores Lower threshold and supply voltages lowers energy per op Cisco Metro 188 CPUs + 4 spares; Sun Niagara sells 6 or 8 CPUs Parallel is energy efficient path to performance:CV2F Redundant processors can improve chip yield Small, regular processing elements easier to verify One size fits all? Law Heterogeneous processors Special function units to accelerate popular functions Amdahls 40 HW features supporting Parallel SW Want Composable Primitives, Not Packaged Solutions Transactional Memory is usually a Packaged Solution Partitions Fast Barrier Synchronization & Atomic Fetch-and-Op Active messages plus user-level event handling Used by parallel language runtimes to provide fast communication, synchronization, thread scheduling Configurable Memory Hierarchy (Cell v. Clovertown) Can configure on-chip memory as cache or local store Programmable DMA to move data without occupying CPU Cache coherence: Mostly HW but SW handlers for complex cases Hardware logging of memory writes to allow rollback 41 Partitions InfiniCore chip with 16x16 tile array Partition: hardware-isolated group Chip divided into hardware-isolated partition, under control of supervisor software User-level software has almost complete control of hardware inside partition Power-of-2 size, naturally aligned 42 Multiple Fast Barrier Networks Tap for 8-way partition Tap for 4-way partition Distributed Barrier Logic Multiple parallel barrier networks (1 bit each) at each level Signals propagate combinationally, with varying clock divisor (optimize latency not bandwidth) Hypervisor sets taps saying where partition sees barrier Tiles 43 7 Questions for Parallelism 1. 2. 3. 4. 5. 6. Applications: What are the apps? What are kernels of apps? Hardware: What are the HW building blocks? How to connect them? Programming Model & Systems Software: How to describe apps and kernels? How to program the HW? Evaluation: 7. How to measure success? (Inspired by a view of the Golden Gate Bridge from Berkeley) 44 Measuring Success: What are the problems? 1. 2. Only companies can build HW, and it takes years Software people dont start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW 3. 4. How to quickly get 100-1000 CPU systems in hands of researchers to let them innovate in algorithms, compilers, languages, OS, architectures, ASAP? Can avoid waiting 4 years + 3 months between HW/SW iterations? 45 Build Academic Manycore from FPGAs As 10 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from 100 FPGAs? 8 32-bit simple soft core RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate HW research community does logic design (gate shareware) to create out-of-the-box, Manycore E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent supercomputer @ 150 MHz/CPU in 2007 Ideal for heterogeneous chip architectures RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington Research Accelerator for Multiple Processors as a vehicle to lure more researchers to parallel challenge and decrease time to parallel salvation 46 768 CPU RAMP Blue (Wawrzynek, Krasnov, at Berkeley) 768 = 12 32-bit RISC cores / FPGA, 4 FGPAs/board, 16 boards, $10k/bd Simple MicroBlaze soft cores @ 90 MHz Full star-connection between modules 1008 node RAMP Blue soon (21 boards) NASA Advanced Supercomputing (NAS) Parallel Benchmarks (all class S) UPC versions (C plus shared-memory abstraction) CG, EP, IS, MG; DEMO? RAMPants creating HW & SW for manycore community using next gen FPGAs Chuck Thacker & Microsoft designing next boards 3rd party to manufacture and sell boards: 1H08 47 $1B+ Industry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Timesharing Client/server Graphics Entertainment Internet LANs Workstations GUI VLSI design RISC processors Relational DB Parallel DB Data mining Parallel computing RAID disk arrays Portable comm. World Wide Web Speech recognition Broadband last mile Total 7 C al C al t C ec ER h CN M U Il lin M oi IT s P ur R du oc e St he an s t To f o . ky rd Uo C L UA ta Wh is c. Innovation in Information Technology, Source: National Research Council Press, 2003. 48 Universities help start 19 1B$+ Industries 2 2 3 2 5 1 1 4 1 2 1 2 Change directions of research funding for parallelism? Historically: Get leading experts per discipline (across US) working together to work on parallelism Cal UIUC MIT Application Language Compiler Libraries Stanford Networks Architecture Hardware CAD 49 Change directions of research funding for parallelism? To increase crossdisciplinary bandwidth, get experts per site collaborating on (app-driven) parallelism Cal UIUC MIT Application Language Compiler Stanford Libraries Networks Architecture Hardware CAD 50 Summary: A Berkeley View 2.0 Our industry has bet its future on parallelism (!) Goal: Productive, Efficient, Correct Parallel Progs while doubling number of cores every 2 years (!) Try Apps-Driven vs. CS Solution-Driven Research Laptop-Handhelds/Datacenters as modern Client/Server 13 Dwarfs as lingua franca , anti-benchmarks Composition is critical to parallel computing success Composition is open problem for Transactional Memory Use C&C Lang to reuse experts code Create libraries, frameworks, for use in productivity layer Productivity layer for 90% todays programmers Efficiency layer for 10% todays programmers Autotuners over Parallelizing Compilers OS & HW: Composable Primitives over CS Solutions 51 Acknowledgments Berkeley View material comes from discussions with: Profs Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, John Wawrzynek, Kathy Yelick UCB Grad students Bryan Catanzaro, Jike Chong, Joe Gebis, William Plishker, and Sam Williams LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf See view.eecs.berkeley.edu RAMP based on work of RAMP Developers: Krste Asanovic (MIT/Berkeley), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), and John Wawrzynek (Berkeley, PI) 52 See ramp.eecs.berkeley.edu
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

USC - CS - 596
IntroductionComputer Experiment Computational science: An area of scientific investigation, where computers play a central role. Scientific computing: An area in computer science to support computational sciences by innovative use of computer syst
USC - CS - 596
Molecular Dynamics BasicsBasics of the molecular-dynamics (MD) method1-3 are described, along with corresponding data structures in program, md.c.Newtons Second Law of MotionTRAJECTORY, COORDINATE, AND ACCELERATION Physical system = a set of ato
USC - CS - 596
Linked-List Cell Molecular DynamicsThis chapter explains the linked-list cell MD algorithm, the computational time of which scales as O(N) for N atoms. We will replace function computeAccel() in the O(N2) program md.c by this algorithm. Recall that
USC - CS - 596
Message Passing Interface (MPI) ProgrammingMPI (Message Passing Interface) is a standard message passing system that enables us to write and run applications on parallel computers. In 1992, MPI Forum was formed to develop a portable message passing
USC - CS - 596
Message Passing Interface (MPI) Programming Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Sout
USC - CS - 596
Calculating in Parallel Using MPI Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern Cali
USC - CS - 596
Parallel Molecular DynamicsThis chapter explains the example parallel MD program, pmd.c, in detail.Spatial Decomposition Spatial decomposition: The physical system to be simulated is partitioned into subsystems of equal volume. Processors in a pa
USC - CS - 596
Visualizing Molecular Dynamics IWindowingGraphic LibraryOpenGL OpenGL: A standard, hardware-independent interface to graphics hardware (e.g., SGI InfiniteReality2 Graphics Engine). A set of graphics routines defined on the OpenGL virtual graphics
USC - CS - 596
Quantum Dynamics BasicsIn this chapter, we will simulate the dynamics of a particle, such as an electron, which follows the law of quantum mechanics [1]. Basics of the quantum-dynamics (QD) method [2-5] are described, along with corresponding data s
USC - CS - 596
Parallel Quantum Dynamics Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California Em
USC - CS - 596
Hybrid MPI+OpenMP Parallel MD Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern Californi
USC - CS - 596
Optimizing Molecular Dynamics Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern Californi
USC - CS - 596
Advanced Topics in Parallel Molecular Dynamics Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of S
USC - CS - 596
Divide-&-Conquer Parallelism Aiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California
USC - CS - 596
CSCI596: Scientic Computing & VisualizationSummaryAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University o
USC - CS - 596
Least Square Fit of a LineProblem: Given a set of N pairs of numbers, {(xi, yi) | i = 1, ., N}, what is the best linear fit, y = ax + b,Nin the sense that it minimizes the square error, S = # ( axi + b " yi ) .i=12Answer: S is a quadratic fu
USC - CS - 596
How to Install OpenGLWINDOWSInstall your favorite integrated development environment (IDE). This tutorial assumes that you have Microsoft Visual Studio installed on your machine. 1. Install OpenGL OpenGL software runtime is included as part of oper
USC - CS - 596
Scientific Virtual-Reality ApplicationsThis lecture note describes interactive three-dimensional visualization of scientific datasets, using the earliest immersive and interactive visualization platforms, CAVE and ImmersaDesk, and the associated CAV
USC - CS - 596
Massive Dataset VisualizationAiichiro Nakano Collaboratory for Advanced Computing & Simulations Dept. of Computer Science, Dept. of Physics & Astronomy, Dept. of Chemical Engineering & Materials Science University of Southern California Email:
USC - CS - 596
Ashish Sharma sharmaa@usc.edu Collaboratory for Advanced Computing and Simulations Dept. of Computer Science Aiichiro Nakano anakano@usc.edu Dept. of Computer Science Rajiv K. Kalia rkalia@usc.edu Dept. of Physics & Astronomy Priya Vashishta priyav@u
USC - CS - 596
A Spatially Decomposed Parallel Visualization AlgorithmInternational Journal of Computational Science 1992-6669 (Print) 1992-6677 (Online) Global Information Publisher 2007, Vol. 1, No. 4, 407-421ParaViz: A Spatially Decomposed Parallel Visualiza
USC - CS - 653
International Journal of Parallel Programming, Vol. 29, No. 3, 2001Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings 1John Mellor-Crummey, 2, 4 David Whalley, 3 and Ken Kennedy 2Received Octo
USC - CS - 653
124IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL. 13,NO. 1, JANUARY/FEBRUARY 2001Analysis of the Clustering Properties of the Hilbert Space-Filling CurveBongki Moon, H.V. Jagadish, Christos Faloutsos, Member, IEEE, and Joel H. Salt
USC - CS - 653
Cache-Oblivious AlgorithmsEXTENDED ABSTRACT SUBMITTED FOR PUBLICATION.Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 021393 0 8 ( B8 & 8 3 & 2
USC - CS - 596
CSCI596 (Scientific Computing and Visualization) Assignment 1Molecular Dynamics Due: September 15 (Mon), 2008, at the class The purpose of this assignment is to familiarize yourself with the simple molecular dynamics (MD) program, md.c (and its linke
USC - CS - 596
CSCI596 Assignment 2Message Passing Interface Due: September 24 (Wed), 2008, at the class In this assignment, you will write your own global summation program (equivalent to using MPI_Send and MPI_Recv. Your program should run on P = 2l processors (l
USC - CS - 596
CSCI596 Assignment 3Parallel Computation of Due: October 1 (Wed), 2008 Part I: Programming Write a message passing interface (MPI) program, global_pi.c, to compute the value of based on the lecture note on Parallel Computation of Pi and using the g
USC - CS - 596
CSCI596 Assignment 4Parallel Molecular Dynamics Due: October 8 (Wed), 2008 The purpose of this assignment is to get familiar with the message-passing scheme used in the parallel molecular dynamics (MD) program, pmd.c, and asynchronous messages in the
USC - CS - 596
CSCI596 Assignment 5: Molecular Dynamics Animation Due: October 22 (Wed), 2008 Combine md.c and atomv.c to write a C/OpenGL program that animates molecular dynamics (MD) simulation in real time. (Follow the lecture note on Visualizing Molecular Dynam
USC - CS - 596
CSCI596 Assignment 6: Parallel Quantum Dynamics Due: November 3 (Mon), 2008 (at the class) Parallelize the one-dimensional quantum dynamics (QD) simulation program qd1.c, using MPI. Use spatial decomposition, so that processor p [0, P-1] (P is the n
USC - CS - 596
CSCI596 (Scientific Computing and Visualization) Assignment 7 Hybrid MPI+OpenMP Parallel Molecular Dynamics Due: November 12 (Wed), 2008 (at the class) 1. Write a hybrid MPI+OpenMP parallel molecular dynamics (MD) program (name it hmd.c), starting fr
USC - CS - 596
CSCI596 (Scientific Computing and Visualization) Final Project Anything Related to What You Have Learned in the Class Due: December 17 (Wed), 2008 Submit the following project by Wednesday, December 17. In addition, at 3:30-4:50pm on Wednesday, Decem
USC - CS - 596
Guidelines for the Final Project I. II. Programming Critical review (>2-3 pages) Dont repeat what the paper says. 1. 2. 3. 4. * III. Problem: Whats the problem? Method: How to solve it? Results: Bottom line? So what? Critique: Flaw? Improvement (how
USC - CS - 653
84Computer Physics Communications 63 (1991) 8494 North-HollandVisualizing quantum scattering on the CM-2 supercomputerJohn L. Richardson1Received 2 February 1990 Thinking Machines Corporation, 245 First Street, Cambridge, MA 02142-1214, USAWe
USC - CS - 653
CSCI653: High Performance Computing & SimulationsAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of
USC - CS - 653
Molecular Dynamics I: PrinciplesBasics of the molecular-dynamics (MD) method1-3 are described, along with corresponding data structures in program, md.c.Newtons Second Law of MotionTRAJECTORY, COORDINATE, AND ACCELERATION Physical system = a set
USC - CS - 653
Minimal Complex Analysis Complex function: A mapping from a complex variable z = x + iy (i = f(z) C."1) to a complex numberDifferentiation: A complex function f(z) at z is differentiable if the quantity ! f (z + "z) # f (z)"zconverges to a
USC - CS - 653
JOURNALOF COMPUTATIONALPHYSICS73, 315-348 (1987)A Fast Algorithmfor ParticleANDSimulations*L. GREENCARDV. ROKHLINDepartment of Computer Science, Yale Lnipersiry, New Haven, Connecticut 06520 Received June 10. 1986; revised February
USC - CS - 653
ELSEYIER2s -/.-I@Computer Physics Communications Computer Physics Communications ( 1997)59-69 104Parallel multilevel preconditioned conjugate-gradient approach to variable-charge molecular dynamicsAiichiro Nakano Depurtment of Computer Scienc
USC - CS - 653
Computer Physics Communications 153 (2003) 445461 www.elsevier.com/locate/cpcScalable and portable implementation of the fast multipole method on parallel computers Shuji Ogata a , Timothy J. Campbell b , Rajiv K. Kalia c,d , Aiichiro Nakano c,d,
USC - CS - 653
Reversiblemultipletime scale moleculardynamicsM. Tuckermar?) G. J. Martynaand B. J. BerneDepartment of Chemistry, Columbia University, New York, New York 10027 Department of Chemistry, University of Pennsylvania, Philadelphia, Pennsylvani
USC - CS - 653
Computer Physics CommunicationsELSEVIERComputer Physics Communications 83 (1994) 197214Multiresolution molecular dynamics algorithm for realistic materials modeling on parallel computersAiichiro Nakano, Rajiv K. Kalia, Priya VashishtaConcurrent
USC - CS - 653
Message Passing Interface (MPI) ProgrammingMPI (Message Passing Interface) is a standard message passing system that enables us to write and run applications on parallel computers. In 1992, MPI Forum was formed to develop a portable message passing
USC - CS - 653
OpenMP ProgrammingAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California Email: ana
USC - CS - 653
Hybrid MPI+OpenMP Parallel MDAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California
USC - CS - 653
Parallel Molecular DynamicsThis chapter explains the example parallel MD program, pmd.c, in detail.Spatial Decomposition Spatial decomposition: The physical system to be simulated is partitioned into subsystems of equal volume. Processors in a pa
USC - CS - 653
Parallel Molecular DynamicsAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California E
USC - CS - 653
Scalability Metrics for Parallel Molecular DynamicsParallel EfficiencyWe define the efficiency of a parallel program running on P processors to solve a problem of size W. Let T(W, P) be the execution time of this parallel program. Speed of the prog
USC - CS - 653
Anton, a Special-Purpose Machine for Molecular Dynamics SimulationDavid E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey S. Kuskin, Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood,
USC - CS - 653
A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle InteractionsDAVID E. SHAWD. E. Shaw Research and Development, LLC and Center for Computational Biology and Bioinformatics, Columbia University, 120 W. 45th S
USC - CS - 653
Divide-and-Conquer Parallelization ParadigmDivide-and-Conquer Simulation Algorithms Divide-and-conquer (DC) algorithms: Recursively partition a problem into subprogram of roughly equal size. If subprogram can be solved independently, there is a pos
USC - CS - 653
Divide-&-Conquer ParallelismAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California
USC - CS - 653
Computational Materials Science 38 (2007) 642652 www.elsevier.com/locate/commatsciA divide-and-conquer/cellular-decomposition framework for million-to-billion atom simulations of chemical reactionsAiichiro Nakano a,*, Rajiv K. Kalia a, Ken-ichi No
USC - CS - 653
DE NOVO ULTRASCALE ATOMISTIC SIMULATIONS ON HIGH-END PARALLEL SUPERCOMPUTERSAiichiro Nakano Rajiv K. Kalia1 Ken-ichi Nomura1 Ashish Sharma1, 2 Priya Vashishta1 Fuyuki Shimojo1,3 4 Adri C. T. van Duin 4 William A. Goddard, III 5 Rupak Biswas Deepak S
USC - CS - 653
Load BalancingAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern California Email: anakano
USC - CS - 653
Lanczos Method for EigensystemsAiichiro NakanoCollaboratory for Advanced Computing & Simulations Department of Computer Science Department of Physics & Astronomy Department of Chemical Engineering & Materials Science University of Southern Californ
USC - CS - 653
Supplementary Derivations for the Lanczos-Algorithm LectureSpectral representation The eigenvalues and eigenvectors satisfyn n$ Aij q (" ) = #" qi(" ) = $ qi(" ) #%&%" , jj=1%=1()(1)where "#$ = 1 ($ = #); 0 ($ % #). Define an orthogona
USC - CS - 653
C3P 913June 1990Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh CalculationsRoy D. WilliamsConcurrent Supercomputing Facility California Institute of Technology Pasadena, CaliforniaAbstract If a nite element mesh has a
USC - CS - 653
SIAM J. SCI. COMPUT. Vol. 20, No. 1, pp. 359392c 1998 Society for Industrial and Applied MathematicsA FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHSGEORGE KARYPIS AND VIPIN KUMAR Abstract. Recently, a number of researc