This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Anatomy of High-Performance Matrix Multiplication KAZUSHIGE GOTO The University of Texas at Austin and ROBERT A. VAN DE GEIJN The University of Texas at Austin We present the basic principles which underlie the high-performance implementation of the matrix- matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance. Categories and Subject Descriptors: G.4 [ Mathematical Software ]: Efficiency General Terms: Algorithms;Performance Additional Key Words and Phrases: linear algebra, matrix multiplication, basic linear algebra subprogrms 1. INTRODUCTION Implementing matrix multiplication so that near-optimal performance is attained requires a thorough understanding of how the operation must be layered at the macro level in combination with careful engineering of high-performance kernels at the micro level. This paper primarily addresses the macro issues, namely how to exploit a high-performance inner-kernel, more so than the the micro issues related to the design and engineering of that inner-kernel. In [Gunnels et al. 2001] a layered approach to the implementation of matrix multiplication was reported. The approach was shown to optimally amortize the required movement of data between two adjacent memory layers of an architecture with a complex multi-level memory. Like other work in the area [Agarwal et al. 1994; Whaley et al. 2001], that paper ([Gunnels et al. 2001]) casts computation in terms of an inner-kernel that computes C := AB + C for some m c k c matrix A that is stored contiguously in some packed format and fits in cache memory. Unfortunately, the model for the memory hierarchy that was used is unrealistic in Authors addresses: Kazushige Goto, Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX 78712, firstname.lastname@example.org . Robert A. van de Geijn, Department of Computer Sciences, The University of Texas at Austin, Austin, TX 78712, email@example.com . Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee....
View Full Document