MIMS_lapack1111

MIMS_lapack1111 - A Class of Parallel Tiled Linear Algebra...

Info iconThis preview shows pages 1–4. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures Alfredo Buttari, Julien Langou, Jakub Kurzak and Jack Dongarra October 2007 MIMS EPrint: 2007.122 Manchester Institute for Mathematical Sciences School of Mathematics The University of Manchester Reports available from: http://www.manchester.ac.uk/mims/eprints And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK ISSN 1749-9097 A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures LAPACK Working Note # 191 Alfredo Buttari 1 , Julien Langou 4 , Jakub Kurzak 1 , Jack Dongarra 123 1 Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, Tennessee 2 Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3 University of Manchester, Manchester UK 4 Department of Mathematical Sciences, University of Colorado at Denver and Health Sciences Center, Colorado Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be re- formulated or new algorithms have to be developed in order to take ad- vantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance com- parisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. 1 Introduction In the last twenty years, microprocessor manufacturers have been driven to- wards higher performance rates only by the exploitation of higher degrees of Instruction Level Parallelism (ILP). Based on this approach, several generations of processors have been built where clock frequencies were higher and higher and pipelines were deeper and deeper. As a result, applications could benefit from these innovations and achieve higher performance simply by relying on compilers that could efficiently exploit ILP. Due to a number of physical limi- tations (mostly power consumption and heat dissipation) this approach cannot be pushed any further. For this reason, chip designers have moved their focus from ILP to Thread Level Parallelism (TLP) where higher performance can be achieved by replicating execution units (or cores ) on the die while keeping the 2 Alfredo Buttari, Julien Langou, Jakub Kurzak, Jack Dongarra...
View Full Document

{[ snackBarMessage ]}

Page1 / 20

MIMS_lapack1111 - A Class of Parallel Tiled Linear Algebra...

This preview shows document pages 1 - 4. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online