Optimization of collective reduction operation

Optimization of collective reduction operation - Published...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Optimization of Collective Reduction Operations Rolf Rabenseifner High-Performance Computing-Center (HLRS), University of Stuttgart Allmandring 30, D-70550 Stuttgart, Germany rabenseifner@hlrs.de, www.hlrs.de/people/rabenseifner/ Published in International Conference on Computational Science, June 7–9, Krakow, Poland, LNCS, Springer-Verlag, 2004. c ± Springer-Verlag, http://www.springer.de/comp/lncs/index.html Abstract. A 5-year-proFling in production mode at the University of Stuttgart has shown that more than 40% of the execution time of Mes- sage Passing Interface (MPI) routines is spent in the collective commu- nication routines MPI Allreduce and MPI Reduce. Although MPI im- plementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors’ and publicly available reduction algorithms could be accelerated with new al- gorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents Fve algorithms optimized for dif- ferent choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation. Keywords. Message Passing, MPI, Collective Operations, Reduction. 1 Introduction and Related Work MPI Reduce combines the elements provided in the input vector (buFer) of each process using an operation (e.g. sum, maximum), and returns the combined values in the output buFer of a chosen process named root. MPI Allreduce is the same as MPI Reduce, except that the result appears in the receive buFer of all processes. MPI Allreduce is one of the most important MPI routines and most vendors are using algorithms that can be improved by a factor of more than 2 for long vectors. Most current implementations are optimized only for short vectors. A 5-year-pro±ling [11] of most MPI based applications (in production mode) of all users of the Cray T3E 900 at our university has shown, that 8.54% of the execution time is spent in MPI routines. 37.0% of the MPI time is spent in MPI Allreduce and 3.7% in MPI Reduce. The 5-year-pro±ling has also shown, that 25% of all execution time was spent with a non-power-of-two number of processes. Therefore, a second focus is the optimization for non-power-of-two numbers of processes. Early work on collective communication implements the reduction operation as an inverse broadcast and do not try to optimize the protocols based on dif- ferent buFer sizes [1]. Other work already handle allreduce as a combination of basic routines, e.g., [2] already proposed the combine-to-all (allreduce) as a com- bination of distributed combine (reduce scatter) and collect (allgather). Collective algorithms for wide-area cluster are developed in [5,7, 8], further protocol tuning can be found in [3,4, 9,12], and automatic tuning in [13]. The main focus of the
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
2 Rolf Rabenseifner work presented in this paper is to optimize the algorithms for diFerent numbers of processes (non-power-of-two and power-of-two) and
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/08/2009 for the course HPC NST105 taught by Professor Hameed during the Spring '09 term at Punjab Engineering College.

Page1 / 8

Optimization of collective reduction operation - Published...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online