The Analysis and Optimization of Collective Communications on a Beowulf Cluster

The Analysis and Optimization of Collective Communications on a Beowulf Cluster

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
The Analysis and Optimization of Collective Communications on a Beowulf Cluster Wi Bing Tan and Peter Strazdins, Department of Computer Science, Australian National University Abstract This paper gives a performance analysis of the All- Gather, All-Reduce and Reduce-Scatter collective commu- nication operations on a Beowulf cluster. This cluster has a contention-free switch-based network with multiple net- work interface cards per node, permitting overlapping of message transmission under certain circumstances. As well as considering traditional algorithms developed previously for parallel computers with vendor-speciFc networks, we also examine simpler algorithms made up of repeated sub- operations, such as broadcasts. We Fnd that for the kind of network on the Beowulf cluster, a somewhat different per- formance modelling of the algorithms is required, and that some simple simulation tools had to be developed in order to fully understand some of the algorithms’ performance. Our results indicate that the LAM MPI implementa- tions for these operations may be signiFcantly improved, and the algorithms with data exchange and potential con- tention perform well on the cluster. ±urthermore, they indi- cate that algorithms permitting message overlap are slightly favoured, with a new and simple algorithm which modestly out-performs the best traditional algorithms in the case of Reduce-Scatter. With the exception that the degree of over- lapping proved difFcult to estimate, our performance mod- els Ftted closely with the results, and together with the simu- lation tools, permit a detailed understanding of the cluster’s communication pattern performance. 1. Introduction Collective communication operations, where a group of nodes on a parallel processor must co-operate in a commu- nication, form an important step in many supercomputing applications, including those for dense linear algebra. For example, broadcast and reduction operations are the main communications in LU and QR factorizations in ScaLA- PACK [3]; where more sophisticated load balance strategies are used, some of these operations are replaced by the All- Gather and Reduce-Scatter operations [5, 1, 10]. These col- lective operations, and the All-Reduce operation, have been considered suf±ciently important to have been introduced into the MPI standard ([8]). Research on this area has established that a variety of algorithms (or communication patterns) exist for such op- erations, and their performance has been modelled and analysed on mesh and hypercube communication networks on vendor-supplied parallel computers [9, 2, 7]. Here, mes- sage contention was often a major issue in the choice of optimal communication patterns. Recent years have seen the advent of the cluster com- puters as an alternative model of parallel computer. Typ- ically, their COTS communication networks are relatively slow, and hence the (software) optimization of communi- cation operations becomes more important. Furthermore, these networks tend to be switch-based ; these are free of
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 8

The Analysis and Optimization of Collective Communications on a Beowulf Cluster

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online