On optimizing collective communication_ Relative order of send recv

On optimizing collective communication_ Relative order of send recv

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
On Optimizing Collective Communication Ernie W. Chan, Marcel F. Heimlich, Avi Purkayastha, and Robert A. van de Geijn The University of Texas at Austin {echan, heimlich, rvdg}Qcs .utexas . edq avijitQtacc .utexas. edu Abstract In this peper we discuss issues related to the high- performance implenientation of collective communica- tions operations on distributed-memory computer ar- chitectures. Using a cornbination of known techniques (many of which were first proposed in the 1980s and early 1900s) along with careful exploitation of com- munication modes supported by MPI, we have devel- oped implementations that have improved performance in most situations compared to those currently sup ported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Pen- tiuni 4 (R) processor cluster are included. 1 Introduction In this paper a set of collective communication algo- rithms impleniented on distributed-memory systems is presented. These already efficient algorithms are reengineered to further improve performance. To in- crease performance, we change the type of commu- nication used between processors and combine algo- rithms (hybridization) that are appropriate for differ- ent data volumes. Using the methods detailed in this paper, it is possible to significantly decrease the cost of collective coinniunication operations relative to com- inonly available public domain implementations such MPICH [13. 141. Extensive research over the past decade has been re- ported on collective comniunication and iniplementa- tions of algorithms for distributing data between prc- cessors. It has been shown that effective communica- tion algorithms can be implemented using simple yet powerful techniques [l, 2, 3, 4, 5, 7, 9, 11, 12, 16, 171. Even though these algorithms have been extensively re- searched, public domain and vendor implementations are frequently still suboptimal 1151. These techniqnes are revisited, refined, and reimplemented to achieve even higher performance than in past presentations. 2 Model of parallel computation To analyze the perfornlance of the presented algo- rithms, it is useful to asume a simple model of parallel computation. The following assumptions are made: Target architectures: The target architectures are distributed-memory parallel architectures. Indexing: This paper assumes a parellel architecture with p computational nodes (nodes hereafter). The nodes are indexed from 0 to p- 1. Each node has one computational processor. Logically fully connected: We will assume that any node can send directly to any other node where a communication network with some topology pro- vides automatic routing. Communicating between nodes: At any given time, each node can send only one message to one other node. Similarly, it can only receive one message from one other node. We will assume it cm send and receive simultaneously.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 12/08/2009 for the course HPC NST105 taught by Professor Hameed during the Spring '09 term at Punjab Engineering College.

Page1 / 12

On optimizing collective communication_ Relative order of send recv

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online