Efficient shared memory and RDMA based collectives on multi-rail QsNet SMP clusters

Efficient shared memory and RDMA based collectives on multi-rail QsNet SMP clusters

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Cluster Comput (2008) 11: 341–354 DOI 10.1007/s10586-008-0065-8 Efficient shared memory and RDMA based collectives on multi-rail QsNet II SMP clusters Ying Qian · Ahmad Afsahi Received: 14 January 2008 / Accepted: 24 September 2008 / Published online: 16 October 2008 © Springer Science+Business Media, LLC 2008 Abstract Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-per- formance. Scientific applications running on clusters em- ploy collective communications extensively. Shared mem- ory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter- node communications, and thereby in boosting the perfor- mance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNet II with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages. The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan _ gather() , and up to 2.26 for 2 KB messages over elan _ alltoall() , respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan _ gather() for 512 B to 8 KB messages, with a 1.96 improvement fac- tor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan _ gather() by a factor of 1.49 for 32 KB messages. Ex- perimentation with real applications has shown up to 1.47 Y. Qian · A. Afsahi ( ) Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, Canada K7L 3N6 e-mail: [email protected] Y. Qian e-mail: [email protected] communication speedup can be achieved using the proposed all-gather algorithms. Keywords Collective communications · RDMA · Multi-rail networks · Clusters · Shared-memory · SMP 1 Introduction SMP Clusters are the predominant platforms for high- performance computing due to their cost-performance effec- tiveness. Interconnection networks and the communication system software play a key role on the performance of such clusters. In this regard, several modern networks such as Quadrics [ 2 ], InfiniBand [ 12 ], Myrinet [ 19 ], and 10-Gigabit iWARP Ethernet [ 24 ] have been introduced to support scal- able and efficient communications. While SMP nodes are traditionally equipped with multiple single-core processors, the emergence of multi-core SMP nodes in clusters, where each core will run at least one process with multiple intra- node and inter-node connections to several other processes, will put immense pressure on the interconnection network...
View Full Document

This note was uploaded on 12/08/2009 for the course HPC NST105 taught by Professor Hameed during the Spring '09 term at Punjab Engineering College.

Page1 / 14

Efficient shared memory and RDMA based collectives on multi-rail QsNet SMP clusters

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online