This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Cluster Comput (2008) 11: 341–354 DOI 10.1007/s10586-008-0065-8 Efficient shared memory and RDMA based collectives on multi-rail QsNet II SMP clusters Ying Qian · Ahmad Afsahi Received: 14 January 2008 / Accepted: 24 September 2008 / Published online: 16 October 2008 © Springer Science+Business Media, LLC 2008 Abstract Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-per- formance. Scientific applications running on clusters em- ploy collective communications extensively. Shared mem- ory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter- node communications, and thereby in boosting the perfor- mance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNet II with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages. The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan _ gather() , and up to 2.26 for 2 KB messages over elan _ alltoall() , respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan _ gather() for 512 B to 8 KB messages, with a 1.96 improvement fac- tor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan _ gather() by a factor of 1.49 for 32 KB messages. Ex- perimentation with real applications has shown up to 1.47 Y. Qian · A. Afsahi ( ) Department of Electrical and Computer Engineering, Queen’s University, Kingston, ON, Canada K7L 3N6 e-mail: [email protected] Y. Qian e-mail: [email protected] communication speedup can be achieved using the proposed all-gather algorithms. Keywords Collective communications · RDMA · Multi-rail networks · Clusters · Shared-memory · SMP 1 Introduction SMP Clusters are the predominant platforms for high- performance computing due to their cost-performance effec- tiveness. Interconnection networks and the communication system software play a key role on the performance of such clusters. In this regard, several modern networks such as Quadrics [ 2 ], InfiniBand [ 12 ], Myrinet [ 19 ], and 10-Gigabit iWARP Ethernet [ 24 ] have been introduced to support scal- able and efficient communications. While SMP nodes are traditionally equipped with multiple single-core processors, the emergence of multi-core SMP nodes in clusters, where each core will run at least one process with multiple intra- node and inter-node connections to several other processes, will put immense pressure on the interconnection network...
View Full Document
This note was uploaded on 12/08/2009 for the course HPC NST105 taught by Professor Hameed during the Spring '09 term at Punjab Engineering College.
- Spring '09