Efficient high performance collective communication for the cell blade

Efficient high performance collective communication for the cell blade

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Efficient High Performance Collective Communication for the Cell Blade * Qasim Ali School of Electrical and Computer Engineering, Purdue University West Lafayette, IN 47907 qali@purdue.edu Samuel P. Midkiff School of Electrical and Computer Engineering, Purdue University West Lafayette, IN 47907 smidkiff@purdue.edu Vijay S. Pai School of Electrical and Computer Engineering, Purdue University West Lafayette, IN 47907 vpai@purdue.edu ABSTRACT This paper presents high-performance collective communication algorithms and implementations that exploit the unique archi- tectural features of the Cell heterogeneous multicore processor. This paper specifically describes novel algorithms for the barrier , broadcast , reduce , all-reduce , and all-gather collective operations, and shows the efficiency of these by comparing them to the previ- ous fastest known implementations of these operations targeting the Cell. The new implementations are faster than the published state- of-the-art, achieving up to 19.21 times the performance (95% re- duction in latency) of the previous published collective communi- cation work for the Cell [19, 25]. The results presented show per- formance both within a chip and across the two Cell chips on a Cell blade [10]. Categories and Subject Descriptors D.1.3 [ Programming Techniques ]: Parallel programming General Terms Algorithms, Performance, Design, Keywords Collective communication, algorithms, reductions, Cell processor 1. INTRODUCTION Accelerator-based computing has proven to be an effective paradigm for achieving high performance, power-efficiency, and space-efficiency, with examples such as the Roadrunner petascale machine that includes Cell processors [1]. A Cell processor con- tains a single general-purpose 64-bit PowerPC processor (the PPE), which is a dual-issue in-order RISC core, along with eight special- purpose high-performance SIMD processors called synergistic pro- * This work is supported in part by the National Science Foundation under Grant Nos. CCF-0325603, CNS-0509390, CCF-0532448 and CNS-0751153. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS’09, June 8–12, 2009, Yorktown Heights, New York, USA. Copyright 2009 ACM 978-1-60558-498-0/09/06 . ..$5.00. cessing elements (SPEs). The first-generation Cell has a peak performance of 204.8 Gflops for single-precision and 14.6 Gflops in double precision mode [12]. The second-generation Cell has 102.4 Gflops peak performance in double precision mode [10]. Cell processor features that can affect the performance and struc- ture of both communication operations and computations are the limited storage that each SPE can directly access (the on-chip local store , which has only 256 KB of SRAM per SPE), DMA opera- tions that are the primary mechanism to transfer data, and dedi-
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 11

Efficient high performance collective communication for the cell blade

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online