1 Page

sc96

Course: CS 96, Fall 2008
School: Cornell
Rating:
 
 
 
 
 

Document Preview

Communication Low-Latency on the IBM RISC System/6000 SP Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel and Thorsten von Eicken Department of Computer Science Cornell University Ithaca NY 14853 Abstract The IBM SP is one of the most powerful commercial MPPs, yet, in spite of its fast processors and high network bandwidth, the SPs communication latency is inferior to older machines such as the TMC CM-5 or...

Register Now

Unformatted Document Excerpt

Coursehero >> New York >> Cornell >> CS 96

Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.

Course Hero has millions of student submitted documents similar to the one below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Communication Low-Latency on the IBM RISC System/6000 SP Chi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel and Thorsten von Eicken Department of Computer Science Cornell University Ithaca NY 14853 Abstract The IBM SP is one of the most powerful commercial MPPs, yet, in spite of its fast processors and high network bandwidth, the SPs communication latency is inferior to older machines such as the TMC CM-5 or Meiko CS-2. This paper investigates the use of Active Messages (AM) communication primitives as an alternative to the standard message passing in order to reduce communication overheads and to offer a good building block for higher layers of software. The rst part of this paper describes an implementation of Active Messages (SP AM) which is layered directly on top of the SPs network adapter (TB2). With comparable bandwidth, SP AMs low overhead yields a round-trip latency that is 40% lower than IBM MPLs. The second part of the paper demonstrates the power of AM as a communication substrate by layering Split-C as well as MPI over it. Split-C benchmarks are used to compare the SP to other MPPs and show that low message overhead and high throughput compensate for SPs high network latency. The MPI implementation is based on the freely available MPICH version and achieves performance equivalent to IBMs MPI-F on the NAS benchmarks. 1 Introduction The IBM RISC System/6000 SP has established itself as one of the most powerful commercial massively parallel processors (MPPs) because of its fast Power2 processors, high-bandwidth interconnection network, and scalability. Nevertheless, the SPs network latency is 2 to 4 times higher than that of older MPPs such as the TMC CM-5 or Meiko CS-2, mainly due to overheads in the communication software and in the network interface architecture. Active Messages (AM) provide simple communication primitives that are well suited as building blocks for higher layers of software such as parallel languages and complex message interfaces. Originally developed for the CM-5 [12], This work has been sponsored by IBM under the joint project agreement 112691-A and University Agreement MHVU5851, and by NSF under contracts CDA-9024600 and ASC-8902827. Chi-Chao Chang is supported in part by a doctoral fellowship (200812/94-7) from CNPq/Brazil. In the Proceedings of ACM/IEEE Supercomputing, Pittsburgh, PA, November 1996. Copyright c 1996 IEEE. implementations are also available for the Meiko CS-2[10], HP workstations on FDDI ring [9], Intel Paragon, and the U-Net ATM cluster of Sun Sparcstations[13]. All the implementations are based on the Generic Active Message Specication Version 1.1 [3]. Message passing is the most widely used communication model in parallel computing and is now standardized in the Message Passing Interface (MPI)[5]. It supports blocking and nonblocking sends and receives, collective communication, noncontiguous messages, and contains facilities for dealing with groups of processes and libraries. Since much of MPIs functionality is machine-independent, a freely available MPICH [7] implementation of MPI was developed to take care of the upper layers of MPI while providing an abstract device interface (ADI) to the machine dependent layers. This paper investigates the use of AM communication primitives as an alternative to message passing on the SP in order to reduce communication overheads and deliver high bandwidth with small messages. The rst part of this paper (Section 2) describes the IBM SP implementation of AM (SP AM) which is layered directly on top of the SPs network adapter (TB2) and which does not use any IBM software on the Power2 processor. SP AM achieves a one-word message round-trip time of 51 s which is only 4 s higher than the raw application-to-application round-trip latency, and 40% lower than the 88 s measured using IBMs message passing library (MPL). SP AM bulk transfers achieve an asymptotic network bandwidth of 34.3 MBytes/s which is comparable with the 34.6 MBytes/s measured using IBM MPL. Furthermore, SP AM has a message half-power point (n 1 ) of only 260 bytes 2 using asynchronous bulk-transfers. The second part of the paper demonstrates the power of AM as a communication substrate by porting Split-C, a split-phase shared-memory extension to C, and implementing MPI over SP AM. Split-C benchmarks are used in Section 3 to compare the SP to other MPPs and show that low message overhead and high message throughput compensate for the SPs high network latency. The MPI implementation (Section 4) is built entirely over SP AM and is based on MPICH. The goal is to demonstrate that the communications core of MPI can be implemented over AM in a simple fashion and still provide very high performance. This simplicity eases portability and eases optimizations that might otherwise be unwieldy. The implementation focuses on the basic point-to-point communication primitives used by MPICHs ADI and relies on the higher-level MPICH routines for collective communication and non-contiguous sends. Ex- tending the implementation to specialize these functions to use AM more directly would be straight-forward. The current MPI over SP AM matches MPI-Fs performance for very small and very large messages and outperforms MPI-F by 10 to 30% for medium size (8 Kbyte to 20 KByte) messages. The NAS benchmarks (Section 6) achieve the same performance using MPICH over SP AM as using MPI-F. 1.1 Active Messages background AM is a low-latency communication mechanism for multiprocessors that emphasizes the overlap of communication and computation [12]. Messages are in the form of requests and matching replies and contain the address of a handler that is invoked on receipt of the message along with up to four words of arguments. The handler pulls the arguments out of the network, integrates them into the ongoing computation and potentially sends a reply back. Large messages are sent using bulk transfer operations called stores and gets which transfer data between blocks of memory specied by the node initiating the transfer. Message delivery is generally done by explicit network polling in which case each call to am request checks the network and explicit checks can be added using am poll. Interrupt-driven reception is also available but not used in this analysis of SP AM. AM guarantees reliable delivery of messages but does not recover from crash failures, network partitions, and other similar kinds of failures. The interface is summarized in Table 1. Function am am am am am am request M (M 1; 2; 3; 4) reply M (M 1; 2; 3; 4) store store async get poll a larger main memory of up to 2 Gbytes, and a SPEC rating of 121.6 SPECint92 and 259.7 SPECfp92. The processing nodes are organized in racks of up to 16 thin nodes or 8 wide nodes each and are connected by a highperformance scalable switch. The switch provides four different routes between each pair of nodes, a hardware latency of about 500ns, and a bandwidth close to 40 MBytes/s. SP nodes are connected to the high-speed interconnection switch via communication adapters (referred-to as TB2) [11] which contain an Intel i860 microprocessor with 8 MBytes of DRAM. Shown in Figure 1, the adapter plugs into the 32bit MicroChannel I/O bus with a 80 MB/s peak transfer rate and contains a custom Memory and Switch Management Unit (MSMU) to interface into the network. Data transfers between the MSMU and the MicroChannel are performed using two DMA engines and an intermediate 4KB FIFO. Direct programmed I/O from the host to the adapter RAM is also possible. Mem Bus CPU $ ADAPTER i860 bus switch link len array sendQ len array sendQ recvQ DRAM . . . DMA1 MSMU 8MB FIFOs DMA2 recvQ ... i860 20-80 MB/s MicroChannel = = Operation Send an M-word request Send an M-word reply Send a long message blocking Send a long message non-blocking Fetch data from remote node Poll the network Figure 1: Schematic of the SP network interface 2 Active Messages Implementation Table 1: Summary of AM interface. Am request M functions invoke the handler on the remote node with i1 ; ::; iM as parameters. The handler may reply using similar am reply M functions. Am store copies nbytes of local data to the specied remote address and invokes a handler on the remote node after the transfer is complete. Am store blocks until the source memory region can be reused, while am store async returns immediately and a separate completion function is called on the sending side at the end of the data transfer. Am get initiates a data transfer from a remote memory block to a local one and invokes a local handler at the end of the transfer. This section describes the software interface to the network adapter, the basic mechanisms of sending and receiving a packet, the ow control strategies employed for reliable delivery, and the main performance characteristics of SP AM. 2.1 Basic Send and Receive Mechanisms 1.2 SP Overview The IBM SP is an MPP consisting of Power2 RS/6000 nodes interconnected by a custom network fabric as well as by Ethernet. Each node has its own memory, CPU, operating system (AIX), MicroChannel I/O bus, Ethernet adapter, and high performance switch adapter [8]. The SP processing nodes operate at a clock speed of 66MHz and offer a peak performance of 266 Mops. A model 390 thin node contains a 64 KB data cache with 64-byte lines, a 64-bit memory bus, and 64 to 512 Mbytes of main memory. The SPEC ratings are 114.3 SPECint92 and 205.3 SPECfp92. A model 590 wide node differs from thin nodes in that it has a 256 Kbyte data cache with 256-byte lines, SP AM relies on the standard TB2 network adapter rmware but does not use any IBM software on the Power2 processor. The adapter rmware allows one user process per node to access a set of memory-mapped send and receive FIFOs directly. These FIFOs allow user-level communication layers to access the network directly without any operating system intervention: the adapter monitors the send FIFO and delivers messages into the receive FIFO, all using DMA for the actual data transfer (Figure 1). The send FIFO has 128 entries while the receive FIFO has 64 entries per active processing node (determined at runtime). Each entry consists of 256 bytes and corresponds to a packet. A packet length array is associated with the send FIFO. Its slots correspond to entries in the send FIFO and indicate the number of bytes to be transferred for each packet. The adapter monitors this length array which is located in adapter memory and transmits a packet when the corresponding slot in the packet length array becomes nonzero. 2 A packet is sent by placing the data into the next entry of the send FIFO along with a header indicating destination node and route. Since the RS/6000 memory bus does not support cache coherency the relevant cache lines must be ushed out to main memory explicitly. Finally, the transfer size (1 byte) is stored in the packet length array located in adapter memory on the microchannel bus. In bulk transfers this store across the I/O bus can be optimized by writing the lengths of several packets at a time. To receive a packet, the data in the top entry of the receive FIFO is copied out to the user buffers. After being ushed out of the data cache in preparation for a FIFO wrap-around, the entry is popped from the adapters receive FIFO. This is done lazily (after some xed number of messages polled) to reduce the number of microchannel accesses (each access costs around 1 s). 1st chunk 2nd chunk ack for 1st chunk 3rd chunk ack for 2nd chunk 2.2 Flow Control Figure 2: Flow-Control Protocol: initially, two chunks are transmitted and the next chunk is sent only when the previous to last chunk is acknowledged SP AM provides reliable, ordered delivery of messages. The design is optimized for a lossless SP switch behavior given that the switch is highly reliable. Because packets can still be lost due to input buffer overows, ow control and fast retransmission have proved essential in attaining reliable delivery without harming the performance of the layer. Sequence numbers are used to keep track of packet losses and a sliding window is used for ow control; unacknowledged messages are saved by the sender for retransmissions. When a message with the wrong sequence number is received, it is dropped and a negative acknowledgement is returned to the sender forcing a retransmission of the missing as well as subsequent packets. Acknowledgements are piggybacked onto requests and replies whenever possible; otherwise explicit acknowledgements are issued when one-quarter of the window remains unacknowledged. During a bulk transfer, data is divided into chunks of 8064 bytes1 . Packets making up a chunk carry the same sequence number, and the window slides by the number of packets in a chunk; address offsets are used to order packets within a chunk and each chunk requires only one acknowledgment. Figure 2 illustrates the ow-control protocol. The transmission of chunks is pipelined such that chunk N is sent after an ack for chunk N-2 has been received (but typically before an ack for chunk N-1 arrives). The overhead for sending a chunk (175 s) is higher than one round-trip which ensures that the pipeline remains lled. Note that with this chunk protocol, there is virtually no distinction between blocking and non-blocking stores for very large transfer sizes. Although a chunk of 36 packets slightly exceeds half the size of the receive FIFO (64 packets per active node), the sender is unlikely to overow the receive FIFO in practice. Given the ow control scheme, the window size must be at least twice as large as a chunk (72 packets). To accommodate start-up request messages, the window size is chosen to be 75 packets for requests and 76 for replies. A keep-alive protocol is triggered when messages remain unacknowledged for a long period of time2 . This protocol forces negative acknowledgements to be issued to the protocol initiator, causing the retransmission of any lost messages. 1 A packet has 224 bytes of data and 32 bytes of header. A chunk corresponds to 36 packets. 2 Timeouts are emulated by counting the number of unsuccessful polls. 2.3 Round-trip Latency A simple ping-pong benchmark using am request 1 and am reply 1 shows a one-word round-trip latency of 51.0 s on thin nodes. This value increases by about 0.25 s per word when two, three, or four 32-bit words are transferred. This round trip latency compares well with a raw message (no data or sequence number) ping-pong latency of 46.6-47.0 s. The additional overhead of 4 s is due to the cost of the cache ushes and the ow control bookkeeping. The same ping-pong test using MPLs mpc send and mpc recv yields a round-trip latency of 88 s. 2.4 Bandwidth Several tests are used to measure the asymptotic network bandwidth (r ) and the data size at which the transfer rate is half the asymptotic rate (n 1 ). r indicates how fast the network 2 adapter moves data from the virtual buffers to the network while n 1 characterizes the performance of bulk transfers for 2 small messages. The bandwidth benchmarks involve two processing nodes and measure the one-way bandwidth for data sizes varying from 16 bytes to 1 Mbyte3 . They were run using SP AM bulk transfer primitives as well as IBM MPL send and receive primitives for comparison. The blocking transfer bandwidth test measures synchronous transfer requests by issuing blocking requests (am store and am get) and waiting for their completion. For MPL an mpc bsend is followed by 0-byte mpc brecv. The pipelined asynchronous transfer bandwidth uses a number of small requests to transfer a large block. This benchmark sends N bytes of data using d N e transfers of n bytes, where n N is 1 MByte and n varies from 64 bytes to 1 MByte, using am store async and mpc send respectively. Figure 3 shows the results. The r achieved by pipelining am store async and am get is 34.3 MBytes/s compared to MPLs 34.6 MBytes/s using mpc send. The n 1 value of about 2 260 bytes for am store async (slightly higher for am get) compared to about 450 bytes for mpc send indicates that SP AM achieves better performance with small messages. The bandwidth of SP AMs synchronous stores and gets 1 1 1 3 Measurements of the bandwidth on exchange can be found in [2]. 3 Performance Metric 40 AM MPL 88.0 s 34.6MBytes/s 450 bytes 3500 bytes 35 One-Word Round-trip latency Asymptotic Bandwidth, r Half-Power Point (non-blocking), n 1 1 51.0 s 34.3MBytes/s 260 bytes 2800 bytes 30 Half-Power Point (blocking), n 1 2 2 MBytes per second 25 Table 3: Performance Summary of SP AM and IBM MPL MPLs user space round-trip latency is measured with a pingpong benchmark which sends a one-word message back and forth between two processing nodes using mpc bsend and mpc recv. The asymptotic bandwidth of SP AM is comparable to MPLs asymptotic bandwidth. A half-power point comparison indicates that SP AM delivers better performance than MPL for short messages. 1000000 20 15 Sync Store Sync Get MPL send/reply Pipelined Async Store Pipelined Async Get Pipelined MPL Send 10 5 0 100 300 1000 3000 10000 30000 message size (bytes) 100000 300000 3 Figure 3: Bandwidth of blocking and non-blocking bulk transfers. also converges to 34.3 MBytes/s but at a slower rate due to the round-trip latency as the sender blocks after every transfer waiting for an acknowledgement. Also, for smaller transfer sizes, the performance for gets is slightly lower than for stores because of the overhead of the get request. Consequently, the bandwidth curve for synchronous gets shows an n 1 of 3000 2 bytes compared to the 2800 bytes for stores. The effect of this overhead on the bandwidth vanishes as the transfer size increases, explaining the overlapping of both curves for sizes larger than 4 KBytes. Despite a higher r of 34.6 MBytes/s, synchronous transfers using MPLs sends and receives have an n 1 greater than 3500 bytes. Split-C Application Benchmarks 1 2 Figure 3 clearly shows that SP AMs asynchronous transfers are no better than their blocking counterparts for message sizes larger than one chunk (8064 bytes), which is when the ow control kicks in. 2.5 Overheads Table 2 summarizes the time needed to complete a successful request or reply call. The difference between these two operations is that am request N() calls am poll() after the message is sent while am reply N() does not. The time needed to complete am poll() varies with the number and kind of messages received. The overhead of polling an empty network is 1.3 s. The additional overhead per one received message is about 1.8 s. N am request N am reply N 1 7.7 s 4.0 s 2 7.9 s 4.1 s 3 8.0 s 4.3 s 4 8.2 s 4.4 s Table 2: Cost of calling am request N() and am reply N() functions. (The column for am request N() assumes that no messages are received as part of the am poll which is performed inside am request.) 2.6 Summary and Comparison with MPL Split-C [4] is a simple parallel extension to C for programming distributed memory machines using a global address space abstraction. is It implemented on top of Generic Active Messages and is used here to demonstrate the impact of SP AM on applications written in a parallel language. Split-C has been implemented on the CM-5, Intel Paragon, Meiko CS-2, Cray T3D, a network of Sun Sparcs over U-Net/ATM, as well as on the IBM SP using both SP AM and MPL. A small set of application benchmarks is used here to compare the two SP versions of Split-C with each other and to the CM-5, Meiko CS-2, and U-Net cluster versions. Table 4 compares the machines with respect to one another: the CM-5s processors are slower than the Meikos and the U-Net clusters, but its network has lower overheads and latencies. The CS-2 and the U-Net cluster have very similar characteristics. The SP has the fastest CPU, a network bandwidth comparable to the CS-2, but a relatively high network latency. The Split-C benchmark set used here consists of three programs: a blocked matrix multiply, a sample sort optimized for small messages, the same sort optimized to use bulk transfers, and two radix sorts optimized for small and large transfers. All the benchmarks have been instrumented to account for the time spent in local computation phases and in communication phases separately such that the time spent in each can be related to the processor and network performance of the machines. The absolute execution times for runs on eight processor are shown in Table 5. Execution times normalized to the SP AM are shown in Figure 4. Detailed explanation of the benchmarks can be found in [2]. The two matrix multiply runs use matrices of 4 by 4 blocks with 128 by 128 double oats per block, respectively 16 by 16 blocks with 16 by 16 double oats each. For large blocks, the performance of Split-C over SP AM and MPL is the same which can be explained by the comparable bandwidth in large block transfers. The oating-point performance of Power2 give the SP an additional edge over the CM-5, CS-2, and the U-Net/ATM cluster. For smaller blocks, however, the performance over MPL degrades signicantly with respect to SP AM because of higher message overheads. Notice that the results over SP AM exhibit a smaller network time compared to all other machines. As long as the transfer sizes remain below 8064 bytes, ow control is not activated and thus overhead matters more than latency. For radix and sample sorts, Figure 4 shows that the SP 4 Machine TMC CM-5 Meiko CS-2 U-Net ATM IBM SP CPU speed 33 MHz Sparc-2 40 MHz Sparc-20 50/60 MHz Sparc-20 66MHz RS6000-590 Msg Overhead 3 11 3 4 s s s s Round-trip Latency 12 25 66 51 s s s s Bandwidth 10MBytes/s 39MBytes/s 14MBytes/s 34MBytes/s Table 4: Comparison of the TMC CM-5, Meiko CS-2, U-Net ATM cluster, and IBM SP performance characteristics Benchmark mm 128x128 mm 16x16 smpsort sm 512K smpsort lg 512K rdxsort sm 512K rdxsort lg 512K IBM SP AM 1.094s 0.229s 4.393s 1.814s 9.894s 3.543s IBM SP MPL 1.180s 0.489s 18.570s 1.811s 54.652s 3.587s TMC CM-5 4.606s 0.970s 10.448s 8.612s 27.106s 20.011s Meiko CS-2 2.516s 0.371s 9.845s 7.432s 21.255s 7.995s SS20/U-Net/ATM 4.470s 0.415s 15.730s 2.792s 81.344s 6.126s Table 5: Absolute Execution Times (seconds) spends less time in local computation phases because of the faster CPU. SP AM spends about the same amount of time, if not less, in the communication phases as the CM-5 and CS-2. Although SPs round-trip latency is relatively higher, SP AM combines low message overhead with high network bandwidth to achieve a higher message throughput. Again, the performance over MPL for small messages suffers from the high message overhead. For large messages (albeit not large enough to activate ow control), both the SP AM and MPL outperform the other machines in both computation and communication phases. The Split-C benchmark results show that SP AMs low message overhead and high throughput compensates for SPs high network latency. The software overhead in MPL degrades the communication performance of ne-grain applications, allowing machines with slower processors (CM-5) or even higher network latencies (U-Net/ATM cluster) to outperform the SP. lets the receiver supply that address. This discrepancy can be resolved either by using a buffered protocol, where the message is stored into some temporary buffer at the receiver and then copied, or by using a rendez-vous protocol, where the receiver sends the receive buffer address to the sender which then stores directly from the send buffer into the receive buffer (Figure 6). For small messages, the buffered protocol is most appropriate because the extra copy cost is insignicant. Each receiver holds one buffer (currently 16 Kbytes) for every other process in the system. To send a message, the sender allocates space within its buffer at the receiver (this allocation is done entirely at the sender side and involves no communication) and performs an am store into that buffer. After the receiver has copied the message into the users receive buffer, it sends a reply to free up the temporary buffer space. The buffered protocols requirements are well matched to am store: the store transfers the data and invokes a handler at the receiving end which can update the MPICH data structures and send a small reply message back using am reply. If the store handler nds that the receive has been posted it can copy the message and use the reply message to free the buffer space. If a matching receive has not been posted, the messages arrival is simply recorded in an unexpected messages list and an empty reply is sent back (it is actually used for ow-control by the underlying AM implementation). The buffer space is only freed when a matching receive is eventually posted. For large messages the copy overhead and the size of the preallocated buffer become prohibitive and a rendez-vous protocol is more efcient. The sender rst issues a request for address message to the receiver. When the application posts a matching receive, a reply containing the receive buffer address is sent back. The sender can then use a store to transfer the message. This protocol may lead to deadlock when using MPI Send and MPI Recv because the sender blocks while waiting for the receive buffer address. This is inherent in the message passing primitives and MPI offers nonblocking alternatives (MPI Isend and MPI Irecv). In the implementation of the rendez-vous protocol MPI Send or MPI Isend causes a request to be sent to the receiving node. If a matching receive (MPI Recv or MPI Irecv) has been posted, the handler replies with the receive buffer address; otherwise the request is placed in the unexpected messages list and the receive buffer address is sent when the receive is eventually posted (see Figure 5). At the sender side, the handler for the receive buffer address message is not allowed to do the actual data transfer due to the restrictions 4 MPI Implementation over Active Messages Implementations of MPI currently fall into two broad categories: those implemented from scratch and tuned to the platform, and those built using the portable public-domain MPICH package. MPICH contains a vast set of machine independent functions as well as a machine specic abstract device interface (ADI). The main design goal of MPICH was portability while performance considerations took a second rank. For this reason MPI implementations built from scratch are expected to outperform MPICH based ones. A close look at MPICHs ADI, however, reveals that the basic communication primitives can be implemented rather efciently using AM and that, with a few optimizations to the ADI and the higher layers of MPICH, the package should yield excellent performance. This section describes such an implementation and presents a number of micro-benchmarks which demonstrate that MPICH layered over AM (MPI-AM) indeed achieves point-to-point performance competitive with or better than IBMs from scratch MPI-F implementation. 4.1 Basic Implementation The major difculty in layering MPIs basic send (MPI Send or MPI Isend) over AM lies in resolving the naming of the receive buffer: am store requires that the sender specify the address of the receive buffer while message passing in general 5 cpu net 8 7 6 5 4 3 2 1 0 spmpl meiko spmpl meiko spmpl meiko spmpl meiko spmpl meiko spmpl mm128x128 mm16x16 smpsortsm512K rdxsortsm512K smpsortlg512K rdxsortlg512K Figure 4: Split-C benchmark results normalized to SP. placed on handlers by AM. Instead, it places the information in a list, and the store is performed by the blocked MPI Send or, for nonblocking MPI Isends by any MPI communication function that explicitly polls the network. node 1 node 2 MPI_Recv node 1 MPI_Send node 2 tions which have long gaps between calls to MPI functions, a timer may be used to periodically poll for messages, although this has not been tested yet. 4.2 MPI_Send Optimizations req req ues ues t handler t handler MPI_Recv re handler ply req ues t sto re handler sto re time time Figure 5: Rendez-vous protocol over AM, when MPI Recv is posted before MPI Send (left) and after MPI Send (right) MPI species that messages be delivered in order, and the current implementation assumes that messages from one processor to another are delivered in order. Although this is not guaranteed by the Generic Active Messages standard, SP AM does provide ordered delivery. On AM platforms without ordered delivery, a sequence number would have to be added to each store and request message to ensure ordering. The current MPI-AM uses the polling version of SP AM. To ensure timely dispatch of handlers for incoming messages am poll is called explicitly in all MPI communication functions which would not otherwise service the network. For applica- Proling of the basic buffered and rendez-vous protocols uncovered inefciencies that lead to a number of simple optimizations. The rst-t allocation of receive buffers in the buffered protocol turned out to be a major cost in sending small messages. The optimized implementation uses a binned allocator for small messages (currently 8 1K bins) and reverts to the rst-t algorithm only for intermediate messages. Using a message for freeing the small buffers was another source of overhead and combining several free buffer replies into a single message speeds up the execution of the receivers store handler. These two optimizations, along with some slight code reorganization to cut down on function calls improved the small message latency to within a microsecond of MPI-F. Using two distinct strategies for small and large messages means that the implementation has to switch from one to the other at some intermediate message length. This often causes discontinuities in the performance as is the case in MPI-F where the bandwidth achieved using messages of 5 Kbytes is actually lower than with 4 Kbyte messages because of the rendez-vous latency introduced for the larger messages. The optimized MPI-AM augments the rendez-vous protocol by sending out a small prex (4 Kbytes) of the message into a temporary buffer at the receiver while waiting for the rendez-vous reply. This hybrid buffered/rendez-vous protocol keeps the pipeline full while avoiding excessive buffer space requirements. (If no buffer space can be allocated the hybrid protocol simply reverts to a regular rendez-vous protocol.) By using the hybrid 6 meiko cm5 cm5 cm5 cm5 cm5 unet unet unet unet unet cm5 spam spam ...

Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more. Course Hero has millions of course specific materials providing students with the best way to expand their education.

Below is a small sample set of documents:

Cornell - CS - 96
Low-Latency Communication on the IBM RISC System/6000 SPChi-Chao Chang, Grzegorz Czajkowski, Chris Hawblitzel and Thorsten von Eicken Department of Computer Science Cornell University Ithaca NY 14853AbstractThe IBM SP is one of the most powerful
Cornell - CS - 96
Low-Latency Communicationon the IBM RISC System/6000 SP Chi-Chao ChangComputer Science Dept. Cornell University(joint work with Grzegorz Czajkowski, Chris Hawblitzel, and Thorsten von Eicken)Papers FocusGEfficient Communication on the IBM SP
Cornell - CS - 1715
Technical Report 98-1715. Department of Computer Science, Cornell University, November 1998.Design and Evaluation of an Extensible Web & Telephony Server based on the J-KernelDaniel Spoonhower, Grzegorz Czajkowski, Chris Hawblitzel, Chi-Chao Chang
Cornell - CS - 1715
Technical Report 98-1715. Department of Computer Science, Cornell University, November 1998.Design and Evaluation of an Extensible Web & Telephony Server based on the J-KernelDaniel Spoonhower, Grzegorz Czajkowski, Chris Hawblitzel, Chi-Chao Chang
Cornell - CS - 1708
A Software Architecture for Zero-Copy RPC in JavaChi-Chao Chang and Thorsten von Eicken Department of Computer Science Cornell University {chichao,tve}@cs.cornell.edu AbstractRPC has established itself as one of the more powerful communication para
Cornell - CS - 1708
A Software Architecture for Zero-Copy RPC in JavaChi-Chao Chang and Thorsten von Eicken Department of Computer Science Cornell University {chichao,tve}@cs.cornell.edu AbstractRPC has established itself as one of the more powerful communication para
Cornell - CAREER - 07
ACEF PUBLIC SERVICE SUMMER INTERNSHIP CORNELL UNIVERSITY FUNDING APPLICATION 2007Return this form with a resume by April 16, 2007; applications accepted until award is made.EMPLOYER COMPLETES SECTIONS ONE, TWO, AND THREE. STUDENT COMPLETES SECTION
Cornell - CAREER - 07
ACEF Public Service InternshipGuidelines and 2007 Summer Funding ApplicationOverview The Adelphic Cornell Educational Fund (ACEF) provides financial support for one Cornell student to work in an academically- or career-related public service or non
Cornell - CAREER - 07
ACEF Public Service InternshipGuidelines and 2007 Summer Funding ApplicationOverview The Adelphic Cornell Educational Fund (ACEF) provides financial support for one Cornell student to work in an academically- or career-related public service or non
Georgia State - VIEWCOMPLE - 2
Timeline and Budget, Chemistry Department Self-Study (Note A) Year 1 (FY2009) TTN1 faculty member in Cellular Mechanisms TTN2 faculty member in Computational Approaches Operational budget for new faculty ($3000 each) Undergraduate Coordinator (1/2) F
Georgia State - VIEWCOMPLE - 2
Report of the Academic Program Review Committee for the Department of Computer Science for the Review Period FY2004-FY2006 Approved by APRC February 27, 2007 Approved by CAP, March 20, 2006 APRC Review Committee: A. Faye Borthick (chair), Leonard R.
Georgia State - VIEWCOMPLE - 2
Report of the Academic Program Review Committee for the Department of Political Science for the Review Period FY2004-FY2006 Draft as of April 21, 2007 APRC Review Committee: Irene Duhaime, Valerie Miller, Wendy Venet I. Department Profile for FY2004-
Georgia State - VIEWCOMPLE - 2
ACADEMIC PROGRAM REVIEW ACTION PLAN DEPARTMENT OF PSYCHOLOGY December 22, 2006 Approved at psychology faculty meeting December 15. 2006 Revised per Executive Committee recommendations, May 22, 2007 Approved at psychology faculty meeting June 1, 2007
Georgia State - RMICTR - 98
INSURANCE TAXATION IN GEORGIA: ANALYSIS AND OPTIONSMartin F. GraceFiscal Research Center School of Policy Studies Georgia State University Atlanta, Georgia FRP Report No. 17 July 19981TABLE OF CONTENTS Page Executive Summary .. iv Introduction
Georgia State - TOPPAGE - 1
THE PH.D. PROGRAM IN THE FINANCE DEPARTMENTThe Department of Finance has a large faculty and offers a doctoral student specialization, training, and guidance in a wide variety of areas of finance. Areas of faculty research and expertise include corp
Georgia State - ROBINSON - 2006
Registration Form Name: Title: Institution Affiliation: E-mail address: Telephone: If you are a member of the CFEA Executive Committee, please check below. Member of CFEA Executive Committee The conference begins at 12 p.m. with lunch on Friday and e
Georgia State - REQUEST - 1103
Print FormHow to request an Internet Native Banner (INB) UseridRead the Policy/Guidelines on the Use of the INB System. There are 4 Request forms - Functional, College, Administrative and Student Assistant. Complete the appropriateRequest for
Georgia State - REQUEST - 1103
CLASS Validation TablesOBJECTAll Users DESCRIPTIONROLEBAN_DEFAULT_QGEN_UNV_USR_APP_M01 GUAPMNU GUAPPRF GUAPSWD GUAUPRF GUAINIT GUAGMNU GUQINTF GUAMESG GUAJOBS GEN_UNV_USR_APP_Q01 GOAADDR GOAEACC GOAEMAL GOAPGEO GOASGEO GUAABOT GUACALC GUACAL
Georgia State - CSC - 3210
Chapter 3 Digital Logic and Binary NumbersThese are lecture notes to accompany the book SPARC Architecture, Assembly Language Programming, and C, by Richard P. Paul, 2nd edition, 2000. By Michael WeeksRichard P. Paul, SPARC Architecture, Assembly
Georgia State - CSC - 3210
/* This program has a bug in it! It is supposed to demonstrate that 01010101 (binary) becomes 10101010 (binary) after the "not" instruction. It should print but instead, it prints: */ What is the problem? ffffffaa ff34007c.section ".rodata" .align
Georgia State - CSC - 3210
Chapter 4 Binary ArithmeticThese are lecture notes to accompany the book SPARC Architecture, Assembly Language Programming, and C, by Richard P. Paul, 2nd edition, 2000. By Michael WeeksRichard P. Paul, SPARC Architecture, Assembly Language Program
Georgia State - CSC - 3210
The Stack Chapter 5Lecture notes for SPARC Architecture, Assembly Language Programming and C, Richard P. Paul by Anu G. BourgeoisMemory Addresses are 32 bits wide Therefore, 232 bytes in memory Each location is numbered consecutively Memory da
Georgia State - CSC - 3210
Chapter 8 Machine InstructionsThese are lecture notes to accompany the book SPARC Architecture, Assembly Language Programming, and C, by Richard P. Paul, 2nd edition, 2000. By Michael WeeksRichard P. Paul, SPARC Architecture, Assembly Language Pro
Georgia State - CSC - 3210
CSC 3210 Notes Computer Organization and ProgrammingChapter 1 Dr. Anu BourgeoisLayout of Chapter 1 Hand-programmable calculator Fundamental definition of a computer Basic computer cycle Classic implementations of the computer Stack machine ar
Georgia State - CSC - 3210
Assembly language: sto 0 1 rcl 0 7 * rcl 0 11 / g rtn/ type in particular value for x / execute program to compute y Memory used to store program Memory is addressed May compute memory addresses unlike registers Registers may be selected no
Georgia State - CSC - 3210
Chapter 2 Part I SPARC ARCHITECTURE Dr. A.P. Preethy2.1 Introduction SPARC architecture is a load/store architecture, and the load and store instructions move data between registers and memory. The machine has 32 data lines, 30 address lines (addr
Georgia State - CSC - 3210
Chapter 3 Digital Logic and Binary NumbersThese are lecture notes to accompany the book SPARC Architecture, Assembly Language Programming, and C, by Richard P. Paul, 2nd edition, 2000. By Michael WeeksRichard P. Paul, SPARC Architecture, Assembly
Georgia State - CSC - 3210
Chapter 4 Binary ArithmeticThese are lecture notes to accompany the book SPARC Architecture, Assembly Language Programming, and C, by Richard P. Paul, 2nd edition, 2000. By Michael WeeksRichard P. Paul, SPARC Architecture, Assembly Language Program
Georgia State - CSC - 3210
Chapter 7 Subroutines Dr. A.P. Preethy7.1 Introduction There is frequently a need either to repeat a computation or to repeat the computation with different arguments. Subroutines can be used in such situations Subroutines may be either open or cl
Georgia State - CSC - 3210
Chapter 8 Machine InstructionsThese are lecture notes to accompany the book SPARC Architecture, Assembly Language Programming, and C, by Richard P. Paul, 2nd edition, 2000. By Michael WeeksRichard P. Paul, SPARC Architecture, Assembly Language Pro
Georgia State - CSC - 3210
I/O: SPARC AssemblyKen D. Nguyen Department of Computer Science Georgia State UniversityI/O on SPARC(File I/O will be covered in chapter 10) I/O thru buffers allocated for each terminal is an tremendous work in assembly. In SPARC, using system
Georgia State - CSC - 3210
CSC 3210 Assignment #1 Summer 2008due Tuesday, June 17th 10:00 a.m. Objective: To become familiar with the Unix system and assembly code. Requirements: 1. Type the given Hello, World! program, written in C, to a file called hello.c 2. Use the gcc c
Georgia State - CSC - 3210
CSC 3210 Assignment 2 Summer 2008 due Thursday, July 3rd at 10:00 a.m. Program Description: Write an assembly language program to find the maximum of y = x6 14x2 - 56x for the range -4 x 6, by stepping one by one through the range. During each it
Georgia State - CSC - 3210
Program Assignment #3 Summer 2008 CSC 3210 Due Thursday, July 10th at 10:00 am Write a program to perform various logical functions on a range of numbers. At the start of your program define a constant xval to have a certain value. The TA will change
Georgia State - CSC - 3210
CSC 3210 Summer 2008 Assignment #4 Due Tuesday, July 22nd Write an assembly program that: + prompts the user to enter a number. + prompts the user for a bit position (0-31). + displays the users input in hexadecimal notation and the bit at specified
Georgia State - CSC - 3210
CSC 3210 Assignment #5 Summer 2008 due Friday, August 1st, by midnight no late assignments accepted Overview: This program will present the user with a short menu and process their input accordingly using subroutines. The program will ask the user t
Georgia State - P - 12
Georgia State - CSC - 8820
Advanced Graphics AlgorithmsYing Zhu Georgia State University Lecture 02 Overview of Computer Graphics ProgrammingComputer Graphics: A Brief History1960s: William Fetter of Boeing coins the term Computer Graphics (1960) Basic computer
Georgia State - CSC - 8820
Advanced Graphics AlgorithmsYing Zhu Georgia State UniversityLecture 03 The Graphics Rendering PipelineWhat is graphics rendering pipeline?A process of generating a 2D image, given a virtual camera (eye), 3D objects, light sources, textures,
Georgia State - CSC - 8820
Advanced Graphics Algorithms Ying Zhu Georgia State University Lecture 04Geometric objects & User InteractionOutlineGraphics primitives in OpenGL Input devices and user interaction GLUT Programming event-driven input Animating interactive prog
Georgia State - CSC - 8820
Introducing BlenderWhy Blender? Its free Its powerful Its small A large and growing community of Blender users http:/www.blender.orgHow to learn Blender? Learn the interface first http:/www.blender.org/education-help/video-tutorials/gettin
Georgia State - WWW2CAS - 2
Directed Readings 4999 ApplicationCollege of Arts and SciencesEligibility and Conditions: Undergraduate Senior who is within two semesters of graduation. Course 4999 is designed to assist the senior who has a curriculum problem fulfilling the requi
Georgia State - WWW2CAS - 2
PART 1GeorgiaState UniversityPlease type or print in ink.COLLEGE OF ARTS AND SCIENCESAPPLICATION FOR GRADUATE STUDYOffice of Graduate Studies 8th Floor Haas Howell Building Atlanta, Georgia 30303 404-651-2297SECTION I. APPLICANT INFORMATION
Georgia State - WWW2CAS - 2
Guide for StudioThesis Preparation andGeorgiaState UniversitySubmissionSpring, 2006 Graduate Office College of Arts and Sciences8th Floor Haas Howell Building Atlanta, Georgia 30303 404-651-2297TABLE OF CONTENTSIntroduction A. Formatting You
Georgia State - WWW2CAS - 2
Eligibility does not guarantee that a student will be admitted into a course. Admission must be approved by the course instructor, the department/school Director of Graduate Studies, the department/school/Chair/Director, and the Associate Dean for Re
Georgia State - WWW2CAS - 2
GeorgiaState UniversityGuide for Digital Thesis and Dissertation Preparation and SubmissionOffice of Graduate Studies College of Arts and Sciences8th Floor Haas Howell Building Atlanta, Georgia 30303 404-651-2297Guide to Thesis and Dissertation
Georgia State - WWW2CAS - 2
Survey of Earned DoctoratesJuly1, 2007 to June 30, 2008Conducted by forNSFSEDLast NamePlease complete:First Name Middle Name Suffix (e.g., Jr.)Cross Reference: Birth name or former name legally changed Name of Doctoral Institution Type of
Georgia State - WWW2CAS - 2
Petition for Deviation from Graduate Catalog RegulationsOffice of the Dean College of Arts and Sciences 741 General Classroom Building (404) 651-2294_ Student Identification Number__ DateDegree (circle one):MA MAT MHP MS MMu MFA MAEd PhD_
Georgia State - WWW2CAS - 2
CRNApplication for Course 6999 College of Arts and Sciences Georgia State UniversityIn accordance with the provisions of the current Graduate Catalog, Course 6999 in any department that offers graduate work is designed to assist the graduate stude
Georgia State - WWW2CAS - 2
REVIEW OF LECTURERS AND PROMOTION OF LECTURERS TO SENIOR LECTURERS College of Arts & Sciences Georgia State UniversityA. Overview This document describes the process for the review of lecturers and for the promotion of lecturers to senior lecturer.
Georgia State - WWW2CAS - 2
Approved by Senate | 4/29/04Policy on Lecturers (Approved by Faculty Affairs Committee on February 19, 2004) In conformity with Board of Regents Policy 803.03 the colleges and schools of Georgia State University are permitted to employ full-time le
Georgia State - WWW2CAS - 2
CALENDAR FOR THIRD-YEAR REVIEW OF LECTURERS COLLEGE OF ARTS AND SCIENCES 2007 Review Process Lecturers provide all required materials to the Chair/Director. Chair/Director provides all materials to departmental review committee. This is an elected co
Georgia State - WWW2CAS - 2
CALENDAR FOR FIFTH-YEAR REVIEW OF LECTURERS COLLEGE OF ARTS AND SCIENCES 2007 Review Process Lecturers in their fifth year will provide all required materials to the chair. The chair will provide the departmental fifth-year lecturer review committee
Georgia State - WWW2CAS - 2
ERNEST G. WELCH SCHOOL OF ART & DESIGN LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the College of Arts and Sciences Promotion and Tenure Review Board October 5, 2004Lecturers must consult
Georgia State - WWW2CAS - 2
DEPARTMENT OF APPLIED LINGUISTICS AND ENGLISH AS A SECOND LANGAUGE LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the College of Arts and Sciences Promotion and Tenure Review Board October 5, 2
Georgia State - WWW2CAS - 2
DEPARTMENT OF BIOLOGY LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the College of Arts and Sciences Promotion and Tenure Review Board November 1, 2004Lecturers must consult the College of A
Georgia State - WWW2CAS - 2
effective date: Fall 04DEPARTMENT OF CHEMISTRY PROMOTION AND TENURE MANUAL COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the Department of Chemistry March 17, 2003Approved by the Promotion and Tenure Review Board July 7, 2003
Georgia State - WWW2CAS - 2
DEPARTMENT OF COMMUNICATION LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the College of Arts and Sciences Promotion and Tenure Review Board November 1, 2004Lecturers must consult the Colleg
Georgia State - WWW2CAS - 2
DEPARTMENT OF ENGLISH LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the College of Arts and Sciences Promotion and Tenure Review Board October 5, 2004Lecturers must consult the College of Ar
Georgia State - WWW2CAS - 2
1DEPARTMENT OF HISTORY LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by Department of History: October 25, 2005 Approved by the College of Arts and Sciences Promotion and Tenure Review Board:
Georgia State - WWW2CAS - 2
DEPARTMENT OF GEOSCIENCES LECTURER REVIEW AND PROMOTION POLICY COLLEGE OF ARTS AND SCIENCES GEORGIA STATE UNIVERSITYApproved by the College of Arts and Sciences Promotion and Tenure Review Board December 7, 2004Lecturers must consult the College