Unformatted Document Excerpt
Coursehero >>
Texas >>
U. Houston >>
COSC 6365
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Course Hero has millions of student submitted documents similar to the one
below including study guides, practice problems, reference materials, practice exams, textbook help and tutor support.
Science Computer 6365
February 7 and 12, 2008
Lecture #8 and 9: Interconnection Networks
Professor: S. Lennart Johnsson TA:Wei Ding
1
Networks
A bus based parallel architectures is practical only for a small number of processors. For several thousand processors the bus would need to have a width of tens of thousands, or possibly a few hundred thousand wires (bits). Similarly, it is not possible to provide complete interconnectivity for highly parallel systems. For N = 10, 000, 100,000,000 channels would be required. Not only is the total number of channels unrealistic, but so is also the number of channels per node. One of the major constraints in system design is the packaging. The number of connections between a chip and a printed circuit (PC) board, or between a PC board and a backplane, or between backplanes, are limited by ensuring a sufficient mechanical strength of the connectors, enforcing a minimum width of each connector. Pin Grid Arrays (PGA) offer 200 400 pins per chip package, while current state-of-the art so-called Land Grid Arrays (LGA) offers up to 2,000 connectors with spacings at about 1 mm, requiring an area of close to 2,500 mm2 . Board connectors may offer about 2 3 pins per mm, or about 1,000 for a standard PC board. Thus, all highly parallel systems use some form of sparse network to interconnect the processors and memory modules. Locality of reference is important for performance at all levels in most systems. With respect to memory chip technology accessing elements within the same DRAM page offers a significantly faster access time than if data is scattered over different pages. Reducing the demands on memory bandwidth by reusing register contents, and the content of intermediate fast memory in the form of cache is important as well. Limitations on the data transfer rates are imposed not only by the fundamental characteristics of chip technologies, but also by packaging technologies. In the following we will assume that memory is distributed among the processing nodes such that if there is locality of reference and it is properly exploited through data mapping and scheduling of operations, then the demands on the communication system is reduced. The architectural model is shown in Figure 1. Though it is true in most of the current generation of distributed memory architectures that a processor is involved in the data transfer to or from the memory associated with it, there are also efforts under way to allow for remote direct memory accesses which conceptually results in a distributed memeory systems architecture as shown in Figure 2. Separate communications processors (not shown) handles remote requests for memory accesses. In the BSP, the network clearly was an integral part of the memory system, since it had to provide the required alignment for data accesses. But, conceptually, the BSP was a physically shared memory system rather than a distributed memory system. The alignment network had to support the full bandwidth of the memory system. Distributed memory systems are usually designed with the assumption that there is a certain amount of locality of reference, and that the network need not support
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
2
NETWORK
P
P
P
P
P
P
M
M
M
M
M
M
Figure 1: A generic model of distributed memory architectures. MEMORY SYSTEM NETWORK
M
M
M
M
M
M
P
P
P
P
P
P
Figure 2: The memory system for distributed memory architectures. the aggregate bandwidth of the memory modules. Thus, when there is no locality of reference, the performance is typically limited by the network, either by the bandwidth of the ports to the local memory, or by contention in the network. Note that even in a memory systems as shown in Figure 2, the concurrency in communication and computation is limited. The single line into each memory unit symbolizes a single port to memory. Each local memory system can support either the local processor or remote accesses. However, the architecture shown in Figure 2 allows the overhead associated with remote requests to be overlapped with local execution. The overhead with remote access consists in address computations, which may be quite substantial since it involves a translation between local and global address spaces, creating packets to be transmitted, executing the selected communication protocol and providing queuing and possibly also routing services. In most of todays systems, the overhead far exceeds the latency in the network. The ratio is typically about a factor of
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
3
10 for simple, fairly low level protocols, about a factor of 100 for typical message passing calls known as "send" or "receive", or about a factor of a 1,000 for some high level protocols. Selecting a network implies a tradeoff between cost and utility. Traditionally area or volume is a good indicator of the cost of a network. The area/volume is largely determined by the wiring needs. At the lowest level, the chip level, the area is typically entirely determined by the wiring needs. The logic can be made to fit "under" the wires, which is apparent from looking at a microphotograph of almost any modern chip. Similarly, the PC board area is largely determined by the wires making up the different layers of the board. Thus, in estimating the cost of a network we will rely on a simple model for wire placement. Another characteristic related to cost is the number of channels per node, since the size of a packet may entirely be determined by the area or perimeter required to fit all the connectors (pins). Supporting concurrent communication on all the channels of a node requires many independent data paths in a node. If the network provides several edgedisjoint paths between pairs of nodes it is desirable to make effective use of this property. We will discuss this issue further in the context of routing in networks. Modularity is also important with respect to cost; can the network be partitioned into identical modules, in particular modules that would be the same regardless of the system size being constructed? The latter property would allow for the use of mass produced parts. Moreover, if the parts have no dependence on size, then the parts place no upper limit on the size system that can be constructed on one hand, and on the other hand, small systems do not exhibit any penalty for the ability to build large systems of the parts. Measuring the utility of the networks is a quite difficult undertaking. If a limited set of well defined tasks should be performed, then it may be possible to find optimum data allocations for those tasks and associated algorithms. In determining the optimum allocation it is also necessary to determine how data associated with remote references should be routed through the network in order to minimize the communication time. Today, the utility of networks is largely determined by defining a set of data reference patterns in the index space of the computations, then finding good, possibly optimum, data allocations and associated routing schemes for the networks being considered. This approach is useful when the data reference patterns are sufficiently simple that they can be easily parameterized and analyzed. For more complex data reference patterns, a more conservative approach is often taken. Instead of striving for the best possible performance for fairly regular data accesses, such as nearest neighbor references on a grid on two or three dimensions, an approach guaranteeing an acceptable worst case behavior may be followed. We will discuss both approaches in the following. Determining the utility of a network by determining how well it can perform a certain computation is often referred to as emulation. For instance, for computations for which the dominating data reference pattern is nearest neighbor communication on a grid in one or several dimensions, it is of interest to determine how well such a grid can be emulated on the network being considered. Some of the grid edges may be mapped to network nodes far apart, which may introduce high latencies and contention for network channels and nodes. As we will see later, it is important to consider both low and high load in the network. A high load is typically associated with nodes with a substantial memory. In such a case, many data elements may
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
4
need to be sent between a pair of nodes, or gathered or scattered by a node. For instance, in a node with 32 Mbytes of memory, a 64 64 64 subgrid with four variables in 64bit precision would occupy 25% of the memory. The number of variables to be moved between a pair of nodes for a shift operation with shift distance of one is 64 64 = 16384, or a total of 128 kbytes. For light load, latency tends to be the factor determining the performance, while for high load the bandwidth provided by the nodes or the network are most important. Topological properties of the networks are often used to aid in determining the (best possible) performance. Below, we define the most important ones. Network Diameter This is the maximum distance between any pair of nodes in the network. The distance between a pair of nodes is the smallest number of wires that have to be traversed to get from one node to the other. A small diameter is desirable because it is the lower bound for worstcase nodetonode communication time. Network Bisection Width This is the minimum number of wires that have to be removed to break the network into two halves, with identical (within one) numbers of nodes. A large bisection width is desirable because it is the minimum bandwidth available between two halves of the network. Maximum Edge Length. The length of the longest edge interconnecting a pair of nodes when laid out in a plane or a threedimensional volume. Network Area or Volume. The area or volume consumed by the network when laid out in a plane or a threedimensional volume. Edge or channel width. The width in bits of each edge connecting a pair of nodes. The maximum edge length is of importance with respect to how fast the network can be operated in a synchronous mode. In a selftimed system it establishes a lower bound on performance. Finally we mention the concept of a universal network, which is a network that, in a given volume, can simulate any other network with a slowdown that in the worst case is proportional to the logarithm of the volume it occupies. Below, we present a number of common interconnection networks. For some of the networks, it is often helpful to view the nodes as being factored into rows and columns, possibly in several dimensions. Then, the total number of nodes is factored as N = N0 N1 N2 . . . Nd-1 , where d is the number of axes in the array representation of the nodes.
2
The cost of a network
In order to get a basis for the cost of a network we will use a simple, but formal model for the layout of networks in the plane. A similar model can be defined for three dimensions. The
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks formal model is often referred to as the Thompson Grid point model.
5
In VLSI (Very Large Scale Integration) technology used for chip manufacturing [7], layouts are usually wire limited, i.e., the area is determined by the wiring requirements. The logic, in the form of transistors, can fit under the wires without appreciably increasing the chip area. The socalled Thompson grid model [11] is a simplified model for device and circuit layout aimed at capturing the area requirements for different networks. The Thompson grid point model is as follows: The layout medium has two layers for wires connecting nodes. One layer is used for vertical tracks, one layer for horizontal tracks. Nodes, like transistors or processors in a macroscopic model, are placed at intersections between wires in vertical and horizontal tracks. Nodes are represented as points. Wires on different layers are connected through contact cuts. A track can only be used for a single wire at a time, i.e., the width of a track only holds one wire. Tracks are spaced according to the minimum pitch required by the technology. In VLSI technology there are usually more than two layers. In generic MOS technology, as described in [7], there is three layers: metal, poly, and diffusion. On printed circuit boards there are often many layers, say 10 or more. The grid point model results in what is often called Manhattan geometry, because of the regular vertical and horizontal track model. In Boston geometry, wires are allowed to take any path obeying minimum width and minimum separation rules. However, such geometries may be considerably more expensive to use for chip manufacturing, since for chips each path is broken down into a sequence of rectangles. The exposure time per rectangle is independent of its shape. Thus, Manhattan geometry is dominating chip manufacturing. And even though more than two layers may be possible, the number of layers is very limited compared to the number of nodes or tracks. The additional layers are not expected to significantly change the results obtained by using the simple grid point model by more than a small constant factor. The area of a chip has a very strong influence on its cost. Chips are produced by photo lithography, with chips placed on a wafer, which today is 12 - 15 cm in diameter on state of the art fabrication lines. Thus, the bigger the chip the fewer the number of chips per wafer. The wafer fabrication cost is independent of what is on it. Moreover, the manufacturing process is not perfect, so some chips will have defects. The likelihood for defects grows exponentially in the area. Hence, the cost of a working chip grows very rapidly as a function of its size, see section 2.3 [3]. Doubling the die area may increase the cost per die fivefold! One concern in chip manufacturing are contact cuts, which both adds area, and tend to reduce the yield. Minimizing the number of contact cuts is important in chip layout. However, we will not pursue this issue.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
6
V
R C U
Figure 3: Circuit model for charging a gate in MOS technology. Another important factor with respect to performance is the wire length, in particular the maximum wire length since it may determine the clock frequency. MOS (Metal Oxide Semiconductor) VLSI technology is essentially a charge transfer technology. In silicon based MOS technologies, a transistor is formed by crossing a wire in polychrystalline silicon with another wire in diffused silicon, or diffusion for short. In addition, wires in aluminum, known as metal, is used to carry both signals, clock, and power. Aluminum is a much better conductor than either poly or diffusion. The gate of a transistor formed by the crossing of poly and diffusion acts as a capacitor. When charged, the transistor is open, allowing current to flow in diffusion through the gate. When the capacitor is drained, then the transistor is turned off, and the gate is closed. No current can flow passed the gate in the diffusion layer. The charging of the capacitor can be modeled as shown in Figure 3. The voltage across the capacitor follows the equation U = V (1 - e- RC ) for an initial charge of 0. The value RC is known as the time constant. At time t = RC the voltage has reached 63% of its final value. R is directly proportional to the length of the wire. Thus, long wires means slow circuits in MOS technology. There exist driver techniques that can overcome this problem in part [7], but at best the time to charge the gate may be proportional to the logarithm of the wire length. Note that if speed of light is the limiting factor, then the wire length is again very important. Hence, in addition to area, we will also discuss the maximum wire length of the layouts we consider. The Thompson Grid model can be used to establish some area and wirelength properties for networks as a function of their topological characteristics. Theorem. The minimum area A of a network with bisection width B satisfies the relation A B2. Proof. With the network laid out on a grid, consider a vertical cut of minimum height. This cut must cross at least B - 1 horizontal tracks, since there is at most one jog to split the nodes in the left and right parts. A total of B tracks must be cut since the minimum bisection width is B. The same argument can be used for a horizontal cut, which hence must have width at least B - 1 tracks. Hence,
t
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
7
z z z z z z z z z z z z z z z z z
z
z
z
Figure 4: Bisecting a collection of nodes laid out on a planar grid.
A = height width (B - 1)2 = O(B 2 ). QED Figure 4 illustrates the idea in the proof. The theorem above allows us to establish an interesting relation between area and speed. In a computation like sorting, in the worst case half of the data must be moved from one half to the other. Thus, T O(N)/B Squaring this relation yields B 2 T 2 O(N 2 ) or AT 2 O(N 2 ) or BT O(N)
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
8
The implication of this relationship, commonly known as the AT 2 bound is that there is a tradeoff between area and speed. A small area must result in a relatively large running time. Conversely, a small running time must result in a relatively large area. Turning to the wire length we first establish a general lower bound for grid based layouts. Theorem A lower bound for the maximum wire length L is L of the graph.
A , D
where D is the diameter
Proof. Assume an approximately square layout. Then, there are A tracks. Thus, there are two nodes in the graph that are at distance of at least A. In the graph, these nodes are at most D links apart. Thus, the minimum length of a wire is at least A/D. QED From the theorem it follows that LT O( N ). Thus, for a graph of a given diameter, the D faster the circuit the longer is the minimum wire length. Also, as the diameter of the graph is reduced, the longer is the shortest wire for a give time T . The average wire length can also be used as a lower bound for the maximum wire length. Thus, with the total wire length W and the total number of wires being M, we have L O(W/M).
3
Completely Connected Networks
In a completely connected network or crossbar each node is connected to every other node. For an N node network, each node has degree N - 1. The diameter is 1 and the bisection width is N N . Due to the large degree of the nodes, these networks are impractical for more than 2 2 a few nodes. An 8 node network is shown below:
~ d ~ ~ rr e d e d r d e r f d re e r d f e e d r f d e d f r d e ~ f rr e rr~ e d d r f e de fd rr e d d e rr f d e d f e rrde d e ~ f~ r de ~
4
4.1
Array Networks
Linear Array
An N node linear array has the following connectivity:
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
9
A linear array with 10 nodes is shown below:
z z z z
i 1 0 < i < N - 1, i i+1 i=0 i-1 i= N -1
z
z
z
z
z
z
4.2
Ring
A ring is simply a linear array with wraparound. The connectivity is: i (i 1) mod N
4.3
Mesh
A mesh is a 2dimensional array where the lengths of the axes are N0 and N1 . There are N = N0 N1 nodes. For a N0 N1 mesh (N0 , N1 > 1), the connectivity is:
(i, j)
(i 1, j) (i + 1, j) (i - 1, j) (i, j 1) (i, j + 1) (i, j - 1)
0 < i < N0 - 1, i = 0, i = N0 - 1, 0 i < N0 , 0 i < N0 , 0 i < N0 ,
0 j < N1 0 j < N1 0 j < N1 0 < j < N1 - 1 j=0 j = N1 - 1
A 3 6 mesh is shown below:
z z z z z z z z z z z z z z z z z z
In practice, a linear array may in fact be laid out in a two or three dimensional gridlike manner. In such situations, it is clearly of interest to find out both what the gains would be from actually creating the additional connections inherent in a twodimensional or three dimensional grid, as
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
10
well as the associated cost. One additional aspect of using grids of two or higher dimensions is, that for certain computations, a good algorithm for linear arrays may be known, and it is of interest to find out how the linear array can be emulated on the higher dimensional arrays. The emulation of linear arrays and rings on higher dimensional arrays is first a question of whether the graph is Hamiltonian or not, second a question of finding an embedding efficient with respect to expansion and dilation. It is also of interest to consider the emulation of a mesh of two or more dimensions on another mesh of two or more dimensions, possibly different from the number of dimensions of the mesh to be emulated. We will discuss such emulations in later lectures.
4.4
Torus
A torus is a mesh with wraparound along both axes. The connectivity is: (i, j) ((i 1) mod N0 , j) 0 i < N0 , 0 j < N1 (i, (j 1) mod N1 ) 0 i < N0 , 0 j < N1
A 3 6 torus is shown below:
z z z z z z z z z z z z z z
z
z
z
z
4.5
Twisted Torus
A twisted torus is a mesh with wraparound in which the wrapping is skewed. With a skewing distance of one, the last element in a column is connected to the first element in the next column and similarly for the rows. If we let sc be the skewing distance between columns and sr the skewing distance between rows, then the connectivity is:
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
11
(i, j)
(i 1, j) (i + 1, j) (N0 - 1, (j - sc ) mod N1 ) (i - 1, j) (0, (j + sc ) mod N1 ) (i, j 1) (i, j + 1) ((i - sr ) mod N0 , N1 - 1) (i, j - 1) ((i + sr ) mod N0 , 0)
0 < i < N0 - 1, i = 0, i = 0, i = N0 - 1, i = N0 - 1, 0 i < N0 , 0 i < N0 , 0 i < N0 , 0 i < N0 , 0 i < N0 ,
0 j < N1 0 j < N1 0 j < N1 0 j < N1 0 j < N1 0 < j < N1 - 1 j=0 j=0 j = N1 - 1 j = N1 - 1
Twodimensional arrays with wraparound, known as tori, can be laid out using 2N0 horizontal tracks and 2N1 vertical tracks for wires for a total area proportional to 4(N0 N1 ). The maximum wire length is two. Such a layout can be created by simply applying the technique we used for laying out a ring in one dimension to both dimensions of the tori. What is the area requirement for a twisted torus as a function of the twisting factors?
4.6
Three dimensional meshes
From a construction point of view, a chip and a printed circuit board are essentially two dimensional media, even though each may have several layers. But, for the construction of systems larger than what fits in either of these media, three spatial dimensions are used. Since the diameter of a three dimensional cubic mesh of N nodes is 3 3 N, which compares favorably with 2 N for a square mesh, it is natural to explore the merits of building and using three dimensional meshes. The bisection width of a cubic mesh with an even number of nodes along each axes is N 3 . Thus, 1 1 one lower bound for sorting on a cubic mesh is 2 N 3 . The diameter is another lower bound. 1 Both bounds are of order O(N 3 ). Thus, if an algorithm can be found that sort in a time proportional to the lower bound, it can be expected to be faster than sorting on a mesh. Each node in a threedimensional mesh has six channels, the boundary nodes excepted. Thus, if the unit to be packaged into a planar unit is a single node, then the width of each channel is a third 2 of that of a channel of a linear array, and 3 rd of that of a channel of a twodimensional array. If the subsystem packaged onto a planar unit consists of a cubic submesh of M nodes, each w channel is of width w 2 compared to w for the linear array and 4M for the twodimensional 2 6M 3 mesh. w is the total width of all channels for the packaging unit. Thus, for a fixed perimeter and planar layouts, optimal sorting times for the linear array, the twodimensional mesh and 1 2 1 1 threedimensional meshes compares as O(N), O(M 2 N 2 ) and O(M 3 N 3 ). We will later discuss sorting algorithms of optimal order. Another issue in constructing three dimensional meshes is their area requirements when laid out in a plane. Whereas an M node twodimensional mesh can be laid out in area O(M), that is not true for a threedimensional mesh with M nodes. Also, a twodimensional mesh can be laid out in the plane with all
2
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
x
12
x x x x x x
Figure 5: A 7point stencil in two dimensions. channels being of order O(1), even if a reshape is required. However, this is not possible for threedimensional arrays laid out in the plane. Regular discretization of threedimensional domains leads to regular threedimensional meshes. Using Jacobi iterations to solve a set of partial differential equations discretized by 7point stencils leads to the need to emulate threedimensional meshes. The common 7point stencil in Figure 5 is derived similarly to the 5point stencil in two dimensions. Another example of where threedimensional arrays leads to fewer steps than the same computation on twodimensional arrays is matrixmatrix multiplication. We will discuss matrix matrix multiplication on threedimensional meshes later.
3 3 3 For the layout of a threedimensional array, assume for simplicity that N0 = N1 = N2 = 1 N 3 . Then, an approximately square layout is obtained by laying out the nodes on a plane perpendicular to the surface of the layout, and parallel to the horizontal tracks between tracks representing points of the cube in the plane of the embedding surface, as shown in Figure 6. The other planes perpendicular to the embedding surface, but aligned with the vertical tracks 1 are treated analogously. Thus, N 3 tracks are needed in both the horizontal direction and the vertical direction between each pair of tracks representing rows or columns of the plane aligned 1 1 2 with the embedding surface. Thus, a total of N 3 N 3 = N 3 horizontal and vertical tracks 2 2 4 are required. The area of this layout is N 3 N 3 = N 3 . Hence, an upper bound on the area 1 4 is O(N 3 ). The maximum wire length is O(N 3 ).
This straightforward layout is indeed optimal with respect to area and wire length. It is easy 2 4 to see that the bisection width is N 3 and that the area thus must be O(N 3 ). For our threedimensional mesh laid out in the plane the Thompson grid model yields a max1 imum wire length of at least O(N 3 ). Thus, the layout is optimal with respect to both area and maximum wire length. It is easily verified that if instead the planes parallel to the embedding surface are placed along 1 2 1 one axis to form a layout with N 3 N 3 nodes, then N 3 tracks are still required between each 1 pair of rows in the plane of the embedding surface, since there are N 3 nodes all of which must 1 connect to another plane placed at a distance of N 3 . We note, to strictly adhere to the Thompson grid model, in the layouts we have described each 2 node should be represented as a 2 2 subgrid, and a total of 2N 3 horizontal and vertical tracks 2 2 4 are required for an area of 2N 3 2N 3 = 4N 3 .
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
13
' ' d d d d d d d d d d d d d d d d d d
N3
1
E d d
N3 N3
d
1
1
E T d d d c d T
1
d d
N3
d d d d d d d d d d
d d d d d d d d d c d d
N3 N3
1
1
Figure 6: Layout of a cube in two dimensions.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
14
4.7
Multidimensional Arrays
These are straightforward generalizations of a mesh to more than two dimensions.
4.8
Multidimensional Tori
These are straightforward generalizations of a torus to more than two dimensions.
5
5.1
Tree Networks
Complete Binary Trees
A nodes of a dlevel complete binary tree with can be labeled from 1 to 2d - 1, with the connectivity being: i 2i, 2i + 1 0 i < 2d-1
A 24 - 1 node complete binary tree is shown below: 1
$$$ $ $ $ rr r
2
rr r
3
4
d d
5
d d
6
d d
7
d d
8
9
10
11
12
13
14
15
5.1.1
Layout of complete binary trees
A complete binary tree layout is shown in Figure 7. The width of this layout is (N + 1)/2 and its height is log2 (N + 1) - 1, where N = 2h+1 - 1. The area is O(N log2 N). The maximum wire length is (N + 1)/4. One of the merits of this layout is that all the leaf nodes are on the boundary. If the leaf nodes are used for input and output, then this feature is highly desirable. The layout in Figure 7 is straightforward. However, compared to the layout of the linear array and the twodimensional mesh, this complete binary tree layout requires both larger area and longer wires. It is natural to ask: 1. Does there exist a layout with area O(N)?
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
15
Figure 7: A binary tree. 2. What is the minimum maximum wire length? 3. Does there exist layouts that minimizes both the area and the maximum wire lengths? 4. How can the binary tree be partitioned into parts that can be replicated and used for the construction of arbitrarily large trees? In an attempt to reduce the area we can try to place the leaf nodes along all four boundaries of the bounding box for the tree. Placing half of the leaf nodes along the top boundary and half along the bottom boundary reduces the width by a factor of two, but doubles the height. Thus, though the aspect ratio improved, the area remained the same. A few leaf nodes (O(log2 N) nodes) can be moved from the bottom to one of the sides without increasing the height. A similar rearrangement can be made for the top nodes. This rearrangement reduces the area by a lower order term (O(log2 N)). The area remains O(N log2 N). Moving additional nodes 2 increases the height, and the area. Distributing the leaf nodes evenly along all four boundaries results in an area of order (N 2 ). The layout in Figure 7 indeed has an area of optimal order with leaf nodes assigned to the boundary. Theorem. [2, 12] Any complete binary tree with Nnodes laid out with the leaf nodes on the boundary of a rectangle must have area O(N log2 N). Proof. The proof is based on the total wire length for the complete binary tree with the leaf nodes on a line. Let the total wire length of a tree of height h be W (h). Let M(h) be the lengths of all wires in a tree of height h with a longest path from the root to a leaf excluded. Then, M(h) W (h - 1) + M(h - 1) This fact can be seen as follows. Combine two h - 1 level trees to create one level h tree. Exclude a longest path from the root to one leaf in the tree of height h. Then, the total wire length in one of the two original subtrees is W (h - 1), and in the other it is M(h - 1). The idea is illustrated in Figure 8 In addition, the following relationship also holds,
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
e d e d e de ee e ee e e ee M(h) e e e
16
e e e W(h-1)
e e e M(h-1)
Figure 8: Recursive relationships of wire lengths with longest path from root to a leaf excluded.
W (h) 2M(h - 1) + 2h-1 To see that this relationship indeed is true, consider the two subtrees of the root. Let v be the leftmost leaf node and let u be the rightmost leaf node of the other subtree. Since u is the rightmost leaf node of the other subtree, the distance between nodes v and u must be sufficiently large to leave room for all leaf nodes, which in a tree of height h is 2h-1 . Thus, v and u are at least a distance of 2h-1 apart with all leaf nodes placed on a line. Now, remove the path from v to u. This path does not necessarily contain the longest path in either subtree. Furthermore, the removed path must be at least of length 2h-1 . Hence, W (h) 2M(h - 1) + 2h-1 . Substituting the second expression into the first expression gives M(h) 2M(h - 2) + M(h - 1) + 2h-2 M(0) = 0 and M(1) = 1. By induction, M(h) h2h /6. It follows that the total wire length W (h) satisfies the same relation. Hence, since at least half of the wire length must be either in horizontal or vertical tracks, the area is at least W (h)/2 and we have proved that the area for a complete binary tree with the nodes on a line has an area of at least O(N log2 N). With the arguments preceding the theorem we have now proved that any complete binary tree with the leaf assigned to the boundary of a rectangle must have area of at least O(N log2 N). QED With leaf nodes along both the top and bottom boundaries of Figure 7, the maximum wire length is N/8. In the above tree layout we imposed a constraint in the placement of the leaf nodes. In discussing the layout of other networks we did not impose any constraint on the placement of nodes. However, if we impose the constraint that all nodes of an N node mesh are placed on the boundary, then an area of order O(N N) would be required for a twodimensional square mesh with all nodes placed on a line. The maximum wire length would be O( N) instead of O(1). What is the minimum area required for a d dimensional mesh with the constraint of all nodes on the boundary? What is the minimum maximum wire length? Relaxing the constraint on the placement of the tree nodes allows for layouts with an area of order O(N), and a maximum wire length of optimal order, O( N log2 N). The socalled Htree layout, shown in Figure 9, requires area O(N).
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
17
Figure 9: Htree layout of a complete binary tree. Theorem. An N node complete binary tree can be laid out in area O(N). Proof. The number of nodes in the Htree is 2 4i - 1 = 22i+1 - 1. The number of levels in the complete binary tree is 2i. The number of horizontal and vertical tracks is 2i+1 - 1 for a complete binary tree of height 2i. This is easily shown by induction. Hence, the area is 22i+2 - 2 2i+1 + 1 2N. QED The wire length in the Htree layout in Figure 9 doubles for every pair of levels of the binary tree. Thus, the maximum wire length is 2i-1 for a tree of height 2i. Thus, the maximum wire N length is 22 . Compared to the layout with all nodes on the boundary, the maximum wire
N length is reduced by a factor of 22 . Relaxing the constraint to place the nodes on the boundary yields a substantial improvement in both area and maximum wire length.
The area is of optimal order, but the wire length is not. A lower bound on the wire length based on the diameter is O( N/ log2 N). Does there exist an area optimal layout with minimum maximum wire length? The answer is yes [8, 10, 12]. A schematic layout is shown in Figure 10. Theorem. An N node complete binary tree can be laid out in the plane in area O(N) with a maximum wire length of O( log NN ).
2
Proof. The root of the tree is in, say, the bottom left corner of the horizontal box. This box contains a tree of h/4 levels, and thus 2h/4 leaf nodes, all of which are placed on the upper boundary. This boundary is of length 2h/2 , and easily holds the leaf nodes of the tree in the bottom box. The height of the bottom box is made to be 2h/4 to allow for flexibility in the tree layout in the box. This generous height will not increase the order of the layout area, but allow a layout that does not exceed the optimal maximum wire length.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
18
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
H
Figure 10: Maximum wire length optimal H-tree layout of a complete binary tree. Each of the vertical boxes are just like the horizontal box. The 2h/4 leaf nodes in each of these trees are placed on the left boundary. Each of these nodes form the root of an Htree with h/2 levels of a complete binary tree. Each Htree has a side of 2h/4 . The total height of the Htrees connecting to a vertical box is 2h/2 , which matches the height of the vertical box. The width of our layout is (2 2h/4 ) 2h/4 = 2 2h/2 , which matches the width of the bottom box within a factor of two. The height is 2h/2 + 2h/4 , which is O(2h/2). Thus, we know that the area is of optimal order. The maximum wire length in each Htree is O( 2h/2 ) = O(2h/4 ), which is below the bound O(2h/2/h). Thus, the maximum wire length must occur in the rectangular boxes. Since there are h/4 levels, no horizontal wire needs to be longer than 2h/2 /(h/4) for the bottom box. This quantity is of optimal order. Vertical wires are at most 2h/4 in length. There is ample room to move the nodes at the bottom edge to assure that the maximum horizontal wire length constraint is satisfied. The number of horizontal tracks is equal to the number of leaf nodes, and conflicts can be avoided. Thus, we have a layout that is optimal with respect to both area and maximum wire length. QED For additional results on binary tree layouts see [1, 13].
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks 5.1.2 Partitioning
19
Complete binary trees can be partitioned in a scalable way. It is possible to construct trees such that four channels per partition suffices, regardless of what size subtrees fits within the partition, and regardless of what size complete binary tree is constructed out of the partitions. One spare per partition suffices to accomplish this goal. All connections necessary for the assembly of the partitions into a large complete binary tree are external to the partitions. This very nice partitioning property of complete binary trees, may make them competitive with, for instance, twodimensional meshes when channel width is important for performance. With a total of w bits crossing the partition boundary, each channel is of width w/4 for the complete binary tree, but only of width w/(4 N ) for a twodimensional square mesh, and of width 2 w/(6N 3 ) for a threedimensional cubic mesh.
5.2
XTrees
Xtrees are complete binary trees with the nodes of each level of the tree connected as a linear array. The is: connectivity i 2i, 2i + 1 0 i < 2d-1 i+1 2j i < 2j+1 , 0 < j < d
A 24 - 1 node Xtree is shown below: 1
$$$ $ $ $ rr r
2
rr r
3
4
d d
5
d d
6
d d
7
d d
8
9
10
11
12
13
14
15
5.3
TwoRooted Trees
Tworooted trees are complete binary trees in which the root has been split into two nodes. A 16 node tworooted tree is shown below:
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
20
z z d d z dz e e e e z ez z ez
z z d d z dz e e e e z ez z ez
6
Hypercubic Networks
The hypercube is one of the most powerful networks for parallel computation. It is well suited for many different tasks. More important, the hypercube can efficiently simulate any other network of the same size. Unfortunately, as we shall soon see, the degree of hypercube nodes is not constant. To address this problem, several other related networks have been proposed. These networks are all capable of emulating the hypercube with constant or logarithmic slowdown. Together with the hypercube, these related networks form the hypercubic class of networks. For an extensive discussion of these networks, see Chapter 3 of [4].
6.1
The Hypercube
The hypercube is also know as the boolean or binary cube. These networks are multidimensional arrays with an axis extent of two for each axis. Thus, N = 2d , where d is the number of axes (or dimensions), and Nj = 2 for 0 j < d. Let a node index be i = (id-1 id-2 . . . i0 ), where ij is the index value along axis j. Note that ij {0, 1} and that the Cartesian coordinate representation of a node index is identical to the binary decomposition of i. The connectivity can be represented as: i i 2j for j = {0, 1, . . . , d - 1}, and 0 i < N where denotes the bitwise exclusive OR operation. An alternative definition is: i = (id-1 id-2 . . . ij . . . i0 ) (id-1 id-2 . . . ij . . . i0 ) for j = {0, 1, . . . , d - 1}, and 0 i < N where ij denotes the complement of ij . Two, three, and four dimensional hypercubes are shown below:
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
21
z
z
z d d dz z z
z z z d d dz
z d d dz z d d z d d
z z z d d dz
z
z
z z d d z dz
d z z d d d dz dz d
6.1.1
Layout
For the layout of binary cubes with an even number of dimensions we place the nodes on a N N grid. Each node is represented by a 1 log2 N 1 log2 N subgrid in order to allow 2 2 for the log2 N channels per node. Figure 11 shows the layout of a 64 node binary cube. The number of tracks for routing between nodes within a row is one for the first dimension, two for the second dimension, four for the third dimension, etc. Thus, the total number of tracks log2 N -1 = N - 1. It follows that for routing between nodes in a row is 1 + 2 + 4 + 8 + . . . + 2 1 the total height of the layout is N( 2 log2 N + N - 1) N. The width is also N. Thus the area is N 2 . This layout has an area of optimal order, since the bisection width is N/2. The wire length is N/2. The diameter of the binary cube is log2 N, and the total number of wires is 1 N log2 N. Thus, a lower bound for the maximum wire length is N/ log2 N, whether 2 determined by the average wire length, or by the diameter. Hence, our layout is nonoptimal with respect to wire length by at most a factor of O(log2 N). Is there a better lower bound, or a layout with shorter maximum wire length? Comparing a threedimensional mesh and a binary cube occupying the same area, we notice 3 1 that we can fit A 4 nodes of the threedimensional mesh, while only A 2 binary cube nodes. 1 1 With respect to the maximum wire length, it is A 4 for the mesh while it is A 2 for the binary cube. With respect to partitioning of the binary cube, the width of each channel that must leave the partition depends both on the number of nodes in the partition, and the number of nodes in 1 2 the total system. If Ap is the area for a partition, then each of the Ap nodes in the partition
1 2 2 channels leaving the partition is Ap (n - 2 log2 Ap ) compared to 6Ap for a threedimensional mesh. For both these estimates we assumed for simplicity that channels leaving the partition Ap do not consume any area in the partition. This assumption is not valid when all connectors are confined to the perimeter of the partition.
2 requires n - log2 Ap channels to nodes not contained in the partition. Hence, the number of 1 1 1
As an example, consider a partition that holds a 4 4 4 = A 4 submesh. Then, the partition
3
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
22
Figure 11: Layout in the plane of a binary cube.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
23
Figure 12: Gray code layout in the plane of a binary cube. would hold 16 binary cube nodes. The threedimensional mesh would require 6 16 = 96 channels off the partition. For a 64k node system, i.e., a thousand partitions in the system, the binary cube network would require 16(16 - 4) = 196 channels. The width of each channel is half of that of the threedimensional mesh. We leave it to the reader to verify that a Gray code based layout requires N - log2 N tracks for routing between nodes in rows and columns, respectively. Thus, the area is only improved by a lower order term. A Gray code layout is shown in Figure 12. Note however, that even though the Gray code layout does not yield any substantial reduction in area and increases the maximum wire length by a factor of two, the Gray code layout allows a N N mesh to be emulated using only wires of unit length. The binary cube is a special case of a hypercube. A kdimensional hypercube is a mesh of shape M M M . . . M = M k . For the binary cube M = 2. It can be shown that the hypercube can be laid out in area O(M 2(k-1) ), k > 1. With a bisection width of M (k-1) this area is optimal.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
24
6.2
Cube Connected Cycles (CCC) Networks
The cube connected cycles networks are derived from Boolean cube networks by replacing each node of a Boolean cube network with a ring of nodes. Each ring has one node for each edge connecting the replaced cube node with its adjacent (replaced) nodes. Hence, a d dimensional CCC network has d 2d nodes. The nodes in the CCC network are represented by two indices (i, j), where i denotes the replaced cube node (i.e., selects the ring), and j represents the location within the ring. 0 i < 2d and 0 j < d. The connectivity is: (i, j) (i 2j , j) 0 i < 2d , 0 j < d (i, (j 1) mod d) 0 i < 2d , 0 j < d
The first set of edges are the cube edges; the second set of edges are the cycle edges. A 3dimensional CCC is shown below:
z z z d z d z d d z dz z
z d d dz
z d d z dz z
z z
z z z d z d
z d d dz
z z
The CCC can also be drawn in a way similar to the way in which we drew the butterfly network described in the next subsection. The modifications are as follows: each row is made into a ring instead of a linear array, i.e., the output is connected to the input in the same row. instead of the crossconnections between a pair of rows a single connection is used, and one rank is omitted. A CCC drawn in this way has d ranks of 2d nodes each. Figure 13 shows a CCC drawn in this manner. From Figure 13 it is clear that the layout properties of the CCC is similar to that of the butterfly network. The relationship of the butterfly network to the binary cube is also clear. If each of the cycles are broken be removing one edge, for instance the edge that connects the last and first
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
25
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(2,0)
(2,1)
(2,2)
(3,0)
(3,1)
(3,2)
(4,0)
(4,1)
(4,2)
(5,0)
(5,1)
(5,2)
(6,0)
(6,1)
(6,2)
(7,0)
(7,1)
(7,2)
Figure 13: A CCC drawn similarly to a butterfly network.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
26
ranks in Figure 13, then the only difference between the CCC and the butterfly network is the replacement of the crossconnections in the butterfly network by the vertical lines in Figure 13. Identifying all nodes on a line in Figure 13 yields the binary cube network.
6.3
Butterfly Networks
The butterfly network, also known as the FFT network, can be defined in terms of two indices (i, j), where i is the row index, 0 i < N0 = 2d and j is the column index, 0 j d = N1 - 1. Thus, the total number of nodes N is (d + 1)2d . The connectivity of the butterfly network is: (i, j) (i, j + 1) 0 i < 2d , 0 j < d (i 2j , j + 1) 0 i < 2d , 0 j < d
Butterfly networks can also be defined with wraparound, in which case, column d is identified with column 0, i.e., there are only d columns and 0 j < d. The total number of nodes is d2d . (i, j) (i, (j + 1) mod d) 0 i < 2d , 0 j < d (i 2j , (j + 1) mod d) 0 i < 2d , 0 j < d
As with the multidimensional arrays, wrapped butterfly networks can be defined with skewing. A butterfly network is shown below:
(0,2) (0,3) rr (0,1)d e rr d e r (1,1) r d e (1,0) (1,2) (1,3) d d e e d e e d d d(2,2) e e (2,3) (2,0) rr (2,1) e e e d rr e e e d r d(3,2) e e e (3,3) r(3,1) (3,0) e e e e e e e e (4,0) (4,1) (4,2) e e e e(4,3) r d rr e e e d rr e e e r(5,1) d (5,0) (5,2) e e e(5,3) d d e e d d e e d d(6,2) (6,0) e e(6,3) rr (6,1) d e rr d r e d(7,2) e(7,3) r(7,1) (7,0) (0,0)
6.3.1
Layouts of butterfly networks
Butterfly networks are closely related to binary cube networks. For the butterfly network an area optimal layout with minimum maximum wire length O(N/ log2 N) is known. The wire maximum length is within a factor of O(log2 N) of the lower bound, just as for the binary cube.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
27
(0,2) (0,3) rr (0,1)d e rr d e r (1,1) r d e (1,0) (1,2) (1,3) d d e e d e e d d d(2,2) e e (2,3) (2,0) r (2,1) rr e e e d e e e d rr r(3,1) d(3,2) e e e (3,3) (3,0) e e e e e e e e (4,0) (4,1) (4,2) e e e e(4,3) r rr d e e e d rr e e e r(5,1) d (5,0) (5,2) e e e(5,3) d d e e d d e e d d(6,2) (6,0) e e(6,3) rr (6,1) d e rr d r e r(7,1) d(7,2) e(7,3) (7,0) (0,0)
Figure 14: A 8 4 butterfly network. The butterfly network, also known as the FFT network, is usually defined in terms of two indices (i, j), where i is the row index, 0 i < N0 = 2d and j is the column index, 0 j d = N1 - 1. The total number of nodes N is (d + 1)2d . The connectivity of the butterfly network is: (i, j) (i, j + 1) 0 i < 2d , 0 j < d (i 2j , j + 1) 0 i < 2d , 0 j < d
Butterfly networks can also be defined with wraparound, in which case, column d is identified with column 0, i.e., there are only d columns and 0 j < d. The total number of nodes is d2d . (i, j) (i, (j + 1) mod d) 0 i < 2d , 0 j < d (i 2j , (j + 1) mod d) 0 i < 2d , 0 j < d
As with multidimensional arrays, wrapped butterfly networks can be defined with skewing. A butterfly network for d = 3 and no wraparound is shown in Figure 14. For the layout of a butterfly network without warparound we allow the nodes to be three units high to accommodate the necessary number of horizontal tracks per node. With nodes labeled (i, j) there is one connection to node (i, j + 1) for every node in column j, 0 j d. This set of wires account for one horizontal track per node. The wires (i, j) (i 2j , j + 1) account for the other two horizontal tracks per node. There are 2j+1 vertical tracks between columns j and j +1. The total number of vertical tracks for wiring between columns is 2+4+. . . 2d = 2(2d -1). In addition, there are d + 1 vertical tracks reserved for the nodes. Thus, the total width of the layout is 2d+1 + d - 1, and the total height is 3 2d .
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
28
Figure 15: Area optimal layout of a butterfly network.
N The layout area is 6 22d + 3d2d - 3 2d = O( log2 N ). This area is of optimal order, since the N . log2 N
2 2
bisection width is 2d
N The maximum wire length in our layout is 3 2d-1 + 2d = O( log N ). The lower bound, based 2 N on either the diameter or the average wire length, is O( log2 N ). Thus, our area optimal layout 2 is nonoptimal with respect to wire length by at most a factor of log2 N.
6.4
High Radix Butterfly Networks
The butterfly network defined above is a radix-2 network. The fanin and the fanout to each node is two. Similar networks can be defined for a higher fanin and fanout. A fanin and fanout equal to some power of two is most common, such as radix4, 8, and 16. The number of columns are decreased accordingly.
6.5
Bene Networks s
Bene networks are two butterfly networks of the same size connected together to form a network s of 2d + 1 columns and 2d rows. The second network is the mirror image of the first network, i.e., the column index runs backwards.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
29
6.6
ShuffleExchange Networks
The shuffle exchange network is often defined in terms of two sets of edges: shuffle edges and exchange edges. The number of nodes N = 2d . The binary decomposition, i = (id-1 id-2 . . . i0 ), is used for the definition. The shuffle edges are: i = (id-1 id-2 . . . i0 ) The exchange edges are: i i 1 for 0 i < 2d . An 8node shuffleexchange network is shown below. Thick lines denote shuffle edges and thin lines denote exchange edges. 010 011 (sh(i) = id-2 id-3 . . . i0 id-1 ) 0 i < 2d (sh-1 (i) = i0 id-1 id-2 . . . i2 i1 ) 0 i < 2d
000 z
z 001 z r rr rr rz
100
z rr 110 rr r z r z
111
z
101
6.7
de Bruijn Networks
The de Bruijn network is related to the shuffle-exchange network. However, unlike the shuffle exchange, all of the edges are directed. The connectivity is: i = (id-1 id-2 . . . i0 ) sh(i) = id-2 id-3 . . . i0 0 0 i < 2d sh(i) = id-2 id-3 . . . i0 1 0 i < 2d
A d - -dimensional de Bruijn network can be obtained from a d + 1 dimensional shuffle exchange graph by merging nodes that share an exchange edge. An 8node de Bruijn network is shown below:
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks 001 011
30
z B r 000 Trrr 010 r z j r ' z r rr r r % rz '
100
z E r B r rr 111 r z E j r z r rr s rr % z rc
101
110
6.8
Networks
An network is a multistage network like the butterfly network. The connections between stages are defined as shuffleconnections. There are 2d rows and d+1 columns. The connections between columns are the same for all successive column pairs: (i, j) (sh(i), j + 1) 0 i < 2d , 0 j < d
The network and the butterfly network are actually identical. They are just drawn (and labeled) differently.
6.9
Mesh of Trees
An ndimensional mesh of trees of N = n 2ki leaf nodes consists of an ndimensional grid i=1 with 2ki nodes along the ith axis. The leaf nodes along each axis are connected as a complete binary tree. Leaf nodes can be given addresses in a multidimensional address space, such that a node address is (an-1 , an-2 , . . . , a0 ), where 0 ai < 2ki . Leaf nodes are in the same tree if and only if they differ in precisely one digit position. For example, all nodes with address (an-1 , , an-3 , . . . , a0 ) are in the same tree.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks Four 2 2 meshoftree networks
rr rx x e e e e eh eh r rr x x
31
rr rx x e e e e eh eh r rr x x
rr x rx e e e e eh eh rr x x r
rr x rx e e e e eh eh rr x x r
A 4 4 meshoftree networks $ $$ rrx rrrx x x e e e e e e e e eh eh eh $ $ eh e $$$e e e $ rr e rr e e e e x x e r e x x e r e e e e e e e e eh eh eh eh $ $$ $ $ $ r r rrx rrx x x e e e e e e e e h h h h e$ e e e $ $$$ $ rr rr r x x r x x $$ $ r
6.10
FatTrees
Fat trees are trees in which the width of the channels increases towards the root. Fattrees can be constructed as ordinary trees, such as binary trees, or in general kary trees. A binary fattree is shown in Figure 16. In this case the width of the channels doubles for every level of the tree. Thus, for this tree, each leaf node in effect has its own channel to the root. There is no contention for channels even if all nodes in one subtree of the root must exchange its data with the other subtree of the root. Compared to a binary tree, this is an improvement in the data motion capacity of the root by a factor equal to the number of leaf nodes. However, it is not inherent in the definition of a binary fattree that the capacity between a node and its
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
32
r r rr r rr r r rr r rr rr r rr r rr rr r rr r r rr rr rr rr
1
2
rr rr rr rr
3
4
d d
5
d d
6
d d
7
d d
8
9
10
11
12
13
14
15
Figure 16: A binary fattree with channel widths doubling for each level towards the root. parent must double. However, there must be some growth in the channel width from leafs to root for a tree to be considered a fattree. Since most nodes in a fattree has multiple inputs and outputs, they require some routing capability between input and output channels. It is not necessary to guarantee that every input channel can route to any output channel in every node in order to assure that a message from any leaf node can be routed to any other leaf node. We will discuss the routing issues further later. Figure 17 shows a twolevel 4ary fattree. Each node has four children. The number of channels from a node to its parents doubles for every level in this 4ary fattree. Thus, the total number of channels is reduced by a factor of two for each level in moving from the leafs towards the root. Though this may at first seem like no gain compared to a complete binary tree, it indeed represents an improvement since the tree is a 4ary tree. For an ordinary 4ary tree N/4 nodes may be competing for a channel at the root. With the fattree in Figure 17 leaf at most N /2 leaf nodes may compete for a channel at the root, an improvement by a factor of N /2. The internal node structure in the fattree in Figure 17 is a onestage butterfly network. There are two connections between a leaf and its parents. Each leaf node consists of a 2input 2 output butterfly network. The number of input and output nodes in a node doubles for each level of the tree. Thus, in the level above the leaf nodes, each node consists of a butterfly stage with four inputs and four outputs. In the next level up, each node consists of an 8 input 8output butterfly stage. The fattree in Figure 17 can be viewed as a slender butterfly network. The Connection Machine CM5 network is a fattree network. It is based on a 4ary tree. As in Figure 17, the number of channels is reduced by a factor of two between a pair of levels, but only for the two levels closest to the leafs. For the higher levels of the tree, the number of channels is the same for all levels. Such fattrees are scalable with respect to communications bandwidth. The contention for communication channels in the worst case is a constant factor,
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
s $ $ $ s s $ s $s$s$s$s $ $$ $ $ $ s $s $ $$3$$ s3 $$3$$ s s s s s 3 3 3 3 e 3333 3 3 3 e e e 3 3 e e e e 3333 33333 3 e e e e 333 3333 3 e e e e 333 3333 3 333 3333 3 e e e e 333 ssss 3333 3 s s s es es es es s 3 s s s 3 s r r rr r rr r rr rr r r r r s sr s sr s sr s sr srs s rs srs srs d d d d d d d d d d d d d d d d s s s sdsds s s s sdsds s s s s dsds s s s sds ds s s s s s s s s d s d s d s d s d s d s d s d s d s d s d s d s d s d s d s d s sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd sd
33
Figure 17: A 4ary fattree of height two.
Network 1-D 2-D 3-D k-D array array array array Area A N N 4 4N 3 2 (d - 1)2 N 2- k 4N N2
N2 log2 N 2 N2 log2 N 2 N2 log2 N 2
Bisection width, B 1 1 N2 2 N3 1 N 1- k 1 N/2
N 2 log 2 N N 2 log 2 N N 2 log 2 N
Diameter D N 1 2N 2 1 3N 3 1 kN k log2 N log2 N 2 log2 N 2 log2 N 2 log2 N - 1
Max wire length, L 1 1 1 N3 2 N 1- k
1
Lower bound max(D, N/B) N/2 1 2N 2 1 3N 3 1 kN k N/2 log2 N 2 log2 N 2 log2 N 2 log2 N
Part. size M wires cut 2 1 4M 2 2 6M 3 1 2kM 1- k 4 M (log2 N - log2 M ) 2M/ log2 M
Compl. bin. tree Binary cube Butterfly CCC Shuffle exch.
N2 log2 N
N/2
N 2 log2 N N 2 log2 N N 2 log2 N
Table 1: Some properties of layouts of in the plane. regardless of the tree size.
7
A layout comparison
Some of the network characteristics we have derived are summarized in Table 1. Of the networks considered so far, the binary cube has both a small diameter and a high bisection width resulting in good lower bounds whether based on the diameter of the graph, or the bisection width. These bounds are both important for operations such as sorting. But, the area required for the binary cube is high, indeed highest of all networks that we considered. The butterfly network, the cubeconnectedcycles network and the shuffleexchange network all have the a diameter that only is twice that of the binary cube, and a bisection width that is a factor of log2 less than that of the binary cube. But, the area required is a factor of log2 N 2 less. In VLSI technology, the cost increases very rapidly with the chip area due to reduced yield and
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
34
fewer chips per wafer. What is on the chip does not have any significant impact on its cost. And, the chip area is largely determined by wiring needs. Transistors only account for a few percent of the total area. The cost of printed circuit boards also increase with the area, though not as dramatically as for chips. However, the cost increases at least in proportion to the area. Area or volume are good indicators of the cost of a computing system. Hence, an interesting question to ask which network yields the best performance for a fixed area (cost)? To answer this question faithfully, it is necessary to not only account for the difference in the number of nodes that can fit in a fixed area for different networks, but also how wire lengths and partitioning the system into parts with a fixed, limited perimeter affects the performance. For simplicity, we will assume here that all channels are equally fast, and ignore the effects of partitioning, though we will comment on it later. Which network will be the fastest also depends upon the computations to be performed. For instance, assume that we are choosing between building a twodimensional mesh or a binary cube and the main task is to solve Poisson's equation in two dimensions with Jacobi's method, as described before. Then, according to the Thompson grid point model and our layout results, we can construct a mesh with O(A) nodes, or a binary cube with O( A) nodes in area A. Then, each binary cube node must emulate A nodes of the mesh. With respect to computation time, the binary cube is a factor of A slower. With respect to communication time, for a fixed area 1 the width of each of the channels of the cube is a factor of A 4 narrower than that of the mesh. If the subgrid for the mesh machine is of shape M M, then the subgrid for the binary cube 1 1 is of shape MA 4 MA 4 . Communicating the boundary of the local mesh requires a time that 1 1 1 is A 4 A 4 = A 2 longer in the binary cube case, assuming that only a single channel between a pair of nodes is used to emulate the mesh connection. Using all channels of the binary cube 1 will improve the communication time by at most a factor of log2 A 4 . Thus, even though the mesh can be embedded in the binary cube preserving adjacency (using a Gray code), the binary cube is slower than the mesh by a factor of A both with respect to computation and communication. In addition, the mesh only has short wires, while some of the binary cube wires are long, and may reduce the performance further. Emulating a twodimensional mesh on a binary cube and comparing the performance to that of a twodimensional mesh is unfair to the binary cube. It cannot be expected that a computation for which a twodimensional mesh is optimal will perform as well on a binary cube. Thus, let us instead consider sorting. Then, the mesh may require a time proportional to A. The constant of proportionality depends upon whether the diameter or the bisection width determines the time in the case of one element per node. With P > A elements distributed evenly, and the communication time determined by the bisection width, the mesh may require a time P P proportional to 2A . The binary cube will require a time of 2(P = sqrtA , since there are A A/2) nodes in area A, and hence the bisection width is 1 A. Thus, in this case, the communication 2 time for the binary cube and the mesh are of the same order, with the lower bound for the mesh being slightly better. With respect to comparisons, each binary cube node must perform the work of A mesh nodes. Hence, for sorting, if the communication time dominates then the binary cube may be faster if the diameter determines the communication time, while if the bisection width is the determining factor, then mesh may have a slight advantage. On the
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
35
other hand, of the comparison time determines the performance, the mesh is expected to have an advantage by a factor of A, using straight emulation techniques. The area expense for the binary cube network is in many situations not paying off in performance for a fixed area (cost). Hence, let us consider a butterfly network instead of the binary cube. Then, in area A we can lay out a network with A log2 A nodes. The bisection width is 1 A 2 as for the binary cube. Computations limited by the bisection width would still require time P A . Compared to the mesh, each node in the butterfly network must emulate nodes. log2 A A Thus, we again find that the butterfly network, though less area consuming does not offer any advantage over the mesh, unless the computation time is determined by the diameter. Meshes cannot be embedded in butterfly networks with constant dilation. With respect to computation, the butterfly network must emulate O( log AA ) mesh nodes.
2
In these examples we see that if the area is the same for a mesh, a binary cube and a butterfly network, then the time for operations that are bandwidth limited is of the same order, while the mesh is faster if the desired computation indeed is that of a mesh emulation. The speed advantage of the mesh may be as much as A for the binary cube, and log AA for the butterfly 2 network. And again, the mesh only has short wires. But, a mesh may be considerably slower than the other networks when the diameter of the network is the limiting factor. The diameter may be the factor determining the speed in a lightly loaded network. The problems with the networks we have discussed so far is that the networks with a diameter of order O(log2 N) have a relatively large area requirement, or a bisection width that is somewhat too large for the number of nodes in the network. On the other hand, the complete binary tree has desirable area requirements, but the bisection width is too small. It is desirable that it is at least as large as for the mesh. Does there exist any N node networks with diameter O(log2 N), area O(N) and a bisection 2 width of order log NN ? The area of such a network must be at least N/log2 N. Since the bisection 2 width is a factor of log2 A less than the maximum possible, computations for which the bisection width determines the performance cannot be suboptimal by more than a factor of O(log2 A). Since we have O(N) nodes in area A, the computational slowdown is at most a constant factor. Networks that can simulate any other network with at most a logarithmic slowdown O(log2 A) are known as area universal networks.
8
Area Universal networks
Fattree networks can be made area universal [5]. We have earlier shown both a binary fattree in which the channel capacity doubled for each level proceeding from the leafs to the root, and a 4ary fattree where the bandwidth was reduced by a factor of two for each level, Figure 17. The binary fattree in Figure 16 cannot be laid out in area O(N), since the bisection width
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
36
is O(N). Thus, to achieve our goal of area universality we must reduce the total bandwidth between levels as we move from the leafs towards the root. We will now show that an N node 4ary fattree can be laid out in area O(N), just as a complete binary tree can be laid out in area O(N). We will also show that the bisection width is log NN .
2
In the 4ary fattree each node has four children. Each node in the 4ary fat tree must be capable to route a message from any of its children to any other child node. Similarly, it must be possible to route a message from any child to a nodes parent. Since, in general, there are several channels between any pair of nodes, it is also necessary that the load can be balanced among the channels between a pair of nodes. Thus, each internal node of the fattree must possess good switching capabilities, though a full crossbar switch is not necessary. In the 4ary fattree in Figure 17, each leaf node has two connections to its parent node, while a parent of the leaf nodes has a total of eight connections going to children nodes. The parent nodes of the leaf nodes have four connections to their parents, which in turn have 16 connections to their children. The number of connections between a node and its parent doubles for every level of the tree. Each node in our 4ary fattree consists of one stage of a butterfly network. Each leaf node consists of a 2input 2output butterfly stage. In the level above the leaf nodes, each node consists of a 4input 4output butterfly stage. In the next level up, each node consists of an 8input 8output butterfly stage. Thus, the number of butterfly inputs and outputs per node doubles for every level of the 4ary tree. The 4ary fattree in Figure 17 can be viewed as a slender butterfly network. A k level 4ary tree has 4k leaf nodes. An N leaf node 4ary tree has height k = log4 N. The number of nodes in such a tree is (4N - 1)/3. The length of a path from the root to a leaf is log4 N. In our 4ary fattree, the leaf nodes contains four butterfly nodes, the next level of nodes contains eight butterfly nodes, the next 16 butterfly nodes, etc. Thus, in our 4ary fattree there is a total of 4 4k butterfly nodes at the leaf level. At the level above, there are 4k-1 tree nodes, each of which has 8 butterfly nodes, for a total of 8 4k-1 butterfly nodes. The total number of butterfly nodes is 4 4k + 4 2 4k-1 + 4 22 4k-2 + . . . 4 2k = 4k+1 (2 - 1 ). 2k
With each tree node containing one butterfly stage, the path length from the tree root to a tree leaf node is 2k - 1, since one link must be traversed within each internal tree node. With respect to butterfly nodes, the maximum distance form root to leaf is 2k + 1. The diameter of the fattree is 4 log4 N = 8 log2 N. The number of input nodes at the tree root is 2 2k = 2 N. The bisection width is the same, i.e., 2 N . A lower bound on the area is 2N. The fattree can be laid out in an Htree like manner. Each node in the Htree represents a 2 2 butterfly network switch. Each such switch can be laid out in on a 5 2 subgrid using the Thompson grid model. At the lowest level, four such switches connect to two 2 2 switches at the level above. The two 2 2 switches are obtained through a simple permutation of the columns in an internal tree node as drawn in Figure 17. At the next level, four sets
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
37
Figure 18: Htree layout of the 4ary fattree. each containing two 2 2 switches connect to four nodes of 2 2 switches, then four sets each containing four 2 2 switches connects to eight 2 2 switches, etc. A fattree layout is shown in Figure 18. Each of the boxes in this figure represent a 2 2 butterfly switch. To arrive at this layout from Figure 17, the nodes at each level of the fattree can be labeled from 0 in a lefttoright order. Then, reorder the nodes within a level by bitreversing the node index. The width of the Htree layout for the fattree is w(N) = 2w(N/4) + O( N ) This recurrence has a solution of the form w(N) = N log2 N + N w(1). Now, let each leaf node in fact consist of a log2 N log2 N array. Then w(1) = log2 N, and the area is A = O(N log2 N), which is of the same order as the total number of nodes. The fattree had 2 O(N) leaf nodes before adding the log2 N log2 N arrays at the leafs. Adding the arrays increases the diameter by a constant factor. Thus, our Htree layout of the fattree is area optimal. What is the maximum wire length? As for the emulation capacity, the maximum bisection width of any other network in area O(N log2 N) is at most O( N log2 N). The bisection width of the fattree is N . Thus, the 2 slow down caused by the emulation on the fattree is at most O(log2 N). We earlier showed that a twodimensional mesh, a binary cube, and a butterfly network could perform bisection width limited computations in time O( A) in area A. The fattree can perform the same computations in a time of at most O( A log2 A), i.e., a slowdown by a factor of at most O(log2 A). Compared to the complete binary tree, the fattree offers a speedup of
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks O( log AA ) for computations limited by the bisection width.
2
38
We note a few properties of the fattree: A klevel fattree contains a 2k+1(k + 2) node butterfly network as a subgraph. A klevel fattree contains a 2k + 1level complete binary tree from any of its root notes. There exist a unique path from any leaf node to any root node. Making random choices on the path to the root will select the root randomly. The path to another leaf is fully determined once the root is chosen. How should a fattree be partitioned with respect to minimizing the number of edges cut? The fattree used in the Connection Machine system CM5 is described in [6].
9
Partitioning
One desirable characteristic of a network is modularity in the sense that big networks can be built from smaller networks, and that the smaller parts are independent of the size of the larger network. Meshes and butterfly networks have this property, while binary cubes do not. The largest network to be built has to be anticipated at design time, and built into the subcubes. Binary trees of arbitrary size can be constructed from parts with a constant number of connectors, regardless of the tree size constructed and the subtree on the part. Another important issue is the channel width. With a fixed perimeter of a chip, or a multichip carrier, or a board edge, a large number of channels cut implies that each of the channels becomes narrower the larger the number of channels cut. In [9] we in show that in two networks of the same area, the same number of nodes, but different channel widths for a fixed bisection width, the mesh and the binary cube can emulate a butterfly network in the same time, if the dependence upon the wire length is ignored. Factoring in the dependence upon the wire length results in a speed advantage for the grid by a factor of O(log2 A) or O( A). The first factor applies for a logarithmic time model for the channel transfer time as a function of its length. The second factor applies for a linear time dependence.
References
[1] Sandeep N. Bhatt and Charles E. Leiserson. How to assemble tree machines. In Fourteenth Annual ACM Symposium on the Theory of Computing, 1982. [2] Richard P. Brent and H.T. Kung. On the area of binary tree layouts. Information Processing Letters, 11(1):4446, 1980. [3] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantative Approach. Morgan Kaufmann Publishers, Inc, 1990.
COSC6365 Spring 2008 Lecture #8 and 9: Interconnection Networks
39
[4] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann, 1992. [5] Charles E. Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Trans. Computers, 34:892901, October 1985. [6] Charles E. Leiserson, Zahi S. Abuhamdeh, David C. Douglas, Cral R. Feynman, Mahesh N. Ganmukhi, Jeffrey V. Hill, W. Daniel Hillis, Bradley C. Kuszmaul, Margaret A St Pierre, David S. Wells, Monica C. Wong, Shaw-Wen Yang, and Robert Zak. The network architecture of the Connection Machine CM5. In SPAA '92, pages 272285. ACM Press, 1992. [7] Carver A. Mead and Lynn Conway. Introduction to VLSI Systems. Addison-Wesley, 1980. [8] M.S. Paterson, W.L. Ruzzo, and Larry Snyder. Bounds on minimax edge length for complete binary trees. In Proc. of the 13th Annual Symposium on the Theory of Computing, pages 293299. ACM, 1981. [9] Abhiram Ranade and S. Lennart Johnsson. The communication efficiency of meshes, Boolean cubes, and cube connected cycles for wafer scale integration. In 1987 International Conf. on Parallel Processing, pages 479482. IEEE Computer Society, 1987. [10] Snyder L. Ruzzo W.L. Minimum edge length planar embeddings of trees. In VLSI Systems and Computations, pages 119123. Computer Sciences Press, 1981. [11] C.D. Thompson. A complexity theory for VLSI. Technical report, Dept. of Computer Science, Carnegie-Mellon Univ., 1980. [12] Jeffrey D. Ullman. Computational Aspects of VLSI. Computer Sciences Press, 1984. [13] Andrew C. Yao. The entropic limitations of vlsi computations. In The Thirteenth Annual ACM Symposium on the Theorey of Computation, pages 308311. ACM, 1981.
Find millions of documents on Course Hero - Study Guides, Lecture Notes, Reference Materials, Practice Exams and more.
Course Hero has millions of course specific materials providing students with the best way to expand
their education.
Below is a small sample set of documents:
U. Houston - COSC - 6365
Computer Science 6365March, 27 2008Lecture #20-21: Dense matrix multiplicationProfessor: S. Lennart Johnsson TA:Wei Ding1MatrixVector MultiplicationSeveral scenarios are possible depending upon whether or not the data is external or intern
U. Houston - COSC - 6365
Computer Science 6365April 17, 2008Lecture #26: Fast Fourier Transforms IProfessor: S. Lennart Johnsson TA: Wei Ding1The Fast Fourier TransformThe Fast Fourier Transform, the FFT, is one of the most widely used algorithms in science and e
Michigan State University - CSE - 838
Pipelined Merging of Two sorted list in a constant time (Coles Algorithm) Leaves contain the value Internal nodes merge at each time by updating the values Lv: the sequence of values of descendants of v Qv(j): At time j, a sorted sequence v has.
Penn State - MKTG - 485
CHAPTER 2THE ORGANIZATIONAL BUYING PROCESSImportant Topics of the Chapter Changing Role of Business Buyer. The Business Buying Process. Business Buying and Buying Center. Environmental Forces and Buying Decision.Objectives of Business Buyers
Penn State - MATH - 110
Compound interest. If one invests P dollars at an annual interest rate of i percent then the return S1 at the end of the rst year will be S1 = P + rP or S1 = P (1 + r) where r = .01i. The return S2 at the end of the second year can be viewed as inve
Penn State - LAF - 243
Request For A New ApplicationDate Submitted: April 24, 2007 Submitted by: Lindsay Federoff Purpose: To develop an application that converts Kilometers into Miles. Application Title: Algorithms: Kilometers to Miles Converter Miles = Kilometers * 0.62
Penn State - LAF - 243
Request For A New ApplicationDate Submitted: April 19, 2007 Submitted by: Lindsay Federoff Purpose: To create an application for the Paint Dept. to convert liters to pints and gallons for McIntyres Hardware Store. Application Title: Algorithms: Metr
Penn State - EJS - 5116
Erich Schaefer 9/19/06 MIS 204 There are many advantages to using a computer in todays world. First and foremost would be the speed at which information can be conveyed. The list of things that a computer can process is virtually never ending. From m
Penn State - EJS - 5116
Request For A New ApplicationDate Submitted: 11/30/06 Submitted by: Erich Schaefer Purpose: To create a temperature converterApplication Title: Algorithms:Temperature Converter C= (F-32) * 5/9Notes:ApprovalsApproval Status: Approved By: Dat
Penn State - EJS - 5116
Request For A New ApplicationDate Submitted: November 11, 2006 Submitted by: Erich Schaefer Purpose: The Personnel department often is asked to do a quick computation of a customers state tax. Employees would save time and provide more accurate info
Penn State - EJS - 5116
Request For A New ApplicationDate Submitted: 12/5/06 Submitted by: Erich Schaefer Purpose: To convert kilograms into both pounds and ouncesApplication Title: Algorithms:Weight converterNotes:ApprovalsApproval Status: Approved By: Date: Assi
Penn State - GES - 5024
Rebecca Colabaugh Tamara Leone Gretel Sheasley Ed Wasko Shopping on the Web The web has become such a large place with so much to offer. The internets main intention was to share information, but now it is used for a vast variety of things. It has al
Penn State - JMG - 520
Jennifer M. Gerrardjmg520@psu.eduCurrent Address: 117 Jordan Hall State College, PA 16801OBJECTIVE EDUCATIONPermanent Address: 7622 Huntmaster Lane McLean, VA 22102__To obtain a position in the marketing field, special interest in research an
Penn State - MES - 121
COMMON LAYER 2 DEVICES AND FUNCTIONALITIES1Introduction7 6 5 4 3 2 1Routers, PAD's, X.25 switches Bridges, LAN switches, ATM switches and terminal servers Transceivers, repeaters, hubs, FDDI concentrators, MSAU's, modems, terminal adapters, DS
Penn State - CEB - 5111
Wireless Devices in HealthcareCourtney BellObjectives The use of Wireless devices Type of devices and software Uses in Nursing care Legal and Ethical issues Advantages and disadvantages for the NurseWireless Devices Wireless devices such a
Penn State - MIS - 204
Josh Yorio MIS204 2/6/06 1. Name and describe the various types of software in each of these categories, Business, Graphic & Multimedia, Educational, & Communication. Business software is application software that assists people in becoming more eff
U. Houston - MATH - 1313
LINEAR PROGRAMMING: A GEOMETRIC APPROACH33.1 3.2 3.3Graphing Systems of Linear Inequalities in Two Variables Linear Programming Problems Graphical Solution of Linear Programming ProblemsMany practical problems involve maximizing or minimizing
U. Houston - ELED - 4310
I decided to choose a child from my moms preschool class for the reading interest inventory. My mom chose Keely, one of the four year olds, for me to use because she is one of the smartest students in her class. While my mom was reading a book to the
U. Houston - QUEST - 4310
I decided to choose a child from my moms preschool class for the reading interest inventory. My mom chose Keely, one of the four year olds, for me to use because she is one of the smartest students in her class. While my mom was reading a book to the
Penn State - GEC - 5031
Gretchen CraigObjectivesExplain e-Prescribing Describe hardware requirements Describe software requirements Describe legal issues Define advantages/disadvantages for healthcareE-PrescribingUses computer system Prescription sent electronically t
U. Houston - ECE - 4437
900 MHz TransceiversLong Range300 ft. (90 m) indoor/urban environments 1000 ft. (300 m) line-of-sight w/ dipole -108 dBm receiver sensitivity (industry average only -93 dBm)Low Power55 mA transmit / 45 mA receive current consumption Power down m
U. Houston - ECE - 4437
Number 20White PaperTAOS Colorimetry Tutorial"The Science of Color"contributed by Todd Bishop and Glenn Lee February 28, 2006ABSTRACTThe purpose of this paper is to give a brief overview of colorimetry. Colorimetry is the science of measuring
U. Houston - ECE - 3455
ELECTRONIC CIRCUITS AND DEVICES ECE 3455 LECTURE NOTES - DAVE SHATTUCK SET #6 Chapter 5 - INTRODUCTION TO BIPOLAR JUNCTION TRANSISTORS (BJTs) also known as Junction Transistors, sometimes just as Transistors. These are made up of two pn junctions bac
Penn State - STAT - 100
Lecture 33Today's lecture will cover material from Chapter 21.Please turn off cell phones, pagers, etc. The lecture will begin shortly. There will be a quiz at the end of today's lecture.1. Review CI for a proportion 2. CI for a mean (Section 21.
Penn State - DSD - 149
ASHRAEStandard62.1ComplianceReport(Rendering Courtesy of Clark-Nexsen)William&Mary VirginiaInstituteofMarineScience MarineResearchBuildingComplex SeawaterResearchLaboratory GloucesterPoint,VAPreparedFor: Dr.WilliamBahnfleth Professor ThePennsyl
U. Houston - COSC - 6365
Computer Science 6365January 31, 2008Lecture #6: Memory Systems Data distributionProfessor: S. Lennart Johnsson TA:Wei Ding1Memory SystemsIn our discussion of vector architectures we noticed that having a memory system that allows for two
U. Houston - COSC - 6365
Computer Science 6365April 28, 2008Lecture 28: Sparse matrix computationsProfessor: S. Lennart Johnsson TA:Wei Ding1Sparse matrix computationsA sparse matrix is a matrix with relatively few nonzero elements. Often the number of nonzeroes a
U. Houston - COSC - 3480
Lecture 10: Query EvaluationDragan Mirkovic Department of Computer Science University of HoustonD. Mirkovic, COSC 3480: Design of File and Database Systems, Fall 2004Announcements Today: Overview of query evaluation Chapter 12 in Ramakrishnan
U. Houston - COSC - 2410
Lecture 3: Basic InstructionsDragan Mirkovic Department of Computer Science University of HoustonD. Mirkovic, COSC 2410: Computer Organization and Programming, Fall 2003Introduction Announcements: Homework #1 due today Quiz #1 on Thursday, 9/1
U. Houston - COSC - 2410
Chapter 1: Context of Assembly LanguageAssembly Language for Intel-Based Computers, Third EditionTable 1. Software Hierarchy Levels.Lv l ee A p a nP g m p lic tio ro ra D s r tio e c ip n S ftw re d s n d fo a p rtic la c s o o a e ig e r a u r
U. Houston - CUIN - 3111
WHIDBY ELEMENTARY SCHOOLField Trip Permission FormYour child's class will be attending a field trip to: Houston Museum of Natural ScienceDate Location Cost32406Time9:15am2:15pmOne Hermann Circle Drive, Houston, TX 77030 $2.00 HISD BusT
U. Houston - CUIN - 3111
DinosaurTradingCardsMeaning:ThreeHornedFaceTriceratopsLength:30ft. Height:10ft. TypeofFeeder:Herbivore(planteater) WhereFound:NorthAmerica WhenItLived:6772millionyearsagoDeinonychusMeaning:TerribleClaw Length:10ft. Height:5ft. TypeofFeeder:C
U. Houston - ERODRI - 3113
Bounce Up Tiny Take off Enter Raise Fly Little Young
U. Houston - CUIN - 3113
Por: Sandra y CarlosPAJAROSLos pjaros tienen plumas nacen de huevos viven en el aire y la tierraMAMIFEROSCuerpo cubierto de pelo Nacen vivos Sangre calientePECESNacen de huevo Cuerpo cubierto de escamas Sangre fra Respiran por agallas Viv
U. Houston - ERODRI - 3113
Por: Sandra y CarlosPAJAROSLos pjaros tienen plumas nacen de huevos viven en el aire y la tierraMAMIFEROSCuerpo cubierto de pelo Nacen vivos Sangre calientePECESNacen de huevo Cuerpo cubierto de escamas Sangre fra Respiran por agallas Viv
U. Houston - CUIN - 3111
April 2005Schedule of Events November 1-3 Introduction to World War 1Knowledge is PowerSun Mon Tue Wed Thu Fri Sat November 7 Causes of WW I November 8-11 Major battles November 14 Homework 1 due November 15 Major battles continued Novembe
U. Houston - CUIN - 3111
MostImprovedStudentAwardPresentedtoShaunBeasley forhisimprovedperformanceinReadingAndWriting duringthe2004fallterm_PraptiTrivedi_ 2005_(Signatureofteacher)_1031(Date)
U. Houston - CUIN - 3111
Cite LabL N F V T I S R Q D Q H P F B J K V Y X V D W S F M E K Z H X A V M I Y S T C R T B R X N L Q X U T H K L E Q K S H Z B I Y N V U D M W Q R O J S N G K K T H T V G F R G F T Q Y Q F X Y I F N E M P S K W D H I V U D E F P L D V S K O E J K D
U. Houston - CUIN - 3111
The Faculty ofJoy Middle Schoolcertify to all thatJessica Coxhas been namedStudent of the MonthPrapti Trivedi: 12/11/2005
U. Houston - CUIN - 3112
1. How many died in the First World War and how much did the war cost?2. Why were Britain and France in trouble in 1918? 3. Clearly, many people wanted revenge. Why did some people think Germany should not be treated harshly? 4. What was the peace s
U. Houston - CUIN - 3112
Little Garden Middle SchoolDear Parents, Its again that time of the year that our school is doing history open house where students are presenting their knowledge of what they learned about the European exploration and the colonization era. This let
U. Houston - CUIN - 3111
Dear Parents of Last Name First Name, AddressI am so glad to inform you that First Name is doing a great job in English class and have a grade of Grade. Hope this will help you to know where he/she stands in this class at the end of three weeks. If
U. Houston - CUIN - 3111
Dear Parents ofESL_Class_Of_Ms_Trivedi F2, F3I am so glad to inform you that F2 is doing a great job in English class and have a grade ofF4. Hope this will help you to know where he/she stands in this class at the end of three weeks. If you have an
U. Houston - CUIN - 3111
FromHoughtonMifflinEnglishPowerProofreading Formorefunonlineinteractiveproofreadingpracticeseehttp:/www.eduplace.com/kids/hme/k_5/ proofread/ PracticeI MixedPractice GradeLevel:3rd Statistics:atleast12errors DearMr.Matho, MeandmybrutherMybrotherandIw
U. Houston - CUIN - 3113
MissPaek&Ms.RaysWildThingsLearningAreas Levels Objectives La n g ua g e Arts ,R e a d ing ,T e c h n o lo g y 4 5 ye a ro ld s S WBAT lis te nto ,un d e rs ta nd ,a nd a c to u tth e b o o k:WheretheWild Thingare.Studentswillhavetounderstandtheproce
U. Houston - CUIN - 3111
Welcome to.A Game of Xs and OsAnotherPresentation 2000 - All rights Reservedmarkedamon@hotmail.com123456789Scoreboard123X4 O567Click Here if X Wins Click Here if O Wins891What kind of Cloud is low
U. Houston - CUIN - 3111
To the parent(s) or guardian(s) of First_Name Last_Name: Your child currently has a(n) Grade in his/her Class_ class. If you have any questions or concerns please contact me via e-mail or phone. Sincerely, Miss Catherine Ray
U. Houston - CUIN - 3111
WordArt Acrostic Name PoemsBack Home Table of ContentsWrdartex.doc
U. Houston - CUIN - 3111
Cypress Hill Elementary School Back Home Table of ContentsFIELD TRIP PERMISSION FORMYour child's class will be attending a field trip to the Local News & Weather Station Date: November 14th Cost: No cost; lunch a.m. 2:00 p.m. Time: 9:00 $ optiona
U. Houston - CUIN - 3111
To the parent(s) or guardian(s) of Jimi Hendrix: Your child currently has a(n) A in his/her Music class. If you have any questions or concerns please contact me via e-mail or phone. Sincerely, Miss Catherine Ray
U. Houston - CUIN - 3111
ZippyBy Catherine RayOnce upon athere was anamed-ER + Y. -ER + Y wasHehad SPOTS and purple. One day, whilemunching on a crispy, the weather-E + GAN to change.away, butFirstthepicked up, almost blowing the little-ER + PY
U. Houston - CUIN - 3111
BATSFun Facts About BatsBy Kim rly S be hawnObje s ctive Le what m s bats such spe m m arn ake cial am als. Le theanatom of a bat's wing. arn y Le what bats e arn at. Le thetype of bats. arn s Le whe bats sle p. arn re eBats are Mammals
U. Houston - CUIN - 3111
Welcome to.A Game of Xs and OsAnotherPresentation 2000 - All rights Reservedmarkedamon@hotmail.com123456789Scoreboard123X4 O567Click Here if X Wins Click Here if O Wins891How many mosquitoes can a
U. Houston - CUIN - 3111
Dear Mr. and Mrs. Shawn, This is the time of year for my mid-semester review. I hope you find this letter helpful in determining how well your child is doing in my class. Kim has a/an B in Math. Kim needs to stay on top of her homework.Kim needs to s
U. Houston - CUIN - 3111
TechnologyL H H F A C C O U N T Q H N I C C H G B R J U S N V K H N R X O T Y J W S L O Y Z N S K O T M L A X E P I H I L T S B A T E A O R T T A M K A K E L M D B N N R C T M G N J E R B I X O A R E P P E T C B Y D A N H M W N E O R M L G C B D T G
U. Houston - CUIN - 3113
Maria Cantu Kimberly Shawn Second Grade Lesson (Tomie dePaola) Technology Objectives: The learner will utilize attentively listening in order to understand what an illustrator does. The learner will also demonstrate their technology and artistic abil
U. Houston - CUIN - 3112
Kimberly Shawn Comparison between the TEKS/NETS Standards Comparison of TEKS and NETS TEKS Standards 1. TEKS are a Texas Curriculum only. 2. TEKS breakdown grades K-2 and 3-5 with specific criteria for each objective. 3. There are 12 Objectives withi
U. Houston - CUIN - 3111
FranceArea: 210,026 sq mi Capital: Paris Industry: machinery, chemicals, automobiles, metallurgy, aircraft, electronics, textilehttp:/plasma.nationalgeographic.com/mapmachine/profiles/fr.htmlAgriculture: wheat, cereals, sugar beets, potatoes; bee
U. Houston - CUIN - 3111
Dear parents of Charles Schultz, This letter is to inform you that your student, Charles, is making a/an A in Math. Comments: Good job on your hard work! Bethie Chapin 4th gradeDear parents of Joey Kirkland, This letter is to inform you that your s
Penn State - CSE - 598
Scientific Application-Based Performance Comparison of SGI Altix 4700, IBM POWER5C, and SGI ICE 8200 SupercomputersSubhash Saini, Dale Talcott, Dennis Jespersen, Jahed Djomehri, Haoqiang Jin, and Rupak BiswasNASA Advanced Supercomputing Division NA