Parallel Min on a Hypercube
CDA 5110/COP 4520
The problem
Given one million integers, find the smallest. Do this in parallel on a hypercube with 8 processors.
The specifications
Assume each processor has 125,000 integers. Clock: 2 Ghz Integer compare an
Batcher's Networks for Merging / Sorting
a b
L H
min(a , b) max(a , b)
Comparator Compare-exchange operation Building Block
A sequence a0, a1, a2, ., am-1, am, ., an-1 of length n in which the first part a0, a1, a2, ., am-1 is monotonically increasing and
Mesh Implementations
Cannon's Algorithm
Uses a mesh of processors with wraparound connections (a torus) to shift the A elements (or submatrices) left and the B elements (or submatrices) up. 1.Initially processor Pi,j has elements ai,j and bi,j (0 <= i < n
Convergence of Gauss-Seidel iterative method for 2 simultaneous equations
Proceeding from the kth to the (k+1)th iteration (1) (2) Substituting the value of from eqn. (2) into eqn. (1)
(3)
Likewise, from (k+1)th iteration to (k+2)th iteration (4) Subtract
Architecture
Single processor:
single instruction stream single data stream von Neumann model
Multiple processors:
Flynn's taxonomy
Flynn's Taxonomy
Many
Instruction Streams
MISD
MIDM
Interconnection Shared memory Master-Slave Large grain fine grain ?
Iterative, Row-oriented Algorithm
Series of inner product (dot product) operations
=
Performance as n Increases
Reason: Matrix B Gets Too Big for Cache
Computing a row of C requires accessing every element of B
Block Matrix Multiplication
=
Replace scala
Network Topologies
Linear Array Star Tree Mesh Hypercube Completely connected
Linear Array
Linear Ring (wrap-around)
Tree
Star
Completely Connected
Mesh
2D Torus
2D Torus
Hypercube
Recursive construction
a k-dimensional hypercube is constructed from tw
Batcher's Networks for Merging / Sorting
a b
L H
min(a , b) max(a , b)
Comparator Compare-exchange operation Building Block
Sorted
a a a a
L H L H
L H
c c c c
Sorted
2x2 O-E merging Network
# of comparators = 3 depth, delay or time = 2 ( length of the lon
Prefix Sums (or Partial Sums) Input:
a sequence of n integers;
a0, a1, ., an-1 Output:
a sequence of sums, s
si = a0 + a1 + . + ai =
a
j =0
i
j
Example:
Input: Output: 2, 8, 4, 7, 1, 1 2, 10, 14, 21, 22, 23
Sequential Algorithm
Complexity: O(n) for n i
Parallel Processing
Several processing elements working to solve a single problem Primary consideration: elapsed time
NOT: throughput, sharing resources, etc.
Downside: complexity
system, algorithm design
Given a sequence (array) A of n elements (unso