Aydin_CSC11_Indexing

Aydin_CSC11_Indexing - Parallel Sparse Matrix Indexing and...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Parallel Sparse Matrix Indexing and Assignment Aydın Buluç Lawrence Berkeley Na4onal Laboratory John R. Gilbert University of California, Santa Barbara 1 Sparse adjacency matrix and graph 1  AT x 4 3 2 7 5 6 •  Every graph is a sparse matrix and vice ­versa •  Adjacency matrix: sparse array w/ nonzeros for graph edges •  Storage ­efficient implementa4on from sparse data structures Linear ­algebraic primi4ves for graphs Sparse matrix ­matrix Mul4plica4on (SpGEMM) x Element ­wise opera4ons Sparse matrix ­sparse vector mul4plica4on x Sparse Matrix Indexing .* Matrices on semirings, e.g. (×, +), (and, or), (+, min) 3 Indexed reference and assignment Matlab internal names: subsref, subsasgn For sparse special case, we use: SpRef, SpAsgn SpRef: SpAsgn: B = A(I,J) B(I,J) = A A,B: sparse matrices I,J: vectors of indices A SpRef using mixed ­mode sparse matrix ­matrix mul4plica4on (SpGEMM). Ex: B = A([2,4], [1,2,3]) 4 B Why are SpRef/SpAsgn important? Subscrip4ng and colon nota4on: ⇒  Batched and vectorized opera4ons ⇒  High Performance and parallelism. A=rmat(15) A(r,r) : r random − Load balance hard ± Some locality + Load balance easy − No locality 5 A(r,r) : r=symrcm(A) − Load balance hard + Good locality More applica4ons Prune isolated ver4ces; plug ­n ­play way (Graph 500) 0 0 2 2 4 4 6 6 8 8 10 10 12 12 14 0 16 0 2 4 6 8 10 nz = 36 12 14 2 4 6 nz = 36 8 10 12 16 sa = sum(A); // A is symmetric, for undirected graph nonisov = find(sa>0); A= A(nonisov, nonisov); // keep only connected vertices 6 More applica4ons Extrac4ng (induced) subgraphs 1 2 4 7 5 3 6 1 2 4 1 Area A 7 2 4 7 5 3 3 6 PAPT 6 Area B •  Per ­area analysis on power grids •  Subrou4ne for recursive algorithms on graphs 7 5 Sequen4al algorithms function B = spref(A,I,J) ! R = sparse(1:length(I),I,1,length(I),size(A,1)); ! Q = sparse(J,1:length(J),1,size(A,2),length(J)); ! B = R*A*Q;! Tspref = flops( R ! A) + flops( RA ! Q ) = nnz( R ! A) + nnz( RA ! Q ) = O(nnz( A)) function C = spasgn(A,I,J,B) ! [ma,na] = size(A);! [mb,nb] = size(B);! R = sparse(I,1:mb,1,ma,mb);! Q = sparse(1:nb,J,1,nb,na);! S = sparse(I,I,1,ma,ma);! T = sparse(J,J,1,na,na);! C = A + R*B*Q - S*A*T;! 8 !0 0 0$ !0 # &# A +# 0 B 0 &'# 0 #0 0 0& #0 " %" Tspasgn = O(nnz( A)) 0 0$ & A( I , J ) 0 & 0 0& % Parallel algorithm for SpRef 1. Forming R from I in parallel, on a 3x3 processor grid P(0,0) P(1,1) 7 SCATTER 0 1 2 3 4 5 6 7 8 2 5 8 1 P(2,2) 3 I R •  Vector distributed only on diagonal processors; for illustra4on. •  Full (2D) vector distribu4on: SCATTER  ALLTOALLV •  Forming QT from J is iden4cal, followed by Q=QT.Transpose() 9 Parallel algorithm for SpRef 2. SpGEMM using memory ­efficient Sparse SUMMA. j k k i x = Cij Cij += Pik Akj 10 Minimize temporaries by: •  Splilng local matrix, and broadcas4ng mul4ple 4mes •  Dele4ng P (and A if in ­place) immediately amer forming C=P*A 2D vector distribu4on n pc x1,1 n pr ! A1,1 A1,2 A1,3 x1,2 A ! 2,1 A2,2 5 ! A2,3 ! ! ! ! 3,1 A A3,2 ! 8 A 3, 3 ! ! ! ! ! ! x 2,2 Default distribu4on in Combinatorial BLAS. x 3,1 x 3,2 ! x3 x 3,3 ! ! x2 x 2,1 x 2,3 ! ! Matrix/vector distribu4ons, interleaved on each other. x1,3 ! ! x1 !  ­ Performance change is marginal (dominated by SpGEMM)  ­ Scalable with increasing number of processes  ­ No significant load imbalance Complexity analysis SpGEMM: Tcomp $ nnz( A) $ length( I ) length( J ) '' ! "& # log & + + p )) p p % (( %p # nnz( A) & Tcomm = ! %! " p + " " ( % ( p' $ Matrix formaDon: Dominated by SpGEMM Bopleneck: bandwidth costs Speedup: " p () # length( I ) + length( J ) & ! %! " log( p) + " " ( % ( p $ ' ! Assump4ons:  ­ The triple product is evaluated from lem to right: B=(R*A)*Q  ­ Nonzeros uniformly distributed to processors (chicken ­egg?) Strong scaling of SpRef Time (secs) Speedup 100 40 80 30 60 20 40 10 20 0 0 1 4 16 64 256 Speedup 120 50 Seconds 60 1024 Cores random symmetric permutaDon  relabeling graph verDces •  RMAT Scale 22; edge factor=8; a=.6, b=c=d=.4/3 •  Franklin/NERSC, each node is a quad ­core AMD Budapest Strong scaling of SpRef Time (secs) Speedup 200 60 Seconds 40 100 30 20 50 Speedup 50 150 10 0 0 1 4 16 64 256 1024 Cores Extracts 10 random (induced) subgraphs, each with |V|/10 vert. Higher span  Decreased parallelism  Lower speedup Conclusions •  •  •  •  •  Parallel algorithms for SpRef and SpAsgn Systemic algorithm structure imposed by SpGEMM Analysis made possible for the general case Good strong scaling for 1000 ­way concurrency Many applica4ons on sparse matrix and graph world. Caveat: Avoid load imbalance by indexing non ­monotonically ! # # # I =# # # # " 1 3 4 6 7 8 $ & & & & & & & % ! # # # I =# # # # " 7 3 8 1 6 4 $ & & & & & & & % ...
View Full Document

This note was uploaded on 12/27/2011 for the course CMPSC 240A taught by Professor Gilbert during the Fall '09 term at UCSB.

Ask a homework question - tutors are online