This preview shows page 1. Sign up to view the full content.
Unformatted text preview: IsoRank
CMSC 858L Singh, Xu, Berger. RECOMB 2007. Local alignment:
1. Which nodes are dissimilar [low sim(u,v)] but have similar
neighbors / neighborhoods? (e.g. Bandyopadhyay et al.)
functional orthologs: proteins that play the same role,
but may look very diﬀerent.
2. Which edges are real and important, e.g. form a conserved
pathway in the cell? Global alignment:
Singh et al., 2007 propose:
Maximum common subgraph: Find the largest graph
H that is isomorphic to subgraphs of two given graphs G1
and G2. Maximum Common Subgraph
Input: weighted graphs G1 and G2 with weights between 0 and 1.
Output:
• Maximum Common Subgraph: largest subgraph B that is
isomorphic a subgraph of G1 and G2.
• Mapping of nodes between G1 and G2 s.t. each node is mapped
to ≤ 1 other node. Maximum Common Subgraph
Intuition: mapping i↔j is good if the neighbors of i can be mapped
to the neighbors of j: j ? i Deﬁne: Rij as the “quality” of mapping i↔j: Rij := u∈N (i) v ∈N (j )
Over all pairings in the between
the neighbors of i and j. 1
Ruv
N (u)N (v )
Ruv has 1 unit to give, and it
spreads it evenly over its
N(u)N(v) neighbors Example (Figure from Singh, Xu, Berger, 2007) Rij := u∈N (i) v ∈N (j ) 1
Ruv
N (u)N (v ) The Weighted Cases
Unweighted case: Rij := u∈N (i) v ∈N (j ) 1
Ruv
N (u)N (v ) w(i, u) Weighted case: Rij := w(i, u)w(j, v )
Ruv
W (u)W (v ) v u w(j, v) i u∈N (i) v ∈N (j ) where W (u) = x∈N (u) w(x u)
i, ← “weighted degree” j Matrix Form
Many equations: Rij := u∈N (i) v ∈N (j ) 1
Ruv
N (u)N (v ) Want to ﬁnd the Rij values. Gather into matrix:
R = AR
where 1
A[i, j ][u, v ] =
N (u)N (v ) if (i, u) ∈ G1 and (j, v ) ∈ G 2
E
u↔v n1n2 × n1n2
matrix.
108 by 108 for the yeastﬂy alignment, but sparse. i↔j R = A R Finding R: Want an R vector such that: R = AR
R is an eigenvector of A. A Random Walk View
R = AR
Think of A as an adjacency matrix of a graph G:
V = {ij with i ∈ G1 and j ∈ G2}
E = {(ij, uv) : (i,j) ∈ G1 and (u,v) ∈ G2} u
w(i, u) i Then vector R is a stationary distribution for a random
walk on G. v
w(j, v) j Accounting For Sequencing Similarity
Bij = Sequence similarity between i and j
Normalize: E = B / B
New problem: weights neighbors and similarity with
parameter α: R = αAR + (1 − α)E When α is 1, only network used; when α = 0 only sequence
information is used. Convert this to the format R’ = A’ R’:
R
αA
=
1
0···0
(1 − α)E R
1
1 Finding the Mapping, Given R
Method 1: maximum matching j
i Rij j
i maximum matching Method 2: greedy
F=∅
Repeat:
Output highest weight pair (p,q) such that p,q ∉ F
F = {p,q} ∪ F Fly vs. Yeast
Networks had > 25,000
edges each.
Largest component (35
edges) of ﬂyyeast
alignment ➝ Complete alignment
had 1420 edges, split
into many components. (Figures from Singh, Xu, Berger, 2007) Including even a tiny bit of
sequence information improves
the performance greatly. Average fraction of nodes
mapped to themselves 200 node subgraph of yeast
and several randomized versions of it
Map random versions to the real one. (Figure from Singh, Xu, Berger, 2007) Choosing α:
Chose the α (=0.6) that matched the
most Inparanoid database entries. Vs. PathBLAST: • Of IsoRank’s 701 aligned pairs, 83% were seen in at least
1 local alignment of PB. • PB aligns the same protein to many different proteins: If
aligned, a yeast protein is aligned to an average of 5.38
ﬂy proteins. • E.g. PathBLAST maps SNF1 to 71 different ﬂy proteins. Summary • Global alignment guarantees consistent mapping of
nodes. • Values (Rij) for each pair of nodes modeling the
goodness of that mapping. (Can these Rij values be used
for something else?) • Via eigenvector, seek “equilibrium” values for the Rij. • Then select a highweight, consistent subset of those
pairs to form the mapping. (Is there a better algorithm
than the greedy?) Graemlin: General and robust
alignment of multiple large
interaction networks
Flannick, Novak, Srinivasan, McAdams,
Batzoglou, Genome Res. 2006. The 4 Big Ideas of Graemlin
1. Nodes scores via “likelihood” of common
evolutionary history
2. Edge scores based on “edgescoring matrices”
3. Seeded alignment based on good matches between a
small number of nodes (and greedily extended)
4. Progressive alignment to align multiple sequences Graemlin: Aligning Multiple Networks
Multiple Sequence Alignment: Multiple Network Alignment:
Species 1 Species 2 Species 3 Species 4 Graemlin: Aligning Multiple Networks
Multiple Sequence Alignment: Multiple Network Alignment:
Species 1 Species 2 Species 3 Species 4 ← One difference: a species can
have multiple nodes in each
“column”
Just as with MSA, require items in
the same column to be homologous Scoring Alignments: “Column” Scores Parameters for events
taken from real data
Estimate evolutionary history:
• protein duplication
• protein divergence
• protein creation (insertion)
• protein loss (deletion) X
x • sequence similarity: sum of pairs of
sequence distances Score(u) = log S(M, u) / log S(R, u)
Parameters for
events taken from
random data Edge Scores
For every pair of proteins that are in the same
species but different equivalent classes: v
u w (u ,
v) =:
w PrM [w − δ < x < w + δ ]
Score(w) := log
PrR [w − δ < x < w + δ ]
(wδ = 0 and w+δ = lowest possible edge score
if there is no edge.) PrM [x] :
Equivalence classes are
assigned labels from the
ESM. (Figure from Flannick et al.) Alignments: Seeding dcluster: is a node u and its d1
closest neighbors.
For each node, generate its dcluster.
For every pair of dclusters,
compute the best alignment
exhaustively.
Toss out all dcluster
alignments that score below
some threshold T.
The highestscoring pairs in
the remaining dcluster
alignments become seeds
around which they will
attempt to grow an alignment. Greedy Growing (Figure from Flannick et al, 2006) Frontier: nodes that are
neighbors of nodes in the
current alignment. Repeat: Add the node or
a pair of nodes from the
frontier to the alignment
that will increase the
score the most. Graemlin: Summary • Pairwise alignment that accounts for • edge scores that speciﬁes the broad topology desired Multiple Alignment • how likely a “column” is to have arisen by evolution achieved via “progressive pairwise alignments” Sped up via  seeds to ﬁnd good initial matches. ...
View
Full
Document
This note was uploaded on 01/13/2012 for the course CMSC 423 taught by Professor Staff during the Fall '07 term at Maryland.
 Fall '07
 staff

Click to edit the document details