isorank

# isorank - IsoRank CMSC 858L Singh Xu Berger RECOMB 2007...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: IsoRank CMSC 858L Singh, Xu, Berger. RECOMB 2007. Local alignment: 1. Which nodes are dissimilar [low sim(u,v)] but have similar neighbors / neighborhoods? (e.g. Bandyopadhyay et al.) functional orthologs: proteins that play the same role, but may look very diﬀerent. 2. Which edges are real and important, e.g. form a conserved pathway in the cell? Global alignment: Singh et al., 2007 propose: Maximum common subgraph: Find the largest graph H that is isomorphic to subgraphs of two given graphs G1 and G2. Maximum Common Subgraph Input: weighted graphs G1 and G2 with weights between 0 and 1. Output: • Maximum Common Subgraph: largest subgraph B that is isomorphic a subgraph of G1 and G2. • Mapping of nodes between G1 and G2 s.t. each node is mapped to ≤ 1 other node. Maximum Common Subgraph Intuition: mapping i↔j is good if the neighbors of i can be mapped to the neighbors of j: j ? i Deﬁne: Rij as the “quality” of mapping i↔j: Rij := ￿ ￿ u∈N (i) v ∈N (j ) Over all pairings in the between the neighbors of i and j. 1 Ruv |N (u)||N (v )| Ruv has 1 unit to give, and it spreads it evenly over its |N(u)||N(v)| neighbors Example (Figure from Singh, Xu, Berger, 2007) Rij := ￿ ￿ u∈N (i) v ∈N (j ) 1 Ruv |N (u)||N (v )| The Weighted Cases Unweighted case: ￿ Rij := ￿ u∈N (i) v ∈N (j ) 1 Ruv |N (u)||N (v )| w(i, u) Weighted case: Rij := ￿ ￿ w(i, u)w(j, v ) Ruv W (u)W (v ) v u w(j, v) i u∈N (i) v ∈N (j ) where W (u) = ￿ x∈N (u) w(x u) i, ← “weighted degree” j Matrix Form Many equations: Rij := ￿ ￿ u∈N (i) v ∈N (j ) 1 Ruv |N (u)||N (v )| Want to ﬁnd the Rij values. Gather into matrix: R = AR where 1 A[i, j ][u, v ] = |N (u)||N (v )| if (i, u) ∈ G1 and (j, v ) ∈ G 2 E u↔v n1n2 × n1n2 matrix. 108 by 108 for the yeastﬂy alignment, but sparse. i↔j R = A R Finding R: Want an R vector such that: R = AR R is an eigenvector of A. A Random Walk View R = AR Think of A as an adjacency matrix of a graph G: V = {ij with i ∈ G1 and j ∈ G2} E = {(ij, uv) : (i,j) ∈ G1 and (u,v) ∈ G2} u w(i, u) i Then vector R is a stationary distribution for a random walk on G. v w(j, v) j Accounting For Sequencing Similarity Bij = Sequence similarity between i and j Normalize: E = B / |B| New problem: weights neighbors and similarity with parameter α: R = αAR + (1 − α)E When α is 1, only network used; when α = 0 only sequence information is used. Convert this to the format R’ = A’ R’: ￿￿ ￿ R αA = 1 0···0 ￿￿ ￿ (1 − α)E R 1 1 Finding the Mapping, Given R Method 1: maximum matching j i Rij j i maximum matching Method 2: greedy F=∅ Repeat: Output highest weight pair (p,q) such that p,q ∉ F F = {p,q} ∪ F Fly vs. Yeast Networks had > 25,000 edges each. Largest component (35 edges) of ﬂy-yeast alignment ➝ Complete alignment had 1420 edges, split into many components. (Figures from Singh, Xu, Berger, 2007) Including even a tiny bit of sequence information improves the performance greatly. Average fraction of nodes mapped to themselves 200 node subgraph of yeast and several randomized versions of it Map random versions to the real one. (Figure from Singh, Xu, Berger, 2007) Choosing α: Chose the α (=0.6) that matched the most Inparanoid database entries. Vs. PathBLAST: • Of IsoRank’s 701 aligned pairs, 83% were seen in at least 1 local alignment of PB. • PB aligns the same protein to many different proteins: If aligned, a yeast protein is aligned to an average of 5.38 ﬂy proteins. • E.g. PathBLAST maps SNF1 to 71 different ﬂy proteins. Summary • Global alignment guarantees consistent mapping of nodes. • Values (Rij) for each pair of nodes modeling the goodness of that mapping. (Can these Rij values be used for something else?) • Via eigenvector, seek “equilibrium” values for the Rij. • Then select a high-weight, consistent subset of those pairs to form the mapping. (Is there a better algorithm than the greedy?) Graemlin: General and robust alignment of multiple large interaction networks Flannick, Novak, Srinivasan, McAdams, Batzoglou, Genome Res. 2006. The 4 Big Ideas of Graemlin 1. Nodes scores via “likelihood” of common evolutionary history 2. Edge scores based on “edge-scoring matrices” 3. Seeded alignment based on good matches between a small number of nodes (and greedily extended) 4. Progressive alignment to align multiple sequences Graemlin: Aligning Multiple Networks Multiple Sequence Alignment: Multiple Network Alignment: Species 1 Species 2 Species 3 Species 4 Graemlin: Aligning Multiple Networks Multiple Sequence Alignment: Multiple Network Alignment: Species 1 Species 2 Species 3 Species 4 ← One difference: a species can have multiple nodes in each “column” Just as with MSA, require items in the same column to be homologous Scoring Alignments: “Column” Scores Parameters for events taken from real data Estimate evolutionary history: • protein duplication • protein divergence • protein creation (insertion) • protein loss (deletion) X x • sequence similarity: sum of pairs of sequence distances Score(u) = log S(M, u) / log S(R, u) Parameters for events taken from random data Edge Scores For every pair of proteins that are in the same species but different equivalent classes: v u w (u , v) =: w PrM [w − δ < x < w + δ ] Score(w) := log PrR [w − δ < x < w + δ ] (w-δ = 0 and w+δ = lowest possible edge score if there is no edge.) PrM [x] : Equivalence classes are assigned labels from the ESM. (Figure from Flannick et al.) Alignments: Seeding d-cluster: is a node u and its d-1 closest neighbors. For each node, generate its d-cluster. For every pair of d-clusters, compute the best alignment exhaustively. Toss out all d-cluster alignments that score below some threshold T. The highest-scoring pairs in the remaining d-cluster alignments become seeds around which they will attempt to grow an alignment. Greedy Growing (Figure from Flannick et al, 2006) Frontier: nodes that are neighbors of nodes in the current alignment. Repeat: Add the node or a pair of nodes from the frontier to the alignment that will increase the score the most. Graemlin: Summary • Pairwise alignment that accounts for • edge scores that speciﬁes the broad topology desired Multiple Alignment • how likely a “column” is to have arisen by evolution achieved via “progressive pairwise alignments” Sped up via - seeds to ﬁnd good initial matches. ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online