This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Color Coding
Speeding up Network Searches
858L Efﬁcient Algorithms for Detecting
Signaling Pathways in Protein
Interaction Networks
Scott, Ideker, Karp, Sharan
RECOMB 2005 • Color Coding: Alon et al, 1995. Searching for High Scoring Paths
Weighted network G:
G might be an alignment graph, a PPI
network, metabolic network, etc... u
p(u,v) =
probability this
edge exists
w(u,v) =  log p(u,v) v P = simple path
Weight(P) = sum of w(u,v) values along
its edges
Length(P) = number of nodes in P Goal: Lowweight, simple, lengthk paths
Given: Graph G, a subset of nodes I, and a node v.
Find: The lowestweight path P that:
(1) starts at some vertex in I
(2) ends at v
(3) is of length k and is simple (doesn’t use any vertex twice) Set I let’s us
specify, e.g.,
that the path
G=
should start at
a surface
receptor
protein. I= { }
P v Is this Problem Hard?
Given: Graph G, a subset of nodes I, and a node v.
Find: The lowestweight, simple, lengthk path between I and v. Is this Problem Hard?
Given: Graph G, a subset of nodes I, and a node v.
Find: The lowestweight, simple, lengthk path between I and v. Yes. It’s NPhard. Why? Is this Problem Hard?
Given: Graph G, a subset of nodes I, and a node v.
Find: The lowestweight, simple, lengthk path between I and v. Yes. It’s NPhard. Why? Reduce Hamiltonian Cycle (HC) to it: To solve an HC
instance <GH>, let G = GH, I = {v}, and k = n. Is this Problem Hard?
Given: Graph G, a subset of nodes I, and a node v.
Find: The lowestweight, simple, lengthk path between I and v. Yes. It’s NPhard. Why? Reduce Hamiltonian Cycle (HC) to it: To solve an HC
instance <GH>, let G = GH, I = {v}, and k = n. Without the simple condition or lengthk condition, the
problem is easy. Dynamic Programming Algorithm
v∈S Set of ≤ k vertices W(v, S) := minimum weight of a simple path that starts at I,
visits each vertex in S, and ends at v, and
is of length S.
W(v, S) := ∞ if no such path exists.
0 if v ∈ I
W (v, {v }) =
∞ if v ∈ I W (v, S ) = min u∈S −{v } v
W (u, S − {u}) + w(u, v ) Smaller size “S” set, so we can
compute W(•, •) in order of
increasing size of S. I= { }
u
v Ok, So:
OPT(I, v ) = min W (v, S )
S :S =k What’s the running time? Note how “simple” this algorithm
is: try all possible sets of k nodes,
compute their optimal order, and
return the best set. Ok, So:
OPT(I, v ) = min W (v, S )
S :S =k Note how “simple” this algorithm
is: try all possible sets of k nodes,
compute their optimal order, and
return the best set. What’s the running time?
Number of sets we will consider =
all possible subsets of nodes of
size ≤ k = k
i=0 n
i =n k For each set, computing the min takes at most O(k) steps.
Therefore: Running time = O(knk). Color Coding • O(knk) is too slow for any interesting k. • Can we do better? • Idea: rather than keeping track of all of S, we’ll keep track
of less information about which nodes we’ve already
visited. • This will introduce a problem: we may miss the optimum
path... Color Coding
Main Step: Randomly color each node with a color from
{1,2,...,k}. Let c(u) be the color of node u.
Deﬁne: a path is “colorful” if it contains exactly 1 vertex of
each color.
Note: any colorful path is simple.
So, we consider this modiﬁed problem:
Given: Graph G, a subset of nodes I, and a node v.
Find: The lowestweight, colorful, lengthk path between I and v. Color Coding DP Algorithm
c(v) ∈ C Set of ≤ k colors W(v, C) := minimum weight of a path that starts at I,
visits a vertex of each color in C, ends at v, and
is of length C.
W(v, C) := ∞ if no such path exists. ¯
W (v, C ) = min u:c(u)∈C −{c(v )} Intuition for faster run
time: we must consider
only 2k possible sets
k
“C” instead of O(nk) k
i=0 i ¯
W (u, C − {c(u)}) + w(u, v )
v
“C” keeps track of the
remaining allowed colors. = 2k Alternative View of Color Coding Algorithm
Let I be the given starting node set
Let colorings(u, j) be the set of valid
path colorings for a path of length j1
from I to u 1 I
5
2 For all u in I: colorings(u,1) = {c(u)}
2
For j = 1, ..., k:
For every edge (u, w):
For every C in colorings(u, j):
If c(w) not in C:
Add C ∪ {c(w)} to colorings(w, j+1). 8 5 u w {1,2,8}
{1,5,8} {1,2,5,8} Alternative View of Color Coding Algorithm
Let I be the given starting node set
Let colorings(u, j) be the set of valid
path colorings for a path of length j1
from I to u 1 I
5
2 For all u in I: colorings(u,1) = {c(u)}
2
For j = 1, ..., k:
For every edge (u, w):
For every C in colorings(u, j):
If c(w) not in C:
Add C ∪ {c(w)} to colorings(w, j+1). 8 u Running time:
k
j =0
k
k
E 
j = O(2 k E )
j 5 w {1,2,8}
{1,5,8} {1,2,5,8} So:
We had an algorithm that was ≈ O(nk)
We converted it into an ≈ O(2k) algorithm,
but with an ε probability we’ll miss the optimal answer. n = 100 1020 1016 1012 108 104 2 4 6 k 8 10 What if the optimal path is not colorful?
Have to repeat this procedure enough times so that the
probability that that happens is low. What if the optimal path is not colorful?
Have to repeat this procedure enough times so that the
probability that that happens is low. k! ways to make a
path colorful.
kk ways to color a path.
Pr[Path is colorful] = k!/kk ≥ ek.
Pr[OPT is colorful] ≥ ek.
Pr[OPT is not colorful] < (1ek) What if the optimal path is not colorful?
Have to repeat this procedure enough times so that the
probability that that happens is low.
Repeat algorithm −e ln
k k! ways to make a
path colorful. times. kk ways to color a path.
Pr[Path is colorful] = k!/kk ≥ ek.
Pr[OPT is colorful] ≥ ek.
Pr[OPT is not colorful] < (1ek) Pr[OPT is never colorful] ≤
k
−k −e ln 1−e = 1+ ln ≤e 1
−ek −ek ln = 0.015 k!/kk
0.010 0.005 ek 6 7 8 9 10 Running Times
Yeast Network with ~4,500 nodes and ~14,500 edges: Pheromone Response Pathway
STE2/3 STE3
STE3 AKR1 STE4/18 AKR1 STE4/18 CDC42 STE4
CDC24
CDC42 STE20 CDC24 BEM1
FAR1
STE11 STE11 BEM1
GPA1 STE5 STE7 STE7 STE5 STE50 FUS3 STE7 DIG1/2 KSS1 KSS1 FUS3 DIG1/2 STE12 STE12 (a) STE12 (b) (c) Collection of all lowKnownThe pheromone response signaling pathway in yeast. (a) The main chain of
pathway
Best length9 pathway
Fig. 2.
weight paths between
between STE3 and STE12 best path of the same length (9) in
the known pathway, adapted from [13]. (b) The
STE3 and STE12 the network. (c) The assembly of all lightweight paths starting at STE3 and ending Color Coding Summary • Turned a slow, O(nk) algorithm into a lessslow O
(2k) algorithm that is correct with high probability. •
• Used on yeast to identify signaling pathways. • Color Coding: Alon et al, 1995. Directly extends to ﬁnding goodscoring pathways
in the alignment graph of PathBLAST. ...
View
Full
Document
This note was uploaded on 01/13/2012 for the course CMSC 423 taught by Professor Staff during the Fall '07 term at Maryland.
 Fall '07
 staff

Click to edit the document details