Unformatted text preview: Network Archaeology: Uncovering Ancient
Networks from Presentday Interactions Saket Navlakha Department of Computer Science University of Maryland, College Park, USA Joint work with Carl Kingsford Yeast PPI network: 2,599 proteins 8,275 interaction edges Did these proteins interact
10 million years ago? Which model of evolution
best characterizes the
structure of this network? How have these protein
complexes reconﬁgured
over evolutionary time? Last.fm social network: 2,957 users 9,659 friendship edges When did this user
enter the network? How do growth principles of
social networks differ from
those of biological networks? Who recruited these users
to join the network? How are networks changing over time? Difﬁcult to study in biology
because we do not have access
to ancestral networks Extant networks are insufﬁcient
to track ﬁnelevel changes: Difﬁcult to study in sociology
if networks are not crawled
regularly or if data remains
hidden (privacy issues) (2002) ? ? human ﬂy worm ? ? yeast 1 billion years – widely separated in time (2010) What can we do with ancestor networks? Estimate the age of a node
or a edge Track the emergence of
clusters and motifs 1 3 ✔ DMC 2 4 5 Downsample to create
realistic & smaller network Validate growth models by
comparing histories t = 100 t = 500 Networks vs. sequence
view of evolution FF ✗ ✗ ER ✗ PA Privacy implications of
predicting user activity Truth Sequence Networks Actual Predicted Challenge: Given a present
day network G and a probabilistic model by which G putatively evolved, can we reconstruct an ancestral version of G? Related work * Do not handle node labels: – Leskovec and Faloutsos (ICML 2007) and Hormozdiari+ (PLoS Comp. Biol. 2007) ﬁnd seeds graphs
(putative ancestral subnetworks) under the Kronecker and a duplicationbased model * Do not model the evolution of interactions over time: – Flannick+ (Genome Res. 2006), Kelley+ (PNAS 2003), and Singh+ (RECOMB 2007) align multiple
static networks to ﬁnd conserved substructures * Do not reconstruct ancestral networks: – Middendorf+ (PNAS 2005), Wiuf+ (PNAS 2006), Bezáková+ (ICML 2006), and Leskovec+ (KDD 2008)
compute the likelihood that a growth model produced a given network * Do not apply to general graphs: – Dutkowski+ (Bioinformatics 2007) and Gibson and Goldberg (PSB 2009) use the history encoded by
the gene sequence phylogeny to reconstruct ancient networks – Ahmed and Xing (PNAS 2009) use node attributes (e.g. gene expression, Senator voting patterns) to
recover a hidden underlying static network – Mithani+ (Bioinformatics 2009) use principles from metabolic networks to model edge gain & loss
amongst ﬁxed set of nodes Network growth models Preferential Attachment Forest Fire Barabási+ (Science 1999) v Leskovec+ (KDD 2005) Degree distribution: 1
v 1 3 2 4 5 2 3 4 5 new node v links to k existing nodes according to the histogram u node v chooses anchor u, starts probabilistic ﬁre, links to burnt nodes Popular technique to study how graph properties evolve under various
growth mechanisms DuplicationMutation with Complementarity model A biological growth mechanism that closely characterizes PPI networks1,2,3 u Step 1 v u Step 2 v u v Step 3 Step 1: Node v enters, chooses random anchor node u, and copies its links Step 2: For each neighbor x of v, delete either (u,x) or (v,x) with prob. qmod Step 3: Add edge (u,v) with prob. qcon 1Vazquez+ (Complexus 2003); 2Pereira
Leal+ (Phil. Trans. R. Soc. Lond. B. Biol. 2006); 3Middendorf+ (PNAS 2005) BUT: the growth model produces a random and unlabeled graph whose history does not correspond to the history of a real network Network archaeology Main idea: reverse a network backwards in time as per the model a ab cg efd Network Model b e f g Ancestor graph at time t4 Duplicationbased (DMC) Preferential Attachment (PA) Forest Fire (FF) ErdösRenyi (ER) WattsStrogatz c d Input graph at time t … ✔ maintains labels ✔ models interactions ✔ likelihood estimate Given Gt and DMC, ﬁnd the most likely precursor network, G∗−∆t :
t
present
day network ∗
Gt−∆t Zme steps := argmax Pr(Gt−∆t  Gt , M, ∆t)
Gt−∆t ancestor network DMC model Given Gt and DMC, ﬁnd the most likely precursor network, G∗−∆t :
t
present
day network ∗
Gt−∆t Zme steps := argmax Pr(Gt−∆t  Gt , M, ∆t)
Gt−∆t ancestor network DMC model BUT: number of possible paths is Ω(n!) G1 G1 G1 …… Gt
1 Gt Gt
1 … G1 High
likelihood path Gt
1 ≥n−1 ≥n Find: most likely path of graphs (G1, G2, …, Gt
1) that produced Gt under DMC Given Gt and DMC, ﬁnd the most likely precursor network, G∗−∆t :
t
present
day network ∗
Gt−∆t := argmax Pr(Gt−∆t  Gt , M, ∆t)
Gt−∆t ancestor network Instead, set ∆t = 1, greedily approximate: ∗
Gt−1 Zme steps DMC model prior probability (uniform) Pr(Gt  Gt−1 , M) Pr(Gt−1  M)
:= argmax
Pr(Gt  M)
Gt−1 Bayes’ rule ignore denominator (constant over all candidates) What is the likelihood that u is a duplicate of v? 1
LDMC (u, v ) =
n−1 N (u)∩N (v ) Prob. that u
was chosen as
v’s anchor E.g: v u uv ⇒ Gt
1 1 − qmod Common neighbors
of u and v (no divergence) 1
q
(1 − qmod )2 ( mod )2 (qcon )
n−1
2 N (u)N (v ) qmod (γuv )
2 qcon if (u,v) exists Symmetric difference 1qcon else of u and v (divergence) In each step, merge the pair with the
highest likelihood of duplicating: argmax LDMC (u, v )
u,v ∈Gt Gt Synthetic DMC experiments 100 graphs, 100 nodes each
DMC(0.1,0.1) Assess performance when evolutionary
history is known DMC(0.1,0.3) DMC(0.1,0.5) DMC(0.1,0.7) DMC(0.1,0.9) Reverse
Gt=100 Compare
histories Repeat 100x DMC(0.3,0.1) . . . .
.
. DMC(0.9,0.9) Store known node/anchor phylogeny and node arrival Zmes Synthetic DMC results 5060% of node/anchor
relationships correctly identiﬁed Randomly
replace 10%
of true edges
in Gt=100 Better ordering with low qmod (less divergence)
and high qcon (connected to anchor) Recovery of ancient PPI networks Started only with extant PPI
network for yeast Validating arrival times: older
proteins have orthologs in
more eukaryotic species –
COG age class1 Result: The biologicallyinspired DMC model best
matches the sequencebased
age classes (P < 0.01) PA DMC
(0.4,0.7) FF (0.3) Evolutionary principles reﬂected in best DMC
parameters: medlow qmod = 0.4, medhigh qcon = 0.7
(new proteins are usually connected to their duplicates2,3
and share some interaction partners) 1Clusters of Orthologous Genes, Tatusov+ BMC Bioinforma=cs 2003; 2Ispolatov+ NAR 2005; 3Pereira
Leal+ Genome. Biol. 2007. Validating node anchors If two proteins are duplicates, they likely share a functional annotation (tested using MIPS complexes
) u Gt
1 u v Gt Result: 84% of identiﬁed DMC node/anchors belong to same complex vs.
68% for FF and 55% baseline (no anchors for PA) The evolution rate of proteins How does the evolution rate of a protein depend on its degree? Result: Duplication rate is
inversely proportional to its #
of binding partners (agreeing
with rates found via sequence
analysis1,2) Highdegree proteins evolve
slowly because more of the
protein is directly involved in
its function 1Fraser+ Science 2002; 2Makino+ Gene 2006 Number of duplicaZons Core vs. peripheral complex members Coreness of a protein: percentage of likeannotated neighbors Are core members of a protein complex
older than peripheral members? Result:Yes, somewhat: R = 0.37, P < 0.01 Also found via 3D protein structure
analysis1 by looking at age distribution of
domains amongst eukaryotic species 1Kim + Marcode PLoS. Comp. Biol. 2008 ½, newer (ignore) ? x u ¾, older Contrasting with Last.fm growth Breadthﬁrst crawl of the Last.fm
music social network, starting
from user ‘rj‘ Component induced by ﬁrst
2,957 nodes visited (9,659
edges) Validation: registration times Results: PA is most applicable Thus, histories provide a way to
validate and distinguish between
growth models PA DMC
FF (0.3) (0.7,0.3) Conclusions Introduced a framework to formally study and generate
ancient networks (“network archaeology”) Validation: using network topology alone * inferred protein ages that matched sequencebased estimates * related protein age with its coreperipheral structural position * related protein age with # of binding partners * found node/anchor pairs that are often functionally related * validated models based on their ability to produce accurate histories Also experimented with: * Reverse parameters (good performance if near true parameters) * Synthetic FF and PA data (easier to reverse than DMC) Acknowledgements Carl Kingsford Asst. Prof. & Advisor NSF grants EF
0849899 and IIS
0812111 to C.K. Thanks! Manuscript available online at: hdp://arxiv.org/abs/1008.5166 ...
View
Full Document
 Fall '07
 staff
 DMC, likely precursor network, Synthetic DMC experiments00, Synthetic DMC results00

Click to edit the document details