LecR2-networkarch - Network Archaeology: Uncovering Ancient...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Network Archaeology: Uncovering Ancient Networks from Present-day Interactions Saket Navlakha Department of Computer Science University of Maryland, College Park, USA Joint work with Carl Kingsford Yeast PPI network: 2,599 proteins 8,275 interaction edges Did these proteins interact 10 million years ago? Which model of evolution best characterizes the structure of this network? How have these protein complexes reconfigured over evolutionary time? Last.fm social network: 2,957 users 9,659 friendship edges When did this user enter the network? How do growth principles of social networks differ from those of biological networks? Who recruited these users to join the network? How are networks changing over time? Difficult to study in biology because we do not have access to ancestral networks Extant networks are insufficient to track fine-level changes: Difficult to study in sociology if networks are not crawled regularly or if data remains hidden (privacy issues) (2002) ? ? human fly worm ? ? yeast 1 billion years – widely separated in time (2010) What can we do with ancestor networks? Estimate the age of a node or a edge Track the emergence of clusters and motifs 1 3 ✔ DMC 2 4 5 Down-sample to create realistic & smaller network Validate growth models by comparing histories t = 100 t = 500 Networks vs. sequence view of evolution FF ✗ ✗ ER ✗ PA Privacy implications of predicting user activity Truth Sequence Networks Actual Predicted Challenge: Given a present ­day network G and a probabilistic model by which G putatively evolved, can we reconstruct an ancestral version of G? Related work * Do not handle node labels: –  Leskovec and Faloutsos (ICML 2007) and Hormozdiari+ (PLoS Comp. Biol. 2007) find seeds graphs (putative ancestral subnetworks) under the Kronecker and a duplication-based model * Do not model the evolution of interactions over time: –  Flannick+ (Genome Res. 2006), Kelley+ (PNAS 2003), and Singh+ (RECOMB 2007) align multiple static networks to find conserved substructures * Do not reconstruct ancestral networks: –  Middendorf+ (PNAS 2005), Wiuf+ (PNAS 2006), Bezáková+ (ICML 2006), and Leskovec+ (KDD 2008) compute the likelihood that a growth model produced a given network * Do not apply to general graphs: –  Dutkowski+ (Bioinformatics 2007) and Gibson and Goldberg (PSB 2009) use the history encoded by the gene sequence phylogeny to reconstruct ancient networks –  Ahmed and Xing (PNAS 2009) use node attributes (e.g. gene expression, Senator voting patterns) to recover a hidden underlying static network –  Mithani+ (Bioinformatics 2009) use principles from metabolic networks to model edge gain & loss amongst fixed set of nodes Network growth models Preferential Attachment Forest Fire Barabási+ (Science 1999) v Leskovec+ (KDD 2005) Degree distribution: 1 v 1 3 2 4 5 2 3 4 5 new node v links to k existing nodes according to the histogram u node v chooses anchor u, starts probabilistic fire, links to burnt nodes Popular technique to study how graph properties evolve under various growth mechanisms Duplication-Mutation with Complementarity model A biological growth mechanism that closely characterizes PPI networks1,2,3 u Step 1 v u Step 2 v u v Step 3 Step 1: Node v enters, chooses random anchor node u, and copies its links Step 2: For each neighbor x of v, delete either (u,x) or (v,x) with prob. qmod Step 3: Add edge (u,v) with prob. qcon 1Vazquez+ (Complexus 2003); 2Pereira ­Leal+ (Phil. Trans. R. Soc. Lond. B. Biol. 2006); 3Middendorf+ (PNAS 2005) BUT: the growth model produces a random and unlabeled graph whose history does not correspond to the history of a real network Network archaeology Main idea: reverse a network backwards in time as per the model a ab cg efd Network Model b e f g Ancestor graph at time t-4 Duplication-based (DMC) Preferential Attachment (PA) Forest Fire (FF) Erdös-Renyi (ER) Watts-Strogatz c d Input graph at time t … ✔ maintains labels ✔ models interactions ✔ likelihood estimate Given Gt and DMC, find the most likely precursor network, G∗−∆t : t present ­day network ∗ Gt−∆t Zme steps := argmax Pr(Gt−∆t | Gt , M, ∆t) Gt−∆t ancestor network DMC model Given Gt and DMC, find the most likely precursor network, G∗−∆t : t present ­day network ∗ Gt−∆t Zme steps := argmax Pr(Gt−∆t | Gt , M, ∆t) Gt−∆t ancestor network DMC model BUT: number of possible paths is Ω(n!) G1 G1 G1 …… Gt ­1 Gt Gt ­1 … G1 High ­likelihood path Gt ­1 ≥n−1 ≥n Find: most likely path of graphs (G1, G2, …, Gt ­1) that produced Gt under DMC Given Gt and DMC, find the most likely precursor network, G∗−∆t : t present ­day network ∗ Gt−∆t := argmax Pr(Gt−∆t | Gt , M, ∆t) Gt−∆t ancestor network Instead, set ∆t = 1, greedily approximate: ∗ Gt−1 Zme steps DMC model prior probability (uniform) Pr(Gt | Gt−1 , M) Pr(Gt−1 | M) := argmax Pr(Gt | M) Gt−1 Bayes’ rule ignore denominator (constant over all candidates) What is the likelihood that u is a duplicate of v? 1 LDMC (u, v ) = n−1 ￿ N (u)∩N (v ) Prob. that u was chosen as v’s anchor E.g: v u uv ⇒ Gt ­1 1 − qmod Common neighbors of u and v (no divergence) 1 q (1 − qmod )2 ( mod )2 (qcon ) n−1 2 ￿ N (u)￿N (v ) qmod (γuv ) 2 qcon if (u,v) exists Symmetric difference 1-qcon else of u and v (divergence) In each step, merge the pair with the highest likelihood of duplicating: argmax LDMC (u, v ) u,v ∈Gt Gt Synthetic DMC experiments 100 graphs, 100 nodes each DMC(0.1,0.1) Assess performance when evolutionary history is known DMC(0.1,0.3) DMC(0.1,0.5) DMC(0.1,0.7) DMC(0.1,0.9) Reverse Gt=100 Compare histories Repeat 100x DMC(0.3,0.1) . . . . . . DMC(0.9,0.9) Store known node/anchor phylogeny and node arrival Zmes Synthetic DMC results 50-60% of node/anchor relationships correctly identified Randomly replace 10% of true edges in Gt=100 Better ordering with low qmod (less divergence) and high qcon (connected to anchor) Recovery of ancient PPI networks Started only with extant PPI network for yeast Validating arrival times: older proteins have orthologs in more eukaryotic species – COG age class1 Result: The biologicallyinspired DMC model best matches the sequence-based age classes (P < 0.01) PA DMC (0.4,0.7) FF (0.3) Evolutionary principles reflected in best DMC parameters: med-low qmod = 0.4, med-high qcon = 0.7 (new proteins are usually connected to their duplicates2,3 and share some interaction partners) 1Clusters of Orthologous Genes, Tatusov+ BMC Bioinforma=cs 2003; 2Ispolatov+ NAR 2005; 3Pereira ­Leal+ Genome. Biol. 2007. Validating node anchors If two proteins are duplicates, they likely share a functional annotation (tested using MIPS complexes ) u Gt ­1 u v Gt Result: 84% of identified DMC node/anchors belong to same complex vs. 68% for FF and 55% baseline (no anchors for PA) The evolution rate of proteins How does the evolution rate of a protein depend on its degree? Result: Duplication rate is inversely proportional to its # of binding partners (agreeing with rates found via sequence analysis1,2) High-degree proteins evolve slowly because more of the protein is directly involved in its function 1Fraser+ Science 2002; 2Makino+ Gene 2006 Number of duplicaZons Core vs. peripheral complex members Coreness of a protein: percentage of like-annotated neighbors Are core members of a protein complex older than peripheral members? Result:Yes, somewhat: R = 0.37, P < 0.01 Also found via 3D protein structure analysis1 by looking at age distribution of domains amongst eukaryotic species 1Kim + Marcode PLoS. Comp. Biol. 2008 ½, newer (ignore) ? x u ¾, older Contrasting with Last.fm growth Breadth-first crawl of the Last.fm music social network, starting from user ‘rj‘ Component induced by first 2,957 nodes visited (9,659 edges) Validation: registration times Results: PA is most applicable Thus, histories provide a way to validate and distinguish between growth models PA DMC FF (0.3) (0.7,0.3) Conclusions Introduced a framework to formally study and generate ancient networks (“network archaeology”) Validation: using network topology alone * inferred protein ages that matched sequence-based estimates * related protein age with its core-peripheral structural position * related protein age with # of binding partners * found node/anchor pairs that are often functionally related * validated models based on their ability to produce accurate histories Also experimented with: * Reverse parameters (good performance if near true parameters) * Synthetic FF and PA data (easier to reverse than DMC) Acknowledgements Carl Kingsford Asst. Prof. & Advisor NSF grants EF ­0849899 and IIS ­0812111 to C.K. Thanks! Manuscript available online at: hdp://arxiv.org/abs/1008.5166 ...
View Full Document

Ask a homework question - tutors are online