Course: BIO 203, Fall 2009
School: Stanford
1 Treebuilding - Trees *are* Evolution 1 2 3 1 2 3 ONLY ONE TRUE TREE 1 Treebuilding 2 - Tree Parts The Root Internal Branches Terminal Internal nodes A B C D Terminal nodes The Root is a special node: it signifies the last common ancestor of all represented sequences Internal nodes are last common ancestors of subsets of the sequences Terminal nodes signify extant sequences Branches are the...

1 Treebuilding - Trees *are* Evolution 1 2 3 1 2 3 ONLY ONE TRUE TREE 1 Treebuilding 2 - Tree Parts The Root Internal Branches Terminal Internal nodes A B C D Terminal nodes The Root is a special node: it signifies the last common ancestor of all represented sequences Internal nodes are last common ancestors of subsets of the sequences Terminal nodes signify extant sequences Branches are the lineages; they are bounded by nodes 2 Treebuilding 3 - Tree Displays # of events, rate = A B C D A B C D = B C A D This axis has no meaning Sometimes trees are displayed on their side: B C A D Same topology, with branch lengths shown (preferable): B C A D 3 Treebuilding 4 - Rooted vs. Unrooted A 1 B 2 1 2 C D Same tree! A B C D Anunrootedtreecanbeobtainedfromarootedtreebyremovingtheroot.Thetopology remainsthesame. Virtuallyalltreebasedinference(buildingandrateestimation)isdoneonunrootedtrees becausetheyhavecertainconvenientproperties. 4 Treebuilding 5 - Rooting the Unrooted Tree ButButButBut what if we want a rooted tree??? UseanOUTGROUPwithwhichtorootthetree This is the root of the tree were interested in: chordate ancestor Dros Droso- Ciona Fish phila You Ciona Fish You 5 Knowledge of the outgroup must be based on information independent from the analysis at hand Treebuilding 6- Rooting the Unrooted A One unrooted tree for three sequences 1 2 3 Now lets add an outgroup to root the tree of A,B,C B O A C O B 1 B C 2 C A O C 3 A B O A B C O B A C O C B A 6 Treebuilding 7 - Number of Possible Trees 1 A One unrooted tree for three sequences Three unrooted trees for four sequences B O A B C O B C C A O C A B How many unrooted trees for 5 sequences? 7 Treebuilding 8 - Number of Possible Trees 2 Number of unrooted trees (n = number of sequences) = (2n-5)! 2n-3(n-3)! for: 10 20 30 50 100 2027025 221643095512140477293 36 8.7x10 76 1.3x10 mycalculatorcan'tdothis 8 Treebuilding 9 - Alignment 1 Before Trees can be built, we must have a Multiple Alignment of homologous sequences. The alignment needs to be accurate, i.e., not only do the sequences need to be homologs, but Homologous Sites need to be correctly lined up in the columns of the alignment. 9 Treebuilding 10 - Alignment 2 Indelsobsurehomology.Becauseweneedtobesurethatwehave alignedhomologouspositions,regionsofuncertainhomologyare excludedfromanalysis. Atreegiventoyoubyamultiplesequencealignmentalgorithmisan impression,notarobusttree Itsoktoanalyzepartsofproteins(thosethatalignwell),andifa speciestreeistobebuilt,lotsofpartsfromdifferentproteins(more data=>betterresult) Theseconceptsalsoapplytosequencealignmentandtreebuilding usingDNAsequences 10 Treebuilding 11 - Alignment 3 Protein Multiple Alignment Confined in scale ... proteins just arent very long ProbCons (Do, Batzoglou; probcons.stanford.edu) MUSCLE (Edgar, www.drive5.com) ClustalW (old workhorse, try not to use it) Genomic DNA Multiple Alignment Area of active research by several groups because scale of problem is huge when genomes are aligned large-scale evolutionary events (large indels, rearrangements, duplications) require homology detection before actual nucleotide alignment MLAGAN (Brudno, Batzoglou) http://lagan.stanford.edu/lagan_web/index.shtml TBA, BlastZ/MultiZ (Miller, Kent) Revolver (Asimenos, Edgar, Sidow, Batzoglou) Gene211, Genomics (Winter): Genome concepts, PERL 11 Treebuilding 12 - Distance Methods 1 1. Convertsequencedatainto matrixofallpairwisedistances (countdifferencesandapply multiplehitcorrection) Fitthedistancestoa bifurcatingtree(veryfast) 1. 1234 10 0.40 0.50 0.65 20 0.30 0.45 30 0.35 40 Which Tree is the Right One? 1 2 0.3 0.1 0.1 0.25 0.1 Note:Distancemethodslose information(becauseofthe conversiontopairwisedata);only preferableifaccuracyhastobe sacrificedforspeed. 3 OR 4 1 4 2 3 12 (or the third possibility) Treebuilding 13 - Distance Methods 2 Algorithmsthatclusterthemostsimilarsequencesfirstoften givethewronganswer thisstatementalsoappliestoyourbrain NeighborJoiningissmarterandisthedefactostandardfor distancemethodsbecauseofitsspeed Distancemethodsshouldonlybeusedwhenspeedlimitations precludetheuseofLikelihoodbasedcharacterstate methods(next) 13 Treebuilding 14 - Character-state Methods ...donotconvertsequencesintodistancesbutassesswhetherthe patternsofstatesatallsites/positionsinthealignment... Seq1...GATCG... Seq2...AATCG... Seq3...AATAA... Seq4...GATAA... ...favoronetreeoveranother(fromamongthetestedtrees) 1 3 1 2 1 2 vs. 2 4 3 4 vs. 4 3 (Maximum)Parsimonyand(Maximum)Likelihood Both require searching for the best tree, i.e., evaluating one tree after another. Smart algorithms do a good job at that unless the number of sequences is very large. 14 Treebuilding 15: Parsimony 1 Parsimony:OccamsRazor.Findthetreethatexplainsthesequence databyinvokingtheleastnumberofchanges. Basic Idea: Seq1...G... Seq2...G... Seq3...A... Seq4...A... favors: G1 3 The other two trees would require two changes: Tree 2: G1 2 Tree 3: G G1 2 G A A3 4 A A4 3 A or G2 4 or 2 A G1 G G1 2 G One change from G to A (or A to G) A3 4 A A4 3 A 15 Treebuilding 16 - Parsimony 2 Pos/site123456 Seq1...ATACGA... Seq2...ATCCTC... Seq3...ATCCTT... Seq4...ACCTTA... Pos 1: No change Pos 2: One change on lineage to 4 Pos 3: One change on lineage to 1 Pos 4, 5: Same as 2,3 Site 6: Two changes for all trees, no so discrimination: Tree 1 1A T3 1A Tree 2 C2 1A Tree 3 C2 or 2C A4 3T A4 4A T3 None of these positions are informative as none of them favor one tree over another 16 Treebuilding 17 - Parsimony Exercise Thetreerequiringthefewestchangeswins. 111 Pos:123456789012 Letscount changesfor eachposition. Seq1:GATCGATCGATC Seq2:AATCGTTAGATG Seq3:AACCGTGCAAAC 1 2 1 3 1 O 3 O 2 O 2 3 Outg:AACCGCGTGCAG 1 0 1 0 0 2 1 2 1 1 1 2 12 1 0 2 0 0 2 2 2 1 1 2 1 14 1 0 2 0 0 2 2 2 1 1 2 2 15 17 Treebuilding 18 - Summary Parsimonyimplicitlyassumesthattherearenomultiplehits.Ithasno stochasticmodelforsubstitution. Greatforlearning,notsogoodforactualscience Substitutionsmustberareonanygivenlineage Parsimonyscountingonlyapproachmaybeokforverycloselyrelated sequences Likelihoodisthewaytobeonsoundstatisticalfooting(versusparsimony) withoutlossofinformation(versusdistancemethods).UseSEMPHYfor treebuilding,PAMLforinferringevolutionaryparameters. However(1):Forapplicationsotherthansequencebasedtreebuilding,a parsimonyapproach(Occamsrazor)canbeappropriate However(2):Formappingtraitsontoatree,parsimonycanbe appropriate 18 next Do treebuilding distance character states Understand tree-based inference orthology and paralogy evolutionary events Discuss evolutionary constraint and how it matters proteins genomes Explore genome structure and how it is shaped by evolution 19 Gene Duplication 1 Both paralogs are retained if: High expression level is needed Actins, Tubulins, Cow stomach lysozyme, rRNA Concerted evolution after duplication by gene conversion or cycles of duplication and loss The universal symbol for Gene Duplication Neofunctionalization New biochemical function or new expression domain p53 vs p63,73; RNA polymerases; hemoglobins; etc. independent evolution after duplication Subfunctionalization of Proteins specific function or of expression region many developmental regulators in vertebrates independent evolution after duplication 20 Gene Duplication 2 Species Tree O 1 2 3 Where is the Gene Duplication on the Species Tree? Gene Tree a b O 1 2 3 1 2 3 21 Gene Duplication 3 There are only two types of nodes in a tree Gene Duplication Last Common Species Ancestor a b O 1 2 3 1 2 3 If independent evolution in the lineages was precipitated by Paralogy Orthology a gene duplication: a speciation event: (give some examples of orthologs and paralogs from the tree above) (hint: trace back to the last common node; is it a gene dup or a last common species ancestor?) When a gene duplication occurred is inferred by comparison of the Gene Tree with the Species Tree 22 Gene Duplication 4 - Some Exercises Human Frog ZFish Ciona ZFish1 ZFish2 Frog Ciona Hum2 Frog Hum1 ZFish Ciona Human ZFish Frog Ciona 23 Gene Duplication 5 - Rerooting Gene Trees Same Tree Left: rooted arbitrarily by treebuilding program Right: rooted by user under reasonable outgroup assumption TOPOLOGY IS THE SAME 24 Gene Duplication 6 - Some Gene Trees p53, p63, p73 Patched 25 Mapping Traits onto Tree 1 Gene Tree Sequence name Mollusk p53/63/73 Trait: C-terminal domain present Vertebrate p63 present Vertebrate p73 present Fish p53 absent Chick p53 absent Mouse p53 absent 26 Mapping Traits onto Tree 2 Gene Tree Sequence name Fly YFG Trait: expression domains eye Vertebrate YFG1 fin/limb bud Vertebrate YFG2 eye, fin/limb bud Fish YFG3 eye Chick YFG3 eye Mouse YFG3 eye, placenta 27 next Do treebuilding distance character states Understand tree-based inference orthology and paralogy evolutionary events Discuss evolutionary constraint and how it matters proteins genomes Explore genome structure and how it is shaped by evolution 28 Source and Use of Evolutionary Constraint Constraint on Proteins: - Structural: Folding or Packing - Functional: Catalysis or Binding Constraint on DNA: - Structural: ? - Functional: TFBS, Exons, etc Constraint determined DeleteriousnessofMutation determined infer! EvolutionaryRate estimate from extant homologous sequences by standard molecular evolutionary methodology 29 Constraint vs Conservation Conservationimpliesbinarylogic:eitheritsconservedorits notconserved Constraintimpliesquantification,adegreetowhichthe evolutionofafunctionalunithasbeenrestrained Theupperlimitofconstraintisnoevolutionarychange. Thelowerlimitofconstraintisneutral. Thewholerangeofevolutionaryratesisinbetween. 30 Alignments and Trees Analyses of Constraint Require Alignment of sequences Protein or DNA Tree relating the sequences Topology only ... ... or species tree with neutral branch lengths (There are other types of constraint that evolutionary rates of single base substitutions do not capture -- not in this course) 31 Constraint in Proteins Structure and Function Inference of Constraint by Evolutionary Analysis 32 Constraint on Proteins - Rates of Evolution p53, p63, p73 Patched 33 Example: p53 Single Site Constraint Colors correspond to strength of evolutionary constraint. Red, weak constraint; Blue, strong constraint; Rainbow in between. 34 Example: p53 Regional Rate Plot Rates of evolution of neighboring sites are averaged and smoothed in order to see regional features such as domains Normalized Rate of Evolution Highly Constrained Regions mean rate Position in the Alignment 35 Regional vs Single Site Constraint 36
