LecturesPart05

LecturesPart05 - Computational Biology, Part 5 Multiple...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 5 Multiple Sequence Alignment Robert F. Murphy Copyright © 1996-2006. Copyright All rights reserved. Multiple Sequence Alignment s Goal: Create best possible “overall” Goal: alignment of a family of sequences (more than two) than s Ideal approach: Compare all sequences Ideal “simultaneously” “simultaneously” s Short-cut approach: Align all of the Short-cut members pairwise with one of the members one Pairwise Multiple Sequence Alignment Example - MacVector s “Align to Folder” x Inputs 3 An open sequence file 3 A folder containing a set of sequences 3 Settings for sequence comparison x Outputs 3 Aligned sequence map (graphical) 3 Aligned sequence listing (text) Pairwise Multiple Sequence Alignment Example 1 s Input: Folder containing a set of protein Input: sequences for various α and β tubulin chains chains s Task: Compare first sequence to all others Task: and display map and Pairwise Multiple Sequence Alignment Example 1 s Open first tubulin sequence (A23035) Pairwise Multiple Sequence Alignment Example 1 s s Under Under Database, Database pull down to Align to Folder to Click on Click Folder to Search Search Pairwise Multiple Sequence Alignment Example 1 s Select Select folder containing tubulins tubulins Pairwise Multiple Sequence Alignment Example 1 s Use Use defaults for search settings and click OK OK Pairwise Multiple Sequence Alignment Example 1 s Click all boxes for Display Options and OK Click Display OK Aligned Sequence Description List for Tubulins Description List Search Analysis for Sequence: A23035 Search from 1 to 451 where origin = 1 Date: February 20,1997 Time: 00:48:27 Matrix: pam250 matrix Score Region from 1 to 451 Maximum possible score: 2265 Database: UserFolder: tubulins Sequence 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. A23035 A25873 UBUTA A25601 UBUTB A29141 A26561 UBPGB UBCHB A25377 UBBYB UBURAL A25342 UBURB Opt. 2265 2193 1995 1952 1058 1051 1050 1044 1043 1022 1002 826 532 354 Init. Description 2265 2176 1995 1948 754 755 773 765 765 762 699 826 361 182 Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin Tubulin alpha chain - Human alpha chain - Human alpha chain - Trypanosoma brucei rhodesiense alpha chain - Slime mold (Physarum polycephalum) beta chain - Trypanosoma brucei rhodesiense beta chain - Chlamydomonas reinhardtii beta chain - Human beta chain - Pig beta chain, embryonic - Chicken beta chain - Neurospora crassa beta chain - Yeast (Saccharomyces cerevisiae) alpha chain - Sea urchin (Lytechinus pictus) (fragm beta chain - Slime mold (Physarum polycephalum) beta chain - Sea urchin (Lytechinus pictus) (fragme Aligned Sequence Map for Tubulins Aligned Pairwise Multiple Sequence Alignment Example 2 s Input: Folder containing sequence files for Input: just three tubulins just s Task: Compare using different query Task: sequences and examine aligned sequence listing for differences listing Pairwise Multiple Sequence Alignment Example 2 s Open first of three sequences Pairwise Multiple Sequence Alignment Example 2 s Align to Align Folder with itself and two other tubulins tubulins Alignment List Search Analysis for Sequence: A23035 Search from 1 to 451 where origin = 1 Date: February 20,1997 Time: 00:51:34 Matrix: pam250 matrix Score Region from 1 to 451 Maximum possible score: 2265 Database: UserFolder: subset of tubulins 10 * * A23035 MRECISIHVG | 1. A23035 10 [ 2265 ] MRECISIHVG ^^^^^^^^^^ A23035 MRECISIHVG | 2. A25601 10 [ 1952 ] MREvISIHiG ^^^v^^^^^^ A23035 MRECISIHVG | 3. UBUTB 10 [ 1058 ] MREivcvqaG ^^^v^-^^-^ A23035 MRECISIHVG 20 * * QAGVQIGNAC | 20 QAGVQIGNAC ^^^^^^^^^^ QAGVQIGNAC | 20 QAGtQvGNAC ^^^-^^^^^^ QAGVQIGNAC | 20 QcGnQIGskf ^v^v^^^^vv QAGVQIGNAC 30 * * WELYCLEHGI | 30 WELYCLEHGI ^^^^^^^^^^ WELYCLEHGI | 30 WELYCLEHGI ^^^^^^^^^^ WELYCLEHGI | 30 WEvisdEHGv ^^^v-v^^^^ WELYCLEHGI 40 * * QPDGQMPSDK | 40 QPDGQMPSDK ^^^^^^^^^^ QPDGQMPSDK | 40 QPDGQMPSDK ^^^^^^^^^^ QPDGQMPSDK | 40 dPtGtyqgDs ^^-^vv-^^QPDGQMPSDK 50 * * TIGGGDDSFN | 50 TIGGGDDSFN ^^^^^^^^^^ TIGGGDDSFN | 50 svGyGDDaFN ^^^v^^^^^^ TIGGGDDSFN | | dl--qleriN -^ vv^-^^ TIGGGDDSFN 60 * * TFFSETGAGK | 60 TFFSETGAGK> ^^^^^^^^^^ TFFSETGAGK | 60 TFFSETGAGK> ^^^^^^^^^^ TFFSETGAGK | 50 | vyFdEatgGr> -^^-^^-^^^ TFFSETGAGK Pairwise Multiple Sequence Alignment Example 2 s Close Close first, open second sequence and repeat Align to Folder Folder Alignment List Search Analysis for Sequence: A25601 Search from 1 to 450 where origin = 1 Date: February 20,1997 Time: 00:52:46 Matrix: pam250 matrix Score Region from 1 to 450 Maximum possible score: 2161 Database: UserFolder: subset of tubulins 10 * * A25601 MREVISIHIG | 1. A25601 10 [ 2161 ] MREVISIHIG ^^^^^^^^^^ A25601 MREVISIHIG | 2. A23035 10 [ 1948 ] MREcISIHvG ^^^v^^^^^^ A25601 MREVISIHIG | 3. UBUTB 10 [ 1014 ] MREivcvqaG ^^^^^-^^v^ A25601 MREVISIHIG 20 * * QAGTQVGNAC | 20 QAGTQVGNAC ^^^^^^^^^^ QAGTQVGNAC | 20 QAGvQiGNAC ^^^-^^^^^^ QAGTQVGNAC | 20 QcGnQiGskf ^v^-^^^^vv QAGTQVGNAC 30 * * WELYCLEHGI | 30 WELYCLEHGI ^^^^^^^^^^ WELYCLEHGI | 30 WELYCLEHGI ^^^^^^^^^^ WELYCLEHGI | 30 WEvisdEHGv ^^^v-v^^^^ WELYCLEHGI 40 * * QPDGQMPSDK | 40 QPDGQMPSDK ^^^^^^^^^^ QPDGQMPSDK | 40 QPDGQMPSDK ^^^^^^^^^^ QPDGQMPSDK | 40 dPtGtyqgDs ^^-^vv-^^QPDGQMPSDK 50 * * SVGYGDDAFN | 50 SVGYGDDAFN ^^^^^^^^^^ SVGYGDDAFN | 50 tiGgGDDsFN ^^^v^^^^^^ SVGYGDDAFN | | dlql-er-iN -^vv ^v ^^ SVGYGDDAFN 60 * * TFFSETGAGK | 60 TFFSETGAGK> ^^^^^^^^^^ TFFSETGAGK | 60 TFFSETGAGK> ^^^^^^^^^^ TFFSETGAGK | 50 | vyFdEatgGr> -^^-^^-^^^ TFFSETGAGK Alignment List Search Analysis for Sequence: A23035 Search from 1 to 451 where origin = 1 Date: February 20,1997 Time: 00:51:34 Matrix: pam250 matrix Score Region from 1 to 451 Maximum possible score: 2265 Database: UserFolder: subset of tubulins 10 * * A23035 MRECISIHVG | 1. A23035 10 [ 2265 ] MRECISIHVG ^^^^^^^^^^ A23035 MRECISIHVG | 2. A25601 10 [ 1952 ] MREvISIHiG ^^^v^^^^^^ A23035 MRECISIHVG | 3. UBUTB 10 [ 1058 ] MREivcvqaG ^^^v^-^^-^ A23035 MRECISIHVG 20 * * QAGVQIGNAC | 20 QAGVQIGNAC ^^^^^^^^^^ QAGVQIGNAC | 20 QAGtQvGNAC ^^^-^^^^^^ QAGVQIGNAC | 20 QcGnQIGskf ^v^v^^^^vv QAGVQIGNAC 30 * * WELYCLEHGI | 30 WELYCLEHGI ^^^^^^^^^^ WELYCLEHGI | 30 WELYCLEHGI ^^^^^^^^^^ WELYCLEHGI | 30 WEvisdEHGv ^^^v-v^^^^ WELYCLEHGI 40 * * QPDGQMPSDK | 40 QPDGQMPSDK ^^^^^^^^^^ QPDGQMPSDK | 40 QPDGQMPSDK ^^^^^^^^^^ QPDGQMPSDK | 40 dPtGtyqgDs ^^-^vv-^^QPDGQMPSDK 50 * * TIGGGDDSFN | 50 TIGGGDDSFN ^^^^^^^^^^ TIGGGDDSFN | 50 svGyGDDaFN ^^^v^^^^^^ TIGGGDDSFN | | dl--qleriN -^ vv^-^^ TIGGGDDSFN 60 * * TFFSETGAGK | 60 TFFSETGAGK> ^^^^^^^^^^ TFFSETGAGK | 60 TFFSETGAGK> ^^^^^^^^^^ TFFSETGAGK | 50 | vyFdEatgGr> -^^-^^-^^^ TFFSETGAGK Pairwise Multiple Sequence Alignment Example 2 1st alignment A23035 A25601 UBUTB 2nd alignment A23035 A25601 UBUTB TIGGGDDSFN svGyGDDaFN dl--qleriN tiGgGDDsFN SVGYGDDAFN dlql-er-iN Note that a different (better) alignment of A25601 and Note A25601 UBUTB is obtained from direct comparison to each other UBUTB (2nd alignment) than when each is compared indirectly via indirectly comparison to A23035 (1st alignment) comparison “True” Multiple Sequence Alignment s Find optimal alignment of multiple Find sequences by considering all possible alignments of each position of k sequences NP hard NP s Approximation: Approximation: x First do all k(k-1)/2 pairwise alignments (not First all just one sequence with all others) just x Then combine pairwise alignments to generate Then overall alignment overall Global vs. Local s Just as for pairwise alignments, we can Just choose to find either a global multiple global alignment (in which the final alignment is full length) or a local alignment that may local match only blocks of conserved sequence shared among the sequences shared How do we score a multiplesequence alignment? s What score should we assign to each What position in a multiple alignment? position s Each position corresponds to aligning Each three (or more) amino acids three s How good is matching L with S with How T? T? s Solution: Carrillo-Lipman sum of Solution: pairs method pairs How do we score a multiplesequence alignment? s Any position in an Any MSA can be projected onto individual sequence pairs and the scores of those pairs summed to give overall score for that position for Scoring Global Multiple Sequence Alignments s SP method s For a set of sequences x and a position k, For sum scores using scoring function S for all pairs of sequences at that position pairs g( k ) = å i å j¹ i S( x (k )i, x (k ) j ) Problem with SP method s Using SP method, what is the effect of Using adding one (or a small number) of sequences that don’t match at a given position to a set that matches perfectly? position gn match = n = Smatch n ( n - 1) / 2 gn match = n - 1 = Smatch ( n ( n - 1) / 2 - ( n - 1)) + Smismatch ( n - 1) Dg = ( Smatch - Smismatch )( n - 1) Dg / gn match = n = (2 / n )( Smatch - Smismatch ) / Smatch Problem with SP method s Thus the larger the set of sequences, the Thus more the relative effect of having one mismatch. This is counter to our expectation for a reasonable method. expectation ∆g / gn match = n = 2( Smatch - Smismatch ) /( Smatch n ) Alternative scoring approaches s Use a phylogenetic tree to order the Use sequences and only sum scores for all nodes of the tree of Multiple Sequence Alignment programs s MSA x attempts to find optimal alignment by attempts simultaneous dynamic programming simultaneous x available via web server s ClustalW x progressive pairwise alignment of sequences x available via web server x included within MacVector MSA s Use pairwise alignments to build a tree Use showing how sequences could have diverged showing s Use it to define region examined for multiple Use alignment alignment MSA s Then use dynamic programming in that Then region region Weighting scores to de-emphasis distant relatives s When scoring a position, MSA calculates When weights for each pair of sequences using the tree tree MSA s Only possible for small number of short Only sequences sequences s U. Washington server can handle 8 U. sequences of 500 amino acids sequences s PSC supercomputer can handle 10 PSC sequences of 1000 amino acids sequences CLUSTALW s Group sequences by similarity and then Group align them progressively from most to least progressively similar similar s The grouping is done by building a The phylogenetic tree (called a guide tree) phylogenetic s Can be done for large number of long Can sequences even on desktop computers sequences ClustalW within MacVector s Use same set of tubulins from the Use MacVector Sample Files folder MacVector s Open each sequence s Under Analyze, select ClustalW Under Analyze select Alignment Alignment ClustalW within MacVector ClustalW Guide Tree UBUTB 0.073 0.073 0.003 0.215 From pairwise alignments, build tree that links similar sequences 0.050 0.038 0.068 0.013 0.089 0.180 UBUTA UBURAL A23035 A25873 A25601 0.396 0.402 0.029 UBBYB 0.134 0.089 A25377 A25342 0.080 0.083 0.018 0.027 0.046 0.007 0.016 0.043 CVJB 0.225 0.015 0.023 FVVFBA UBURB UBCHB UBPGB A26561 A29141 ClustalW within MacVector ClustalW within MacVector s Three ways to save results x Click Save button on initial “Aligned Sequences Click Formatted Alignments” screen (or do Save As... on “Aligned Sequences”) “Aligned 3 x Use Save As... on “Aligned Sequences Multiple Use Alignments” Alignments” 3 x get BW PICT file get interleaved text file useful for printing but requires get significant editing for input to other programs significant Use Save As... on “Aligned Sequences” display 3 3 Select FAST format get sequential text file useful for input to other ClustalW graphical display (partial) Color (or shade of gray) shows type of amino acid ClustalW interleaved file Clustal W(1.4) multiple sequence alignment 16 Sequences Aligned. Gaps Inserted = 114 Alignment Score = 145639 Conserved Identities = 1 Pairwise Alignment Mode: Slow Pairwise Alignment Parameters: Open Gap Penalty = 10.0 Extend Gap Penalty = 0.1 Similarity Matrix: blosum Multiple Alignment Parameters: Open Gap Penalty = 10.0 Extend Gap Penalty = 0.1 Delay Divergent = 40% Gap Distance = 8 Similarity Matrix: blosum Processing time: 21.2 seconds UBUTB UBUTA UBURB UBURAL UBPGB UBCHB UBBYB FVVFBA CVJB A29141 A26561 A25873 A25601 A25377 A25342 A23035 UBUTB UBUTA UBURB UBURAL UBPGB UBCHB MREIVCVQAGQCGNQIGSKFWEVISDEHGVDPTGTYQGDSDLQL--ERINVYFDEATGGRYVPRSVLIDLEPGTMDSVRAGPYGQIFRPDNFIFGQSGAG MREAICIHIGQAGCQVGNACWELFCLEHGIQPDGAMPSDKTIGVEDDAFNTFFSETGAGKHVPRAVFLDLEPTVVDEVRTGTYRQLFHPEQLISGKEDAA MREIVHIQAGQCGNQIGAKFWEVISDEHGIDPTGSYHGDSDLQL--ERINVYYNEAAGNKYVPRAILVDLEPGTMDSVRSGPFGQIFRPDNFVFGQSGAG MREIVHIQAGQCGNQIGAKFWEVISDEHGIDPTGSYHGDSDLQL--ERINVYYNEATGNKYVPRAILVDLEPGTMDSVRSGPFGQIFRPDNFVFGQSGAG MREIIHISTGQCGNQIGAAFWETICGEHGLDFNGTYHGHDDIQK--ERLNVYFNEASSGKWVPRSINVDLEPGTIDAVRNSAIGNLFRPDNYIFGQSSAG TDE--I--------------------TSFS-------IP--KFR---P---D--------Q---P-NLIF-Q---G ADT--I------------------VAVELDT------YPNTDIGD--PS--------------YP----------MREIVHIQGGQCGNQIGAKFWEVVSDEHGIDPTGTYHGDSDLQL--ERINVYFNEATGGRYVPRAILMDLEPGTMDSVRSGPYGQIFRPDNFVFGQTGAG MREIVHIQAGQCGNQIGAKFWEVISDEHGIDPTGTYHGDSDLQL--DRISVYYNEATGGKYVPRAILVDLEPGTMDSVRSGPFGQIFRPDNFVFGQSGAG MRECISVHVGQAGVQMGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFTTFFCETGAGKHVPRAVFVDLEPTVIDEIRNGPYRQLFHPEQLITGKEDAA MREVISIHIGQAGTQVGNACWELYCLEHGIQPDGQMPSDKSVGYGDDAFNTFFSETGAGKXXXXAVFLDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAA MREIVHLQTGQCGNQIGAAFWQTISGEHGLDASGVYNGTSELQL--ERMNVYFNEASGNKYVPRAVLVDLEPGTMDAVRAGPFGQLFRPDNFVFGQSGAG MREIVHIQAGQCGNQIGAKFWEVISDEHGIDPTGSYHGDSDLQL--ERINVYYNEATGGKYVPRAVLVDLEPGTMDSVRAGPFGQIFRPDNFVFGQTGAG MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGKHVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAA NNWAKGHYTEGAELIDSVLDVCCKEAESCDCLQGFQICHSLGGGTGSGMGTLLISKLREQYPDRIMMTFSIIPSPKVSDTVVEPYNTTLSVHQLVENSDE NNYARGHYTIGKEIVDLCLDRIRKLADNCTGLQGFLVYHAVGGGTGSGLGALLLERLSVDYGKKSKLGYTVYPSPQVSTAVVEPYNSVLSTHSLLEHTDV NNWAKGHYTEGAELVDSVLDVVRKESESCDCLQGFQLTHSLGGGTGSGMGTLLISKIREEYPDRIMNTFSVVPSPKVSDTVVEPYNATLSVHQLVENTDE NNWAKGHYTEGAELVDSVLDVVRKESESCDCLQGFQLTHSLGGGTGSGMGTLLISKIREEYPDRIMNTFSVMPSPKVSDTVVEPYNATLSVHQLVENTDE ClustalW sequential file MREIVCVQAGQCGNQIGSKFWEVISDEHGVDPTGTYQGDSDLQL--ERINVYFDEATGGR YVPRSVLIDLEPGTMDSVRAGPYGQIFRPDNFIFGQSGAGNNWAKGHYTEGAELIDSVLD VCCKEAESCDCLQGFQICHSLGGGTGSGMGTLLISKLREQYPDRIMMTFSIIPSPKVSDT VVEPYNTTLSVHQLVENSDESMCIDNEALYDICFRTLKLTTPTFGDLNHLVSAVVSGVTC CLRFPGQLNSDLRKLAVNLVPFPRLHFFMMGFAPLTSRGSQQYRGLSVPELTQQMFDAKN MMQAADPRHGRYLTASALFRGRMSTKEVDEQMLNVQNKNSSYFIEWIPNNIKSSVCDIP---PKG----LKMAVTFIGNNTCIQEMFRRV-GEQFTLMFRRKAFLHWYTGEGMDEMEFT EAESNMNDLVSEYQQYQDATIEEEGE------FDEEEQY--------MREAICIHIGQAGCQVGNACWELFCLEHGIQPDGAMPSDKTIGVEDDAFNTFFSETGAGK HVPRAVFLDLEPTVVDEVRTGTYRQLFHPEQLISGKEDAANNYARGHYTIGKEIVDLCLD RIRKLADNCTGLQGFLVYHAVGGGTGSGLGALLLERLSVDYGKKSKLGYTVYPSPQVSTA VVEPYNSVLSTHSLLEHTDVAAMLDNEAIYDLTRRNLDIERPTYTNLNRLIGQVVSSLTA SLRFDGALNVDLTEFQTNLVPYPRIHFVLTSYAPVISAEKAYHEQLSVSEISNAVFEPAS MMTKCDPRHGKYMACCLMYRGDVVPKDVNAAVATIKTKRTIQFVDWSPTGFKCGINYQPP TVVPGGDLAKVQRAVCMIANSTAIAEVFARI-DHKFDLMYSKRAFVHWYVGEGMEEGEFS EAREDLAALEKDYEEVGAESADMDGE--------EDVEEY----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------FAPLTSRGSQQYRALTVSELTQQMFDAKN MMAACDPRHGRYLTVAAIFRGRMSMKEVDEQMLNVQNKNSSYFVEWIPNNVKTAVCDIP---PRG----LKMSATFIGNSTAIQELFKRI-SEQFTAMFRRKAFLHWYTGEGMDEMEFT EAESNMNDLVSEYQQYQDATAEEEGE------FDEEEGDEEAA----- Multiple Alignment Servers s CLUSTALW x http://www.bioinformatics.nl/tools/clustalw.html s BCM Multiple Sequence Alignment BCM Launcher Launcher x http://searchlauncher.bcm.tmc.edu/multi-align/mult s ExPASy x http://www.expasy.org/tools/#align Reading for next class s Mount, Chapter 6 through p. 259 s Altschul paper ...
View Full Document

This note was uploaded on 01/13/2012 for the course BIO 101 taught by Professor Staff during the Fall '10 term at DePaul.

Ask a homework question - tutors are online