seq-comparison

seq-comparison - Bio-sequence Comparison Bio sequence Ying...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Bio-sequence Comparison Bio sequence Ying Xu (徐鹰) Bio-molecules Bio molecules • Three major types of bio-molecules in our cells – – – nucleotides (DNA, RNA) nucleotides proteins (poly)sugar Bio-sequences Bio • The first two classes of bio-molecules have linear molecules structures so they can be represented as bio-sequences structures – – – DNA sequences (consisting of four types of letters, A, C, G, T) RNA sequences (consisting of four types of letters, A, C, G, U) protein sequences (consisting of 20 types of letters) protein DNA sequence ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcga ggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggta ggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtg actgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatg protein sequence SAANLEYLKNVLLQFIFLKPG-SERERLLPVINTMLQLSPEEKGKLAAV NEKNMEYLKNVFVQFLKPESVPAERDQLVIVLQRVLHLSPKEVEILKAA Bio-sequence Comparison Bio • Bio-sequence comparison is one of the most basic sequence problems in bioinformatics problems • The basic computational problem is to determine if two The sequences are “similar”, partially similar and how similar – AACGGTA versus ATCGGGT DNA Sequence Comparison through Sequence Alignment Sequence • Defining DNA sequence (dis)similarity in terms of two parameters, gaps and mismatches gaps mismatches AACG • Example 1: AACG and AACG |||| AACG AAGG || • Example 2: AAGG and AACG | AACG 1 mismatch • Example 3: AACGGTATGC and ATCGGGTTGC AACG - GT ATGC ATCG G GT -TGC 2 gaps and 1 mismatch DNA Sequence Alignment DNA • Best alignment: the alignment of two sequences with the the smallest possible number of mismatches and gaps smallest • Score: each aligned position: +2; each mismatch/ gap: each -1 AACG AAGG |||| || AACG AACG score = 8 score = 5 | AACG - GT ATGC ATCG G GT -TGC score = 13 Protein Sequence Alignment Protein t • Protein sequence alignment: iit is more complex to measure protein sequence similarity than that of DNA sequences sequences – protein sequence alignment: “degree” of similarity protein • Each pair of amino acids have a similarity score, which Each varies for different amino acids varies – Example: (A, A) = 4; (R, R) = 5; (A, R) = -1; (C, A) = 0; Example: 1; Blosum Matrix Blosum A R N D C Q E G H I L K M F P S T W Y V 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 4 -2 2 0 -3 -2 -1 -2 -1 1 5 -1 -3 -1 0 -1 -3 -2 -2 5 0 -2 -1 -1 -1 -1 1 6 -4 -2 -2 1 3 -1 7 -1 -1 -4 -3 -2 4 1 -3 -2 -2 5 -2 -2 0 11 2 -3 7 -1 4 A R N D C Q E G H I L K M F P S T W Y V Protein Sequence Alignment Protein • Aligning protein sequences: (gap = -5) Aligning 5) – FDSKTHRGHR and FESYWTHGHR FDSK-THRGHR :.: :: ::: FESYWTH-GHR Score: 6+2+4-2-5+5+8-5+6+5+5 = 29 FDSKTHRGHR - FESYWTHWHR Score: -5-3+0+0-2-2-1-5-2-2+0-5 = -27 Amino acids with similar physiochemical properties have higher similarity scores among them Sequence Alignment Sequence • We can possibly find the best alignment between two We sequences with 5 or 10 letters by hand and on paper sequences • .. but how about genes with hundreds to thousands of .. letters long letters • .. and how about attempting to align two genomes with .. millions up to billions letters millions We need to sequence alignment algorithms .. Existing Sequence Alignment Algorithms Existing • Needleman and Wunsch (1970) Needleman Wunsch • Smith and Waterman (1981) • FASTA (1988) • BLAST (1990, 1997) • PatternHunter (2002) The Basic Idea of Sequence Alignment The • Find an alignment between two sequences that minimizes Find the weighted gap and mismatch penalties the • This problem can be solved using a dynamic programming This algorithm Sequence Alignment by DP Sequence AAGG Two sequences: AACG and AAGG || | AACG Step #1: calculating alignment matrix A A C G A 2 1 -3 -4 A 1 4 3 G -3 3 3 5 G -4 2 2 5 2 Rule: 1: initialization– fill the first row and column with matching scores 2: fill an empty cell based on scores of its left, upper and upperleft neighbors + the matching score of the current cell 3: chose the one giving the highest score Sequence Alignment by DP Sequence Step #2: Tracing back to recover the alignment A A C G A 2 2 -3 -4 A 2 4 3 2 G -3 3 3 5 G -4 2 2 5 Rule: 1: start from the rightlower corner 2: trace back to left, upper or upper-left neighbor which gives the current cell’s score 3. Keep doing this until it cannot continue Sequence Alignment by DP Sequence AACG - GT A ATCG G GT - Two sequences: AACGGTA and ATCGGGT Step #1: calculating alignment matrix A A A C G G T A T C G G G T 2 -2 -3 -4 -5 -6 -7 1 1 0 -1 -2 -3 -4 -3 0 3 2 1 0 -1 -4 -1 2 5 4 3 2 -5 -2 1 4 7 6 5 -6 -3 0 3 6 6 8 -3 -1 2 5 5 7 -4 Sequence Alignment by DP Sequence Step #2: Tracing back to recover the alignment A A A C G G T A T C G G G T 2 -2 -3 -4 -5 -6 -7 1 1 0 -1 -2 -3 -4 -3 0 3 2 1 0 -1 -4 -1 2 5 4 3 2 -5 -2 1 4 7 6 5 -6 -3 0 3 6 6 8 -3 -1 2 5 5 7 -4 AACG - GT A ATCG G GT - Applications Applications • Sequence alignments have been used for solving MANY Sequence biological problems biological • The basic information it provides is if two sequences are The homologous, ii.e., they have evolved from a common homologous .e., ancestor ancestor • It also provides (semi-)quantitative information about )quantitative evolutionary distances among difference sequences evolutionary Applications Applications • Actually it provides much more information as we will Actually show in the next few lectures … – – – – taxonomy information taxonomy molecular structure information molecular molecular functional information molecular ….. Applications Applications • Functional prediction: homologous proteins generally homologous have the same or similar functions have • Functional site prediction: conserved sites among aligned Functional sequences tend to be functional • Structural prediction: homologous proteins tend to have homologous similar structures similar Homology Genes Homology • Using sequence alignment information to determine if Using two genes are homologous two As genes evolve, homologous genes have mutations across their sequences, so their sequence similarity goes down … Homology Genes Homology • How can we tell if genes are homologous since they How have gone their separate ways billions years ago? have SAANLEYLKNVLLQFIFLKPG--SERERLLPVINTMLQLSPEEKGKLAAV NEKNMEYLKNVFVQFLKPESVP-AERDQLVIVLQRVLHLSPKEVEILKAA KNEKIAYIKNVLLGFLEHKE----QRNQLLPVISMLLQLDSTDEKRLVMS REINFEYLKHVVLKFMSCRES---EAFHLIKAVSVLLNFSQEEENMLKET EPTEFEYLRKVMFEYMMGR-----ETKTMAKVITTVLKFPDDQAQKILER DPAEAEYLRNVLYRYMTNRESLGKESVTLARVIGTVARFDESQMKNVISS STSEIDYLRNIFTQFLHSMGSPNAASKAILKAMGSVLKVPMAEMKIIDKK Detection of Homology Detection • Does higher sequence alignment score always mean a higher probability for being homologous? • No, sequence alignment scores depend not only on the No, quality of an alignment but also on sequence compositions quality Detection of Homology Detection Query sequence: AAAA Sequence #1: AATTAATACATTAATATAATAAAATTACTGA Sequence #2: CGGTAGTACGTAGTGTTTAGTAGCTATGAA Which of these two sequences will have better chance to have a good match with the query sequence after randomly reshuffling them? Detection of Homology Detection E-value One way to assess the true quality of a particular alignment is to derive the background alignmentscore distribution of “similar” sequences with the same “letter” composition. Not significant significant Detection of Homology Detection Score = -1500 Score = -720 Score = -1120 Score = -900 E-value = e-1 E-value = e-2 E-value = 0.5 e-1 E-value = e-21 The lower an e-value, the higher the statistical significance of an alignment and the more probable the sequences are homologous Detection of Homology Detection Multiple Sequence Alignment Multiple • The goal is to find the best alignment among multiple The sequences sequences • This is a highly challenging and unsolved problem! Take-Home Message Take • Sequence alignment is the basis for biological data Sequence analysis • The techniques for pair-wise sequence alignment are The techniques wise are fairly mature now • The main challenging issues lie in developing efficient The algorithms for doing multiple sequence alignments algorithms Homework • Find the homologous sequences using the NCBI BLAST for the Find following sequence following MSEEKPKEGVKTENDHINLKVAGQDGSVVQFKIKRHTPLSKLMKAYCERQGLS MRQIRFRFDGQPINETDTPAQLEMEDEDTIDVFQQQTGGVPESSLAGHSF 1. Using protein-protein BLAST BLASTp 2. Paste the query sequence to the “Search” Window 3. Click on “BLAST” 4. Click on the “red” bar in the result page ...
View Full Document

This note was uploaded on 06/16/2011 for the course BIO 127 taught by Professor Xuyin during the Spring '10 term at Georgetown.

Ask a homework question - tutors are online