LecturesPart02

LecturesPart02 - Computational Biology, Part 2 Sequence...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright © 1996, 1999-2006. Copyright All rights reserved. Sequence Alignment s Definition: Procedure for comparing two or Definition: more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences in x Pair-wise alignment: compare two sequences x Multiple sequence alignment: compare more compare than two sequences than Example sequence alignment s s Task: align “abcdef” with “abdgf” “abcdef” “abdgf” Task: Write second sequence below the first abcdef abdgf s s Move sequences to give maximum match between Move them them Show characters that match using vertical bar Example sequence alignment abcdef || abdgf s Insert gap between b and d on lower Insert sequence to allow d and f to align Example sequence alignment abcdef || | | ab-dgf Example sequence alignment abcdef || | | ab-dgf s Note e and g don’t match Note Matching Similarity vs. Identity s Alignments can be based on finding only Alignments identical characters, or (more commonly) can be based on finding similar characters similar s More on how to define similarity later More similarity Global vs. Local Alignment s We distinguish x Global alignment algorithms which optimize overall alignment between two sequences x Local alignment algorithms which seek only relatively conserved pieces of sequence conserved 3 Alignment stops at the ends of regions of strong Alignment similarity similarity 3 Favors finding conserved patterns in otherwise Favors different pairs of sequences different Global vs. Local Alignment s Global LGPSSKQTGKGS-SRIWDN | | ||| || ||| LN-ITKSAGKGAIMRLGDA LN-ITKSAGKGAIMRLGDA s Local --------GKG-------||| --------GKG---------------GKG-------- Global vs. Local Alignment s Global LGPSSKQTGKGS-SRIWDN | | ||| || ||| LN-ITKSAGKGAIMRLGDA LN-ITKSAGKGAIMRLGDA s Local -------TGKG-------||| -------AGKG--------------AGKG-------- Why do sequence alignments? s To find whether two (or more) genes or To proteins are evolutionarily related to each other other s To find structurally or functionally similar To regions within proteins regions Origin of similar genes s s s s s Similar genes arise by Similar gene duplication gene Copy of a gene inserted Copy next to the original next Two copies mutate Two independently independently Each can take on separate Each functions functions All or part can be All transferred from one part of genome to another of Methods for Pairwise Alignment s Dot matrix analysis s Dynamic Programming s Word or k-tuple methods (FASTA and Word k-tuple BLAST) BLAST) Sequence comparison with dot matrices s Goal: Graphically display regions of Goal: similarity between two sequences (e.g., domains in common between two proteins of suspected similar function) of Sequence comparison with dot matrices s Basic Method: For two sequences of Basic lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position. same, Examples for protein sequences s (Demonstration A6, Sequence 1 vs. 2) s (Demonstration A6, Sequence 2 vs. 3) Interpretation of dot matrices s Regions of similarity appear as diagonal Regions runs of dots runs s Reverse diagonals (perpendicular to Reverse diagonal) indicate inversions diagonal) s Reverse diagonals crossing diagonals (Xs) Reverse indicate palindromes indicate x (Demonstration A6, Sequence 4 vs. 4) Interpretation of dot matrices s Can link or "join" separate diagonals to Can form alignment with "gaps" alignment x Each a.a. or base can only be used once 3 Can't trace vertically or horizontally 3 Can't double back x A gap is introduced by each vertical or gap horizontal skip horizontal Uses for dot matrices s Can use dot matrices to align two proteins Can or two nucleic acid sequences or s Can use to find amino acid repeats within a Can protein by comparing a protein sequence to itself itself x Repeats appear as a set of diagonal runs stacked Repeats vertically and/or horizontally vertically 3 (Demonstration A6, Sequence 5 vs. 6) Uses for dot matrices s Can use to find self base-pairing of an RNA Can (e.g., tRNA) by comparing a sequence to itself complemented and reversed itself s Excellent approach for finding sequence Excellent transpositions transpositions Filtering to remove “noise” s A problem with dot matrices for long problem sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) to s Solution use a window and a threshold x compare character by character within a compare window (have to choose window size) window x require certain fraction of matches within require window in order to display it with a “dot” window Example spreadsheet with window s (Demonstration A7) How do we choose a window size? s Window size changes with goal of analysis x size of average exon x size of average protein structural element x size of gene promoter x size of enzyme active site How do we choose a threshold value? s Threshold based on statistics x using shuffled actual sequence find average (m) and s.d. (σ ) of match scores of of shuffled sequence shuffled 3 convert original (unshuffled) scores ( x) to Z scores scores 3 • Z = (x - m)/σ 3 use threshold Z of of 3 to 6 x using analysis of other sets of sequences 3 provides “objective” standard of significance Dot matrix analysis with DNA Strider (Mount, Fig 3.4) Get phage λ cI and phage P22 c2 repressor Get sequences from Genbank (X00166 and V01153 respectively) V01153 s Use DNA Strider 1.4 (contact TA to get a Use copy) s Use window size of 11 and stringency of 7 s Dot matrix (Mount Fig 3.4) 100 s Note set of Note diagonals in lower right that do not line up due to insertion near 475 on cI on 200 300 400 500 600 100 100 200 200 300 300 400 400 500 500 600 600 700 700 100 200 300 400 500 600 Dot matrix analysis with DNA Strider (Mount, Fig 3.6) s Get human LDL receptor protein sequence Get from Genbank (P01130) from s Use weighting “Identity” s Use window size of 1 and stringency of 1 s Use window size of 23 and stringency of 7 Dot matrix (Mount Fig 3.6) 100 200 300 400 500 600 700 800 s W=1 S=1 s Note set of Note stacked diagonals in upper left left 100 100 200 200 300 300 400 400 500 500 600 600 700 700 800 800 100 200 300 400 500 600 700 800 Dot matrix (Mount Fig 3.6) 100 200 300 400 500 600 700 800 s W=23 S=7 s Note set of Note stacked diagonals in upper left left 100 100 200 200 300 300 400 400 500 500 600 600 700 700 800 800 100 200 300 400 500 600 700 800 Reading for next class s Mount, Chapter 3 through page 93 s Look over paper by Needleman and Wunsch Look on web site on s (03-510/710) Durbin et al, pp 17-32 ...
View Full Document

Ask a homework question - tutors are online