Unformatted text preview: Computational Biology, Part 2 Sequence Motifs
Robert F. Murphy Copyright 1996, 19992009. All rights reserved. Slides from Chapter 4
s Ch04_Motifs_mod.ppt Describing features using frequency matrices
s Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences s Need to describe how often particular bases are found in particular positions in a sequence feature Describing features using frequency matrices
s Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature Frequency matrices (continued)
s Three uses of frequency matrices
x Describe a sequence feature x Calculate probability of occurrence of feature in a random sequence x Calculate degree of match between a new sequence and a feature Matlab Demonstration
% read some aligned sequences provided with the bioinformatics toolbox seqs = fastaread('pf00002.fa'); seqdisp(seqs); startposition=4; endposition=13; [P,S] = seqprofile(seqs,'limits',[startposition endposition]); disp([' ' sprintf('%2d ',[1:size(P,2)])]); for i=1:length(S) disp([S(i) ' ' sprintf('%4.3f ',P(i,:))]) end seqlogo(seqs,'startat',startposition,'endat',endposition,'alphabet','aa'); Frequency matrix Logo Example Logos for displaying sequence motifs
s http://www.ccrnp.ncifcrf.gov/~toms/sequencelogo.html s Free logo maker at http://weblogo.berkeley.edu/ Frequency Matrices, PSSMs, and Profiles
s A frequency matrix can be converted to a PositionSpecific Scoring Matrix (PSSM) by converting frequencies to scores s PSSMs also called Position Weight Matrixes (PWMs) or Profiles Methods for converting frequency matrices to PSSMs
s Using log ratio of observed to expected score( j,i) = log m( j,i) / f ( j)
x where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences) s Using amino acid substitution matrix (Dayhoff similarity matrix) [see later] Pseudocounts
s How do we get a score for a position with zero counts for a particular character? Can't take log(0). s Solution: add a small number to all positions with zero frequency Finding occurrences of a sequence feature using a Profile
s As with finding occurrences of a consensus sequence, we consider all positions in the target sequence as candidate matches s For each position, we calculate a score by "looking up" the value corresponding to the base at that position Block Diagram for Building a PSSM Aligned Sequences
Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Block Diagram for Building a PSSM Unaligned Sequences
Set of unaligned sequences Parameters for aligning (i.e., expected length) Expected frequencies of each sequence element PSSM builder PSSM Block Diagram for Searching with a PSSM
PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Block Diagram for Searching for sequences related to a family with a PSSM
Set of Aligned Sequence Features Expected frequencies of each sequence element PSSM builder PSSM Threshold Set of Sequences to search PSSM search Sequences that match above threshold Positions and scores of matches Consensus sequences vs. PSSMs
s Should I use a consensus sequence or a frequency matrix to describe my site?
x If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence
3 Example: Restriction enzyme recognition sites x If some allowed characters are "better" than others, use PSSM
3 Example: Promoter sequences Consensus sequences vs. frequency matrices
s Advantages of consensus sequences: smaller description, quicker comparison s Disadvantage: lose quantitative information on preferences at certain locations Reading for next class
s Jones/Pevzner Ch 6 through section 6.9 (p. 185) s Read paper by Needleman and Wunsch on web site s (recommended) Durbin et al, pp 1732 ...
View
Full
Document
This note was uploaded on 12/03/2011 for the course BIO 118 taught by Professor Staff during the Fall '08 term at Rutgers.
 Fall '08
 Staff
 Computational Biology

Click to edit the document details