This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Computational Biology, Part 8
Representing and Finding
Sequence Features
Robert F. Murphy
Copyright © 19962006.
Copyright
All rights reserved. Sequence Analysis Tasks
⇒ Representing sequence features, and
finding sequence features using consensus
sequences and frequency matrices
sequences Definition
s A sequence feature is a pattern that is
sequence
observed to occur in more than one
sequence and (usually) to be correlated with
some function
some Sequence features
s Features following an exact pattern
x restriction enzyme recognition sites s Features with approximate patterns
x promoters
x transcription initiation sites
x transcription termination sites
x polyadenylation sites
x ribosome binding sites
x protein features Consensus sequences
s A consensus sequence is a sequence that
summarizes or approximates the pattern
observed in a group of aligned sequences
containing a sequence feature
containing
s Consensus sequences are regular
Consensus
expressions
expressions Finding occurrences of
consensus sequences
s Example: recognition site for a restriction enzyme
x
x s EcoRI recognizes GAATTC
GAATTC
AccI recognizes GTMKAC
GTMKAC Basic Algorithm
x
x
x
x Start with first character of sequence to be searched
See if enzyme site matches starting at that position
Advance to next character of sequence to be searched
Repeat previous two steps until all positions have been
Repeat
tested
tested Interactive Demonstration
s (A1 Pattern matching demo) Block Diagram for Search with a
Consensus Sequence
Consensus
Sequence (in
IUB codes)
Sequence to be
searched Search
Engine List of positions
where matches
occur Describing features using
frequency matrices
s Goal: Describe a sequence feature (or
Goal:
motif) more quantitatively than possible
motif more
using consensus sequences
using
s Need to describe how often particular bases
Need
are found in particular positions in a
sequence feature
sequence Describing features using
frequency matrices
s Definition: For a feature of length m using
For
an alphabet of n characters, a frequency
matrix is an n by m matrix in which each
element contains the frequency at which a
given member of the alphabet is observed at
a given position in an aligned set of
sequences containing the feature
sequences Frequency matrices (continued)
s Three uses of frequency matrices
x Describe a sequence feature
x Calculate probability of occurrence of feature
Calculate probability in a random sequence
in
x Calculate degree of match between a new
Calculate degree
sequence and a feature
sequence Interactive Demonstration
s (A4 Frequency matrix demo) Frequency Matrices, PSSMs, and
Profiles
s A frequency matrix can be converted to a
frequency
PositionSpecific Scoring Matrix (PSSM)
pecific coring
by converting frequencies to scores
frequencies scores
s PSSMs also called Position Weight
also
osition eight
Matrixes (PWMs) or Profiles
s) Profiles Methods for converting
frequency matrices to PSSMs
s Using log ratio of observed to expected score( j, i) = log m( j, i) / f ( j )
x s where m(j,i) is the frequency of character j observed at
where m(j,i)
position i and f(j) is the overall frequency of character j
f(j)
(usually in some large set of sequences)
(usually Using amino acid substitution matrix (Dayhoff
Using
similarity matrix) [see later]
similarity Pseudocounts
s How do we get a score for a position with
How
zero counts for a particular character? Can’t
take log(0).
take
s Solution: add a small number to all
Solution:
positions with zero frequency
positions Finding occurrences of a
sequence feature using a Profile
s As with finding occurrences of a consensus
As
sequence, we consider all positions in the
target sequence as candidate matches
target
s For each position, we calculate a score by
For
“looking up” the value corresponding to the
base at that position
base Interactive Demonstration
s (A5 Searching with Profile demo) Block Diagram for Building a
PSSM
Set of Aligned
Sequence
Features
Expected
frequencies of
each sequence
element PSSM
builder PSSM Block Diagram for Searching
with a PSSM
PSSM
Threshold
Set of
Sequences to
search PSSM
search Sequences that
match above
threshold
Positions and
scores of
matches Block Diagram for Searching for
sequences related to a family
with a PSSM
Set of
Aligned
Sequence
Features
Expected
frequencies
of each
sequence
element PSSM
builder PSSM
Threshold
Set of
Sequences
to search PSSM
search Sequences that match above
threshold
Positions and scores of
matches Consensus sequences vs.
frequency matrices
s Should I use a consensus sequence or a
Should
consensus
frequency matrix to describe my site?
frequency
x If all allowed characters at a given position are
If equally "good", use IUB codes to create
consensus sequence
consensus
3 Example: Restriction enzyme recognition sites x If some allowed characters are "better" than
If others, use frequency matrix
others,
3 Example: Promoter sequences Consensus sequences vs.
frequency matrices
s Advantages of consensus sequences: smaller
smaller
description, quicker comparison
description,
s Disadvantage: lose quantitative information
lose
on preferences at certain locations
on Sequence Analysis Tasks
⇒ Representing and finding sequence
features using hidden Markov models
features Markov chains
s If we can predict all of the properties of a
If
sequence knowing only the conditional
dinucleotide probabilities, then that
sequence is an example of a Markov chain
Markov
s A Markov chain is defined as a sequence
Markov
of states in which each state depends only
on the previous state
on Formalism for Markov chains
s
s M=(Q,π,P) is a Markov chain, where
Q = vector (1,..,n) is the list of states
is
x s π= vector (p1,..,pn) is the initial probability of each
is
state
state
x s Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA
(1)=A, (2)=C, (3)=G, π i)=pQ(i) (e,g., π(1)=pA for DNA)
(
for P= n x n matrix where the entry in row i and
column j is the probability of observing state j if
the previous state is i and the sum of entries in
each row is 1 (≡ dinucleotide probabilities)
each
x P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA)
Q(i)Q(i) Generating Markov chains
s s s Given Q,π,P (and a random number generator), we
Given Q,
can generate sequences that are members of the
Markov chain M
Markov
If π,P are derived from a single sequence, the
If ,P
family of sequences generated by M will include
that sequence as well as many others
that
If π,P are derived from a sampled set of sequences,
If ,P
the family of sequences generated by M will be
the population from which that set has been
sampled
sampled Interactive Demonstration
s (A11 Markov chains) Discriminating between two
states with Markov chains
s To determine which of two states a
To
sequence is more likely to have resulted
from, we calculate
from,
L a
P ( x  model+)
S ( x ) = log
= å log
P ( x  model) i=1
a
L S( x ) = å b xi  1 xi
i =1 +
xi  1 xi
xi  1 xi State probablities for + and models
s +
A
C
G
T Given examples sequences that are from
Given
either + model (CpG island) or  model (not
CpG island), can calculate the probability
that each nucleotide will occur for each
model (the a values for each model)
A
0.180
0.171
0.161
0.079 C
0.274
0.368
0.339
0.355 G
0.426
0.274
0.375
0.384 T
0.120
0.188
0.125
0.182 A
C
G
T A
0.300
0.322
0.248
0.177 C
0.205
0.298
0.246
0.239 G
0.285
0.078
0.298
0.292 T
0.210
0.302
0.208
0.292 Transition probabilities converted
to log likelihood ratios
ß
A
C
G
T A
0.740
0.913
0.624
1.169 C
0.419
0.302
0.461
0.573 G
0.580
1.812
0.331
0.393 T
0.803
0.685
0.730
0.679 Example
s What is relative probability of C+G+C+
What
compared with CGC?
compared
s First calculate logodds ratio:
S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273
s Convert to relative probability:
22.273=4.833
s Relative probability is ratio of (+) to ()
P(+)=4.833 P() Example
s Convert to percentage
P(+) + P() = 1
4.833P() + P() = 1
P() = 1/5.833 = 17%
s Conclusion
P(+)=83% P()=17% Block Diagram for Generating
Sequences with a Markov Model
alphabet
initial
probabilities
transition
probabilities
number of
characters to
generate Markov
Model
Sequence
Generator sequence Hidden Markov models
s “Hidden” connotes that the sequence is
Hidden”
generated by two or more states that have
different transition probability matrices
different More definitions
s π i = state at position i in a path
path s akl = P(π i = l  π i1 = k)
x probabilityof going from one state to another
x “transition probability” s ek(b) = P(xi = b  π i = k)
x probability of emitting a b when in state k
probability emitting
x “emission probability” Decoding
s The goal of using an HMM is often to
The
determine (estimate) the sequence of
underlying states that likely gave rise to an
observed sequence
observed
s This is called “decoding” in the jargon of
This
speech recognition
speech More definitions
s Can calculate the joint probability of a
Can
sequence x and a state sequence π
L P ( x, p ) = a0 p 1 Õ ep i ( x i ) ap i p i +1
i=1 requiring
p L +1 = 0 Determining the optimal path:
the Viterbi algorithm
s Viterbi algorithm is form of dynamic
Viterbi
programming
programming
s Definition: Let vk(i) be the probability of the
(i)
most probable path ending in state k with
observation i
observation Determining the optimal path:
the Viterbi algorithm
s Initialisation (i=0):
=0):
v0(0)=1, vk(0)=0 for k>0
(0)=1, (0)=0 s Recursion (i=1..L):
):
vl(i)=el(xi)maxk(vk(i1)akl)
ptri(l)=argmaxk(vk(i1)akl)
ptr s Termination: P(x,π*)=maxk(vk(L)ak0)
πL*=argmaxk(vk(L)ak0) s Traceback (i=L..1):
Traceback πi1*=ptri(πi*) Block Diagram for Viterbi
Algorithm
alphabet
initial
probabilities
transition
probabilities
sequence
position i
state k Viterbi
Algorithm probability
sequence
was
generated
with
position i
being in
state k Multiple paths can give the same
sequence
s The Viterbi algorithm finds the most likely
The
path given a sequence
path
s Other paths could also give rise to the same
Other
sequence
sequence
s How do we calculate the probability of a
How
sequence given an HMM?
sequence Probability of a sequence
s Sum the probabilities of all possible paths
Sum
that give that sequence
that
s Let P(x) be the probability of observing
Let P(x)
sequence x given an HMM P ( x ) = å P ( x, p )
p Probability of a sequence
s Can find P(x) using a variation on Viterbi
Can
P(x)
algorithm using sum instead of max
algorithm
s This is called the forward algorithm
This
forward
s Replace vk(i) with fk(i)=P(x1…xi,πi=k)
Replace Forward algorithm
s Initialisation (i=0):
=0):
f0(0)=1, fk(0)=0 for k>0
(0)=1, (0)=0 s Recursion (i=1..L):
): f l (i) = el ( x i )å f k (i  1) akl
k s Termination: P ( x ) = å f k ( L) ak 0
k Backward algorithm
s We may need to know the probability that a
We
particular observation xi came from a
particular state k given a sequence x,
P(πi=kx)
P( s Use algorithm analogous to forward
Use
algorithm but starting from the end
algorithm Backward algorithm
s Initialisation (i=0):
=0):
bk(L)=ak0 for all k s Recursion (i=L1,…,1):
=L1,…,1): bk (i) = å akl el ( x i +1)bl (i + 1)
l s Termination: P ( x ) = å a0 l el ( x1)bl (1)
l Estimating probability of state at
particular position
s Combine the forward and backward probabilities
Combine
to estimate the posterior probability of the
sequence being in a particular state at a particular
position
position f k (i)bk (i)
P (p i = k  x ) =
P( x) ...
View Full
Document
 Fall '10
 Staff
 Biology, Computational Biology

Click to edit the document details