This preview shows pages 1–5. Sign up to view the full content.
1
15853
Page 1
15853:Algorithms in the Real World
Computational Biology III
– Multiple Sequence Alignment
–Sequencing the Genome
15853
Page 2
Multiple Alignment
A C T _ G T A
A C A C G T T
A G T G _ T A
C C _ G C T A
Goal:
match the “maximum” number of aligned pairs
of symbols.
Applications:
– Assembling multiple noisy reads of fragments
of sequences
– Finding a canonical among members of a family
and studying how the members differ
The problem is NPhard
15853
Page 3
Example Output
Output from typical multiple alignment software
DNAMAN (using
ClustalW
)
15853
Page 4
Scoring Multiple Alignments
1. Distance from consensus S
c
:
2. Pairwise distances:
3. Evolutionary Tree Alignment
∑
∈
=
S
S
c
i
i
S
S
D
D
)
,
(
∑
∑
∈∈
=
S
SS
S
S
j
i
ii
j
S
S
D
D
/
)
,
(
S
1
S
2
S
3
S
4
S
5
)
,
(
)
,
(
)
,
(
)
,
(
45
123
3
12
5
4
2
1
S
S
D
S
S
D
S
S
D
S
S
D
D
+
+
+
=
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document 2
15853
Page 5
Approaches
Dynamic programming
: optimal, but takes time that
is exponential in
p
Center Star Method
: approximation
Clustering Methods
: also called iterative pairwise
alignment.
Typically an approximation.
Many variants, many software packages
15853
Page 6
Using Dynamic Programming
For
sequences of length
n
we can fill in a

dimensional array in
n
time and space.
For example for
= 3:
where
assuming the pairwise distance metric.
Takes time exponential in
p.
Perhaps OK for
p
= 3
⎪
⎪
⎩
⎪
⎪
⎨
⎧
+
+
+
=
−
−
−
−
−
−
...
_)
_,
(
_)
,
,
(
)
,
,
(
min
,
,
1
,
1
,
1
1
,
1
,
1
i
k
j
i
j
i
k
j
i
k
j
i
k
j
i
ijk
a
d
D
b
a
d
D
c
b
a
d
D
D
)
,
(
)
,
(
)
,
(
)
,
,
(
c
a
d
c
b
d
b
a
d
c
b
a
d
+
+
=
7 cases
15853
Page 7
Example
15853
Page 8
Optimization
As in the case of pairwise alignment we can view the
array as a graph and find shortest paths.
Used in a program called MSA.
Can align 6 strings consisting of 200 bp each in a
“practical” amount of time.
3
15853
Page 9
Using Clustering
1.
Compute D(S
i
,S
j
) for all pairs
2.
Bottom up cluster
I.
All sequences start as their own cluster
II. Repeat
a)
find the two “closest” clusters and join
them into one
b)
Find best alignment of the two clusters
being joined
S
1
S
2
S
3
S
4
S
5
15853
Page 10
Distances between Clusters
Could use difference between consensus.
A popular technique is called the “Unweighted Pair
Group Method using arithmetic Averages”
(UPGMA).
It takes the average of all distances among the
two clusters.
Implemented in Clustal and Pileup
actg_a
attg_a
actgga
_accca
aaccga
D?
15853
Page 11
Summary of Matching
Types of matching:
–
Global
: align two sequences A and B
–
Local
: align A with any part of B
–
Multiple
: align k sequences (NPcomplete)
Cost models
–
LCS and MED
–
Scoring matrices:
Blosum, PAM
–
Gap cost:
affine, general
Methods
–
Dynamic programming:
many optimizations
–
“Fingerprinting”
: hashing of small seqs. (approx.)
–
Clustering:
for multiple alignment (approx.)
15853
Page 12
Sequencing the Genome
One of the great achievements of the 21
st
century.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document 4
15853
Page 13
Tools of the Trade
Cutting:
Arber, Nathans, and Smith,
Nobel Prize in Medicine
(1978) for “the discovery of restriction enzymes
and their application to problems of molecular
genetics".
This is the end of the preview. Sign up
to
access the rest of the document.
This note was uploaded on 01/26/2010 for the course COMPUTER S 15853 taught by Professor Guyblelloch during the Fall '09 term at Carnegie Mellon.
 Fall '09
 GuyBlelloch
 Algorithms

Click to edit the document details