Integrative Biology 200A
“PRINCIPLES OF PHYLOGENETICS”
University of California, Berkeley
Will- 21 Feb 2008
Two or more sequences (bases, amino acids, proteins, etc.) are matched in a
either globally (two
sequences matched over their whole length) or locally (some subset of the sequences matched while other regions are
not expected to match).
Sequence similarity can simply be a mathematical distance between two.
Establishing an initial estimate of homology (basically similarity) is essential. Unaligned sequence data has no
base homology. As a consequence, the fixed alignment, achieved by one method or another, is treated as prior, or
background knowledge. Recall the hierarchy of characters and state and that only the states are really tested in the
analyses. The outcome phylogenetic analyses are often strongly influenced by the alignment.
BLAST (Altschul, SF, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local alignment search tool. J Mol Biol 215(3):403-10, 1990). For
example, a gene is newly identified and function understood in
, a researcher can BLAST the database of
the human genome to look for similar gene sequences.
Very basic description of BLAST
1. Uses short segments (“words”) of sequence to find other sequences that contain the same set.
2. Does “ungapped” alignment extending from the matched subsequence regions to find high-scoring matches
3. Does a rapid gapped alignment to select and rank close matches
For two sequences, i.e. pairwise alignment, of length n, if no gaps are allowed then there is one or few optimal
alignment(s). If gaps are allowed, i.e. there is sequence length variation, then.
n=50 then 10
alignments. Enumeration is not an option! We need heuristic searches based on
Optimality and scoring.
Two problems- how to find alignments and how to choose.
Alignment really attempts to balance the amount of indels with the amount of base substitution, normally based on
some cost differential. Of course it is possible to account for all differences by inserting enough gaps (trivial
alignment). In the simplest model this is the “Edit distance” or the minimal number of events required to transform one
sequence into another using some scheme of insertions, deletions and substitutions.
Go from acctga to agcta: