This preview shows pages 1–3. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Multiple Alignment, Index Structures, Suffix Trees Jens Stoye stoye@TechFak.Uni-Bielefeld.DE 1 Introduction String comparison is a fundamental task in bioinformatics. The comparison of two se- quences (pairwise alignment), and how to compute good pairwise alignments by means of the algorithmic technique of dynamic programming was the topic of Robert Giegerichs lecture on sequence similarities and dynamic programming. Here we want to go beyond this. In the first part (Section 2) we will discuss methods for the simultaneous comparison of more than two sequences, multiple sequence alignment. In the second part (Section 3) we will introduce a very powerful data structure for preprocess- ing (indexing) large sequence data in order to allow flexible and efficient string comparison, the suffix tree. We will also describe a few applications of suffix trees in bioinformatics. Finally in Section 4 we will have a short look at a recent development where suffix trees meet multiple sequence comparison, the multiple genome aligner (MGA). 2 Multiple Sequence Alignment 2.1 Why Multiple Sequence Alignment? Multiple sequence alignment is the study of more than two (related) bio-sequences at the same time. The following are the two main arguments why multiple sequence alignment is more than just a mathematical generalization of pairwise alignment: When comparing several evolutionarily related (homologous) protein or DNA se- quences, family-specific patterns (motifs) that are shared by all the sequences are easier detectable than from pairwise comparisons because the chance of random sim- ilarities occurring in all sequences at the same time is much lower. Evolutionary relationships become much clearer if several sequences are compared simultaneously because this allows character-based studies as compared to distance- based analyses. That is why studies in molecular phylogeny usually are based on multiple alignments. These arguments reflect the two main uses of multiple sequence alignments: 1 Detect common similarities (specific signals) in the sequences under consideration which then can be used, e.g., for more sensitive database searches. Detect dissimilarities (differences between related sequences) which express mutations and can be used, e.g., for phylogenetic analyses. This section is divided into three parts. The first two (Sections 2.2 and 2.3) discuss the multiple sequence alignment problem in a more theoretical way, while the last part (Section 2.4) considers methods for computing multiple sequence alignment methods as they are used in practice. 2.2 Optimal Multiple Alignment We begin with a formal definition of the multiple sequence alignment problem and how it can be solved optimally. This method, while conceptually rather simple, is computation- ally very demanding so that only small problem instances can be solved. Speed-ups and variations of the basic algorithm are sketched in Section 2.3....
View Full Document
This note was uploaded on 10/01/2009 for the course CS BCB/Co taught by Professor Olivereulenstein during the Fall '06 term at Iowa State.
- Fall '06