MSA_SPMetric_NPC - The complexity of multiple sequence...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: The complexity of multiple sequence alignment with SP-score that is a metric Paola Bonizzoni a , 1 Gianluca Della Vedova a , 1 a Dipartimento di Informatica, Sistemistica e Comunicazione, Universit` a degli Studi di Milano - Bicocca, via Bicocca degli Arcimboldi 8, 20126 Milano - Italy, e-mail: { bonizzoni, dellavedova } @disco.unimib.it Abstract This paper analyzes the computational complexity of computing the optimal align- ment of a set of sequences under the SP (sum of all pairs) score scheme. We solve an open question by showing that the problem is NP-complete in the very restricted case in which the sequences are over a binary alphabet and the score is a metric. This result establishes the intractability of multiple sequence alignment under a score function of mathematical interest, which has indeed received much attention in biological sequence comparison. Key words: multiple sequence alignment, SP-score, intractability. 1 Introduction Multiple sequence alignment is one of the most popular and important prob- lems in computational biology [7]. It finds different applications in molecular biology, mainly in two related areas: finding information about the structure and function of the molecules, and estimate the evolutionary distance between species from their associated sequences. An alignment of k sequences is defined by a matrix k × m in which each row contains a sequence interleaved by spaces. Then, the similarity of sequences in the alignment is measured by using a score or distance between elements of the matrix. More precisely, in DNA (or RNA) sequences, the alphabet contains four letters and the score assigned to the comparison between two letters (or 1 Supported by grant Cofinanziato 98: “Modelli di calcolo innovativi: metodi sin- tattici e combinatori”. Preprint submitted to Elsevier Preprint 7 May 2001 nucleotides) may be zero if there is a match, i.e. the letters are identical, otherwise the score may be one. A popular assumption in biological alignment is that the score is a metric, that is the distance between identical letters is zero and it satisfies the triangle inequality. Among different score schemes, the sum of all pairs score, in short the SP-score, is the one that has received more attention, mainly for its mathematical elegance. By means of the SP-score a value is assigned to a multiple alignment; an optimal alignment is the one that minimizes the value over all possible alignments. Several methods have been developed for multiple sequence alignment [3,2], but no efficient methods are known to find the optimal alignment. Recently, a polynomial time approximation algorithm for the problem has been proposed by Gusfield [6] who achieved a 2 − 2 /k approximation factor by assembling an alignment of k sequences from optimal alignment of pairs of sequences. The approximation ratio has been improved to a 2 − l/k factor, for any fixed l , by Bafna, Lawler and Pevzner [1]. But, besides these results it was an open question whether the problem is...
View Full Document

{[ snackBarMessage ]}

Page1 / 19

MSA_SPMetric_NPC - The complexity of multiple sequence...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online