This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: The complexity of multiple sequence alignment with SPscore that is a metric Paola Bonizzoni a , 1 Gianluca Della Vedova a , 1 a Dipartimento di Informatica, Sistemistica e Comunicazione, Universit` a degli Studi di Milano  Bicocca, via Bicocca degli Arcimboldi 8, 20126 Milano  Italy, email: { bonizzoni, dellavedova } @disco.unimib.it Abstract This paper analyzes the computational complexity of computing the optimal align ment of a set of sequences under the SP (sum of all pairs) score scheme. We solve an open question by showing that the problem is NPcomplete in the very restricted case in which the sequences are over a binary alphabet and the score is a metric. This result establishes the intractability of multiple sequence alignment under a score function of mathematical interest, which has indeed received much attention in biological sequence comparison. Key words: multiple sequence alignment, SPscore, intractability. 1 Introduction Multiple sequence alignment is one of the most popular and important prob lems in computational biology [7]. It finds different applications in molecular biology, mainly in two related areas: finding information about the structure and function of the molecules, and estimate the evolutionary distance between species from their associated sequences. An alignment of k sequences is defined by a matrix k × m in which each row contains a sequence interleaved by spaces. Then, the similarity of sequences in the alignment is measured by using a score or distance between elements of the matrix. More precisely, in DNA (or RNA) sequences, the alphabet contains four letters and the score assigned to the comparison between two letters (or 1 Supported by grant Cofinanziato 98: “Modelli di calcolo innovativi: metodi sin tattici e combinatori”. Preprint submitted to Elsevier Preprint 7 May 2001 nucleotides) may be zero if there is a match, i.e. the letters are identical, otherwise the score may be one. A popular assumption in biological alignment is that the score is a metric, that is the distance between identical letters is zero and it satisfies the triangle inequality. Among different score schemes, the sum of all pairs score, in short the SPscore, is the one that has received more attention, mainly for its mathematical elegance. By means of the SPscore a value is assigned to a multiple alignment; an optimal alignment is the one that minimizes the value over all possible alignments. Several methods have been developed for multiple sequence alignment [3,2], but no efficient methods are known to find the optimal alignment. Recently, a polynomial time approximation algorithm for the problem has been proposed by Gusfield [6] who achieved a 2 − 2 /k approximation factor by assembling an alignment of k sequences from optimal alignment of pairs of sequences. The approximation ratio has been improved to a 2 − l/k factor, for any fixed l , by Bafna, Lawler and Pevzner [1]. But, besides these results it was an open question whether the problem is...
View
Full Document
 Spring '08
 UNGOR
 DNA, NC, Alignment, Sequence alignment, Multiple sequence alignment

Click to edit the document details