MSA_Just_NPC

# MSA_Just_NPC - COMPUTATIONAL COMPLEXITY OF MULTIPLE...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: COMPUTATIONAL COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT WITH SP-SCORE WINFRIED JUST DEPARTMENT OF MATHEMATICS OHIO UNIVERSITY ATHENS, OHIO 45701, U.S.A. Abstract. It is shown that the multiple alignment problem with SP-score is -hard for each scoring matrix in a broad class that includes most scoring matrices actually used in biological applications. The problem remains hard even if sequences can only be shifted relative to each other and no internal gaps are allowed. It is also shown that there is a scoring matrix M0 such that the multiple alignment problem for M0 is MAX-hard, regardless of whether or not internal gaps are allowed. NP M NP SNP Key words and phrases. sequence alignment, scoring matrix, -hardness, MAXhardness, polynomial time approximation scheme. e-mail: just@math.ohiou.edu phone: (740)-593-1260 fax: (740)-593-9805. 1 NP SNP - 2 WINFRIED JUST 1. Introduction The importance of good multiple sequence alignment algorithms is evidenced by the large number of programs that have been developed for this task (Fasman and Salzberg 1998). Finding an optimal alignment of k sequences appears to quickly become computationally intractable as k increases. For example, dynamic programming algorithms that are guaranteed to nd a best scoring alignment of k sequences with mean length n have a running time of O(nk ) (Carillo and Lipman 1988). This explains the widespread use of heuristic algorithms for multiple alignment. It has been formally proved by Wang and Jiang (1994) and Bonizzoni and Della Vedova (2000) that there are scoring matrices for which the problem of nding a multiple alignment of k sequences with optimal SP-score is NP -hard. Unfortunately, the scoring matrix used by Wang and Jiang (1994) for obtaining this result is not a metric, which makes it very di erent from the matrices that are actually used in biological applications. The proof technique used by Bonizzoni and Della Vedova (2000) uses matrices in which the indel (insertion/deletion) penalties depend on which character a space symbol is aligned with. While such variable indel penalties are sometimes used for aligning amino acid sequences, the use of scoring schemes with uniform indel penalties seems much more common. Thus for most scoring schemes used in practice, computational intractability of the multiple alignment problem had not been formally proven prior to the results of the present paper. Here we show that the multiple alignment problem is NP -hard for each scoring matrix from a broad class M that includes most scoring schemes that are actually used in biological applications. A brute force algorithm for nding optimal multiple alignments would have to evaluate all possibilities of inserting gaps into the sequences to be aligned. However, the optimal alignments found in practice usually contain relatively few gaps (Pascarella and Argos 1992), (Benner et al. 1993). This observation led to the question whether the problem becomes less complex if one limits the number of gaps that can be inserted into the sequences (Jiang 1999). An extreme version of such restrictions is what we call here gap-0 alignment. In this version, sequences can be shifted relative to each other, but no internal gaps are allowed. Unpublished results of Bonizzoni, Della Vedova, and Jiang show that there is a scoring matrix that does not satisfy the triangle inequality for which gap-0 alignment is still NP hard, and the problem is even MAX-SNP -hard if the scoring matrix is considered part of the input (Jiang 1999). Subsequently, a xed scoring matrix M was found such that M is a metric and gap-0 multiple alignment for M is NP -hard (Just 1999). Here we show that the gap-0 multiple alignment problem is NP -hard for each scoring matrix from a broad class M1 M. We also show that there is a xed scoring matrix M0 over a three-letter alphabet such that the multiple alignment problem and the gap-0 multiple alignment problem for M0 are MAX-SNP -hard. Unfortunately, M0 does not satisfy the triangle inequality. 2. Definitions and Results Let us formally state the multiple alignment problem and the gap-0 multiple alignment problem. At the outset, we are given a nite alphabet = fa1; :::; awg and a (w + 1) (w + 1) scoring matrix M = (si;j )i6w;j 6w . Intuitively, for i; j > 0, si;j represents the penalty for aligning character ai with character aj . For i > 0, the numbers s0;i ; si;0 are called indel penalties. Penalties s0;i ; si;0 are incurred whenever COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 3 the character ai is aligned with a special character 62 that stands for a space. A given scoring scheme may also specify additional gap opening penalties that are incurred in addition to the indel penalties for aligning ai with the rst or last in a string of 's (in this case, what we call \indel penalty" will usually be called gap extension penalty ). Our results do not depend on whether or not gap opening penalties are added to the indel penalties. We will say that a scoring matrix is metric if it satis es the following conditions: 1) si;j > 0 for all i 6= j; 2) si;i = 0 for all i; 3) si;j = sj;i for all i; j. 4) si;j + sj;k > si;k for all i; j; k. The last of the above properties is called the triangle inequality. Metric scoring matrices are of considerable theoretical interest, since they allow for the natural interpretation of pairwise alignment scores as distances between sequences (see e.g. (Wheeler 1993) and (Fitch 1993) for a discussion of the role of the triangle inequality in this context). However, scoring matrices used in practice, such as the PAM matrices of Dayho et al. (1978) and the BLOSUM matrices of Heniko and Heniko (1992) give log-odds scores rather than distances. In particular, for the latter type of matrices, the multiple alignment problem will be formally cast as a maximization rather than a minimization problem. In this paper we will use the language of \distances" as a convenient and intuitive metaphor, but our development of the theory and our results will not require any of the properties 1)-4). A maximization problem can of course be transformed into an equivalent minimization problem by multiplying each score by ?1. Given two sequences t0, t1 of symbols from f g of length n and a scoring matrix M, we de ne a distance dM (t0 ; t1) as the sum of penalties speci ed by M for aligning the j-th character t0;j of t0 with the j-th character t1;j of t1 , plus gap opening penalties if applicable, where j ranges over the length of the sequences. If we have a k-tuple ht0 ; :::; tk?1i of sequences of equal length, then the SP-score for P these sequences is given by SP M (t0 ; :::; tk?1) = i<j<k dM (ti ; tj ). For a k-tuple ht0 ; :::; tk?1i of sequences as above, an alignment a of these sequences is obtained by preserving the order of symbols in each sequence, but possibly inserting space symbols . We will always assume that there are suitable numbers of space symbols inserted at the end of each sequence so that the aligned sequences hat0 ; : : : ; atk?1i are all of the same length. Alignments are not allowed to contain columns that consist entirely of space symbols. An alignment a is called a gap-0 alignment if spaces are possibly added at the beginning and at the end of sequences, but not between symbols (i.e., sequences can be shifted relative to each other, but no internal gaps are allowed). A gap-0-1 alignment is a gap-0 alignment of sequences of equal length such that each of the aligned sequences contains exactly one space, either at its end or at its beginning. Given an alignment a of sequences ht0; :::; tk?1i, we de ne the SP-score with respect to M for this alignment as SP M (at0 ; :::; atk). Now let us formally de ne the multiple alignment problem, the gap-0 multiple alignment problem, and the gap0-1 multiple alignment problem for a given alphabet and scoring matrix M. In each case, the instance is a k-tuple of sequences of common length1 of characters 1 In most biological applications, the sequences to be aligned have approximately equal length, but not necessarily exactly equal length. Note that if multiple alignment of sequences of exactly 4 WINFRIED JUST from . The problem is to nd a multiple alignment (respectively gap-0 multiple alignment, or gap-0-1 multiple alignment) of the given sequences that minimizes the SP-score with respect to M. Now let = fA; T g and let us say that a scoring matrix M is generic if it is of the form A T x y z A y vA u T z u vT where the parameters x; y and z are xed nonnegative numbers2 and the inequality u > maxf0; vA; vT g holds. Let us say that a (w + 1) (w + 1) scoring matrix N contains a generic submatrix if there are 1 i; j w such that after deleting all rows and colums of N except those numbered 0; i; j one obtains a generic matrix M. Now let M2 be the class of all scoring matrices that contain a generic submatrix M, let M1 be the class of all scoring matrices that contain a submatrix isomorphic to a generic matrix M with z > vT , and let M be the class of all scoring matrices that contain a submatrix isomorphic to a generic matrix M with y > u and z > vT . Recall that an optimization problem is NP -hard if the existence of a polynomialtime algorithm that is guaranteed to nd the optimal solution for all instances of this problem implies that P = NP (Garey and Johnson 1979). Here is the main result of this paper. Theorem 1. (a) For every scoring matrix M 2 M, the multiple alignment problem is NP -hard. (b) For every scoring matrix M 2 M1, the gap-0 multiple alignment problem is NP -hard. (c) For every scoring matrix M 2 M2, the gap-0-1 multiple alignment problem is NP -hard. FIG. 1. A generic scoring matrix. Of course we have M2 M1 M. Even the class M is very broad; note that M contains each scoring matrix M for which there is ai 2 such that M penalizes mismatches of ai with some aj 2 relative to ai {ai and aj {aj matches, penalizes all spaces aligned with ai more heavily than mismatches between ai and aj , and penalizes all spaces to some extent. Thus M appears to cover most scoring schemes used in biological applications. A notable exception are scoring schemes that use a xed gap penalty or a xed penalty for gaps that exceed a speci ed length. Our proof will implicitly show that the gap-0-1 multiple alignment problem for the latter scoring schemes is still NP -hard, but the question remains open for gap-0 multiple alignment and multiple alignment. Some soring schemes used in practice do not penalize insertion of spaces at the beginning and end of sequences. While such scoring schemes do not formally belong to the classes M2 , M1 and M, it will be clear from the proofs that the analogue of Theorem 1 remains valid for them. equal length is computationally intractable, then so is the more general problem of multiple alignment of sequences of \roughly equal" length. 2 In matrices of practical interest, x = 0. Our proofs work regardless of whether x = 0 or x > 0. COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 5 A T C 0 2 2 2 A 2 0 1 0 T 2 1 0 0 C 2 0 0 0 FIG. 2. The scoring matrix M0: This scoring matrix does belong to M, but it does not satisfy the triangle inequality and thus is not metric. Some NP -hard optimization problems have so-called polynomial time approximation schemes (abbreviated PTAS), that is, for every " > 0 there exists a polynomial-time algorithm A" that is guaranteed to nd for each instance a solution that is within a factor of 1 + " of the optimal solution for this instance.3 It can be shown that if an optimization problem belongs to a class called MAXSNP -hard problems, then it does not have a PTAS (unless P = NP ) (Arora et al. 1992). Theorem 2. For the three-letter alphabet 0 and the scoring matrix M0 de ned above, each of the following problems is MAX-SNP -hard: (a) The multiple alignment problem. (b) The gap-0 multiple alignment problem. (c) The gap-0-1 multiple alignment problem. fA; T; C g: We will also consider the following scoring matrix M0 for the alphabet 0 = It is not known whether there exists a scoring matrix N that is a metric such that the multiple alignment problem, the gap-0 alignment problem, or the gap-0-1 multiple alignment problem for N is MAX-SNP -hard (Jiang et al. 1999). This question is open even if one only requires that all diagonal entries are zero, whereas all o -diagonal entries are positive (Della Vedova 1999). 3. Proofs We will prove Theorems 1 and 2 by reducing the SIMPLE MAX-CUT(B) problem to the respective multiple alignment problems. Recall that an instance of size k of the SIMPLE MAX-CUT(B) problem is a simple graph G = hV; E i such that jV j = k and each vertex of G has degree at most B. The problem is to nd a partition of the set of vertices V into disjoint sets V0 and V1 such that the number of edges that connect a vertex in V0 with a vertex in V1 , i.e., the size of the cut determined by hV0 ; V1i, is as large as possible. There exists a xed positive integer B such that the SIMPLE MAX-CUT(B) problem is NP -hard; in fact, B = 3 works (Garey and Johnson 1979). Proof of Theorem 1. Clearly, if the gap-0 multiple alignment problem is NP -hard for each generic scoring matrix M with z > vT , then the gap-0 multiple alignment problem is NP -hard for all matrices in M1. Analogous observations can be made for M2 and M. This allows us to prove Theorem 1 by proving NP -hardness of 3 Many authors use a slightly more stringent de nition of a PTAS that requires " to be a parameter of a single algorithm. But MAX-hardness implies the nonexistence even of the weak kind of PTAS de ned here. SNP 6 WINFRIED JUST the multiple alignment problems mentioned in it for the respective generic scoring matrices M. Let k be a positive integer, and let B be such that the SIMPLE MAX-CUT(B) problem is NP -hard. Given a graph G = hV; E i with k vertices and degree at most B, we de ne a k2 -tuple tG = ht0; : : : ; tk2?1 i of sequences as follows: Enumerate V = fv0; : : : ; vk?1g, E = fe0 ; : : : ; e?1g. Each sequence ti will have length k12. Intuitively speaking, for i < k, the sequence ti will encode the vertex vi . Sequences ti for i k will be dummy sequences consisting entirely of T's. The role of the latter is to ensure that undesirable alignments are heavily penalized. Edge em = fvi; vh g will be encoded by characters th;j ; ti;j , where j = k7 n + k7 m + r, n < k5 , r 2 f1; 2; 3g. More precisely, we de ne ti;j , the j-th character in ti, as follows. For m < , em = fvh ; vig, h < i, n < k5 we let: th;k7 n+k7m+2 = ti;k7n+k7 m+1 = ti;k7 n+k7m+3 = A. In all other cases, we let ti;j = T. Figure 3 illustrates this construction. We exhibit a situation where em = fvh ; vi g; em0 = fvg ; vh g, m < m0 , n < n0 < k5 . tg;k7n+k7 m tg : ... T T T T T ... T T A T T ... T T T T T ... th : ... T T A T T ... T A T A T ... T T A T T ... ti: ... T A T A T ... T T T T T ... T A T A T ... tp : ... T T T T T ... T T T T T ... T T T T T ... # tg;k7n+k7 m0 # tg;k7n0 +k7 m # j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j Now consider a gap-0-1 alignment a of the sequences tG . Such an alignment naturally induces a partition of V into disjoint subsets V0a and V1a , where V1a consists of all vertices vi such that a appends a space at the beginning of ti (i.e., shifts ti to the right) and V0a consists of all vertices vi such that a appends a space at the end of ti (i.e., ti remains in place). Let ca denote the number of edges in G that connect vertices in Va0 with vertices in V1a , i.e., ca denotes the size of the cut induced by the partition hV0a ; V1a i. We will show that if k is su ciently large (i.e., k k0 for some xed k0 ) and a is an optimal gap-0-1 alignment for a generic matrix M of the sequences tG , then ca is maximal. To see that this su ces for the proof of Theorem 1(c), note that the partition hV0a ; V1a i can be decoded from a by a polynomial-time algorithm and every partition of V can be represented as hV0a ; V1a i for a suitable gap-0-1 alignment a. It follows that if there exists a polynomialtime algorithm A for gap-0-1 alignment with respect to M, then a polynomial-time algorithm for the SIMPLE MAX-CUT(B) problem can be obtained as follows: For FIG. 3. Coding a graph in the proof of Theorem 1. COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 7 graphs with k k0 vertices, encode the graph as a multiple sequence alignment problem in the way described above, run algorithm A to nd the optimal gap-0-1 alignment, and then decode the partition hV0a ; V1a i from the alignment. For the nitely many graphs of degree B with fewer than k0 vertices, construct a lookup table of optimal solutions of the SIMPLE MAX-CUT problem, and use it for the algorithm. Note that using the lookup table only adds a constant (although possibly a large one) to the execution time of the algorithm. Throughout the remainder of this paper, we will without further comments always assume that k is \su ciently large." So let M be a generic scoring matrix. Let us estimate the SP-score for the aligned sequences hat0 ; : : : ; atk?1i. This score has two components: indel (plus possibly gap opening) penalties and scores for character matches/mismatches. Since indel and gap opening penalties occur only in the rst and last columns, the total of those penalties will be of order O(k4), which for su ciently large k will be negligible. Recall that u, the penalty for A-T mismatches, was assumed to be greater than maxf0; vA ; vT g. The total number of character mismatches in the unaligned sequences is 3k5(k2 ? 1). The idea of the proof is to nd a gap-0-1 alignment a that maximally reduces this number by creating as many A-A matches as possible. A gap-0-1 alignment can create an A-A match only if the two A's are in adjacent columns, and each such newly created match will eliminate precisely two A-T mismatches. Note that whenever e = fvh ; vi g 2 E and vh ; vi end up in di erent parts of the partition hV0a ; V1a i (i.e., the edge e is cut by the partition), then a total of k5 A-A matches between sequences th and ti are created, that is, 2k5 A-T mismatches between these sequences are eliminated. No other A-T mismatches can be eliminated by a gap-0-1 alignment, nor can a gap-0-1 alignment introduce additional A-T mismatches. It follows that the total SP-score for the aligned sequences is equal to k12vT k2(k2 ? 1)=2 + 3k5(u ? vT )(k2 ? 1) ? ca k5(2u ? vA ? vT ) + O(k4); and thus for su ciently large k, the optimal gap-0-1 alignment of tG yields a partition of V that maximizes ca . For the proof of Theorem 1(b), let M be a generic scoring matrix with z > maxf0; vT g. We will refer to the vector hat0;j ; : : : ; atk2?1;j i of j-th characters of the aligned sequences as the j -th column of the alignment. Note that we can compute the SP-score (excluding gap opening penalties) of an alignment a as P P 0:5sca(ti;j ), where i ranges of the sequences in the alignment, j ranges over i j the columns in the alignment, and sca (ti;j ) is the sum over all pairwise scores between ti;j and the other symbols in the same column. P particular, if a0 is the (In alignment without any space symbols, then sca0 (ti;j ) = i0 6=i dM (ti;j ; ti0;j ).) Lemma 3. If z > maxf0; vT g and a is an optimal gap-0 alignment or an optimal multiple alignment of the sequences tG , then at most O(k6 ) columns of a contain and let a be an alignment with better score than a0 . Note that our assumption on z implies that the score for a0 can be improved only by replacing some A-T mismatches by T-T matches, or, if y < u, by A- matches. On the other hand, replacing any T-T match by a T- match will worsen the score by z ? vT . Since space symbols. Proof. Consider the alignment a0 that does not contain any spaces whatsoever, 8 WINFRIED JUST  Bk=2, only O(k6 ) of the columns of the unaligned sequences contain any A's. Thus the maximum possible improvement in the score of a0 that can be achieved by inserting spaces is of the order O(k8). For each column c of a, let us de ne the net gain contributed by this column as X sc (t ) ? sc (t ): ng(c) = a i;j a0 i;j Of course, a negative net gain is a net loss. Now suppose a column c of a contains at least one space symbol and ng(c) 0. If z > u, then it is easy to see that this column must contain at least one occurrence of A. If z u, then either c contains at least one occurrence of A, or c contains at most bu=z c space symbols and at least d(k2 ? 1)(z ? vT )=(u ? vT )e T's from columns of a0 that contain an occurrence of A. Let us relax these requirements a little and say that column c of a is benign if either it contains an occurrence of A or c contains at most 2du=z e +1 space symbols and at least 0:5d(k2?1)(z ?vT )=(u?vT )e T's from columns of a0 that contain an occurrence of A. Then there are at most O(k6) benign columns in a, and each column that is not benign contributes a net loss of at least 0:5(k2 ? 1) minfz ? vT ; (z ? vT )2=(u ? vT )g. Since the total gain of order O(k8) must outweigh the combined net loss of all columns, we conclude that all but O(k6) columns of a are benign, and the lemma follows. The de nition of the partition hV0a ; V1a i for a gap-0-1 alignment a of tG can be generalized to gap-0 alignments in a natural way. In the latter case, V0a will consist of all vertices vi such that a appends an even number of spaces at the beginning of ti , and V1a will consist of all vertices vi such that a appends an odd number of spaces at the beginning of ti . For each gap-0 alignment a one can de ne a gap0-1 alignment a that appends a space at the beginning of ti if and only if i < k and vi 2 V1a . Then V0a = V0a and ca = ca . Let a0 denote the alignment that contains no spaces, and let us analyse how much the SP-score of a0 can be reduced by an optimal gap-0 alignment a. The total penalty for A-T mismatches can be reduced by creating A-A matches or, if y < u, by shifting some o ending A's to the side where they are aligned with spaces rather than T's. The A's come in groups of three that reside in consecutive columns of a0 and are separated by spacers of length k7 ? 3. Lemma 3 implies that a can shift sequences only by distances that are much shorter than the spacers. It follows that a can create matches only between two A's that sit in adjacent columns of a0, and a cannot reduce penalties by shifting more than the three leftmost A's \to the side." But for each match between A's from neighboring columns of a0 that is created by a, such a match is also created by a . Thus, the SP-score for the optimal gap-0 alignment a will again be equal to k12vT k2(k2 ? 1)=2 + 3k5(u ? vT )(k2 ? 1) ? ca k5(2u ? vA ? vT ) + O(k4); and a induces a partition of V that maximizes ca , which implies Theorem 1(b). Finally, let M be a generic scoring matrix with y > u and z > vT , and let a be a multiple alignment that minimizes SP M (at0 ; : : : ; atk2?1). Let us think of the sequences tG as forming k5 consecutive blocks, where block number n consists of all columns of a0 numbered k7 n through k7 (n + 1). For 0 < n < k5 ? 1, let us refer to columns numbered k7 n ? bk7 =2c through k7 (n + 1) ? bk7 =2c ? 1 of a as a-block number n. Furthermore, a-block number 0 will consist of all positions to the left of a-block number 1, and a-block number k5 ? 1 will consist of all positions to ti;j 2c COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 9 the right of a-block number k5 ? 2. Lemma 3 implies that for all n, the A's from block number n of the unaligned sequences must end up in a-block number n of the aligned sequences hat0; : : : ; atk2?1 i. Now let us consider a-block number n, which will be denoted by Bn , and let us estimate the combined net gain or net loss over all columns of Bn . There are two possibilities: Case 1: Bn does nota;n contain a space symbol. In this case, we let V0 be the set of all vi such that a inserts an even number of space symbols into ti to the left of Bn , and let V0a;n be the set of all vi such that a inserts an odd number of space symbols into ti to the left of Bn . Let ca;n be the size of the cut determined by the partition hV0a;n ; V1a;ni of V . An argument as in the proof of part (b) shows that the combined net gain of all columns of a on Bn will be at most 2ca;n(2u ? vA ? vT ). Case 2: Bn does contain a space symbol. First note that insertion of space symbols might increase the number of A-A mismatches over what can be achieved by a gap-0 alignment, since the number of such matches will no longer be bounded by the size of any cut. However, Lemma 3 still implies that these matches have to be between A's from adjacent columns. Thus the number of A-A matches is bounded by ; in other words, the combined net gain sca0 (ti;j ) ? sca (ti;j ) over all symbols ti;j in Bn is bounded by 2(u ? vA ? vT ), which is of order O(k), since  Bk=2. Now let " = minfy ? u; z ? vT g. Then any column that contains a space symbol contributes a net loss of at least "(k2 ?1)?2u+vA ?vT , and it follows that the SP-score for a on Bn is worse than the SP-score for a0 on Bn . Now let us estimate the total SP-score for the alignment a. Let U be the set of all n < k5 such that a-block number n does not contain spaces. Then X SPaM (tG ) vT (k16 ?k14)=2+3k5 (u?vT )(k2 ?1)? ca;n k5(2u?vA ?vT )+O(k4 ): Let b be an optimal gap-0-1 multiple alignment of the sequences tG . Since the optimal multiple alignment a cannot have a score that is worse than that of an optimal gap-0-1 multiple alignment, we must have SPaM (tG ) vT (k16 ? k14)=2 + 3k5(u ? vT )(k2 ? 1) ? 2cb k5 (2u ? vA ? vT ) + O(k4 ): It follows that ca;n = cb for most n, and thus for most n the partition hV0a;n ; V1a;ni maximizes the size of the cut ina;n Since the largest of the numbers ca;n and the G. corresponding partition hV0a;n ; V1 i can easily be extracted from a by a polynomialtime algorithm, part (a) of Theorem 1 follows. Proof of Theorem 2. Our argument does not require a formal de nition of the class MAX-SNP . It su ces to know that there is a positive integer B such that the SIMPLE MAX-CUT(B) is MAX-SNP -complete (Papadimitriou and Yannakakis 1991). We will show MAX-SNP -hardness of our multiple alignment problems by showing that there are L-reductions of the SIMPLE MAX-CUT(B) problem to scaled versions of each of the following problems: gap-0-1 multiple alignment for M0, gap-0 multiple alignment for M0 , and multiple alignment for M0 . This establishes MAX-SNP -hardness in the sense of Arora and Lund (1997), who call an optimization problem 0 MAX-SNP -hard if there exist a MAX-SNP -complete problem and a gap-preserving reduction of to 0 . This a de nition explicitly allows scaling of objective functions (see Arora and Lund (1997), page 411). n2U 10 WINFRIED JUST Let us recall the notion of an L-reduction. If and 0 are two optimization (maximization or minimization) problems, then L-reduces to 0 if there are two polynomial-time algorithms f; g and constants ; > 0 such that for each instance I of : (a) Algorithm f produces an instance I 0 = f(I) of 0 , such that the optima of I and I 0 , OPT(I) and OPT(I 0), respectively, satisfy OPT(I 0) OPT(I). (b) Given any solution of I 0 with cost c0 , algorithm g produces a solution of I with cost c such that jc ? OPT(I)j jc0 ? OPT(I 0 )j. Let us de ne a minimization problem 0 as follows: An instance of 0 is a simple graph G = hV; E i with degree at most B. For every partition P = hV0 ; V1i of V , let cP be the size of the cut determined by P. The objective of 0 is to nd a partition P of V that minimizes the number dP = 3jE j ? 2cP . Here is the L-reduction of 0 to scaled versions of the multiple alignment problems: Given a graph G = hV; E i with k vertices and degree at most B, we de ne a k2-tuple tG = ht0 ; : : : ; tk2?1i of sequences as follows: Enumerate V = fv0; : : : ; vk?1g, E = fe0 ; : : : ; e?1g. Each sequence ti will have length k12. We de ne ti;j , the j-th character in ti , as follows. For m < , em = fvh ; vig, h < i, n < k5 we let: th;k7 n+k7m+2 = ti;k7n+k7 m+1 = ti;k7 n+k7m+3 = A. th;k7 n+k7m+1 = th;k7 n+k7m+3 = ti;k7n+k7 m+2 = T. In all other cases, we let ti;j = C. Figure 4 illustrates this construction. Again, we exhibit a situation where em = fvh ; vi g; em0 = fvg ; vh g, m < m0 , n < n0 < k5 . tg;k7n+k7 m tg : ... C C C C C ... C T A T C ... C C C C C ... th : ... C T A T C ... C A T A C ... C T A T C ... ti: ... C A T A C ... C C C C C ... C A T A C ... tp : ... C C C C C ... C C C C C ... C C C C C ... # tg;k7n+k7 m0 # tg;k7n0 +k7 m # j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j FIG. 4. Coding a graph in the proof of Theorem 3. An argument very similar to the reasoning in the proof of Theorem 1 shows that if a is the optimal gap-0-1 multiple alignment, gap-0 multiple alignment, or multiple alignment for M0 , then () SPaM0 (tG ) = (3 ? 2ca )k5 + O(k4 ); where ca is the size of the minimal cut in G. COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 11 Now it is immediately clear that 0 L-reduces to each of the three alignment problems, if the SP-score is scaled by a factor of k?5=2 for every multiple alignment problem that involves k sequences. Since L-reductions compose, it now su ces to show that the SIMPLE MAXCUT(B) problem L-reduces to 0 . Let G = hV; E i be a simple graph of degree at most B. The functions f and g in the de nition of an L-reduction will simply be identity. Note that for any partition P of V that maximizes the size of the corresponding cut, each vertex of degree 1 contributes at least one adjacent edge to the cut induced by P: If not, the size of the cut could be increased by moving the o ending vertex to the other side of the partition. It follows that if the degrees in G are bounded by B, then cP jE j=B. Since dP = 3jE j ? 2cP 3jE j, we can set = 3B. Since any increase of cP by 1 corresponds to a decrease of dP by 2, we can set = 2, and the conditions of an L-reduction will be satis ed. 4. Acknowledgements I would like to thank Liming Cai for bringing the problem to my attention, and David Juedes, Gianluca Della Vedova and Tao Jiang for valuable comments on earlier versions of this paper and a prequel (Just 1999) to it. I also thank the referee of the rst version of this paper for pointing out a mistake in the argument. 5. References Arora, S., Lund, C., Motwani, R., Sudan, M., and Szegedy, M. 1992. Proof veri cation and intractability of approximation problems, 13-22. In Proc. 33rd Arora, S., and Lund, C. 1997. Hardness of Approximations, 399-446. In Hochbaum, D.S., ed., Approximation Algorithms for NP-hard Problems, PWS. Benner, S.A., Cohen, M.A., and Gonnet, G.H. 1993. Empirical and Structural Models for Insertions and Deletions in the Divergent Evolution of Proteins. J. Mol. Biol. 229, 1065-1082. Bonizzoni, P., and Della Vedova G. 2000. The complexity of multiple alignment with SP-score that is a metric. To appear in Theoretical Computer Science. Carillo, H. and Lipman, D. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 48(5), 1073-1082. Dayho , M. O., Schwartz, R. M., and Orcutt, B. C. 1978. A model of evolutionary change in proteins. In M. O. Dayho , ed., Atlas of Protein Sequence and Structure, 345-352. Nat. Biomed. Res. Found., 5, supp. 3. Della Vedova, G. 1999. Personal Communication. Fasman, K.H., and S. L. Salzberg, S.L. 1998. An introduction to biological sequence analysis, 21-42. In Salzberg, S.L., Searls, D.B., and Kasif, S., eds., Computational Methods in Molecular Biology, Elsevier. Fitch, W. M. 1993. Letter to the Editor: Commentary on the letter by Ward C. Wheeler. Molecular Biology and Evolution 10(3), 713-714. Garey, M.R., and Johnson, D.S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman. Heniko , S., and Heniko , J. 1992. Amino acid substitution matrices from protein blocks. Proc. Nat. Acad. of Sci., USA, 89, 10915-10919. Jiang, T. 1999. Personal Communication. IEEE Symp. on Foundations of Computer Science. 12 WINFRIED JUST Jiang, T., Kearney, P., and Li, M. 1999. Some Open Problems in Computational Molecular Biology. SIGACT News 30(3), 43-49. Just, W. 1999. On the computational complexity of gap-0 multiple alignment. Preprint. Papadimitriou, C., and Yannakakis, M. 1991. Optimization, approximation and complexity classes. J. of Computer and System Sciences 43, 425-440. Pascarella, S., and Argos, P. 1992. Analysis of Insertions/Deletions in Protein Structures. J. Mol. Biol. 224, 461-471. Wang, L., and T. Jiang, T. 1994. On the complexity of multiple sequence alignment. Journal of Computational Biology 1(4), 337-348. Wheeler, W. C. 1993. Letter to the Editor: The Triangle Inequality and Character Analysis. Molecular Biology and Evolution 10(3), 707-712. COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 13 A T x y z A y vA u T z u vT FIG. 1. A generic scoring matrix. 14 WINFRIED JUST A T C 0 2 2 2 A 2 0 1 0 T 2 1 0 0 C 2 0 0 0 FIG. 2. The scoring matrix M0: COMPLEXITY OF MULTIPLE SEQUENCE ALIGNMENT 15 tg;k7n+k7 m tg : ... T T T T T ... T T A T T ... T T T T T ... th : ... T T A T T ... T A T A T ... T T A T T ... ti: ... T A T A T ... T T T T T ... T A T A T ... tp : ... T T T T T ... T T T T T ... T T T T T ... # tg;k7n+k7 m0 # tg;k7n0 +k7 m # j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j FIG. 3. Coding a graph in the proof of Theorem 1. 16 WINFRIED JUST tg;k7n+k7 m tg : ... C C C C C ... C T A T C ... C C C C C ... th : ... C T A T C ... C A T A C ... C T A T C ... ti: ... C A T A C ... C C C C C ... C A T A C ... tp : ... C C C C C ... C C C C C ... C C C C C ... # tg;k7n+k7 m0 # tg;k7`n0 +k7 m # j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j j FIG. 4. Coding a graph in the proof of Theorem 3. ...
View Full Document

## This note was uploaded on 05/20/2011 for the course CAP 5515 taught by Professor Ungor during the Spring '08 term at University of Florida.

Ask a homework question - tutors are online