Complexity of common subsequence and ... - Springer Link

16 downloads 0 Views 1MB Size Report
of the words from L if z is a subsequence of each word from L. A common ...... greater than n (n as before is the length of the longest word in L) is a common ...
COMPLEXITY OF COMMON SUBSEQUENCE AND SUPERSEQUENCE PROBLEMS AND RELATED PROBLEMS

V. G. Timkovskii

UDC 519.6

The article examines polynomial-time and intractable longest common subsequence and subword problems and shortest common supersequence and superword problems, both old and new. The results provide a more complete complexity characterization of these problems. Some applications are discussed, as well as the dual problems of common nonsubwords, nonsuperwords, nonsubsequences, and nonsupersequences.

In this paper, we consider old and new polynomial-time and NP-hard problems of finding longest common subsequences and subwords and shortest common supersequences and superwords [1, pp. 289, 290]. The results provide a more complete characterization of the complexity of these problems. We also discuss the dual problems of common nonsuhwords, nonsuperwords, nonsubsequences, and nonsupersequences. Some open questions are formulated in the conclusion.

INTRODUCTION Statement of the Problems Let x and y be words in some alphabet. A word x is called a subsequence of the word y and y a supersequence* of the word x if x can be obtained by erasing from y any number of letters (zero and more). A subsequence x of the word y is called. a subword of the word y, and a supersequence y of the word x is called a superword of the word x if y can be obtained by adjoining any number of letters (zero and more) to the right and to the left of x. For example, the word THEM is a subsequence and a subword, and the word ATTIC is a subsequence but not a subword, of the word MATHEMATICS. The word MATHEMATICS is a supersequence and a superword of the word THEM, and also a supersequence but not a superword of the word ATTIC. Let L be a nonempty finite language in some alphabet A. The word z in this alphabet is called a common subsequence of the words from L if z is a subsequence of each word from L. A common supersequence, a c o m m o n subwerd, and a common superword are similarly defined. We thus have the following natural optimization problems to find the corresponding strings in a given language L over the alphabet A: LCS -- longest common subsequence; SCS -- shortest common supersequence; LCSW -- longest common subword; SCSW -- shortest common superword. Generalizations and modifications of these problems can be found in [1-3]. Without loss of generality, we assume that all the letters of the alphabet A occur in words from L and that A contains more than one letter. For a one-letter alphabet, the four problems are trivial. The orbit of the letter b is the set O b of all its occurrences in the words of the language L. For example, if A = {1, 2, 3, 4}, L = {413, 2343, 432}, then 43 is LCS, 234132 is SCS, 4 is LCSW, 41323432 is SCSW. For any language L, we obviously have length LCSW < length LCS, and length SCS _< length SCSW. We use the following notation: [fl is cardinality when f is a set and length when f is a word;

a=IA[;

l = ILl;

*Only finite alphabets are considered, and the terms "subsequence" and "supersequence" correspond to words of finite length. Translated from Kibernetika, No. 5, pp. 1-13, September-October, 1989. Original article submitted March 1O, 1988. 0011-4235/89/2505-0565512.50

9

Plenum Publishing Corporation

565

m=min{Ixl:xCL}; n=rnax{Ixl:xCL}; r=max{lObl:bGA}; S=Y~ELIXt. For our example, a = 4, l = 3, m = 3, n = 4, r = 4, s = 10; 0 3 is the maximum orbit. The particular cases of LCS, SCS, LCSW, and SCSW problems considered in this paper are obtained by fixing one or two parameters from the set {a, l, m, n, r, s - a}.* The parameter s may be viewed as the dimension of each of the four problems. Obvious polynomial-time algorithms exist for checking if a given word is a subsequence (supersequence, subword, superword) of another given word. The recognition analogs of all the four problems are therefore contained in the class NP [11. Applications These four problems have fairly broad applications. Judging from the literature, the two-word LCS problem arose in the 1960s in molecular genetics in connection with the evolution of protein molecules [4-6]. The LCS length provided a convenient measure of similarity for long molecules, considered as nucleotide sequences [7], and also for arbitrary sequential objects, such as files. Somewhat later, related problems began to be considered in the context of data processing systems. Text editing must deal with the problem of converting one word into another by a minimum number of edit operations -- deletion and insertion of a letter, replacement of a letter with another, transposition of neighboring letters [8, 9]. Data compression [10] requires storing a large number of similar files with maximum space economy. This can be achieved by using the file LCS or SCS with a collection of shorter codes for restoring the files. When LCSW or SCSW are used for the same purpose, the file restore operation becomes faster, with some loss of space efficiency. In cluster analysis, LCS are used to estimate the closeness of an object to a template [11]. SCS and LCS problems recently found applications in mechanical engineering [12]. Here, SCS play a special role: they are used to identify the shortest standardized technological process for a given set of separate technological processes, viewed as sequences of processing operations [13, 14]. The LCSW problem can be used to identify larger standard operation blocks. 1. LCS AND LCSW PROBLEMS LCS for Two Words A variety of algorithms are available for finding the LCS of two words [8, 9, 15-26]. The following table lists the time and space characteristics of some of these algorithms. Here, p is the LCS length, q is the number of pair combinations of the occurrences of a single letter in both words. Time

O(mn) O(mn) O(np) O(p(m-- •) logn) O((n-~q) logn) O(mnflogm) O(n (m-- p) ) O(pm log (n/ p) -}- pro) O(n log a q- pm log (n/m) + pro) O( n (m --P) )

Memory

Year Source

O(mn) O(m-~- n) O(mn) O(mn) O(q-~n) O(mn) O(mn) O(mn) O(mn) O(m -~n)

1974 1975 1977 1977 1977 1980 1982 1984 1984 1987

[8] [15] [ 17] [17] [18 I [ 19] [201 [251 [25] [26]

One of the first algorithms was proposed in [8]. This is a dynamic programming algorithm that can be used to solve a more general editing problem than the one mentioned above. The algorithms of [25] are the fastest for the case when one word is much longer than the other, while the algorithm of [20] is the fastest for two similar words. Among linear-space algorithms, the best time is observed for the algorithm of [26]. Upper and lower bounds on the number of letter comparisons to find the LCS of two words were obtained in [27, 22].

* The difference s - a may be regarded as a measure of dissimilarity of words from L by the composition of letters that they contain: when s - a tends to zero, any common subsequence tends to zero and any common supersequence tends to s.

566

The development of processor-dependent algorithms substantially reduces the time to find a two-word LCS. The algorithm described in [23] runs in time O(m[n/w]) on a w-bit machine and in time O(m) on a n-bit machine. An O(m + n)time algorithm was proposed in [24] for a systolic processor.

LCS for Many Words In general, LCS can be found in time O(/2mn) by a simple enumerative procedure which identifies the shortest word in L and all the subsequences of this word, and then checks the other words in L for the presence of these subsequences. The LCS problem is thus solvable in polynomial time for any fixed m (and therefore for any fixed n, because m _< n). We will show in what follows how to find an exact solution by dynamic programming and an approximate solution by the tournament method. For a fixed a = 2, the problem is NP-hard [28]. The proof is fairly involved, using the vertex cover problem [1]. LCSW A similar enumerative procedure that identifies all subwords, instead of subsequences, will fund LCSW in polynonfial time (OlmZn). A linear-time algorithm that recognizes the inclusion of one word in another [2, 29] should be used in this case to check for the presence of the identified subwords in other words of the language L. An O(m + n)-time algorithm that finds the LCSW of two words was proposed in [30]. It uses the concept of positional tree [2], which is a convenient instrument for the construction of effective algorithms and for other subword problems [2]. A simplified version of the algorithm of [30] is described in [3]. Effective recognition algorithms for all the occurrences of one word in another and a host of other subword problems can be found in [2, 32]. 2. FINDING LCS AND SCS BY DYNAMIC PROGRAMMING Dynamic programming finds LCS in time O(nt), and the LCS problem is therefore polynomial for any fixed L This result follows from an obvious generalization of the algorithm of [8] and has been published on several occasions (see, e.g., ~1, 27]). In this section we will show that the dynamic programming method will also find SCS in the same tir~e. ~Enerefore the SCS problem is also solvable in polynomial time for any fxxed L

Notation and Definitions Let L = {w1.... , wi}, n i the length of the word wi, bij the j-th letter counting left to right in the word wi, i = i .... , L We assume that, in each word in L, the 0-th letter from the left is the empty letter #, and ~t does not affect the word length, i.e., for any i = 1.... ,1, we take bi0 = # and I #wi ] = t wi I. The vector t = 01 ..... Jl), where Jt ~ {0, 1 ..... n i} will be caned the transversal for L. Among all the transversals, we identify the left transversal o = (0, .... 0) and the right transversal T = (n 1, 9--, Ill).

For a transversal t # o and a subset H of the set I = {L ..., l}, we define the transversal t H which is obtained from t by replacing all the components Jh > 0, h E H, with the components ]h - 1. Any transversal t defines a multilanguage* L t = {wu ..... w/t}, where wi~ is the pref~ of length Ji of the word w i, and the set of ~-equivalence classes on I, defined by the rule u ~- v ~ buj u = by]v. Thus, each class K represents # or some letter of the alphabet A, which we denote by/~K" By Ea we denote the subset of -~-equivalence classes representing only letters from A. Length SCS of the words from L t will be denoted? by )~(t).

Recurrence Equations It is easy to verify the following propositions. 1. L T = L and 2(q 3 is length SCS for L. *We use the term "multilanguage" because L t in general is a multiset. ?LCS and SCS for a mu!tilanguage are defined as LCS and SCS for the set of its distinct words.

567

2. If ClC2...C2(t) is SOS for L t and c;t(t) = flK for some class K from Et, then ClC2...c2(t)_ 1 is SCS 3. If Z is the last letter in SCS for Lt, then E t contains a class K such that z = fig. Propositions 2 and 3 lead to the recurrence equation

for LtK.

L (t) = min (g (t~ 2}, N21 -----{v: i n ( v ) > 2 & out (v) = 1},

*In what follows, we omit the index L of G L and PL, implying that the graph G and the partition P were constructed from the language L by this technique.

570

Fragmentof digraphH vfromRlI

Fragmentof digraph H'

~v ;

r

f 2 , . . au~(v)

! 2 ...

~ aat.(v)

i

v fromN21 r 2

N22,

. . . in(v)

f

t 2

-'"

....in(v)

eut C~)

f•i"

ZtCvl

Iv from

l

2

infv) !

2

...

in(v)

Fig. 2. Transformation of digraph H to digraph H'.

N~2 = {v : in (v) ~ 2 & out (v) ~ 2}, N~

= {v : in (v) = 1 & out (v) = 2 V in (v) = 2 & out (v) = t}.

Transform the digraph H = (N, D) into the digraph H ' = (N', D') by removing all the vertices from N00 together with their incident arcs and replacing the vertices v from N~(N1221 U N00) together with their incident arcs by the fragments shown in Fig. 2. Thus, N' = (NkN00) U W, where W is the set of new vertices (light circles in Fig. 2) that appeared in the process of transformation of H into H'. Clearly, H ' is the digraph of an instance of 3-CCVS. To complete its construction, set c' = c. The instance (H, c) of CCVS is decidable if and only if the instance (H', c') of 3-CCVS is decidable. This follows from some obvious propositions. --No cycle in H passes through vertices from No0. --Any subset S cutting all the cycles in H also cuts all the cycles in H'. --The cycles in H ' cut by any vertex from W constitute a subset of the set of cycles cut by some vertex from N. The transformation of H to H ' is obviously done in polynomial time. Let us now show that the SCS problem is NPhard for the case n = 2, r = 3 by proving the following stronger result. THEOREM 1. The SCS problem for two-letter words is NP-hard even if all the orbits are of cardinality 3o Proof uses a polynomial-time equivalent analog of our problem -- the CRR problem, which is restated as the foUowing recognition problem. BOUNDED REGULAR REFINEMENT (BRR). Given the positive integer constant k, the digraph G = (V, E) in the form of a collection of separate paths, and the partition P of the set V, decide if there exists a regular refinement Q of the partition P with _-- 2. For any k _-. 1, the language ~ is contained in class IIkP of the polynomial-time hierarchy ff and only if there exist polynomials PI,P2..... Pk and a (k + 1)-dimensioual relation R on F ~ recognizable in polynomial time such that for all x E F* we have xE~

[vy~Er*, ly~l

Suggest Documents