The Complexity of the Morphism Equivalence Problem for ... - CiteSeerX

The Complexity of the Morphism Equivalence Problem for Context-Free Languages by Wojciech Plandowski Ph.D. thesis prepared under the supervision of Prof. Wojciech Rytter

Department of Mathematics, Informatics and Mechanics Warsaw University 1995

Contents 1 Introduction 1.1 1.2 1.3 1.4

Basic Notions : : : : : : : : : : : : : : : : : : : : : : The Morphism Equivalence Problem - the De nition Main Results of the Thesis : : : : : : : : : : : : : : : Organization of the Thesis : : : : : : : : : : : : : : :

2 Preliminaries 3 Test Sets 3.1 3.2 3.3 3.4 3.5

Examples of Test Sets : : : : : : Linear Context-Free Languages : General Context-Free Languages : Lower Bounds : : : : : : : : : : : Test Sets in Semigroups : : : : :

: : : : :

: : : : :

4 The Morphism Equivalence Problem 4.1 4.2 4.3 4.4

The Words Equivalence Problem The Split Operation : : : : : : : : The Compact Operation : : : : : The Algorithm : : : : : : : : : :

: : : :

: : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : : : : : : :

: : : :

: : : : : : : : :

: : : :

: : : : : : : : :

5

5 7 9 12

15 21 22 26 30 33 37

41 42 45 57 59

5 Applications

63

6 Conclusion and Open Problems

73

5.1 Deterministic Generalized Sequential Machines : : : : : : 64 5.2 Algebraic Systems of Equations : : : : : : : : : : : : : : 67 5.3 Recursive Sequences of Words : : : : : : : : : : : : : : : 69 3

Chapter 1 Introduction The formal language theory originated in the 30s. Its methods combined those of algebra, theory of algorithms and logics. Its later evolution in the 40s and 50s has been stimulated by development of computers and progress in understanding natural languages. Nowadays the theory is rmly established in theoretical computer science, with applications to computational complexity, theory of programming languages and compilers, and data compression.

1.1 Basic Notions The central notion of the formal language theory is the notion of a word which is understood as a nite sequence of letters. Such a general de nition of a word allows it to be a proper mathematical model for various real objects, such as texts in a natural language, texts of computer programs or even a DNA structure in organic cells. A set of all letters, which can be used to buid words, is called an alphabet. Throughout the thesis, the alphabet is assumed to be nite. Classes of similar objects correspond to certain sets of words. Sets of words, called (formal) languages, can be de ned in two ways. The rst one, semantical, consists in de ning properties of objects represented by words in a language. The second one, syntactical, consists in de ning rules of constructing words in a language. The formal language theory studies only the syntactical ways of representing languages. 5

6

Chapter 1. Introduction

One of the syntactical representations of formal languages are rewriting systems of which grammars are the most popular. There are four basic types of grammars: regular grammars, context-free grammars, context-sensitive grammars and recursively enumerable grammars. Let us denote them by RegG, CFG, CSG and RecG, respectively. The types of grammars are ordered from the most restrictive types to the least restrictive types, i.e. RegG CFG CSG RecG. Regular grammars represent languages which are used in lexical analyzers in compilers. Context-free grammars are used to de ne syntax of programming languages and play a signi cant role in construction of syntax analyzers in compilers. Recursively enumerable grammars de ne all languages in which words are generated by some algorithm. There is no algorithm generating all words of a language represented by a grammar from outside of this class. The class of languages de ned by context-sensitive grammars does not play such fundamental role in computer science as the other classes so we skip its interpretation. The descriptive power of individual types of grammars is illustrated by the fact that almost all problems dealing with regular grammars are decidable, and the majority of them in polynomial time with respect to the size of input data. On the other hand almost all problems dealing with context-sensitive and recursively enumerable grammars are undecidable. The class of context-free grammars is a boundary class for which part of problems is decidable and the other part is undecidable. Grammars are regarded as sequential rewriting systems. Parallel rewriting systems correspond to L systems. The simplest ones are D0L systems and HD0L systems. These systems de ne sequences of words which correspond to consecutive phases of evolution of certain organisms. An important role in de nitions of L systems is played by morphisms. These are the very basic functions which are de ned on words. The functions are fully determined by their values on one-letter words. A morphic image of a longer word is obtained from the input word by replacing its letters by their morphic images.

1.2. The Morphism Equivalence Problem - the De nition

7

1.2 The Morphism Equivalence Problem - the De nition Equivalence problems are of special interest in theoretical computer science. The problems consist in testing whether two properties of objects are equivalent on a subset of all objects. A property is a function whose domain is the set of all objects. Formally, equivalence problems can be formulated as follows.

Equivalence problems

Let Objects be the set of all objects and P be an arbitrary set. Given two functions f , g : Objects ! P and a set of objects obj Objects decide whether or not the equation f (x) = g(x) holds for each x 2 obj . The problem is fundamental in studying relationships between sets of objects and properties of objects. A typical example of such a problem comes from databases. An object is a record of some type T and a property is a query to a type T database. The equivalence problem is to decide, given two queries and some type T database, whether the queries are equivalent on this database. Our next problem comes from cryptography. One of the simplest method of encoding uses code books. A code book is a set of pairs: word and its code. Encrypting consists in replacing words in a message to encode by their codes from the code book. Two code books are equivalent for some set of messages if for each message in the set, both code books give the same code. The equivalence problem consists in deciding, given two code books and a set of messages, whether or not the two code books are equivalent for this set of messages.

Example 1 The code books in Fig. 1.1 are equivalent for the set of messages fBarbara is a spy, Donald is a spyg. Indeed, using both code

books the code of the message `Barbara is a spy' is `32212112' and the code of the message `Donald is a spy' is `1112212112'. The problem of code books can be formulated in terms of morphisms and languages. In this new formulation the problem is known as the morphism equivalence problem. An alphabet is a set of all words

8

Chapter 1. Introduction a Barbara Donald is spy

12 32 1112 2 112

a Barbara Donald is spy

121 3 111 22 12

Figure 1.1: Two code books which are equivalent on the set of messages fBarbara is a spy, Donald is a spyg. which can be used in messages. A code book de nes a morphism in which a morphic image of a word is its code. A code of a message is a morphic image of this message. Then the problem of code books can be reformulated in the following way.

Morphism equivalence problem

Given two morphisms f , g and a language L, decide whether or not the equation f (x) = g(x) is satis ed for each word x in L. In the original formulation of the problem the language L is assumed to belong in a certain class of languages L. Then the problem is called the morphism equivalence problem for the class L. The morphism equivalence problem was formulated in 1978 by Salomaa and Culik [7]. Since that time many papers have been written on this and related topics. The reason is that the problem is closely related to problems in formal language and automata theory [2, 6, 25], in theory of equations in semigroups [6, 16, 11], combinatorics on words [17, 23] and data compression [18, 19]. In some cases a solution to the problem is very simple.

Example 2 Denote by wi a word which is a sequence of i words w. Assume that the alphabet is the set fa; bg. De ne two morphisms f ,

1.3. Main Results of the Thesis

9

and g in the following way. f (a) = ab; f (b) = ba; g(a) = a; g(b) = bba Let L = f(ab)i : i 1g. Then the solution to the morphism equivalence problem is trivial. Since f (ab) = abba = g(ab), we have f ((ab)i) = f (ab)i = g(ab)i = g((ab)i), and the morphisms are equal on each word in L.

1.3 Main Results of the Thesis The basic questions related to the morphism equivalence problem are: Is the problem decidable for a certain class of languages? What is the exact complexity of the problem for a certain class of languages? The answers to the above questions depend on how the language L is represented. Suppose that the language L is represented by a grammar. Recall that RegG CFG CSG RecG. The boundary between decidability and undecidability of the morphism equivalence problem is between CFG and CSG, i.e. if the input language is represented by a context-sensitive grammar, then the morphism equivalence problem is in general undecidable even if we restrict the problem to deterministic context-sensitive grammars [6], and if the language is represented by a context-free grammar (or by a regular grammar), then the problem is decidable [7]. We show that the dierence between complexities of the problem for CSG and CFG is unexpectedly high, namely: the problem for CFG is decidable in polynomial time. Previous results gave double exponential [3] and single exponential [13] time solutions to the problem. The polynomial time complexity of the problem for regular grammars has been shown earlier [13]. The morphism equivalence problem for D0L and HD0L systems is decidable [14], but its exact complexity is still not known. The basic tools to solve the morphism equivalence problem are test sets. Note that in Example 2, independently of the morphisms f and g,

10


it is enough to check that the morphisms are equal on the word `ab' to be sure that they are equal on all words in the language L = f(ab)i : i 1g. A subset of a language L which has the above property is called a test set for L. The usefulness of test sets in the morphism equivalence problem is obvious. To solve the problem it is enough to construct a test set which is as small as possible for the input language L and test if the input morphisms are equal on all words in this test set. If they are, then they are equal on the whole language L, otherwise they are not equal on L. Thus in some sense the test set is equivalent to the whole language in the morphism equivalence problem. The most famous theorem on test sets is the Ehrenfeucht Conjecture.

Ehrenfeucht Conjecture

Each language possesses a nite test set. From the point of view of the morphism equivalence problem the conjecture says that each language is equivalent to its nite subset. Of course, the constructibility of such subsets depends on a language and its representation. The Ehrenfeucht Conjecture was posed by Ehrenfeucht in early 70s and proved to be true in 1985 independently by Albert and Lawrence [4] and by Guba [10]. The basic complexity question dealing with the conjecture remained open: Given a language L how small can be a test set T for it? The answer to this question depends on what parameter describing the set L the size of T is to depend. Since morphisms are fully determined by their values on letters, a natural choice for the parameter is the number n of letters we use to build words. Let f (n) be the least number such that each set of words over n letter alphabet possesses a test set of size f (n). It is not dicult to prove that f (1) = 1, but the exact value of f (2) is still not known. The existing bounds are 2 f (2) 3 [8, 9]. Up to now no upper bound for f (3) is known. The only obvious lower bound for the number f (n) is n. We show the rst nontrivial bound for f (n) which is (n3). Recently, this bound has been improved to (n4) [16]. A generalization of this problem to other semigroups than free is considered in [16, 11].

1.3. Main Results of the Thesis

11

If we are interested in constructing the test set T , then the natural choice for the parameter describing the language L is the size m of a representation of the language L. Now, we are interested not only in the existence of test sets but in the complexity of their construction, as well. If the language L is represented by a regular grammar, then it possesses a test set of size O(m) and the test set can be constructed in polynomial time [13]. If the language L is represented by a context-free grammar, then the existing constructions give double exponential [3] and single exponential [13] bounds for the size of a test set. We prove that this bound can be lowered to polynomial O(m6). The words in the test set cannot be listed in polynomial time, since they can be of exponential length, but it is possible to construct in polynomial time a context-free grammar representing this set. Our upper bound for the number of words in the test set does not seem to be tight, but it cannot be lowered too much. We prove that there are context-free languages for which all test sets are of size (m3). Nothing is known about the upper bound for the size of test sets for context-sensitive grammars, but even if such a bound exists test sets for those languages are in general not constructible, which is a simple consequence of undecidability of the morphism equivalence problem for those languages. The problem of existence of polynomial size test sets for context-free languages can be considered for arbitrary semigroups, not only the free ones [16, 11]. We formulate a simple necessary and sucient condition for a group to have the property that in this group, each context-free language possesses a polynomial size test set.

SUMMARY OF MAIN RESULTS

The main results of the thesis are:

1. proving that each language represented by a context-free grammar of size m possesses a test set containing O(m6) words, 2. proving that the polynomial m6 in the previous result cannot be replaced by a polynomial of lower degree than 3,

12


3. formulation of a necessary and sucient condition for a group to have polynomial size test sets for context-free languages, 4. construction of a polynomial time algorithm for the morphism equivalence problem for context-free grammars. Result 4 looks as a simple consequence of result 1. Notice however that if 1 is applied directly then the lengths of words in the test set could be exponential. Moreover, each word in a language represented by a context-free grammar can be of exponential length with respect to the size of the grammar. Thus result 4 requires developing a nontrivial algorithm.

1.4 Organization of the Thesis In Chapter 2 we introduce basic notions and de nitions that we use in the rest of the thesis. Next, in Chapter 3 we prove the results concerning test sets, i.e. we prove 1, 2 and 3. The most dicult part of the chapter is to prove the existence of polynomial size test sets for context-free languages. The proof takes up three sections. First, we prove that a certain set of 16 words possesses a test set consisting of 15 words. It is the most dicult part of the whole proof. The rst proof took up 10 pages [17]. Our proof is considerably shorter (2 pages). This is obtained by a generalization of our problem to free groups and solving this more general problem. Using the above result we prove the existence of polynomial size test sets for linear context-free grammars a subclass of the class CFG. The test set is constructed on the basis of the graph representation of the grammar. In the last part of the proof it is shown that each context-free language possesses a test set which is a linear context-free language. The results of Chapter 3 are used in the next chapter, Chapter 4, to present a polynomial time algorithm for the morphism equivalence problem. The most dicult part of the algorithm is the part which checks whether two words given by their compressed representations are the same. Since the words can be of exponential length, the algorithm has to test their equality without uncompressing them. We solve this problem by studying periodicity structure of some subwords of the

1.4. Organization of the Thesis

13

words. The periodicity structure of exponential length word can be stored in polynomial number of bits. The techniques we use in the algorithm have been applied in polynomial time algorithms for searching a compressed pattern in a compressed text [19, 18]. Chapter 5 contains applications of our results in automata theory, theory of systems of equations on words and combinatorics on words. In the rst section of this chapter we consider a generalization of the morphism equivalence problem which is called the dgsm equivalence problem. Dgsms - deterministic generalized sequential machines de ne a wider class of functions than morphisms. We prove that the equivalence problem for those functions can be also solved in polynomial time. In the next section we show that the morphism equivalence problem and test sets have a natural interpretation in theory of equations on words and we show the interpretation of our results from Chapters 3 and 4. The last section of this chapter is devoted to showing an application of the algorithm from Chapter 4 in combinatorics on words. It deals with sequences of words that are de ned by recurrent formulae. We show that it is possible to decide in polynomial time whether or not rst m elements of two such sequences are the same. The diculty of the problem lies in that the words in the sequences can be of exponential length. An interesting consequence of this result deals with the famous 2nconjecture for D0L systems. D0L systems de ne sequences of words. 2n-conjecture says that if rst 2n-elements of two D0L sequences are the same, then the sequences are the same. n is the number of letters that are used to build words. We prove that if 2n-conjecture for D0L systems is true then the sequence equivalence problem for D0L systems (deciding whether two such sequences are the same) can be solved in polynomial time. We show also that an analogous conjecture for HD0L systems does not hold and if some f (n)-conjecture holds for these systems then f (n) = (n3). The last chapter, Chapter 6, contains conclusions and open problems.

14


The thesis is mainly based on two papers: [17] J. Karhumaki, W. Plandowski, W. Rytter, Polynomial size test sets for context-free languages, JCSS 50(1995), 11{19. [23] W. Plandowski, Testing equivalence of morphisms on contextfree languages, in Proc. on ESA'94, LNCS 855, 460{470, 1994. The rst paper contains the proofs of results 1 and 2. The second one presents a short proof of the main lemma in the rst paper and a proof of results 3 and 4.

Chapter 2 Preliminaries We will use some basic notions of the formal languages theory. We refer the reader to [12] for detail de nitions. Any nite sequence of elements from a set is called a word over the alphabet . The word a1; a2; : : :; ak for k 1, is usually denoted by a1a2 : : :ak . We assume that the alphabet is nite. The elements of are called letters. The i-th letter of the word w is denoted by w[i]. A continuous subsequence of letters of the word w is called a subword of w. The subword starting with the i-th letter of w and ending with the j -th one is denoted by w[i::j ]. A subword which starts with the rst letter is called a pre x and a subword which ends with the last letter is called a sux. The sequence of 0 elements is called the empty word and is denoted by 1. The length of a word w is the number of elements in the sequence w and is denoted by jwj. Let be the set of all words over . Any subset of the set is called a language over the alphabet . A concatenation is a binary function from to . It takes two words w and u as its arguments and returns the word, denoted by wu, which is the sequence of letters of w followed by the sequence of letters of u. It is easy to verify that the concatenation is associative, i.e. for three words u, w, v, (uw)v = u(wv), and that the empty word is a neutral element of the concatenation, i.e. each word w satis es the equations 1w = w1 = w. Thus the set of words with concatenation forms a monoid. It is called the free monoid over and denoted by . Denote by wi, for i 1, the concatenation of i words w. Let w0 = 1 for each word w. 15

16

Chapter 2. Preliminaries

The free monoid can be embedded in a group. Before we de ne it, we have to introduce auxiliary notions. Let f be a one to one function whose domain is and such that the sets f () and are disjoint. Let w be a word. Starting from the word w remove subwords f (a)a or af (a), for a 2 , until it is not possible. Independently of the order of removal the subwords, the nal word is the same. We denote it by (w). The elements of the group are words over [ f () that do not contain subwords f (a)a and af (a) for any a in . The concatenation of elements u, v in the group is the word (uv). It can be proved that the concatenation is associative and 1 is the neutral element of the concatenation. The inverse w?1 of an element w = a1 : : :ak , for ai 2 ( [ f ()), k 0, is de ned as follows: 8 > > < w?1 = > > :

1 f (w) f ?1(w) a?k 1 : : : a?2 1a?1 1

if w = 1; if w 2 ; if w 2 f (); otherwise:

A free group generated by an alphabet is denoted by F (). Denote by wi , for an integer i and an element w of a free group, the element (wi ) if i 0 and the element ((w?1 )i) otherwise. De ne a morphism to be a function : ! such that for two words u, v in , (uv) = (u)(v). It is enough to de ne the morphism on all letters in to de ne it on . Indeed, let w = a1 : : :ak for ai 2 . Then (w) = (a1) : : : (ak ). Sometimes a morphism is de ned for semigroups other than free. Then, for a semigroup S , a morphism is a function : ! S such that for two words u, v in , (uv) = (u) (v), where is the semigroup operation in S . Again it is enough to de ne the morphism on all letters in to de ne it on . We say that two morphisms and agree on a language L if (w) = (w), for each word w in L. A context-free grammar G is a 4-tuple (; N; S; P ), where is a terminal alphabet, N is a nonterminal alphabet, S is the start symbol (or start nonterminal) and S 2 N , P is a set of productions. Elements of the terminal alphabet are called terminal letters or simply terminals. Similarly, elements of the nonterminal alphabet are called nonterminal

17 letters or nonterminals. A production in P is a pair (A; ) such that A 2 N and 2 ( [ N ). The pair is usually denoted by A ! . The size of a production A ! is the length of the word . The size jGj of a grammar G is the sum of the number of elements in terminal and nonterminal alphabets and the sum of the sizes of all its productions. In this and next chapters we use the convention that the capital letters denote nonterminals and lower-case letters are denote terminal letters. A derivation in a grammar G is a nonempty sequence of words 1; 2; : : : ; k , usually denoted by 1 ! 2 ! : : : ! k , such that for 1 i < k there are words i, i over ( [ N ) and a production A ! satisfying i = iA i and i+1 = i i. If there is a derivation which begins with a word and ends with a word , then we say that is derivable from and write ! . Any derivation starting from a nonterminal can be viewed as a labeled tree, called a derivation tree, in which internal vertices are labeled by nonterminals and leaves are labeled by terminal or nonterminal letters. The labeling satis es the condition that if A is a label of an internal vertex and 1; : : : ; k are labels of its sons written from the leftmost son to the rightmost one, then A ! 1 : : : k is a production in G. Such a tree has the property that the labels of leaves written from the label of the leftmost leaf to the label of the rightmost one form the last word in the derivation, see [12] for details. A derivation tree corresponding to a derivation S ! , where S is a start symbol, is called a derivation tree for . A word which consists of terminal letters is called a terminal word. The language L(G) generated by the grammar G consists of all terminal words derivable from the start nonterminal S . We say that a nonterminal A is useless if there is no derivation of a word in L(G) which contains the nonterminal A.

Example 3 Let G = (; N; S; P ) be a context-free grammar, where = fa; bg, N = fS g, P = fS ! aSb; S ! 1g. Clearly, the language L(G) generated by this grammar is faibi : i 0g. The only derivation

of the word a3b3 in this grammar looks as follows. S ! aSb ! a2Sb2 ! a3Sb3 ! a3b3: The derivation tree corresponding to the above derivation is presented in Fig. 2.1.

18

Chapter 2. Preliminaries

S

?? @@ b a S ?? @@ a

S

b

?? @@ b a 1 Figure 2.1: The derivation tree of the word a3b3. Other devices that generate languages are L systems [24]. The simplest such systems are D0L systems. A D0L system is a triple (; h; w), where is an alphabet, w is a word and h is a morphism h : ! . The D0L system de nes an in nite sequence of words w, h(w), h2(w), : : : , hi (w), : : : where hi(w) denotes h| (h({z: : : (h}(w)) : : :)). A i?times

sequence generated by a D0L system is called a D0L sequence. A HD0L system is a quadruple (; g; h; w), where is an alphabet, w is a word and g, h are morphisms g, h : ! . The HD0L system generates the sequence of words g(w), g(h(w)), : : : , g(hi(w)), : : : . Note that the sequence is a morphic image of the sequence generated by the D0L system (; h; w). A sequence generated by a HD0L system is called a HD0L sequence.

Example 4 Consider a D0L system (; h; w) where = fa; bg, w = a, h(a) = b, h(b) = ba. The D0L sequence generated by this sequence is a; b; ba; bab; babba; babbabab; babbababbabba;: : :: The sequence is a well known sequence of Fibonacci words. The sequence of Fibonacci words is de ned by recurrent formulae f1 = a, f2 = b, fn = fn?1 fn?2 for n 3.

19 Throughout the thesis we assume that the word 'i' abbreviates the phrase 'if and only if'.

Chapter 3 Test Sets As we know from the introduction the notion of a test set is a useful tool to solve the morphism equivalence problem.

De nition 5 Let , be alphabets. A language T is a test set for a language L () 1. T is a subset of L, and 2. for two morphisms f , g : ! if they agree on T , then they agree on L. The implication in condition 2 says that it is enough to check if two morphisms agree on a test set to prove that they agree on the whole language. Moreover, since T is a subset of L, if they do not agree on T , then they do not agree on L. This suggests an ecient solution of the morphism equivalence problem for a language L. Construct a small test set for L and check if the input morphisms agree on it. If they do, then the morphisms agree on L, otherwise they do not agree on L. Clearly the smaller the test set the better. This chapter is mainly devoted to prove that each context-free language possesses a test set with polynomial number of words with respect to the size of a context-free grammar representing the language. The proof is divided into three parts which correspond to rst three sections. In Section 3.1 we nd test sets for two nite languages. We shall see that the most complicated part of this section is to prove that a set 21

22

Chapter 3. Test Sets

of words is a test set. To shorten the proofs we embed free monoids in free groups. Next, in Section 3.2, the construction of test sets for linear context-free languages is shown. Section 3.3 contains the construction of test sets for general context-free languages. Next two sections contain other our results for test sets. In Section 3.4 we prove a cubic lower bound for the size of test sets for context-free languages. In the last section we extend the notion of a test set to semigroups and show a simple necessary and sucient condition for groups to have polynomial size test sets for context-free languages.

3.1 Examples of Test Sets In this section we show the construction of test sets for two nite languages. We shall see that the construction is essential to prove our main result in this chapter. The most dicult part is to prove that a set of words is a test set. To shorten the proofs we prove all facts in a free group. We show how it is possible on the proof of the following fact.

Fact 6 X = fuv; uw; xvg is a test set for Y = fuv; uw; xv; xwg. Proof: (in a free monoid) Let X , Y and f; g : ! be two morphisms. It is enough to prove that if f (z) = g(z) for z 2 X , then f (xw) = g(xw). Denote

s0 = f (s) and s00 = g(s) for any word s. Since u0v0 = u00v00 two cases are considered: Case 1: u0 is a pre x of u00. There is a word such that u00 = u0. Using this equation in equations u0v0 = u00v00 and u0w0 = u00w00 and simplifying them we obtain v0 = v00 and w0 = w00, respectively. Since x0v0 = x00v00 we obtain x00 = x0 and nally x00w00 = x0w00 = x0w0. Case 2: u00 is a pre x of u0. The proof in that case is symmetric to the one in the previous case. 2 Proof: (in a free group) Let X , Y and f; g : ! F () be two morphisms. It is enough

3.1. Examples of Test Sets

23

to prove that if f (z) = g(z) for z 2 X , then f (xw) = g(xw). Denote s0 = f (s) and s00 = g(s) for any word s. We have

x0w0 = x0v0(v0)?1 (u0)?1u0w0 = x0v0(u0v0)?1u0w0 = x00v00(u00v00)?1u00w00 = x00w00: This completes the proof. 2 The proof in a free group uses inversion of elements while the proof in a free monoid is divided into cases depending on the length of u00 and u0. The cases are symmetric but it is not a typical situation. Original proof of our next lemma considers 16 cases and after removing symmetric ones still four cases remain [17]. The proof in those cases takes up 10 pages and uses advanced properties of periods in words. Our proof takes up 2 pages and uses one well-known property of free groups. Consider nite languages Lk that are de ned by the following linear context-free grammars Gk : for 1 < i k; Ai ! aiAi?1ai j biAi?1bi; A1 ! a 1 j b 1 ; where nonterminals are in capital letters, terminal symbols in lower case letters and Ak is the start symbol. The language Lk consists of 2k words and it is de ned over an alphabet consisting of 4k ? 2 symbols. Let k be the word in Lk consisting of letters bi and bi. De ne Tk = Lk ?fk g. In the next lemma we prove that T4 is a test set for L4. Denote wa = a?1wa. Note that (wa)b = wab. Lemma 7 Let F be a free group. Consider morphisms f , g: ! F . Denote s0 = f (s) and s00 = g(s) for each word s. Let , , !, , , a, b, c, d be elements of the free group F . Then 1. If ! = ! and = and 6= 1, then ! = !, 2. 8 a c > < b c = cc ab = =) b d = db ; > : a d = d a

24

Chapter 3. Test Sets 3. If x0 = x0 for x 2 T3 then 03 = 03, 4. If x0 = x00 for x 2 T4 then 04 = 004 , 5. T4 is a test set for L4.

Proof:

1. A nontrivial identity is an equation whose left hand side diers from its right hand side. The implication in this part is a consequence of the following property of free groups [21].

Proposition 8 If two elements a, b of a free group satisfy a nontrivial identity then there are an element c and integers i, j such that a = ci and b = cj .

Since , ! satisfy the nontrivial identity ! = !, there is an element q in F such that = qi and ! = qj . For 6= 1 we have i 6= 0 so that qi = qi is a nontrivial identity. Hence, there is an element c such that = ck and q = cl for some integers l, k. As a consequence and ! are powers of c and therefore satisfy ! = !. 2. We may assume that 6= 1 and 6= 1. We apply part 1 to the rst two equations to obtain ab = ba. Now, we apply part 1 again to the obtained equation and the third0 equation. 3. We have = (x0)?1 x0 0 =0?1x for x 2 T3. Hence x0 = y0 , for x; y 2 T3, and therefore x y = , for x; y 2 T3. Substituting x = a3a2a1a2a3, y = a3a2b1a2a3 we obtain

a03a02 a01b0?1 1a0?2 1a0?3 1 = : (3.1) Similarly, for x = a3b2a1b2a3, y = a3b2b1b2a3 and for x = b3a2a1a2b3, y = b3a2b1a2b3 we obtain a03b02a01b0?1 1 b0?2 1a0?3 1 = 0?1 0?1 0?1 b03a02a01b1 a2 b3 = :

(3.2) (3.3)

1 Now,0?1we transform the equation (3.1). Let = a01b0? have 1 . 0?We 0? 1 a a2 1 a 0? 1 0 0 0 0 2 2 a3 a3 = . Thus a3 = a3 and therefore (a3 ) = a03 .

3.1. Examples of Test Sets

25

Finally, we obtain a03 a0?2 1 =0 0?a0?21 1 a03 . 0?Similarly we 1transform equa0?1 b0 1 a0 03 a0? a b b a b 3 3 2 2 2 2 tions (3.2), (3.3) to have = , 0 0?1 = 0?1 0 3 , respectively. Now, we apply part 2 to obtain b3 b2 = b2 b3 . After transforming the last equation in the reverse order than0 previously we 1 b0?1 03 b02 a01 b?1 1 b0? 0 b 3 3 2 obtain = , which is equivalent to = , where = b3b2a1b2b3. Since 2 T3, we have 0 = and nally 03 = , which is equivalent to 03 = 03. 4. The system of equations x0 = x00, for x 2 T4, can be written in form ( 0 0 0 a4x a4 = a004 x00a004 for x 2 L3 b04x0b04 = b004 x00b004 for x 2 T3: Hence, for x 2 T3, 1 00 0000 00?1 00 00 00?1 0 00 00?1 00 a04x0a04 = a004x00a004 = a004 b00? 4 (b4 x b4 )b4 a4 = a4 b4 b4x b4b4 a 4 and therefore 1 00 00?1 0 0 0 0 00?1 00 0?1 (b0? 4 b4 a4 a4)x = x (b4b4 a4 a4 ):

Now, we apply part 3 to obtain 1 00 00?1 0 0 0 0 00?1 000?1 ): (b0? 4 b4 a4 a4)3 = 3 (b4b4 a4 a 4

Hence

1 0 0 0 )a00?1b00 = b0 0 b0 : b004 a00? 4 (a43 a 4 4 4 4 3 4 Since a0403a04 = a004003 a004 , we have b004 003b004 = b0403b04, i.e. 004 = 04. 5. It is enough to prove that the implication in part 4 holds for arbitrary morphisms f , g : ! . Since we may embed the free monoid in a free group, the implication holds. 2 The structure of the proof of the above lemma does not seem to be clear, but if we trace the proof of its parts in the reverse order, then we shall see the structure. The content of each part of the lemma is obtained by simpli cation of the next part in the sense that to prove next part it is enough to prove the previous one. The less natural is the simpli cation of part 3 and it seems to be the most tricky step of the proof.

26


3.2 Linear Context-Free Languages

A linear context-free grammar is a context-free grammar in which productions are of the form A ! uBv or A ! u, where u, v are terminal words and A, B nonterminal symbols. A linear context-free grammar G can be viewed as a labeled digraph graph(G) in which all vertices but one correspond to nonterminals and all edges correspond to productions. The vertex which does not correspond to a nonterminal is labeled by t and called the terminal node. All other vertices are labeled by a corresponding nonterminal. In the latter we identify vertices with their labels. The edges of the graph are labeled by pairs of terminal words. The edge corresponding to the production of the form A ! uBv goes from A to B and is labeled by the pair (u; v). The edge corresponding to the production of the form A ! u goes from A to the terminal node and is labeled by (u; "). The derivations in G have a nice graph interpretation. They correspond to paths of edges. For each path in graph(G) we de ne two words () and () in the following way. If (w1; w1); : : : ; (wk ; wk ) are labels of consecutive edges in the path , then () = w1 : : : wk and () = wk : : : w1.

Fact 9 1. A ! wB w in G () there exists a path in graph(G) which starts with a vertex A and ends with B and such that w = () and w = (). 2. A ! w in G () there exists a path in graph(G) which starts with a vertex A and ends with t and such that w = ()().

Proof: We prove only part 1 the proof of part 2 being similar. A

derivation D

A ! 1 ! 2 ! : : :k ! wB w in a linear context-free grammar has the property that all intermediate words i contain exactly one nonterminal. Thus derivations in a linear context-free grammar are uniquely determined by the sequence of applied productions. The derivation D corresponds to the path

3.2. Linear Context-Free Languages

27

t -t -t -t -t S

(a1; a1) (a2; a2) (a3; a3) (a4; a4)

(b1; b1) (b2; b2) (b3; b3)

(b4; b4)

t

Figure 3.1: graph(G4 ). The rightmost path from S to t corresponds to the only derivation of 4.

= e1; : : :; ek+1 such that the edge ei corresponds to the i-th production in the derivation. Such a path starts with A, ends with B and w = (), w = (). Similarly, a path from A to B corresponds to the derivation determined by the sequence of productions corresponding to consecutive edges in the path. This completes the proof. 2 We say that a terminal word w corresponds to a path from S to t if w = ()(). By Fact 9, the word w is in the language L(G).

Example 10 The graph graph(G4 ) for the grammar G4 is presented in Fig. 3.1. The rightmost path of the graph corresponds to the following derivation of the word 4. A4 ! b4A3b4 ! b4b3A2b3b4 ! b4b3b2A1b2b3b4 ! 4:

For each node v of graph(G) construct a tree tree(v) rooted at v and containing all vertices reachable from v in graph(G). The trees

28


x

S

x x xx

JBBJ tree(S )

BJ v1 ? ? u1

JBBJ tree(v1)

B@J u2 @ R v2

JBBJ tree(v2)

?B J ? v3 u3

LJLJ tree(v3)

LJ

x x x t

Figure 3.2: The path from S to t is associated with the sequence of three edges (u1; v1), (u2; v2), (u3; v3). are used to associate paths with some sequences of edges. A sequence of not necessarily adjacent edges (u1; v1); : : :; (uk ; vk ) is associated with the path which starts with S goes in tree(S ) to u1 then it traverses (u1; v1) and then goes in tree(v1) to a vertex u2 runs through (u2; v2) etc. Finally, it goes via (uk ; vk ) and tree(vk) to t, see Fig. 3.2. Observe that the path if exists is unique, for there is exactly one path in a tree from the root to a vertex of the tree. Let Fk (G) be the set of words corresponding to paths associated with a sequence of at most k edges. Denote by m the number of productions in G. The number of words in Fk (G) does not exceed the number of sequences of at most k edges in graph(G), i.e. Pki=0 mi = O((k +1)mk ). For a xed k it is a polynomial in m. Lemma 11 Let G be a linear context-free grammar. F6(G) is a test set for L(G).

3.2. Linear Context-Free Languages λ’1

S u1

29

e1

λ 1 v1

u2 e 2 v2 u3

λ’2 e3

λ2 v 3

u4 e 4 v4 u5

λ’3 e5

λ 3 v5

u6 e 6 v6 u7

λ’4 e7

λ 4 v7

t

Figure 3.3: All paths except the lowest one are in Fi?1(G).

Proof: Assume that f1, f2 are two morphisms that agree on F6(G) but do not agree on the language L(G). Let i > 6 be the least number for which there exists a word w in Fi(G) such that f1(w) = 6 f2(w).

Denote by a path in graph(G) which corresponds to the word w and which is associated with i edges e1, e2, : : : , ei. After removing three edges e2, e4, e6 from the path we obtain four subpaths 1, 2, 3 and 4, see Fig. 3.3. For 1 j i, let uj and vj be the beginning and the end of the edge ej , respectively. Denote by 01 the path in tree(S ) from S to the vertex u2. Similarly, denote by 02, 03, 04 the paths in tree(v2) from v2 to u4, in tree(v4) from v4 to u6, in tree(v6) from v6 to the terminal node t. There are 16 not necessarily distinct paths from S to t of the form p1, e2, p2 , e4, p3, e6, p4 where pj = j or pj = 0j . All paths except the path are de ned by less than i edges, so the words

30


corresponding to them are in Fi?1(G). Therefore, the morphisms f1, f2 agree on them but they do not agree on the word w corresponding to the path . We construct two morphisms f1, f2 that agree on T4 but do not agree on L4. The existence of such morphisms contradicts Lemma 7. The morphisms are de ned in the following way. fk (bj ) = fk ((j )); fk (bj ) = fk ((j )) fk (aj ) = fk ((0j )); fk (aj ) = fk ((0j )); for k = 1; 2 and j = 1; : : : ; 4. This completes the proof. 2 The construction of the set F6(G) follows the steps of the de nition of the set. First, we construct the graph graph(G) and then for each vertex v of the graph we construct trees tree(v). Then for each sequence of at most 6 edges we try to build a path associated with this sequence. If the path exists we scan the labels from consecutive edges of the path and construct the word corresponding to the path. The path consists of distinguished edges and at most seven subpaths in the trees tree(v). The length of a subpath in the trees does not exceed the number n of vertices in graph(G). Therefore, the length of the path does not exceed 7(n + 1). Hence each step of the construction can be realized in polynomial time with respect to the size of the grammar. We state it in the following theorem. Theorem 12 Let L be a linear context-free language generated by a context-free grammar G with m productions. Then the language L possesses a test set consisting of O(m6) words. The test set may be constructed in polynomial time with respect to the size of G.

3.3 General Context-Free Languages In this section we prove that each context-free language contains a test set with polynomial number of words. We say that a context-free grammar is in weak Chomsky form if each nonterminal generates at least one terminal word and productions of the grammar are of the form A ! BC , or A ! B , or A ! w, where A, B , C are nonterminal symbols and w is a terminal symbol or the empty word. Every context-free grammar G can be transformed in polynomial time into a context-free

3.3. General Context-Free Languages

31

grammar Ch(G) in weak Chomsky form in the following way. For each terminal symbol a which occurs on the right hand side of a production in G we de ne a new nonterminal Na and create for it a new production Na ! a. Next we replace each terminal symbol occurring on the right hand side of productions in G by the corresponding nonterminal. Now, each production in G is in form Na ! a, or A ! ", or A ! A1A2 : : : Ak, for k 1. It remains to transform the productions of the third form for k 3. For each production A ! A1A2 : : :Ak create k ? 2 new nonterminal symbols A0i, for 2 i k ? 1, and replace the production by productions A ! A1A02, A0i ! AiA0i+1, for 2 i k ? 2, A0k?1 ! Ak?1Ak . The resulted grammar Ch(G) generates the same language as the grammar G. Observe also, that jCh(G)j = O(jGj). Let G be a grammar in weak Chomsky form. For each nonterminal A choose a word wA being a word derivable from A in G. Denote by lin(G) a linear context-free grammar which is created from G by replacing each production of the form A ! BC by three productions A ! wB C , A ! BwC , A ! wB wC . Denote by Ld the set of words that are generated by derivation trees in which productions corresponding to vertices lying at level at most d ? 1 are in lin(G) and productions corresponding to other vertices are in G. Assume the root lies at level 0. Let trunch(X ) be the set of words for which there are derivation trees of height at most h.

Lemma 13 Let G be a context-free grammar in weak Chomsky form. Then 1. For h; d 0 the set trunch(Ld+1) is a test set for trunch(Ld ), 2. L(lin(G)) is a test set for L(G).

Proof:

1. Let f , g be two morphisms that agree on trunch(Ld+1 ). Let w be a word in trunch(Ld) ? trunch(Ld+1 ). There is a derivation tree for w in which internal vertices of level at most d correspond to productions in lin(G) and there is a vertex lying at level d + 1 which corresponds to a production of form A ! BC . Thus w = pyzq and B ! y and C ! z, see Fig. 3.4. After substituting u = pwB , v = wC q, x = py and v = zq in Lemma 6, the lemma says that the set of words

32


x x 6

Jx

J

J

JJ J

J

J

J

x x x J J J ? ?x?@@x ? @ ? @ x x ?L @x ? @ xL L L L L L L S

S

S

pA q

pA q

pA q

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

productions in lin(G) d

C wB

C B

. . . . . . . . . . . . . . . . . . . .

productions B in G

.

y

L L z

wC

. . . . . . . . . . . . . . . . . . . . . . . .

L z

y

L

Figure 3.4: Derivation trees for words pyzq, pwB zq, pywC q.

X = fpwB zq; pywC q; pwB wC qg is a test set for the set X [ fwg. Since X trunch(Ld+1 ) , the morphisms f , g agree on X and therefore they have to agree on w. This completes the proof. 2. Assume that L(lin(G)) is not a test set for L(G). Let f , g be two morphisms that agree on L(lin(G)) but not on L(G). Let h be the least number such that there is a word w in L(G) for which there is a derivation tree of height h and f (w) 6= g(w). As a consequence of part 1 we have that trunch(L(lin(G))) = trunch(Lh ) is a test set for trunch(L(G)) = trunch (L0). Since f and g agree on trunch(L(lin(G))) they have to agree on trunch (L(G)), thus on w, too. This contradiction completes the proof. 2 We say that a grammar G de nes a word w if it is in weak Chomsky form without useless nonterminals, and such that each nonterminal is on the left hand side of exactly one production, and L(G) = fwg. In such a grammar the word w has exactly one derivation tree. We treat a grammar de ning w as a compressed representation of w. Example 14 Some grammars de ne a word of exponential size with respect to the number of their productions, e.g. the grammar with productions Ai ! Ai?1Ai?1 for 1 i k, A0 ! a and the start k 2 symbol Ak de nes the word a . Observe here that sometimes it is impossible to list in polynomial time

3.4. Lower Bounds

33

words that are in a test set for a context-free language. Since the language in Example 14 consists of one word, the only test set for it is this language itself and the length of the only word in the test set is exponential with respect to the size of the grammar de ning the language.

Theorem 15 Let L be a context-free language generated by a contextfree grammar G. Then the language L possesses a test set consisting of O(jGj6 ) words. Grammars de ning words in the test set may be constructed in polynomial time with respect to the size of G.

Proof: Let Glin = lin(Ch(G)). By Lemma 13 and Theorem 12, the

language F6(Glin) is a test set for the language L. Since the number of productions in Glin is O(jGj), the test set F6(Glin ) for the language L(G) contains O(jGj6 ) words. The shortest words derivable from nonterminals in grammar Ch(G) can be of exponential size. Fortunately, for each nonterminal A there is a shortest word derivable from A which is de ned by a subgrammar of Ch(G). Take such a word as a wA . Now, we treat all wA as terminal symbols in the grammar Glin remembering that these symbols represent words which are de ned by grammars. On the basis of the graph for Glin we nd F6(Glin ). The words are of lengths O(jGj) and they can be found in polynomial time. They consist of terminal symbols of Glin which correspond to words de ned by grammars. For each word in the test set we can nd a grammar de ning this word. 2

3.4 Lower Bounds In this section we estimate the lower bound for the size of test sets for context-free languages. We present a family of nite languages which has the property that for each L 2 the only test set for L is L itself. The languages are generated by O(m) size grammars and consist of

(m3) words. We start with proving an auxiliary lemma.

Lemma 16 T3 is not a test set for L3.

34


Proof: It is enough to show two morphisms h and g that agree on

T3 and that do not agree on L3, i.e. that do not agree on the word = b3b2b1b1b2b3.

t -t -t -t S

(a3; a3) (a2; a2) (a1; a1)

t

t -t -t -t S

(b3; b3)

(f (a3); f (a3))

(b2; b2)

(f (a2); f (a2))

(b1; b1)

(f (a1); f (a1))

(f (b3); f (b3))

(f (b2); f (b2)) (f (b1); f (b1))

t

Figure 3.5: graph(G3 ) and graphf (G3 ). Consider graph(G3 ) corresponding to the language L3, see Fig. 3.5. Then for a given morphism f , graphf (G3) denotes a graph which is obtained from graph(G3 ) by replacing each edge label (x; x) by the label (f (x); f (x)). It is easy to see that each path corresponding to a word w in graph(G3) corresponds to the word f (w) in graphf (G3). To check if two morphisms h and g agree on w, it is enough to compare words corresponding to appropriate paths in graphh (G3) and graphg (G3). Let h and g be morphisms for which graphg (G3) and graphh (G3) look as in Figure 3.6. In other words, h and g are de ned as follows: h(a3) = h(b3) = h(a3) = h(b3) = p h(a2) = h(b2) = h(a2) = h(b2) = q h(a1) = h(b1) = h(a1) = h(b1) = qp

g(a3) = g(b3) = q g(a3) = g(b3) = g(a2) = g(b2) = p g(a2) = g(b2) = g(a1) = g(b1) = h(a1) = g(b1) = qp

3.4. Lower Bounds

35

t -t -t -t S

(; p) (; q) (; )

t

t -t -t -t S

(p; )

(q; )

(q; )

(p; )

(; qp)

(; )

(; q)

(; p) (; qp)

t

Figure 3.6: graphh (G3 ) and graphg (G3). It is straightforward to check that h, g agree on each word in T3. On the other hand h(3) = pqqp 6= qppq = g(3). This completes the proof. 2 De ne linear context-free grammars k 1 G ;k2 ;k3 = (N k1 ;k2;k3 ; T k1;k2;k3 ; P k1;k2 ;k3 ; S k1 ;k2;k3 ), for k1; k2; k3 1, in the following way: N k1 ;k2 ;k3 = fA1; A2; A3g T k1;k2 ;k3 = fai;j ; ai;j : 1 j ki for 1 i 3g S k1;k2 ;k3 = A3 P k1 ;k2 ;k3 = fAi ! ai;j Ai?1ai;j : 1 ki for i = 2; 3g [ fA1 ! a1;j a1;j : 1 j k1g: Let = fL(Gk1 ;k2;k3 )gk1;k2 ;k31. Our next lemma shows that has the interesting property we are looking for. Lemma 17 For k1; k2; k3 1, the language L(Gk1 ;k2 ;k3 ) is the only test set for L(Gk1 ;k2;k3 ). Proof: Suppose that the language L(Gk1 ;k2 ;k3 ) possesses a test set T L(Gk1 ;k2 ;k3 ) such that T 6= L(Gk1 ;k2 ;k3 ). Let x = a1;j1 a2;j2 a3;j3 a3;j3 a2;j2 a1;j1

36


be any word in L(Gk1 ;k2;k3 )nT . We de ne two morphisms h0, g0 in the following way: (

ji for 1 j k ; 1 i 3; g0(ai;j ) = gg((ab i)) ifif jj 6= i = ji i (

g(ai) if j 6= ji for 1 j k ; 1 i 3; i g(bi) if j = ji ( 6= ji for 1 j ki ; 1 i 3; h0(ai;j ) = hh((bai)) ifif jj = ji i ( 6= ji for 1 j ki ; 1 i 3; h0(ai;j ) = hh((ab i)) ifif jj = ji i where g and h are morphisms from the previous lemma. It follows from the construction that h0 and g0 agree on all words in L(Gk1 ;k2;k3 ) except x. This contradicts our assumption that T is a test set for L(Gk1 ;k2 ;k3 ). 2 Now we are ready to prove our main result of this section. g0(a

i;j ) =

Theorem 18

(a) The lower bound for the size of a test set for languages from the family of linear context-free languages is (m3), where m is a number of productions in a linear context-free grammar which generates a language. (b) The lower bound for the size of a test set for languages from the family of all nite languages is (n3), where n is the size of an alphabet over which languages are de ned. (c) There are two dierent HD0L sequences that coincide on rst (n3) words, where n is the size of an alphabet over which the HD0L systems are de ned.

Proof: The family is a subfamily of linear context-free languages

and nite languages. L(Gk;k;k ) consists of k3 words, it is produced by a linear context-free grammar with 3k productions and it is de ned over 6k-letter alphabet. This completes the proof of parts (a) and (b) of the theorem.

3.5. Test Sets in Semigroups

37

Let k1, k2, k3 be pairwise relatively prime. Then the language L(Gk1 ;k2 ;k3 ) is generated by a D0L system D = (; h; w), where = T k1;k2;k3 ; w = a1;1a2;1a3;1a3;1a2;1a1;1; (

h(ai;j ) = ai;next(j) ; for next(j ) = j + 1 if j < ki h(ai;j ) = ai;next(j) 1 otherwise: The two HD0L sequence are the morphic images of t,he D0L sequence generated by the system D. These morphisms are f 0 and g0, for ji = ki, 1 i 3, from the previous lemma. The HD0L sequences coincide on rst k1 k2 k3 ? 1 elements, but their k1 k2 k3-th elements are dierent. This completes the proof. 2

3.5 Test Sets in Semigroups In this section as in [16, 11] we extend the de nition of test sets to semigroups. As we shall see most results of previous sections can be generalized to semigroups and groups. De nition 19 Let be an alphabet. A language T is a test set for a language L in a semigroup S () 1. T is a subset of L, and 2. for two morphisms f , g ! S : if they agree on T , then they agree on L. Clearly the de nition of test sets in free monoids that are generated by at least two generators is equivalent to our previous de nition of test sets. First we generalize Lemma 11. For a linear context-free grammar G de ne Fk (G) as previously. Lemma 20 Let G be a linear context-free grammar and S be a semigroup. If Tk is a test set for Lk in S , then F2k?2(G) is a test set for L(G) in S .

38


Proof: The proof is a simple generalization of the proof of Lemma 11. 2

The basic fact which we needed to prove Lemma 13 works for all groups. The proof of it is almost the same as the one for free groups. Fact 21 X = fuv; uw; xvg is a test set for Y = fuv; uw; xv; xwg in every group. Lemma 13 can be generalized to all groups. The proof is exactly the same if references to Fact 6 are replaced by references to Fact 21. Lemma 22 Let G be a context-free grammar in weak Chomsky form. Then 1. For h; d 0, the set trunch(Ld+1) is a test set for trunch (Ld) in every group, 2. L(lin(G)) is a test set for L(G) in every group. Now we state the main result of this section. We give a necessary and sucient condition for a group to have polynomial size test sets for context-free languages. Theorem 23 Let G be a group. The following conditions are equivalent. 1. There exists a constant k > 0 such that each language generated by a context-free grammar of size m possesses a test set in G consisting of O(mk ) words. 2. There exists a constant k0 > 0 such that Tk0 is a test set for Lk0 in G . Proof: Let G be a context-free grammar. If there exists k > 0 such that Tk is a test set for Lk in G , then, by Lemma 20 and Lemma 22, F2k?2(lin(Ch(G))) is a test set for L(G). Since F2k?2(lin(Ch(G))) contains O(jGj2k?2 ) words, the polynomial upper bound in G exists. Suppose there does not exist k > 0 such that Tk is a test set for Lk in G . Then the only test set for Lk is Lk itself. Since the number of

3.5. Test Sets in Semigroups

39

words in Lk is 2k and it is de ned by a grammar Gk of size jGk j = O(k), the size of a test set in G cannot be polynomially bounded. 2 By Theorem 23 to prove in a group the existence of polynomial size test sets for context-free languages, it is enough to nd k such that Tk is a test set for Lk in this group. In case of free groups, by Lemma 7, T4 is a test set for L4 and therefore in free groups the polynomial upper bound exists.

Chapter 4 The Morphism Equivalence Problem Our polynomial time algorithm to solve the morphism equivalence problem for context-free languages consists of two stages: construct polynomial number of grammars de ning words that form a test set for the input context-free language, for each word in the test set construct grammars de ning its morphic images and check whether the obtained grammars de ne the same word. The rst stage of the algorithm was described in Chapter 3. The rst part of the second stage is simple. Given a grammar de ning a word w and a morphism f it is easy to construct a grammar de ning a word f (w). The resulted grammar is obtained from G by replacing each production of form A ! u where u is a terminal word by A ! f (u) and then by transforming the grammar into the grammar in weak Chomsky form. The second part of the second stage of the algorithm is solved by the polynomial time algorithm for the words equivalence problem. The words equivalence problem is a generalization of the problem of verifying whether two grammars de ne the same word. This chapter is entirely devoted to describe this algorithm. Some ideas in the algorithm are similar to the ones in Makanin's algorithm [22] to decide whether or not an equation on words has a solution. 41

42

Chapter 4. The Morphism Equivalence Problem

4.1 The Words Equivalence Problem

We say that a grammar G de nes a set of words W if the grammar is in weak Chomsky form (there can be useless nonterminals), each nonterminal is on the left hand side of exactly one production, each nonterminal generates exactly one terminal word and W is a set of words derivable from nonterminals in G. In particular, if G de nes a word then it de nes a set of words derivable from nonterminals of G. Note that for any nonterminal in G there is exactly one derivation tree whose root is labeled by this nonterminal and whose leaves are labeled by terminal letters. Denote by wA the terminal word derivable from the nonterminal A in G. We say that a nonterminal A is simple if A ! wA is a production in G. A production such that a nonterminal A is on its left hand side is called the production for A. Example 24 For a xed k consider a grammar G containing the following set of productions: Ai ! Ai?1Ai?1 for 1 i k A0 ! a A0 ! A0A0k A0i ! A0i?1Ai?1 for 1 i k A01 ! a Since wAi = a2i , for 0 i k, and wA0i = a2i?1, for 1 i k, and wA0 = a2k , the grammar G de nes the set of words fa2i : 0 i kg [ fa2i?1 : 1 i kg: We describe a polynomial time algorithm for the following problem:

Words equivalence problem

Let G be a grammar de ning a set of words and S be a set of pairs of nonterminals in G. Decide whether or not 8(A;B)2S wA = wB : The words equivalence problem is a generalization of the problem of testing whether two grammars de ne the same word. Indeed, assume we are to check whether grammars G1, G2 de ne the same word.

4.1. The Words Equivalence Problem

43

We may assume that the sets of nonterminals in these grammars are disjoint. Let P1 and P2 be the sets of productions in G1 and G2, respectively. Let A1 and A2 be the start symbols in G1 and G2, respectively. To check whether G1, G2 de ne the same word it is enough to solve the words equivalence problem for a grammar with the set of productions P1 [ P2 and for S = f(A1; A2)g. Let G be a grammar with n productions which de nes a set of words W . Since the shortest words derivable from a nonterminal of a grammar in weak Chomsky form with n productions are of length at most 2n , the lengths of words in W can be stored in n bits. Standard algorithms for basic operations (comparing, addition, subtraction) on such numbers work in polynomial time with respect to n. We shall see that this allows to compute the length of each word in W in polynomial time.

Fact 25 Let G be a grammar which de nes a set of words W . The lengths of words in W can be computed in polynomial time in jGj. Proof: De ne a relation R on nonterminals of G in the following way: ARB () the production for A is of the form A ! B , or A ! BC , or A ! CB for some nonterminal C . The relation R can be extended to a linear order . The lengths of the words wA are being computed for consecutive nonterminals in starting from the smallest nonterminal in this relation. If a nonterminal A is simple, then the number jwAj is computed easily on the basis of the production for A. If the production for A is of the form A ! BC , or A ! CB , for some nonterminals B , C , then jwAj is a sum of jwB j and jwC j. Since A B and A C , the lengths of wB and wC have been already computed. Similarly, if the production for A is of the form A ! B , then jwAj is equal to jwB j and since A B , the number jwB j has been computed earlier. This completes the proof. 2 The key point of the algorithm for the words equivalence problem is to check dependences between words wA. The dependences are stored as sets of triples (A; B; i) where A, B are nonterminal or terminal symbols and i is a nonnegative integer. We divide the triples into two groups which we call sux and subword triples. A triple (A; B; i) is a sux triple i i + jwB j jwA j and it is a subword triple i i + jwB j < jwAj.

44


i

wB

wA wB

i

wA

Figure 4.1: Match(A; B; i) is true i marked subwords are the same. At the top is the case when (A; B; i) is a subword triple and at the bottom is the case when (A; B; i) is a sux triple. An important role in our considerations is played by the predicate Match. Intuitively, Match(A; B; i) is true i starting from position i+1 in the word wA consecutive symbols of wA match consecutive symbols in wB , see Fig. 4.1. Formally, the predicate is de ned as follows.

De nition 26 Let A, B be nonterminals or terminal words and i be an integer such that 0 i < jwAj. Match(A; B; i) is true i (A; B; i) is a sux triple and wA[i + 1 : : : jwAj] = wB [1 : : : jwAj ? i] or (A; B; i) is a subword triple and wA[i + 1 : : : i + jwB j] = wB We de ne predicate Match All being extension of Match to sets of triples rel as follows

Match All(rel) = 8(A;B;i)2relMatch(A; B; i): The words equivalence problem can be reformulated as follows:

4.2. The Split Operation

45

Reformulation of the words equivalence problem

Given a grammar G de ning a set of words and a set S of pairs of nonterminals in G decide whether or not [ (8(A;B)2SjwAj = jwB j) and Match All( f(A; B; 0)g): (A;B )2S

The algorithm for the words equivalence problem considers sets of triples. It starts from the set of triples rel = S(A;B)2S f(A; B; 0)g, and changes it by applying two operations Split and Compact. The operations have the property that for each argument r the resulted set r0 satis es Match All(r0) = Match All(r). At the end of the algorithm there are only simple nonterminals in the triples of rel. Then the value of the predicate Match All(rel) is computed directly. Careful application of the operations assure that the number of triples in rel during execution of the algorithm is polynomial. To the end of this chapter we assume that a grammar G de ning a set of words W is xed.

4.2 The Split Operation Let rel be a set of triples and A be a nonterminal in G. This section is devoted to de ning the Split operation and to studying some of its properties. Split is de ned for two parameters a nonterminal A and a set of triples rel. If the nonterminal A is nonsimple, then the aim of Split is to eliminate occurrences of A from the triples in the set rel in such a way that Match All(Split(A; rel)) = Match All(rel). The operation is de ned as follows. If A is simple, then Split(A; rel) = rel. If the production for A is in form A ! B , for some nonterminal B , then the set Split(A; rel) is obtained from rel by replacing the nonterminal A by B in all triples. If the production for A is in form A ! EF , then Split(A; rel) = S (B;C;i)2rel Split(A; f(B; C; i)g) and the operation for one element set f(B; C; i)g is de ned next. The de nition consists of four main cases depending on whether B = A and C = A. To clarify the de nition we omit the set braces for sets consisting of one triple and in each case we informally show in gures why Match All(Split(A; (B; C;i)) =

46


Match(B; C; i). The upper part of the gures illustrates the triple (B; C; i). If Match(B; C; i) is true, then the words that are shaded with the same color are equal. The Split operation uses the equation wA = wE wF to split the words wA in the upper part of gures and replace dependences involving wA by dependences that involve wE and wF . The subcases in the main cases correspond to various positions of wE and wF with respect to other words. The resulted dependences preserve the value of the predicate Match All if they force the equality of all words shaded with the same color. This is done by extracting from the upper part of gures all pairs of words wF and wE whose subwords are shaded with the same color. The pairs correspond to triples in Split(B; C; i).

Case B 6= A and C 6= A Split(A; (B; C; i)) = (B; C; i)

Case B = A and C 6= A, see Fig. 4.2-4.4. 8 (E; C; i) [ (C; F; jwE j ? i) > > > if jwE j > i and jwC j + i > jwE j; > < (E; C; i) Split(A; (A; C; i)) = > if jw j > i and jw j + i jw j; E C E > > ( F; C; i ? j w j ) > : if jwE j Ei:

Case B 6= A and C = A, see Fig. 4.5-4.6. 8 (B; E; i) > > < if jwE j + i jwB j; Split(A; (B; A; i)) = > (B; E; i) [ (B; F; jw j + i) E > :

if jwE j + i < jwB j:


47

i wC wA wE

wF

Split i wC wE wC |w |-i E

wF

Figure 4.2: Case B = A and C 6= A, subcase jwE j > i and jwC j + i > jwE j. Split(A; (A; C; i)) = (E; C; i) [ (C; F; jwE j ? i).

48

Chapter 4. The Morphism Equivalence Problem i wC wA wE

wF

Split i wC wE

Figure 4.3: Case B = A and C 6= A, subcase jwE j > i and jwC j + i jwE j. Split(A; (A; C; i)) = (E; C; i). i wC wA wE

wF

Split

wC

i-|wE|

wF

Figure 4.4: Case B = A and C 6= A, subcase jwE j i. Split(A; (A; C; i)) = (F; C; i ? jwE j).

4.2. The Split Operation i

49 wE

wF

wA wB

Split

i

wE

wB

Figure 4.5: Case B 6= A and C = A, subcase jwE j + i jwB j. Split(A; (B; A; i)) = (B; E; i).

Case B = A and C = A, see Fig. 4.7-4.11. 8 > >; > > > > > > > < Split(A; (A; A; i)) = > > > > > > > > > :

if i = 0 (E; E; i) [ (E; F; jwE j ? i) if jwE j > i 1 and i jwF j; (E; E; i) [ (F; F; i) [ (E; F; jwE j ? i) if jwE j > i 1 and i < jwF j; (F; E; i ? jwE j) if jwE j i and i jwF j; (F; E; i ? jwE j) [ (F; F; i) if jwE j i and i < jwF j:

The next lemma states formally that the operation Split has the required property. Lemma 27 Match All(rel) = Match All(Split(A; rel)) for any nonterminal A and a set of triples rel.

50

Chapter 4. The Morphism Equivalence Problem wE

i

wF

w

A

wB

Split wE

i wB

wF

i+|w | E

wB

Figure 4.6: Case B 6= A and C = A, subcase jwE j + i < jwB j. Split(A; (B; A; i)) = (B; E; i) [ (B; F; jwE j + i). wA wA

Split

(nothing)

Figure 4.7: Case B = A and C = A, subcase i = 0. Split(A; (A; A; i)) = ;.


51

wE

i

wF

wA wA wE

wF

Split wE

i

wE wE

|w |-i E

wF

Figure 4.8: Case B = A and C = A, subcase jwE j > i 1 and i jwF j. Split(A; (A; A; i)) = (E; E; i) [ (E; F; jwE j ? i).

52


wE

i

wF

wA wA wE

wF

Split wE

i

wE wE

|wE|-i

wF i

wF

wF

Figure 4.9: Case B = A and C = A, subcase jwE j > i 1 and i < jwF j. Split(A; (A; A; i)) = (E; E; i) [ (E; F; jwE j ? i) [ (F; F; i).


53

wE

i

wF

wA wA wE

wF

Split

i-|wE|

wE

wF

Figure 4.10: Case B = A and C = A, subcase jwE j i and i jwF j. Split(A; (A; A; i)) = (F; E; i ? jwE j).

54


wE

i

wF

wA wA wE

wF

Split wE

i-|w | E

wF wF

i

wF

Figure 4.11: Case B = A and C = A, subcase jwE j i and i < jwF j. Split(A; (A; A; i)) = (F; E; i ? jwE j) [ (F; F; i).


55

Proof: If A is simple, then there is nothing to prove. If A ! B is the production for A, then wA = wB so the replacement of A by B in the triples of rel does not change the value of the predicate Match All. If A ! EF is the production for A, then it is enough to prove the result for one element set, i.e. it is enough to prove that Match All(Split(A; (B; C; i)) = Match(B; C; i): The equality in case B 6= A and C 6= A is obvious. The equality in other cases is informally proved in the gures illustrating the de nition of Split. We shall show how to transform those informal proofs into formal ones on the example of the gure for the subcase jwE j > i 1 and i < jwF j in the case B = A and C = A . Then the predicate Match(A; A; i) is equivalent to the equation wA [i + 1::jwAj] = wA[1::jwAj ? i]: Since wA = wE wF , we have (wE wF )[i + 1::jwE j + jwF j] = (wE wF )[1::jwE j + jwF j ? i]: Now, we use the subcase conditions jwE j i and i < jwF j to obtain wE [i + 1::jwE j]wF = wE wF [1::jwF j ? i]: This equation is illustrated in Fig. 4.9. On the left hand side of the equation is the word which corresponds to the shaded part (consisting of three parts shaded with dierent colors) of the lower word wA and on the right hand side of the equation is the word which corresponds to the shaded part of the upper word wA. The two borders between words wE and wF in both upper and lower words wA divide the shaded areas into three subwords. In the lower word wA these are wE [i + 1::jwE j], wF [1::i], wF [i + 1::jwF j], in the upper one these are wE [1::jwEj ? i], wE [jwE j ? i + 1::jwE j], wF [1::jwF j ? i]. The words, which are shaded with the same color, are equal i the following system of equations is satis ed. 8 > < wE [i + 1::jwE j] = wE [1::jwE j ? i] w [1::i] = w [jw j ? i + 1::jw j] > : wF [i + 1::FjwF j] = wFE [1::EjwF j ? i] E

56


In the gure this system of equations is illustrated also in the lower part of the gure (below the arrow Split). The lower part of the gure consists of three dependences between words wE and wF . Each of them corresponds to a pair of shaded words in the upper part. The lower part of the gure (and the system of equations) is equivalent to a predicate

Match All((E; E; i) [ (E; F; jwE j ? i) [ (F; F; i)); i.e. in this case Match All(Split(A; (A; A; i))) = Match(A; A; i). Similarly, the other cases can be considered. This completes the proof.

2

Let #suffix(rel) and #subword(rel) be the number of sux and subword triples in rel, respectively. Let #(rel) be the number of triples in rel. Clearly #(rel) = #suffix(rel) + #subword(rel). The following lemma estimates the number of sux and subword triples in Split(A; rel).

Lemma 28 Let rel be a set of triples and A be a nonterminal such that jwAj maxfjwB j; jwC jg, for each (B; C; i) in rel. Then #suffix(Split(A; rel)) 3#(rel) #subword(Split(A; rel)) #(rel)

Proof: The rst inequality is trivially true, since one triple in rel

corresponds to at most three triples in Split(A; rel). The second one is a consequence of the fact that each triple in rel corresponds to at most one subword triple in Split. Indeed, look at the de nition of Split. In most cases one triple corresponds to one triple so that it remains to consider ve cases when one triple corresponds to at least two triples. Since a triple (H; H; i), for each nonterminal H , is a sux triple, it remains to consider two of those ve cases: the case B = A and C 6= A for jwE j > i and jwC j + i > jwE j and the case B 6= A and C = A for i + jwE j < jwB j. In the rst one since i + jwC j > jwE j, the triple (E; C; i) is a sux triple, in the second one since jwAj jwB j (by the assumption of the lemma), we have i + jwE j + jwF j = i + jwA j > jwB j so that (B; F; jwEj + i) is a sux triple. This completes the proof. 2

4.3. The Compact Operation

4.3 The Compact Operation

57

The second operation we de ne for a set of triples rel is the Compact operation. The only parameter of Compact is a set of triples rel. The Compact operation reduces the number of sux triples in the set rel. First, de ne the operation SimpleCompact(rel) in the following way. If there are three sux triples (A; B; i), (A; B; j ), (A; B; k) in rel such that (j ? i) + (k ? i) jwAj ? i and i < j < k, then the operation replaces them by two triples (A; B; i), (A; B; i+gcd(j ?i; k ?i)). If there are no such three triples, then it does nothing. The set Compact(rel) is obtained from rel by applying the SimpleCompact operation to rel until no triple is removed. We say that a number p is a period of a word w i the word w[1::jwj? p] is a sux of w.

Fact 29 Assume that (A; B; i) is a sux triple and Match(A; B; i) is

satis ed. Then Match(A; B; j ) is satis ed for some j greater than i i j ? i is a period of the word wB [1::jwAj ? i].

Proof: Since Match(A; B; i) is satis ed and (A; B; i) is a sux triple, a word w = wB [1::jwAj ? i] is a sux of the word wA .

Assume that Match(A; B; j ) is satis ed. Since j > i and (A; B; i) is a sux triple, (A; B; j ) is a sux triple, too. Thus the word w0 = wB [1::jwAj ? j ] is a sux of wA . Both w and w0 are suxes of wA and since w is longer than w0 the word w0 is a sux of w, see Fig. 4.12. Since w0 is also a pre x of w, jwj ? jw0j = j ? i is a period of w. Assume now that j ? i is a period of wB [1::jwAj? i]. Then w[1::jwj? (j ? i)] = wB [1::jwAj ? j ] is a sux of w, thus it is a sux of wA, too. This means that Match(A; B; j ) is satis ed. This completes the proof.

2

The following lemma states a well known property of periods of words.

Lemma 30 (periodicity lemma [20])

If x, y are periods of a word w and x + y jwj then gcd(x; y) is also a period of w where gcd stands for the greatest common divisor.

58

Chapter 4. The Morphism Equivalence Problem w’ j w

B

i

wB wA w

Figure 4.12: j ? i is a period of the word w = wB [1::jwAj ? i]. Words shaded with the same color are the same. Observe that the greatest common divisor of two n bit numbers can be computed in polynomial time with respect to n [1]. Now, we are ready to prove that the Compact operation preserves the value of the predicate Match All. Lemma 31 Match All(rel) = Match All(Compact(rel)) for each set of triples rel. Proof: It is enough to prove that Match All(rel) = Match All(SimpleCompact(rel)) when SimpleCompact changes rel. Assume that SimpleCompact replaces three sux triples (A; B; i), (A; B; j ), (A; B; k) for i < j < k by two triples (A; B; i), (A; B; i + gcd(j ? i; k ? i)). Assume that Match All(rel) is satis ed. Then Match(A; B; i) and Match(A; B; j ) and Match(A; B; k) are satis ed, and, by Fact 29, both j ? i and k ? i are periods of the word w = wB [1 : : : jwAj ? i]. Since (k ? i)+(j ? i) jwA j? i, by periodicity lemma, gcd(j ? i; k ? i) is also a period of w. Hence, again by Fact 29, Match(A; B; i + gcd(j ? i; k ? i)) is satis ed. Assume that Match All(SimpleCompact(rel)) is satis ed. Then Match(A; B; i) and Match(A; B; i + gcd(j ? i; k ? i)) are satis ed so that, by Fact 29, gcd(j ? i; k ? i) is a period of w. Then j ? i, k ? i

4.4. The Algorithm

59

as multiplies of gcd(j ? i; k ? i) are periods of w, too. Hence, again by Fact 29, both Match(A; B; j ) and Match(A; B; k) are satis ed. This completes the proof. 2 Our next lemma states that the Compact operation reduces the number of sux triples to polynomial. Lemma 32 If rel is a set of triples then #suffix(Compact(rel)) (2n + 1)n2:

Proof: Let (A; B; i1), (A; B; i2),: : : , (A; B; ik) be a sequence of all sux triples of the form (A; B; i) in Compact(rel). Assume that integers ij , for 1 j k, are sorted in the ascending order. The se-

quence of triples is uniquely determined by nonterminals A and B . Therefore the number of such sequences is at most n2. It is enough to prove that the length of such a sequence is at most 2n + 1. Since the SimpleCompact operation does not change the set rel, for three consecutive triples (A; B; ir), (A; B; ir+1), (A; B; ir+2) in the sequence we have ir+2 ? ir + ir+1 ? ir > jwAj ? ir. Hence, by ir+1 < ir+2, we have 2ir+2 ? ir > jwAj. Therefore 21 (jwAj ? ir ) > (jwA j ? ir+2) so that the sequence of numbers jwAj ? i1, : : : , jwAj ? ik has at most 2 log jwA j + 1 elements. Since the word wA is not longer than 2n , the result follows.

2

4.4 The Algorithm Recall now the reformulation of the words equivalence problem.

Reformulation of the words equivalence problem

Given a grammar G de ning a set of words and a set S of pairs of nonterminals in G decide whether or not (8(A;B)2SjwAj = jwB j) and Match All(

[

f(A; B; 0)g):

(A;B )2S

Our algorithm for the words equivalence problem computes the lengths of words wA and checks whether for each pair (A; B ) in S the

60


equation jwAj = jwB j is satis ed. If there is a pair (A; B ) in S such that jwA j 6= jwB j, then the algorithm stops and returns false. If for each pair in S the equation holds, then nonterminals are being ordered in such a way that the lengths of words S wA are sorted in descending order. Then the set of triples rel = (A;B)2S (A; B; 0) is transformed by applying alternately the Split operation to remove from the triples of rel a nonsimple nonterminal and the Compact operation to lower the number of sux triples in rel. At the end the triples in rel contain only simple nonterminals and the value of the predicate Match All(rel) is computed directly.

Algorithm Test; finput:

grammar G which de nes a set of words set S of pairs of nonterminals from G output: verify whether or not (8(A;B)2SjwAj = jwB j) and Match All(S(A;B)2S (A; B; 0)) g

begin compute jwAj for each nonterminal A; if there is (A; B ) 2 S such that jwAj 6= jwB j then return false; (A1; : : :; An):=sort nonterminals in descending order according to jwAj; S rel:= (A;B)2S (A; B; 0); for i:=1 to n do begin rel:=Split(Ai; rel); rel:=Compact(rel) end; fthere are only simple nonterminals in triples of relg return Match All(rel)

end. Theorem 33 Algorithm Test solves the words equivalence problem in

polynomial worst-case time.

Proof: The correctness of the algorithm is a consequence of Lemma 27 and Lemma 31. Let reli for i 1 be the value of the variable rel after

the i-th execution of the main iteration in Algorithm Test and rel0 be the value of rel just before the algorithm enters the iteration. To prove

4.4. The Algorithm

61

the polynomial worst-case performance of the algorithm it is enough to prove that the number #(reli) can be polynomially bounded. Initially, #subword(rel0) = 0 and #suffix(rel0) = jS j n2, thus #(rel0) n2. By Lemma 32 we have #suffix(reli) (2n + 1)n2 for i 1: Since the operation Compact does not change subword triples, we have, by Lemma 28, #subword(reli) #(reli?1) for i 1: Hence

(

#(rel0) n2; #(reli) #(reli?1) + (2n + 1)n2 for i 1:

The solution to this recurrence is #(reli) (i +1)(2n +1)n2, for i 0. Since i n, the polynomial upper bound for #(reli) exists. This completes the proof. 2 Our main result of this chapter is a consequence of the above theorem and our considerations in the previous chapter. Theorem 34 The morphism equivalence problem for context-free languages can be solved in polynomial time.

Chapter 5 Applications This chapter is devoted to showing simple implications of the results from previous chapters. The results have some applications in automata theory, theory of systems of equations on words and combinatorics on words. This chapter is divided into three sections. In the rst one we generalize our result for the morphism equivalence problem by considering more general functions than morphisms. The functions are de ned by deterministic generalized sequential machines (dgsms for short). We prove that the dgsm equivalence problem for context-free languages can be solved in polynomial time. Our considerations are similar to the ones in [2]. In the second section we show the interpretation of our results in theory of systems of equations on words. The ideas in this section are taken from [6]. The last section is devoted to proving that it is possible to decide whether or not two recursively de ned sequences of words agree on a set of positions S . The solution is polynomial with respect to the maximal number in S . The problem is that the words in the sequences can be of exponential length with respect to the maximal number in S . The solution, however, is not polynomial in a common sense, i.e. with respect to the size of the input data. If there is only one position n in the set S , then the input data size is O(log n) and our algorithm works in polynomial time on n. Recently, other applications of the techniques we introduced in Chapter 4 have been found in data compression. We refer the reader 63

64

Chapter 5. Applications

to [19, 18] for details.

5.1 Deterministic Generalized Sequential Machines A deterministic generalized sequential machine (dgsm) is a tuple M = (; ; ; ; Q; q0; F ), where Q is a set of states, q0 2 Q is the initial state, F Q is a set of accepting states, and are input and output alphabets, respectively, : Q ! Q is a transition function and : Q ! an output function. The functions , can be extended to functions ^, ^ whose second argument can be any word in . ^(q; 1) = q; and ^(q; wa) = (^(q; w); a) ^(q; 1) = 1; and ^(q; wa) = ^(q; w)(^(q; w); a) The dgsm M de nes the mapping f : ! such that (^ (q0; x) if ^(q0; x) 2 F f (x) = unde ned otherwise: Denote by Df the domain of the function f . The language Df is regular. Observe that morphisms are de ned by dgsms with one state being simultaneously initial and accepting state. A natural question is whether our results for morphisms can be extended to functions that are de ned by dgsms. We call this problem the dgsm equivalence problem.

Dgsm equivalence problem Decide whether or not

8x 2 L(G) f 1(x) = f 2(x) for a context-free grammar G and two mappings f 1, f 2 that are de ned by dgsms. Let M 1 and M 2 be two dgsms de ning functions f 1 and f 2, respectively. Let M i = (; ; i; i; Qi; q0i ; F i) for i = 1::2. For a pair

5.1. Deterministic Generalized Sequential Machines

65

of states q1 2 Q1, q2 2 Q2 de ne a function gq1 ;q2 : ! 0, where 0 = Q1 Q2. For a word w = a1 : : : ak , let gq1;q2 (w) = (a1; q11; q12)(a2; q21; q22) : : : (ak; qk1; qk2); where q1i = qi, i(aj ; qji ) = qji+1, for 1 j < k, i = 1; 2. Notice that ^i(qi; w) = (qki ; ak ) for i = 1; 2. De ne two morphisms h1, h2 : 0 ! 1as2 h1((a; q1; q2)) = 1(a; q1) and h2((a; q1; q2)) = 2(a; q2). Let g = gq0 ;q0 . Fact 35 f i(w) = hi(g(w)) for each w in Df i and i = 1::2. Let G be a grammar in weak Chomsky form. By Fact 35 to solve the dgsm equivalence problem for the language L(G) it is enough to solve two problems. First problem is to check whether there is a word in L(G) such that one of the functions f1, f2 is unde ned and the other one is de ned, i.e. to check whether the language S = ((Df 1 ? Df 2 ) [ (Df 2 ? Df 1 )) \ L(G) is empty. Since the function g is de ned for all words, it is enough to check whether the language L1 = g(S ) is empty. The second problem is to solve morphisms equivalence problem for morphisms h1, h2 and the language L2 = g(Df 1 \ Df 2 \ L(G)). The languages L1, L2 are context-free, since the languages Df1 , Df2 are regular and context-free languages are closed under taking morphic images and under intersection with regular languages [12]. The following standard construction shows that context-free grammars for languages L1 and L2 can be obtained in polynomial time with respect to sizes of automata M 1, M 2 and the grammar G. The grammars G1 and G2 for L1 and L2, respectively, dier only in the productions for the start symbol S . The set of nonterminals in both grammars is the set f(q11; q12; A; q21; q22) : q1i ; q2i 2 Qi for i = 1::2; A is a nonterminal in Gg [ fS g The role of a nonterminal (q11; q12; A; q21; q22) is to produce words w 2 0 1 2 such that w = gq1 ;q1 (v) for some v 2 , v is derivable from A in G and

66


^(q1i ; v) = q2i for i = 1::2. The set of productions in grammars is a sum of ve sets. Four of them P1 , P2, P3, P4 are common for the grammars G1, G2. The fth one P5 is dierent. It consists of productions for the start symbol S . Let P be the set of productions in G. The sets Pi for i 4 are de ned in the following way. P1 = f(q11; q12; A; q21; q22) ! (q11; q12; B; q31; q32)(q31; q32; C; q21; q22) : A ! BC 2 P qji 2 Qi for j = 1 : : : 3; i = 1 : : : 2g; P2 = f(q11; q12; A; q21; q22) ! (q11; q12; B; q21; q22) : A ! B 2 P; qji 2 Qi for j = 1 : : : 2; i = 1 : : : 2g; P3 = f(q11; q12; A; q11; q12) ! " : A ! " 2 P; q1i 2 Qi for i = 1 : : : 2g; P4 = f(q11; q12; A; q21; q22) ! (a; q11; q22) : A ! a 2 P; q1i 2 Qi for i = 1 : : : 2; i(a; q1i ) = q2i g: The set of productions P5 in G1 consists of the productions of the form S ! (q11; q12; A; q21; q22); where A is the start symbol in G, q1i is the initial state in M i and q2i 2 F i. The set of productions P5 in G2 consists of the productions of the form S ! (q11; q12; A; q21; q22); where A is the start symbol in G, q1i is the initial state in M i for i = 1::2 and either q21 62 F 1 and q22 2 F 2 or q21 2 F 1 and q21 62 F 2.

Fact 36 a) (q11; q12; A; q21; q22) ! w i there is a word v such that w = gq ;q (v), A ! v in the grammar G and ^(q1i ; v) = q2i for i = 1::2. 1 2 1 1

b) L(G1 ) = L1 and L(G2 ) = L2. Given a context-free grammar G the problem whether L(G) is empty can be solved in polynomial time [12]. Thus the problem whether L1 is empty can be solved in polynomial time. This completes the proof of our main result in this section.

5.2. Algebraic Systems of Equations

67

Theorem 37 The dgsm equivalence problem for context-free languages can be solved in polynomial time.

5.2 Algebraic Systems of Equations We consider equations without constants. An equation over a set of variables V is a pair (u; w), usually denoted by u = w, where u; w 2 V . A system of equations over the set of variables V is a set of equations over V . A solution of a system of equations over V in a monoid S is a morphism h : V ! S such that h(u) = h(w) for each equation (u; w) in S . Two systems of equations are equivalent, if they have the same sets of solutions. Each language L over an alphabet corresponds to a system S (L) of equations on words. Each letter a in the alphabet has two corresponding variables xa and ya in the system. Each word w = a1::ak, for ai 2 , corresponds to an equation e(w) of the following form

xa1 ::xak = ya1 ::yak The system S (L) consists of the equations e(w) for w 2 L.

Example 38 Consider an alphabet = fa; b; cg and a language L = fabic : i 1g. The system S (L) consists of equations: xa(xb)ixc = ya(yb)iyc ; for i 1: Our next two facts are simple consequences of the fact that an equation e(w) for a substitution xa = f (a), ya = g(a) is equivalent to the equation f (w) = g(w). Fact 39 The morphism equivalence problem for a language L over an alphabet and two morphisms f , g is equivalent to the problem of deciding whether or not the morphism h(xa) = f (a), h(ya) = g(a) for a 2 is a solution of the system S (L).

Fact 40 A language T is a test set for a language L i the system S (T ) is equivalent to the system S (L) and the system S (T ) is a subsystem of the system S (L).

68


Fact 39 presents the formulation of the morphism equivalence problem in terms of systems of equations and Fact 40 shows that the notion of a test set has also a natural interpretation in terms of systems of equations. A system of equations S over variables V is called algebraic if there are two morphisms f; g : V ! V and a context-free language L such that S = ff (w) = g(w) : w 2 Lg. An algebraic system de ned by morphisms f; g and by a language L is denoted by [f; g](L). We assume that the language L is represented by a context-free grammar. This de nition of algebraic systems of equations is equivalent to the standard one which is based on push-down transducers [6, 5]. Observe that S (L) = [fx; fy ](L), where fx(a) = xa, fy (a) = ya, for each a in . In particular, if the language L is context-free, then the system S (L) is algebraic. Fact 41 Let T and L be languages. Then if the language T is a test set for the language L, then for two morphisms f , g the system [f; g](T ) is equivalent to the system [f; g](L). Proof: If T is a test set for L, then, by Fact 39, the system S (T ) is equivalent to the system S (L). The result follows from the fact that for any language L, a morphism h is a solution of a system [f; g](L) i the substitution xa = h(f (a)), ya = h(g(a)) is a solution to the system S (L). 2 The following theorem is a straightforward consequence of Fact 41 and Theorems 15, 18, 34 in previous chapters.

Theorem 42 1. Each algebraic system of equations, which is given by a contextfree grammar of size m and by two morphisms, possesses an equivalent algebraic subsystem with O(m6) equations. A grammar and two morphisms that represent the subsystem can be generated in polynomial time. 2. There are algebraic systems of equations, that are represented by a grammar of size m and by two morphisms, whose each equivalent subsystem consists of (m3) equations.

5.3. Recursive Sequences of Words

69

3. There is a polynomial time algorithm to decide whether or not a substitution is a solution of an algebraic system of equations which is given by a context-free grammar and two morphisms.

5.3 Recursive Sequences of Words

We say that a sequence of words ffigi1 is recursively de ned i there is a natural number c and k functions li : N ! N , for 1 i k, such that 1 li(n) < n for n > c, li(n) can be computed in polynomial time on n, and fn = fl1(n) : : :flk (n) for n > c.

Example 43 The sequence of Fibonacci words is recursively de ned. For this sequence c = 2, k = 2, l1(n) = n ? 1, l2(n) = n ? 2 and f1 = a, f2 = b and fn = fn?1 fn?2 .

The recursively de ned sequences form a quite big class of sequences. Any HD0L sequence can be easily generated as a regular subsequence of a recursively de ned sequence. Let (; g; h; w) be a HD0L system. The words of the sequence can be generated by the following system of recurrent equations. f0;a = g(a); fi;a = fi?1;a1 : : :fi?1;as if h(a) = a1 : : : as for ai 2 ; i 1 fi;wt = fi;w1 : : : fi;wt if w = w1 : : : wt for wi 2 ; i 0: Note that fi;a = h(gi (a)) for i 0. Thus the sequence ffi;w gi1 forms the HD0L sequence. This set of words can be arranged in the following recursively de ned sequence of words: f0;a1 ; f0;a2 ; : : :; f0;ak ; f0;w ; f1;a1 ; f1;a2 ; : : :; f1;ak ; f1;w ; :::; fi;a1 ; fi;a2 ; : : :; fi;ak ; fi;w ; :::: where the sequence a1; a2; : : :; ak forms a sequence of all letters in . The subsequence equivalence problem for two recursively de ned sequences of words ffi1gi1, ffi2gi1 and a nite set of numbers S N

70


consists in deciding whether or not 8s2S fs1 = fs2. The input to the problem are constants k1, k2, the algorithms computing the functions li1; li2, the numbers c1, c2, the words ffi1g1ic1 , ffi2g1ic2 and the set of numbers S . This problem can be easily transformed into the words equivalence problem. Theorem 44 The subsequence equivalence problem for two recursively de ned sequences of words and a set of numbers S can be solved in polynomial time on m = maxs2S fsg. Proof: Let ffi1gi1, ffi2gi1 be two recursively de ned sequences of words. Let m be the maximal number in S . First, the values of the functions lj1(i), lj2(i), for i m are computed. This takes polynomial time on m. The input to the words equivalence problem is constructed in the following way. In the following grammar G1 the only word derivable from nonterminal Fi1 is fi1. The set of productions in G1 consists of productions Fn1 ! Fl111(n) : : : Fl11k1 (n), for m n > c1, and Fn1 ! fn1, for 1 n c1. Similarly, we can construct a grammar G2 for the words fi2. The input data for the words equivalence problem are the grammars Ch(G1) and Ch(G2) and the set of pairs of nonterminals f(F 1; F 2) : s 2 S g. This completes the proof. 2 Observe that our algorithm is not polynomial time with respect to the size of input data. For instance for two sequences f and g such that f1 = f2 = a, fn = fn?1fn?2 , for n 3, and g1 = g2 = a, gn = gn?2 gn?1, for n 3, and for a set S = fmg the size of the input data to the problem is O(jS j) = O(log m) and our algorithm works in polynomial time on m, i.e. exponential time on the size of the input. Let n be the size of an alphabet. A well known 2n conjecture for D0L sequences states that if two D0L sequences coincide on rst 2n elements then they are the same. Up to now it is not known whether or not the conjecture holds true. Should it turn out that it is true, then our algorithm can be used to decide in polynomial time on n whether or not two D0L sequences are identical. Notice that the de nition of a D0L system contains a de nition a morphism, and this is of size at least n. Thus the algorithm works in polynomial time with respect to the size of the input D0L system. An analogous conjecture for HD0L sequences does not hold as shows part (c) of Theorem 3.4 in Section 3.4.

5.3. Recursive Sequences of Words

71

It may be, however, that in the case of HD0L sequences the bound 2n can be replaced by some polynomial on n. Then again our algorithm may be used to decide in polynomial time whether or not two HD0L sequences are the same.

Chapter 6 Conclusion and Open Problems We proved in Chapter 4, that the morphism equivalence problem for context-free languages can be solved in polynomial time. Estimations of this complexity, which can be concluded from our considerations, give a polynomial of huge degree. Recall that the algorithm consists of two phases. In the rst one, a representation of a test set for an input language is constructed. In the second one, a representation of morphic images of all words in the test set are constructed and, using the algorithm for the words equivalence problem, the equalities of pairs of those images are veri ed. The number of words in the constructed test set (and the number of equalities to verify) is O(m6), where m is the size of the input grammar. Our estimation for the number of triples which may occur in the algorithm for the words equivalence problem is O(m3), which gives (m3) estimation for the time complexity of this algorithm. Therefore our estimations for the time complexity of the algorithm for the words equivalence problem would be (m9). There are two possible ways to obtain improvements. The rst one is to lower the gap between bounds (m3) and O(m6) for the size of test sets for context-free languages. Each improvement in the upper bound results in lowering the estimation for the time complexity of the algorithm. However even if the tight bound is (m3), the estimation would be (m6) which is still huge. The second way to improve the time complexity is to develop new algorithm. The weak point of our 73

74

Chapter 6. Conclusion and Open Problems

algorithm is that it veri es equalities of morphic images separately. More ecient algorithm could verify, using a grammar generating the test set, many equalities simultaneously. The next interesting group of questions deals with the morphism equivalence problem in semigroups other than free. The most basic question is: For which semigroups are there polynomial time algorithms for the morphisms equivalence problem or the words equivalence problem? Even the case of free groups is open. The existence of polynomial size test set for context-free languages in a semigroup is helpful to solve the morphism equivalence problem in polynomial time. We showed that a necessary and sucient condition for groups, in which context-free languages possesses polynomial size test sets, is the existence of number k such that Tk is a test set for Lk . Which groups do satisfy this condition? The third group of open problems was mentioned in the section on applications of our algorithm for the words equivalence problem in combinatorics on words. Does an f (n)-conjecture hold for HD0L sequences? If it does and f (n) is a polynomial, then our algorithm can be used to verify equality of two HD0L or D0L systems in polynomial time. Summarizing, we formulate the following open problems: design an ecient algorithm for the morphism equivalence problem for context-free languages, nd a tight bound for the size of test sets for context-free languages, describe semigroups in which context-free languages possess polynomial size test sets, and those in which the morphism equivalence problem can be solved in polynomial time, verify whether for some polynomial f (n) an f (n)-conjecture holds for D0L or HD0L systems.

Bibliography [1] J. Aho, J. Hopcroft, J. Ullman, \The design and analysis of computer algorithms", Addison-Wesley Publishing Company, 1974. [2] J. Albert, K. Culik II, Test sets for homomorphism equivalence on context-free languages, Inform. and Control 45(1980) 273{284. [3] J. Albert, K. Culik II, J. Karhumaki, Test sets for context-free languages and algebraic systems of equations, Inform. and Control 52(1982) 172{186. [4] M.H. Albert, J. Lawrence, A proof of Ehrenfeucht's Conjecture, Theoret. Comput. Sci. 41(1985) 121{123. [5] J. Berstel, \Transductions and Context-Free Languages", B.G. Teubner, Stuttgart, 1979. [6] K. Culik II, J. Karhumaki, Systems of equations over a free monoid and Ehrenfeucht's Conjecture, Discrete Math. 43(1983) 139{153. [7] K. Culik II, A. Salomaa, On the decidability of homomorphism equivalence problem for languages, JCSS 17(1978) 163{175. [8] K. Culik II, A. Salomaa, Tests sets and checking words for homomorphism equivalence, JCSS 20(1980) 379{395. [9] A. Ehrenfeucht, J. Karhumaki, G. Rozenberg, On binary equality sets and a solution to the Test Set Conjecture in the binary case, J. Algebra 85(1983), 76{85. 75

76

Bibliography

[10] W. Guba, The equivalence of in nite systems of equations in free groups and semigroups to their nite subsystems, Math. Zametki 40(1986). [Russian] [11] T. Harju, J. Karhumaki, W. Plandowski, Compactness of systems of equations in semigroups, in Proc. on ICALP'95. [to appear] [12] J. Hopcroft, J. Ullman, \Introduction to Automata Theory, Languages, and Computation" Addison-Wesley Publishing Company, 1979. [13] S. Jarominek, J. Karhumaki, W. Rytter, Ecient construction of test sets for regular and context-free languages, Theoret. Comput. Sci. 116(1993) 305{316. [14] J. Karhumaki, The Ehrenfeucht Conjecture: A compactness claim for nitely generated free monoids, Theoret. Comput. Sci. 29(1984) 285{308. [15] J. Karhumaki, Equations over nite sets of words and equivalence problems in automata theory, Theoret. Comput. Sci. 108(1993) 103{118. [16] J. Karhumaki, W. Plandowski, On the size of independent systems of equations in semigroups, in Proc. on MFCS'94, Lect. Notes in Comput. Sci., Vol. 729, 443{452, 1994. [17] J. Karhumaki, W. Plandowski, W. Rytter, Polynomial size test sets for context-free languages, JCSS 50 (1995), 11{19. [18] M. Karpinski, W. Plandowski, W. Rytter, The fully compressed string-matching for Lempel-Ziv encoding, manuscript, 1995. [19] M. Karpinski, W. Rytter, A. Shinohara, Pattern-matching for strings with short description, Proc. on CPM'95, 1995. [to appear] [20] M. Lothaire, \Combinatorics on Words", Addison-Wesley Publishing Company, Massachussets, 1983. [21] R. Lyndon, P. Schupp, \Combinatorial Group Theory", Springer Verlag, 1977.

Bibliography

77

[22] G.S. Makanin, The problem of solvability of equations in a free semigroup, Math. USSR Sbornik 32(1977) 129{198. [Russian] [23] W. Plandowski, Testing equivalence of morphisms on context-free languages, in Proc. of ESA'94, Lect. Notes in Comput. Sci., Vol. 855, 460{470, 1994. [24] G. Rozenberg, A.Salomaa, \The Mathematical Theory of L Systems", Academic Press, New York, 1980. [25] P. Turakainen, \On Some Transducer Equivalence Problems for Families of Languages", Mathematics, University of Oulu, Finland, 1987.

The Complexity of the Morphism Equivalence Problem for ... - CiteSeerX

The Complexity of the Morphism Equivalence Problem for ... - CiteSeerX

Suggest Documents

the complexity of the equivalence problem for ... - Semantic Scholar

THE COMPLEXITY OF THE EQUIVALENCE

POLICE PROBLEMS: THE COMPLEXITY OF PROBLEM ... - CiteSeerX

COMPLEXITY CLASSIFICATIONS FOR DIFFERENT EQUIVALENCE ...

The Complexity of the Boolean Formula Value Problem - CiteSeerX

The Problem of Non-equivalence - DSpace UTB

The Complexity of the Separable Hamiltonian Problem

the equivalence problem for systems of second-order ordinary ...

THE COTANGENT COMPLEX OF A MORPHISM

On the Complexity of the Containment Problem for Conjunctive ...

THE COMPLEXITY OF THE LIST HOMOMORPHISM PROBLEM FOR ...

The Equivalence Problem in Finite Rings

Lower bound for the Complexity of the Boolean Satisfiability Problem

Parameterized Complexity of the Clique Partition Problem

TIME COMPLEXITY OF THE CONJUGACY PROBLEM IN

Implications for the Energetic Equivalence Rule - CiteSeerX

A Note on the Equivalence and Complexity of Linear Grammars ...

The Equivalence Problem for Deterministic MSO Tree Transducers is

The Complexity of Finding Top-Toda-Equivalence-Class Members

Facing the imbalance training problem through complexity ... - CiteSeerX

Contact Equivalence Problem for Linear Parabolic Equations

Contact equivalence problem for KDV-type equations

Facing the imbalance training problem through complexity ... - CiteSeerX

On Descriptional Complexity of the Planarity Problem for Gauss Words