On the complexity of computing the order of repetition of a string Juhani Karhumaki
Department of Mathematics and Turku Centre for Computer Science, Turku University, Finland. Supported by Academy of Finland under grant 14047. Email:
[email protected].
Wojciech Plandowski
Turku Centre for Computer Science, Finland. On leave from Instytut Informatyki, Uniwersytet Warszawski, Banacha 2, 02{097 Warszawa, Poland. Email:
[email protected]. Supported by KBN grant 8 T11C 039 15.
Wojciech Rytter
Instytut Informatyki, Uniwersytet Warszawski, Banacha 2, 02{097 Warszawa, Poland, and Department of Computer Science, University of Liverpool. Email:
[email protected].
Turku Centre for Computer Science TUCS Technical Report No 226 November 1998 ISBN 952-12-0345-5 ISSN 1239-1891
Abstract We show a simple O(n log n) time algorithm computing the order of repetition in a string. A parallel version of the algorithm works in O(log n) time on n processors. The algorithm can be extended to report all squares and maximal repetitions in a string. 2
TUCS Research Group
Mathematical Structures of Computer Science
1 Introduction Studying regularities in strings has a long tradition in combinatorics on words. Starting from the fundamental work of Thue [23, 24], who considered the existence of in nite repetition-free words, many authors investigated such words in various contexts, see for a review [7]. In the case of in nite words which are de ned by iterated morphisms the problems of repetition-freeness become dicult even for a restricted class of morphisms [6]. On the other hand detecting whether a morphism is square-free is simple, while, surprisingly, detecting whether it is cube-free is still open. For D0L languages, in turn, even deciding whether the order of repetition of words in the language is unbounded is a nontrivial problem [12, 14]. There are many algorithms for detecting squares in a string, see [5, 8, 20] for sequential algorithms and [2, 3, 11] for parallel ones. The sequential algorithms work in O(n log n) time which is proved to be optimal [20]. The parallel ones are work-optimal, i.e. the total number of operations in them is O(n log n), and they work in polylogarithmic time. The algorithm in [8] additionally computes maximal repetitions in a string in time O(n log n). For constant size alphabet there are linear time algorithms detecting squarefreeness [9, 18] and leftmost maximal repetitions [19]. The algorithm in [15] for constant size alphabet detects all squares in O(n + ) time where is the number of squares in the string. Simple randomized O(n log n)-time algorithm for nding leftmost repetition in a string for constant size alphabet is given in [22]. Our contribution to the topic is the rst algorithm which computes, in all cases, the order of repetition of a string. Our algorithm works in O(n log n) time for constant size alphabet. Its parallel version works in O(log n) time and uses n processors. The algorithm can be extended to report all primitive squares and maximal repetitions. Our notion of order of repetition of a word allows fractional repetitions contrary to earlier de nitions [8, 16]. An order of repetition of a word w is a maximal number such that there is a factor of w of the form uiu0 with u0 a pre x of u and such that = jujuuj j . In other jvj words the order of repetition of w is a maximum number in form period v where v is a factor of w and period(v) is the shortest period of v. The order of repetition is a measure of an unregularity of a string - the smaller the order the less regularities of type uiu0. In particular, if the order of repetition is 2
i 0
( )
1
smaller than two then the string is square-free, i.e. it does not contain the regularity of the form uu. The order one says that there are no regularities at all, the string is composed of dierent letters. From the algorithms in [8, 19] it can be concluded an algorithm which computes the order of repetition if it is at least 2. The techniques which are used there does not seem to be extendible to compute the order of repetition if it is smaller than 2.
2 The algorithm Our construction uses sux trees. A sux tree for a word w is a trie containing all suxes of the word w$ where $ is a symbol which does not occur in w, Fig 1. Let T (w) be a sux tree for a string w. Denote by f (v), for a vertex v of T (w), the factor of w which corresponds to the path from the root of T (w) to v. It is well known that T (w) can be computed sequentially in linear time and parallelly in O(log n) time using n processors, see [21, 25] and [4] respectively. b
a
ab$
$
baab$
$
a
$
b
ab$
aab$
Figure 1: A sux tree for the word w = babaab. Each path from the root to a leaf is labelled by a sux of w$. A factor f is called a special factor of w i there are two distinct letters a and b such that fa and fb are also factors of w or f is a sux of w and f occurs at least twice in w. It is well known that, the set of special factors is precisely the set of all words f (v) for internal vertices v of T (w). Denote by gcf (i; j ) the greatest common factor which starts in the word w at positions i and j . Clearly, for any two positions i, j the word gcf (i; j ) is a special 2
factor. The reverse is also true. Indeed a special factor f is the greatest common factor of the positions of occurrences of fa and fb in the word or of the two positions occurrences of f one of which being a sux.
Lemma 2.1 The order of repetition in a word w is of the form 1 + jf pv j ( )
where v is a vertex of the sux tree T (w) and p is the distance between the closest occurrences of f (v) in w.
Proof: Suppose that is the order of repetition of a string w. Let ui u0 be a factor of w such that = jujuuj j and which occurs in w at position j . Consider the occurrences of ui?1u0 at positions j and j + juj. Note that, since the order of repetition of uiu0 is the same as the one of w, gcf (j; j + juj) = ui?1u0. This means that ui?1u0 is a special factor of w, i.e. ui?1u0 = f (v) for some vertex v of T (w). Then we have = 1 + jfj(uvj)j . Further by the maximality of , juj has to be the shortest distance between two consecutive occurrences of f (v) in w. 2 We say that an order of repetition is based on some set S of positions of occurrences of a factor f in w i is the maximal value of 1 + ij?f jj where i, j 2 S . i 0
Lemma 2.2 (Periodicity lemma of Fine and Wilf) Let p, q be periods of a word w. If p + q jwj, then gcd(p; q) is also a period of w where gcd stands for greatest common divisor.
As an immediate consequence of the periodicity lemma we have.
Lemma 2.3 Let p1 , p2, p3 be starting positions of three such consecutive occurrences of a word f in w that contain a common position t of w. Then p2 ? p 1 = p 3 ? p 2 . Proof: The assumptions are illustrated in Fig. 2. Then p = p2 ? p1 and q = p3 ?p2 are periods of f and p+q jf j. By the periodicity lemma gcd(p; q) is also a period of f . Since p1, p2 and p3 are consecutive occurrences of f we have p = gcd(p; q) = q. This completes the proof.
2
A straightforward consequence of Lemma 2.3 is the key point in our algorithm. 3
p
q
f f f
w p2
p1
p3
t
Figure 2: An illustration of assumptions of Lemma 2.3.
Corollary 2.4 Let p , p , . . . , pn be starting positions of such consecutive 1
2
occurrences of a word f in w that contain a common position t of w. Then p1, p2 , . . . , pn form an arithmetic progression. Let w be a word of length n and v be a vertex of a sux tree T (w). Denote by chain(v) the set of starting positions of such occurrences of a word f (v) in w that contain a position n2 . By Corollary 2.4, they form an arithmetic progression and therefore can be stored as three numbers: rst occurrence, last occurrence and the period of the progression. Observe here that chain(v) can be computed on the basis of the values of chain for the sons of v in T (w). Let max(v) be the position of the latest occurrence of f (v) in the rst half of w, i.e. in w[1:: n2 ]. If such a position does not exist we set max(v) = 1. Similarly, let min(v) be the position of the rst occurrence of f (v) in the second half of w, i.e. in w[ n2 ::n]. Again if such a position does not exist we set min(v) = 1. Observe now that max(v) and min(v) can be easily computed on the basis of the values of max, min and chain for the children of v in T (w). Indeed, to compute max(v) we rst extract from the set max(v0) [ chain(v0), for each child v0 of v, the positions i such that i + jf (v)j n2 and take the maximum of those values for all children of v0. Given the values max(v), min(v) and chain(v) the order of repetition which is based on the positions of occurrences of f (v) which are in a set max(v) [ chain(v) [ min(v) can be computed using the formula jf (v)j jf (v)j 1 + maxf jf (v)j ; ; (1) p(v) first(v) ? max(v) min(v) ? last(v) g: where p(v) is the period of the arithmetic progression chain(v), first(v) is the rst element of chain(v) and last(v) is the last element of chain(v). If
4
chain(v) = ; the formula is simpli ed to )j 1 + min(vj)f?(vmax (v) : The above observations lead to the following algorithm which computes the order of repetition of w.
Algorithm Order of Repetition(w[1::n]); If n = 1 then return 1;
h :=Order of Repetition(w[1:: n ]); h :=Order of Repetition(w[ n ::n]); build a sux tree T (w) of w; for each vertex v of T (w) bottom up compute (max(v); min(v);chain(v)); using formula (1) compute the order of repetition of w on the basis of h , h and the information in T (w); end algorithm 1
2
2
2
1
2
Theorem 2.5 The algorithm Order of Repetition(w) computes the order of repetition in w in time O(jwj log jwj). Proof: Let be the order of repetition of a string w. By Lemma 2.1 we can nd a special factor f of w, and its occurrences i > j in w such that = 1 + ij?f jj . Consider the word v = w[j::i + jf j ? 1]. Either v is completely contained in one of the halves of w or it contains a position n2 . In the rst case = h1 or = h2. In the last case we distinguish three cases:
Case j n < i. Then, by Lemma 2.1, i = min(v) and j = last(v). Case j < i n and j + jf j > n . Then i ? j = p(v) Case j < i n and j + jf j n . Then j = max(v) and i = first(v). 2
2
2
2
2
This completes the proof. 2 The algorithm is easy to parallelize. The recurrence of this algorithm can be resolved giving log n levels. At the i-th level the word w is divided into n blocks of length 2i. To each block we assign 2i processors completing the job of i-th level in time O(i). In each level n sux trees has to be built, each 2i
2i
5
of them for a string of size 2i. For that the algorithm [4] can be used. The values of max, min and chain can be easily computed in O(i) time using a tree-contraction technique [1, 13]. Theorem 2.6 The algorithm Order of Repetition(w) can be implemented to work in O(log jwj) time on O(jwj) processors. 2
3 Detection of squares and maximal repetitions Clearly a factor w[i::j ] of w is a square if it is in form uu for some word u. The square is primitive if u is primitive. We use the characterization of squares which is given by the following equivalence w[i::i + 2p ? 1] is a square () jgcf (i; i + p)j p: The word w[i::i + 2p ? 1] is a primitive square if, additionally, there is no occurrence of gcf (i; i + p) between positions i and i + p. The square is, therefore, uniquely determined by a word v which occurs at positions i < j in w and such that jvj j ? i and v = gcf (i; j ). Then v is a special factor of w. The squares of w can be divided into three classes: such that the occurrences of v are completely contained in one of the halves of w, or one of the occurrences of v contains the position n or both of them contain this position. Since v is a special factor of w we have v = f (x) for a vertex x of T (w). Then, if both occurrences of v contain the position n , then j , i 2 chain(x). Hence, if the square is primitive we have j ? i = p(x) and since f (x) = gcf (i; j ) we have j = last(x). Therefore we report w[last(x) ? p(x); last(x) + p(x)] as a primitive square if the continuation of f (x) at positions last(x) ? p(x) and last(x) is dierent. Similarly, if only one of the occurrences of v contains the position n , then i = max(x) and j = first(x) or i = last(x) and j = min(x). Again we report a primitive square w[i; j + j ? i ? 1] if the continuation of f (x) is dierent at positions i and j . These remarks lead to O(n log n) time sequential algorithm and O(log n) time n processor parallel algorithm whose general structure is the same as the algorithm from the previous section. We say that a triple (i; k; p) is a maximal repetition in a word w if the following conditions are satis ed: 2
2
2
2
6
w[i::k] = utu0 where t 2, u0 is a proper pre x of u and p = juj u0w[j + 1] is not a pre x of u or j = n w[i ? 1] is not the last letter of u or i = 1. The detection of all maximal repetitions is very similar to the detection of all squares. It is due to the following property (i; k; p) is a maximal repetition () w[i; i + 2p ? 1] is a primitive square and w[i ? 1] 6= w[i + p ? 1] and k = i + p + jgcf (i; i + p)j ? 1: The maximal repetition is uniquely determined by a special factor v = gcf (i; j ) and two positions of its occurrences i and j . Then p = j ? i and k = j + jgcf (i; j )j ? 1. The algorithm for computing all maximal repetitions is the same as the one for detecting primitive squares with one change: in the place we reported square w[i::i + 2p ? 1] we report a maximal repetition (i; i + p ? 1 + jgcf (i; i + p)j; p) only if w[i ? 1] 6= w[i + p ? 1].
References [1] Abrahamson K., Dadoun N., Kirckpatrick D., Przytycka T., A simple tree-contraction algorithm, J. Algorithms 10, 287-302, 1989. [2] Apostolico A., Optimal parallel detection of squares in strings, Algorithmica 8(4), 285-319, 1992. [3] Apostolico A., Breslauer D., An optimal O(log log n)-time parallel algorithm for detecting all squares in a string, SIAM J. Comput. 25(6), 1318-1331, 1996. [4] Apostiolico A., Iliopoulos C., Landau G.M., Schieber B., Vishkin U., Parallel construction of a sux tree with applications, Algorithmica 3, 347-365, 1988. [5] Apostolico A., Preparata F., Optimal o-line detection of repetitions in a string, Theor. Comput. Sci. 22, 297-315, 1983. 7
[6] Cassaigne J., Motifs evitables et regularites dans les mots, These de Doctorat, Unviversite Paris 6, 1994. Rapport recherche LITP TH 9404, Institut Blaise Pascal, Paris. [7] Chorut C., Karhumaki J., Combinatorics of words. In G. Rozenbeg and A. Salomaa (eds), Handbook of Formal Languages, vol.1-3, Springer, 1997. [8] Crochemore M., An optimal algorithm for computing the repetitions in a word, Inform. Proc. Letters 12(5), 244-250, 1981. [9] Crochemore M., Transducers and repetitions, Theoret. Comput. Sci. 45(1), 63-86, 1986. [10] Crochemore M., Rytter W., Text Algorithms, Oxford University Press, 1994. [11] Crochemore M., Rytter W., Ecient parallel algorithm to test squarefreeness and factorize strings, Inform. Proc. Letters 38(2), 57-60, 1991. [12] Ehrenfeucht A., Rozenberg G., Repetition of subwords in D0L languages, Information and Control, 59(1-3), 13-35, 1983. [13] Gibbons A., Rytter W., An optimal parallel algorithm for dynamic expression evaluation and its applications, Information and Computation 81, 32-45, 1989. [14] Kobayashi Y., Otto F., Repetitiveness of D0L-languages is decidable in polynomial time, in MFCS'97, LNCS 1295, 337-346, 1997. [15] Kosaraju R., Computation of squares in a string, in CPM'94, LNCS 807, 146-150, 1994. [16] Koscielski A., and Pacholski L., Complexity of Makanin's algorithm, J. ACM 43(4), 670-684, 1996. [17] Lothaire M., Combinatorics on words, Addison-Wesley, 1983. [18] Lorentz R., Main M., Linear time recognition of squarefree strings, In A. Apostolico and Z. Galil (eds), Combinatorial Algorithms on Words, 271-278, Springer-Verlag, 1985. 8
[19] Main M., Detecting leftmost maximal periodicities, Discrete Applied Mathematics 25(1), 145-154, 1989. [20] Main M., Lorentz R., An O(n log n) algorithm for nding all repetitions in a string, J. Algorithms 5(3), 422-432, 1984. [21] McCreight E. M., A space-economical sux tree construction algorithm, J. ACM, 23, 262-272, 1976. [22] Rabin M. O., Discovering repetitions in strings. In A. Apostolico and Z. Galil (eds), Combinatorial Algorithms on Words, 279-288, SpringerVerlag, 1985. [23] Thue A., U ber die gegenseitige Lage gleicher Teile gewisser Zeichenreichen, Norske Videnskabers Selskabe Skrifter Mat-Nat. Kl.(Kristania), 1, 1-67, 1912. [24] Thue A., U ber unendliche Zeichenreiche, Norske Videnskabers Selskabe Skrifter Mat-Nat. Kl.(Kristania), 7, 1-22, 1906. [25] Weiner P., Linear pattern-matching algorithm, in Proc. of the 18th ACM Symp. on Theory of Comput., 220-230, 1986.
9
Turku Centre for Computer Science Lemminkaisenkatu 14 FIN-20520 Turku Finland http://www.tucs.abo.
University of Turku Department of Mathematical Sciences
Abo Akademi University Department of Computer Science Institute for Advanced Management Systems Research
Turku School of Economics and Business Administration Institute of Information Systems Science