Pattern Matching in DCA Coded Text ˇ arek1 Jan Lahoda1,2 , Boˇrivoj Melichar1 , and Jan Zd’´
2
1 Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, Karlovo n´ amˇest´ı 13, 121 35 Praha 2, Czech Republic {melichar,zdarekj}@fel.cvut.cz Sun Microsystems Czech, V Parku 2308/8, 148 00 Praha 4, Czech Republic
[email protected]
Abstract. A new algorithm searching all occurrences of a regular expression pattern in a text is presented. It uses only the text that has been compressed by the text compression using antidictionaries without its decompression. The proposed algorithm runs in O(2m ·||AD||2 +nc +r) worst case time, where m is the length of the pattern, AD is the antidictionary, nC is the length of the coded text and r is the number of found matches.
1
Introduction
We present a new algorithm for searching strings from a set of strings described by a regular expression in a text coded by the Data Compression with Antidictionaries (DCA) compression method (text compression using antidictionaries [1]). The proposed algorithm and its variants run in linear time with respect to the length of the compressed text, not counting the preprocessing costs. The paper is organized as follows: After resumption of several basic notions at the beginning of Section 2, we continue with a short overview of the DCA compression method itself in Section 2.1 and the KMP based searching in the DCA compressed text in Section 2.3. Section 3 discusses proposed basic (3.1), and enhanced algorithms (using almost antiwords in 3.2 and incremental construction in 3.3). We conclude in Section 5. The experimental evaluation of our algorithms is described in the appendix.
2
Basic Notions and Previous Work
Let A be a finite alphabet and its elements be called symbols. The set of all strings over A is denoted by A∗ and A is the set of all strings of length . The empty string is denoted by ε. A power set of set S is denoted by P(S). Language L is any subset of A∗ , L ⊆ A∗ . Let P ∈ Am and T ∈ An be a pattern and a text, respectively, m ≤ n. An exact occurrence of P in T is an index i, such that P [1, . . . , m] = T [i − m + 1, . . . , i], i ≤ n. Dictionary (antidictionary) AD is a set of words over A, AD = {w1 , w2 , . . . , w|AD| }, |AD| denotes the number of
This research has been partially supported by the Czech Science Foundation as project No. 201/06/1039 and by the MSMT research program MSM6840770014.
O.H. Ibarra and B. Ravikumar (Eds.): CIAA 2008, LNCS 5148, pp. 151–160, 2008. c Springer-Verlag Berlin Heidelberg 2008
152
ˇ arek J. Lahoda, B. Melichar, and J. Zd’´
strings in AD. By ||AD|| we denote the sum of lengths of all words in AD. A finite automaton (FA) is a quintuple (Q, A, δ, I, F ). Q is a finite set of states, A is a finite input alphabet, F ⊆ Q is a set of final states. If FA is nondeterministic (NFA), then δ is a mapping Q × (A ∪ {ε}) → P(Q) and I ⊆ Q is a set of initial states. A deterministic FA (DFA) is (Q, A, δ, q0 , F ), where δ is a (partial) function Q × A → Q; q0 ∈ Q is the only initial state. A finite transducer (FT) is (Q, A, Γ, δ, q0 , F ), where δ is a mapping Q × (A ∪ {ε}) → Q × (Γ ∪ {ε}). A regular expression (RE) over finite alphabet A is defined as follows: ∅, ε, and a are REs, ∀a ∈ A. Let x, y be REs, then x + y, x · y, x∗ , and (x) are REs, priority of operators is: + (the lowest), ·, ∗ (the highest). Priority of evaluation of a RE can be modified using the parentheses. The length of the regular expression is defined as the count of all symbols in the regular expression except for parentheses and concatenation operator (·) [2]. 2.1
DCA Compression Method
The DCA compression method has been proposed by Crochemore et al. [1] in 1999. The antidictionary is a dictionary of antiwords – words that do not appear in the text to be coded. Let T ∈ {0, 1}∗ be the text to be coded. The text is being read from left to right. When a symbol (a bit) is read from the input text and if the suffix of the text read so far is the longest proper prefix of an antiword, nothing is put to the output. Otherwise, the current symbol is output. The text can be decoded since the missing symbol can be inferred from the antiwords. The coding process is based on finite transducer E (AD) = (QE , {0, 1}, {0, 1}, δE , qE0 , ∅). The encoding algorithm is shown in Algorithm 1. An example of the encoding finite transducer is given in Figure 1 a). The decoding process is also based on finite transducer B (AD) = (QB , {0, 1}, {0, 1}, δB , qB0 , ∅). The decoding transducer is created from the encoding transducer by swapping input and output labels on all transitions. Note that an additional information about the text length is required to decode the original text properly. The decoding algorithm is shown in Algorithm 2. An example of the decoding finite transducer is given in Figure 1 b). For the construction of the encoding transducer itself, please refer to [1].
Algorithm 1. Text compression using the DCA compression method Input: Encoding transducer E (AD) = (QE , {0, 1}, {0, 1}, δE , qE0 , ∅). q = qE0 while not the end of the input do let a be the next input symbol (q , u) = δE (q, a) if u = ε then print a to the output end if q = q end while
Pattern Matching in DCA Coded Text
153
Algorithm 2. Text decompression using the DCA compression method Input: Decoding transducer B (AD) = (QB , {0, 1}, {0, 1}, δB , qB0 , ∅), the length of the original text n. q = qB0 i=0 while not the end of the input do let a be the next input symbol (q , u) = δB (q, a) print u to the output, q = q , i = i + 1 while i ≤ n and δB (q, ε) is defined do (q , u) = δB (q, ε) print u to the output, q = q , i = i + 1 end while end while
1/ε
0/0 1/1
ε/1
0/0
1/1
0/0 a)
1/1
1/1
0/0 b)
Fig. 1. a) Encoding and b) decoding transducer for AD = {110}
2.2
Regular Expression Pattern Matching
Regular expressions are commonly used for a specification of a full text search. The search can be implemented by means of finite automata. Generally, both nondeterministic (using NFA simulation [3]) and deterministic finite automata can be used for pattern matching. In this paper, however, we will focus only on a pattern matching using deterministic finite automata. The pattern matching using finite automata is a two phase process. In the first (preprocessing) phase, the searching deterministic finite automaton is constructed for the given pattern (e.g. regular expression). In the second phase, the input text is processed by the automaton, and each time it enters a final state, an occurrence of the pattern is reported. This algorithm is outlined in Algorithm 3. For regular expression of length m, the corresponding nondeterministic finite automaton has m + 1 states. The corresponding deterministic finite automaton has therefore O(2m ) states at most. Please note that although the exponential growth of the number of states occurs for certain regular expressions (e.g. a(a + b)m−1 ), the number of states of the deterministic finite automaton is much smaller in many practical situations, e.g. O(m) for exact pattern matching [3].
154
ˇ arek J. Lahoda, B. Melichar, and J. Zd’´
Algorithm 3. Pattern matching using deterministic finite automaton Input: Deterministic finite automaton M = (Q, A, δ, q0 , F ). q = q0 i=1 while not the end of the input do if q ∈ F then mark occurrence of the pattern at index i end if let a be the next input symbol q = δ(q, a) i= i+1 end while if q ∈ F then mark occurrence of the pattern at index i end if
2.3
KMP-Based Pattern Matching in DCA Coded Text
Shibata et al. [4] presented a KMP based approach for pattern matching in the DCA compressed text. As a part of this method, the decoding transducer with εtransitions is converted to a generalized transducer G(AD) = (QG , {0, 1}, {0, 1}, δG , qG0 , ∅) without ε-transitions. The main concept of the conversion is to concatenate sequences of transitions, consisting of a non-ε-transition and at least one ε-transition, into a single transition, see Figure 2.
1/11
1/1
ε/1 a)
0/0
ε/1
1/1
0/0
b)
Fig. 2. Generalized decoding transducer construction: a) original decoding transducer, b) generalized decoding transducer
The original decoding transducer may contain infinite sequences of ε-transitions, which are represented by loops of ε-transitions in the decoding transducer, as shown in Figure 3. Note that the infinite ε-transitions sequence can occur only when the very last character of the coded text is being processed. As the uncoded text is of finite length, the sequence of ε-transitions will not be infinite in fact, it is called semi-infinite [4]. The infinite ε-transitions sequence handling is as follows. A special state ⊥ ∈ QG is defined, and transitions sequences leading into a loop of ε-transitions in the original transducer are redirected into this state. The text decoded by this infinite transitions sequence is always in form of uv ∗ (where u, v ∈ {0, 1}∗, u is
Pattern Matching in DCA Coded Text
155
ε/1 ε/1 1/1
ε/1
ε/0
ε/1
1/1
ε/0
1/1(01)∗ ⊥
a)
b)
Fig. 3. Infinite ε-transitions sequence
the prefix decoded before entering the infinite loop, v is the text decoded by the infinite loop), as shown in Figure 3.
3
Main Result
We propose the algorithm for pattern matching in DCA coded text in this section. Two extensions to the algorithm are also proposed. The first extension allows to search text coded by the “almost antiwords” extension to the DCA algorithm (Crochemore and Navarro [5]). The second extension allows to dissolve the preprocessing cost into the searching phase, which may lead into faster searches. This assumption has been verified successfully by our experiments, see Section 4. The proposed algorithm is based on the algorithm by Shibata et al. [4] and our method described in [6] and [7]. 3.1
Basic Algorithm
The pattern matching in the DCA coded text is based on finite automata. Automaton MC , MC = (QC , {0, 1}, δC , qC0 , ∅), for pattern matching in DCA coded text is constructed from the decoding transducer and deterministic pattern matching automaton M , M = (Q, {0, 1}, δ, q0, F ), for the given pattern P . Automaton MC is constructed by “replacing” states from the pattern matching automaton M with copies of the generalized decoding transducer G(AD), G(AD) = (QG , {0, 1}, {0, 1}, δG, qB0 , ∅). The states of automaton MC are therefore pairs [q, qG ], q ∈ Q, qG ∈ QG . The transitions of automaton MC are then defined as: δC ([q, qG ], a) = [δ ∗ (q, outputG (qG , a)), δG (qG , a)]. While reading one symbol of the coded text, more than one transition in the original pattern matching automaton M may be performed. Consequently, performing one transition in the automaton MC may lead into more than one found match. Ordinary final states are not enough to describe this behavior, so we are proposing two auxiliary functions: N and I. Function N maps each transition
156
ˇ arek J. Lahoda, B. Melichar, and J. Zd’´
of MC to all matches found by this transition, N : QC ×{0, 1} → P(N). Function I maps each transition to the number of symbols that would be decoded by the equivalent transition in the original decoding transducer, I : QC × {0, 1} → N. These two functions allow to report exact match occurrences on exact positions. The construction of the automaton for pattern matching in DCA coded text and of functions N and I is described in Algorithm 4. An example is given in the appendix.
Algorithm 4. Construction of automaton for pattern matching in DCA coded text Input: Pattern matching automaton M = (Q, {0, 1}, δ, q0 , F ), decoding transducer G(AD) = (QG , {0, 1}, {0, 1}, δG , qB0 , ∅). Output: Automaton MC = (QC , {0, 1}, δC , qC0 , ∅) for pattern matching in DCA coded text, auxiliary functions N , I. QC = Q × QG qC0 = (q0 , qG0 ) for all qC ∈ QC (qC = (q, qG )) and a ∈ {0, 1} do , u) = δG (qG , a) (where u is the output text) (qG = ⊥ then if qG ] δC (qC , a) = [δ ∗ (q, u), qG ∗ N (qC , a) = {i; δ (q, u[1 : i + 1]) ∈ F } I(qC , a) = |u| else δC (qC , a) = ⊥ end if end for
Theorem 1. For pattern P of length m, and the antidictionary AD of length ||AD||, worst case time complexity of Algorithm 4 is O(2m · ||AD||2 ), and it uses O(2m · ||AD||2 ) memory in the worst case. Proof. The main loop of the algorithm is performed |Q|·|QG | times. The maximal number of states of Q is 2m where m is the length of the pattern (regular expression). The maximal number of states of QG is the size of the antidictionary, i.e. ||AD||. In each pass through the main loop, it either enters the semi-infinite loop, in which case one pass consumes O(1) time; or it uses a finite string u, in which case the pass uses O(|u|) time. As the maximal length of u is ||AD||, the maximal time to be spent in one pass through the main loop is O(||AD||). The total worst case time of the algorithm is therefore O(2m · ||AD||2 ). The algorithm for pattern matching in the DCA coded text using the automaton MC is described in Algorithm 5. For the following theorem, let us assume the semi-infinite string at the end of the text is shorter than the coded text.
Pattern Matching in DCA Coded Text
157
Algorithm 5. Pattern matching in DCA coded text Input: Automaton MC = (QC , {0, 1}, δC , qC0 , ∅) for pattern matching in DCA coded text, auxiliary functions N , I, the length of original text |T |. Output: List of all occurrences of the given pattern in the given text. q = qC0 i=1 while not the end of the input do a = next symbol from the input if δC (q, a) = ⊥ then process the remaining text using transducer B (AD) and pattern matching automaton M else for all n ∈ N (q, a), n + i ≤ |T | do report found match at index i + n in the original (uncoded) text end for i = i + I(q, a) q = δC (q, a) end if end while
Theorem 2. For DCA coded text TC of length nC Algorithm 5 runs in O(nC +r) worst case time, where r is the number of found matches of the given pattern in the original text. Proof. The main loop of the algorithm is performed nC times. The reporting of matches by the inner for-cycle is performed at most r times for the whole input text. The semi-infinite string handling is (according to the assumptions) O(nC ). The rest of the inner loop runs in O(1). The total time complexity of this algorithm is therefore O(nC + r). 3.2
Almost Antiwords
Algorithm 5 can be extended to handle the extended scheme of compression using almost antiwords [5]. Certain factors of the input text would improve the compression ratio significantly if they would be considered the antiwords. The almost antiwords extension to the DCA compression method uses these factors as antiwords. Exceptions are encoded into a separate list. Theorem 3. Algorithm 6 runs in O(nC + r) worst case time for DCA coded text TC of length nC , r denotes the number of found matches of the given pattern in the original text. Proof. As in Crochemore and Navarro [5], let us assume the exceptions are rare and the semi-infinite string at the end of the text is shorter than the coded text. Then the proof is similar to the proof of Theorem 2, except for the exceptions handling. In line with the assumption, the number of exceptions in the coded text is small, hence the impact on the overall time complexity is negligible.
158
ˇ arek J. Lahoda, B. Melichar, and J. Zd’´
Algorithm 6. Pattern matching in text coded by DCA with almost antiwords Input: Automaton MC = (QC , {0, 1}, δC , qC0 , ∅) for pattern matching in DCA coded text, auxiliary functions N , I, and the sorted list of exceptions. Output: List of all occurrences of the given pattern in the given text. q = qC0 i=1 while not the end of the input do a = next symbol from the input if δC (q, a) = ⊥ then process the remaining text using transducer B (AD) and pattern matching automaton M else i = i + I(q, a) if there is an exception in between i and i then use original decoding automaton and pattern matching automaton else for all n ∈ N (q, a), n + i ≤ |T | do report found match at index i + n in the original (uncoded) text end for i = i q = δC (q, a) end if end if end while
3.3
An Incremental Algorithm
In the previous algorithms, the automaton MC is completely constructed during the preprocessing phase, although parts of the MC automaton may not be used by the pattern matching algorithm. As a possible solution, we propose to construct the automaton MC “on the fly” during the pattern matching phase. The incremental algorithm is embeds the preprocessing phase (described in Algorithm 4) into the searching phase (described in Algorithm 6) constructing on-the-fly only the needed parts of the automaton. This algorithm will construct the whole automaton MC in the worst case. The worst-case time complexity characteristics are the same as in the case of the previous algorithms 5 and 6. Depending on the pattern and coded text, parts of the automaton MC may not be created, leading into improved performance. The time complexity of this algorithm may be as low as Ω(nC + r). An important decision is how to address states in set QC using its components (states from Q and QG ). In Algorithm 4, addressing the state is O(1), as it is a simple addressing in a two-dimensional array. During the incremental construction, creating the two-dimensional array has to be avoided. Instead, we propose to use a hash table with a compound key. First, each state from Q is assigned a unique integer number, in sequence. We do the same for states in QG . For state q ∈ Q and qG ∈ QG and their unique number i and iG (respectively), the key for
Pattern Matching in DCA Coded Text
159
the hash table is determined as i · |QG | + iG . Given the key itself is a reasonable hash function, addressing values in this hash table is O(1) on average. Algorithm 7. An incremental pattern matching in text coded by DCA with almost antiwords Input: Pattern matching automaton M = (Q, {0, 1}, δ, q0 , F ), decoding transducer G(AD) = (QG , {0, 1}, {0, 1}, δG , qG0 , ∅). Output: List of all occurrences of the given pattern in the given text. create qC0 = (q0 , qG0 ) q = qC0 i=1 while not the end of the input do a = next symbol from the input if δC (q, a) not defined then , u) = δG (qG , a) (where u is the output text) (qG = ⊥ then if qG ] δC (qC , a) = [δ ∗ (q, u), qG ∗ N (qC , a) = {i; δ (q, u[0 : i]) ∈ F } I(qC , a) = |u| else δC (qC , a) = ⊥ end if end if if δC (q, a) = ⊥ then process the remaining text using transducer B (AD) and pattern matching automaton M else i = i + I(q, a) if there is an exception in between i and i then use original decoding automaton and pattern matching automaton else for all n ∈ N (q, a), n + i ≤ |T | do report found match at index i + n in the original (uncoded) text end for i = i q = δC (q, a) end if end if end while
4
Experimental Results
We have implemented three algorithms: the basic algorithm described in Section 3.1, the incremental algorithm described in Section 3.3, without almost antiwords extension, and the “decompress and search” algorithm1 . We have then 1
The decoded symbols are passed from the decoder directly into the pattern matching automaton.
160
ˇ arek J. Lahoda, B. Melichar, and J. Zd’´
compared the performance of these three algorithms on the Canterbury corpus using different lengths of regular expressions. We have performed our measurements on a PC with Intel Core 2 Duo downscaled to 1 GHz and 2 GB of main memory. Unsuprisingly, the preprocessing costs of the basic algorithm were prohibiting. The incremental algorithm, however, greatly outperformed the decompress and search algorithm. The average running time of the incremental algorithm over the entire corpus was between 50 % and 52 % of the running time of the “decompress and search” algorithm, depending on the length of the regular expression.
5
Conclusion
We introduced the first algorithm for searching strings from a set of strings described by regular expression in a text coded by the DCA compression method. Besides a basic variant of this algorithm, there were proposed two enhancements in this paper. The algorithm of incremental pattern matching improves performance of our algorithm in practice and our implementation outperforms the decompress-and-search algorithm significantly. Asymptotical time complexity of our algorithm and its variants is linear with respect to the length of the compressed text, not counting the preprocessing costs.
References 1. Crochemore, M., Mignosi, F., Restivo, A., Salemi, S.: Text compression using antidictionaries. In: Wiedermann, J., Van Emde Boas, P., Nielsen, M. (eds.) ICALP 1999. LNCS, vol. 1644, pp. 261–270. Springer, Heidelberg (1999) 2. Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, pp. 399–462. Springer, Berlin (1997) 3. Holub, J.: Simulation of Nondeterministic Finite Automata in Pattern Matching. PhD thesis, Faculty of Electrical Engineering, Czech Technical University, Prague, Czech Republic (2000) 4. Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Pattern matching in text compressed by using antidictionaries. In: Crochemore, M., Paterson, M. (eds.) CPM 1999. LNCS, vol. 1645, pp. 37–49. Springer, Heidelberg (1999) 5. Crochemore, M., Navarro, G.: Improved antidictionary based compression. In: SCCC, pp. 7–13 (2002) 6. Lahoda, J., Melichar, B.: Pattern matching in Huffman coded text. In: Proceedings of the 6th IS 2003, Ljubljana, Slovenia, pp. 274–279. Institut “Joˇzef Stefan” (2003) 7. Lahoda, J., Melichar, B.: Pattern matching in text coded by finite translation automaton. In: Proceedings of the 7th IS 2004, Ljubljana, Slovenia, pp. 212–214. Institut “Joˇzef Stefan” (2004)