Journal of Discrete Algorithms 33 (2015) 58–70
Contents lists available at ScienceDirect
Journal of Discrete Algorithms www.elsevier.com/locate/jda
Extended to multi-tilde–bar regular expressions and efficient finite automata constructions Faissal Ouardi a,∗ , Jean-Marc Champarnaud b , Djelloul Ziadi b a b
LRI, Department of Computer Science, Faculty of Science, Mohammed V University, Morocco LITIS, University of Rouen, 76821 Saint-Etienne-du-Rouvray, France
a r t i c l e
i n f o
Article history: Received 14 June 2013 Received in revised form 20 November 2014 Accepted 2 March 2015 Available online 17 March 2015 Keywords: Regular expressions and languages Finite automata Computation complexity
a b s t r a c t Several algorithms have been designed to convert a regular expression into an equivalent finite automaton. One of the most popular constructions, due to Glushkov and to McNaughton and Yamada, is based on the computation of the Null, First, Last and Follow sets (called Glushkov functions) associated with a linearized version of the expression. Recently Mignot considered a family of extended expressions called Extended to multitilde–bar Regular Expressions (EmtbREs) and he showed that, under some restrictions, Glushkov functions can be defined for an EmtbRE. In this paper we present an algorithm which efficiently computes the Glushkov functions of an unrestricted EmtbRE. Our approach is based on a recursive definition of the language associated with an EmtbRE which enlightens the fact that the worst case time complexity of the conversion of an EmtbRE into an automaton is related to the worst case time complexity of the computation of the Null function. Finally we show how to extend the ZPC -structure to EmtbREs, which allows us to apply to this family of extended expressions the efficient constructions based on this structure (in particular the construction of the c-continuation automaton, the position automaton, the follow automaton and the equation automaton). © 2015 Elsevier B.V. All rights reserved.
1. Introduction According to Kleene’s theorem [16], regular expressions and finite automata are two equivalent representations of regular languages. The conversion from a representation into the other one raised numerous research works. Concerning the conversion of a regular expression into a finite automaton we can cite the following references: [1–5,9–11,13,14,17,19], for which a common aim is to reduce the space and/or worst case time complexity of the result of the conversion. In this paper we are particularly interested by the implementation of conversion algorithms which are based on the notion of position, such as the five first ones in the above list. Following [13,17], these algorithms are based on the computation of the Null, First, Last and Follow sets (called Glushkov functions) associated with a linearized version of the expression. Recently Mignot [18] considered a family of extended expressions called Extended to multi-tilde–bar Regular Expressions (EmtbREs) and he showed that, under some restrictions, the Glushkov functions can be defined for an EmtbRE (see also [6,7]). This class of regular expressions is superpolynomially more succinct than standard expressions [6]. In this paper we present an algorithm which efficiently computes the Glushkov functions of an unrestricted EmtbRE. Our approach is based on a recursive definition of the language associated with an EmtbRE which enlightens the fact that worst case time complexity of the
*
Corresponding author. E-mail address:
[email protected] (F. Ouardi).
http://dx.doi.org/10.1016/j.jda.2015.03.001 1570-8667/© 2015 Elsevier B.V. All rights reserved.
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
59
conversion of an EmtbRE into an automaton is related to the worst case time complexity of the computation of the Null function. Finally we show how to extend the ZPC -structure [19] to EmtbREs, which allows us to apply to this family of extended expressions the efficient constructions based on this structure (in particular the construction of the c-continuation automaton [10], the position automaton [19], the follow automaton [9] and the equation automaton [10,15]). The structure of the paper is as follows. In Section 2, we recall some basic definitions concerning regular expressions and finite automata, and we recall the notion of multi-tilde–bar expression. New properties concerning the language of a multi-tilde–bar expression are stated in Section 3. In Section 4, we give the definition of the position automaton associated with a multi-tilde–bar expression. Section 5 is devoted to an efficient computation of the position automaton from an EmtbRE, through the extension of the ZPC -structure to EmtbRE. In Section 6, we recall the notion of the c-continuation and its extension to the class of EmtbREs. 2. Preliminaries 2.1. Regular expressions and finite automata Let A be a non-empty finite set of symbols, called an alphabet. The set of all the words over A is denoted by A ∗ . The empty word is denoted by ε . A language over A is a subset of A ∗ . Regular expressions over an alphabet A and regular languages that they denote are inductively defined as follows:
• ∅ is a regular expression denoting the language L(∅) = ∅. • x, for all x ∈ A ∪ {ε }, is a regular expression denoting the language L(x) = {x}. • Let F (resp. G) be a regular expression denoting the language L(F) (resp. L(G)); then we have: – (F + G) is a regular expression denoting the language L(F + G) = L(F) ∪ L(G). – (F · G) is a regular expression denoting the language L(F · G) = L(F) · L(G). – (F∗ ) is a regular expression denoting the language L(F∗ ) = (L(F))∗ . The following identities are classically used:
∅ + E = E = E +∅,
ε · E = E = E ·ε, ∅ · E = ∅ = E ·∅.
Let E be a regular expression. Its linearized form, denoted by E , is obtained by ranking every symbol occurrence with a subindex denoting its position in E. We say that a regular expression is in linear form if each symbol of the expression occurs only once. Subscripted symbols are called positions and the set of positions is denoted by Pos(E). We denote by h the application that maps each position in Pos(E) to the symbol of A that appears at this position in E. The size of E, denoted by | E |, is the cardinality of nodes of its syntactical tree. We call alphabetic width of E, denoted by E , the number of occurrences of symbols in the expression. Definition 1. Let E be a regular expression denoting the language L. The set Null(E) is defined by:
Null(E) =
{ε } if ε ∈ L, otherwise. ∅
A finite automaton (NFA) is a 5-tuple A = Q , A , δ, q0 , F , where Q is a finite set of states, A is an alphabet, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states and δ : Q × A −→ 2 Q is the transition function. The language recognized by A is denoted by L(A). 2.2. Multi-tilde–bar expressions We now recall the syntactical definition of extended to multi-tilde–bar regular expressions (EmtbREs) [6]. Notice that these expressions will be proven to be regular later (see Corollary 1). Let E be a regular expression. The expression E (bar operator) denotes the language L(E) \ {ε } and the expression E (tilde operator) denotes the language L(E) ∪ {ε }. In [10], Caron and Ziadi proposed an algorithm for converting a position automaton with n + 1 states into an equivalent small expression having n symbols. The following automaton is not a position one according to the characterization given by Caron and Ziadi.
60
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
However, if we consider the class of EmtbREs, one can obtain an equivalent EmtbRE having only three symbols: E =
(a + ε )(b + ε )c.
Without loss of generality, any regular expression can be considered as a product of concatenation E1 · E2 · · · En of n subexpressions, with n ≥ 1. Such a product is denoted by E1,n and the set of its factors is denoted by F . Let us consider the set of pairs F = {(i , j ) | 1 ≤ i ≤ j ≤ n}. For 1 ≤ i ≤ j ≤ n, the factor Ei , j is represented by the pair (i , j ) ∈ F . A bar operator (resp. a tilde operator) applying on the factor Ei , j is also represented by the pair (i , j ) ∈ F . Given two disjoint subsets F 1 and F 2 of F , a multi-tilde–bar operator is defined by two subsets of F : the set Bn1 of bar operators applying on the factors of F 1 and the set Tn1 of tilde operators applying on the factors of F 2 . Finally, a multi-tilde–bar expression E1,n is defined as a product E1,n equipped with a set Bn1 of bars and a set Tn1 of tildes. Definition 2. (See [6].) An Extended to multi-tilde–bar Regular Expression (EmtbRE) over an alphabet A is inductively defined by:
E = ∅, E = x, with x ∈ A ∪ {ε },
E = (F + G), with F and G two EmtbREs, E = (F · G), with F and G two EmtbREs, E = (F∗ ), with F an EmtbRE,
E1,n is an EmtbRE with Bn1 defines the set of Bar operators, Tn1 defines the set of Tilde operators, and E1,n a concatenation product of EmtbREs. j The EmtbRE Ei , j is deduced from the expression E1,n by taking as set of bars the subset Bi = {(k1 , k2 ) ∈ Bn1 | i ≤ k1 ≤ k2 ≤ j } j of Bn1 and as set of tildes the subset Ti = {(k1 , k2 ) ∈ Tn1 | i ≤ k1 ≤ k2 ≤ j } of Tn1 . The size of E1,n denoted | E1,n | is the size of E1,n added with the term | Tn1 | + | Bn1 |. The alphabetic width E1,n of E1,n is the number of occurrences of letters in the expression. An MtbRE is said to be nullable if and only if the language it denotes recognizes the empty word.
Example 1. Consider the regular expression E1,5 = E1 · E2 · E3 · E4 · E5 . Let us consider the set of bars B51 = {(2, 3), (3, 5)} and the set of tildes T51 = {(1, 2), (4, 5)}. The EmtbREs E1,5 and E1,3 can be represented graphically as follows: E1,5 = E1 · E2 · E3 · E4 · E5
E1,3 = E1 · E2 · E3
3. The language of a multi-tilde–bar expression The original semantical definition of the language of an EmtbRE [6] is based on the description of how words are generated by overlapping tildes and bars. Our approach is different: we provide a recursive definition of the language of an EmtbRE. Definition 3. Let E1,n be a multi-tilde–bar expression. The language associated with E1,n is recursively defined as follows:
⎧ n ⎨ L ∪ {ε } if (i , j ) ∈ T1 , L(Ei , j ) = L \ {ε } if (i , j ) ∈ Bn1 , ⎩ L otherwise
with
L=
j −1 k =i
L(Ei ,k ) · L(Ek +1, j )
and
⎧ n ⎪ ⎨ L(Ek,k ) ∪ {ε } if (k, k) ∈ T1 , L(Ek,k ) = L(Ek,k ) \ {ε } if (k, k) ∈ Bn1 , ⎪ ⎩ L(Ek,k ) otherwise.
for all 1 ≤ i < j ≤ n. Corollary 1. The language of a multi-tilde–bar expression E1,n is regular. As we will see in the following, this recursive definition will allow us to provide the construction of the Glushkov automaton of any EmtbRE. It is worthwhile noticing that in [6], this construction is restricted to saturated EmtbREs, that is expressions such that in each EmtbRE subexpression every factor is equipped with either a tilde or a bar. Let us define a particular concatenation operator, denoted by ε , as follows:
⎧ ⎨ L(E ) · L(E ) \ {ε } if (1, n) ∈ Bn1 , 1, j j +k,n L(E1, j ) ε L(Ej +k,n ) = ⎩ L(E ) · L(E otherwise. 1, j j +k,n )
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
61
Proposition 1. Let E1,n be an EmtbRE. The language associated with E1,n can be recursively computed as follows:
L(E1,k ) = L(E1,k−1 ) ε Lk
∪
−1 k j =1
L(E1, j ) ε Null(Ej +1,k ) ∪ Null(E1,k ), ∀1 < k ≤ n,
with Li = L(Ei ,i ) ∪ Null(Ei ,i ), ∀1 ≤ i ≤ n. Proof. The proof is by induction on k, i.e. the number of factors in E1,k . Let us consider the case where k = 2. It is easy to prove that the proposition is true:
L(E1,2 ) = L(E1,1 ) ε L2
∪ L(E1,1 ) ε Null(E2,2 ) ∪ Null(E1,2 )
We now suppose that the proposition is satisfied for the EmtbE E1,k−1 and we prove it is satisfied for E1,k .
L(E1,k )
k −1
Def. 3
=
j =1
=
Ind. Hyp.
=
L(E1, j ) ε L(Ej +1,k ) ∪ Null(E1,k )
L(E1,k−1 ) ε Lk
L(E1,k−1 ) ε Lk
k −2
∪
j =1
L(E1, j ) ε
j =1
L(Ej +1,k−1 ) ε Lk
∪
−1 k
L(E1,k−1 ) ε Lk
∪
−2 k
Def. 3
=
−2 k j =1
L(E1, j ) ε
l =1
∪ Null(E j +1,k )
L(E1, j ) ε L(Ej +1,k−1 ) ε Lk
−1 k
L(Ej +1,l ) ε Null(El+1,k )
j =1 l = j +1
∪
−2 k
L(E1, j ) ε Null(Ej +1,k )
−2 k −1 k
−2 k
∪ Null(E1,k ) ∪
l = j +1
L(E1,k−1 ) ε Lk
∪
L(E j +1,l ) ε Null(El+1,k )
j =1
j =1
∪
l = j +1
=
L(E1, j ) ε L(Ej +1,k ) ∪ Null(E1,k )
∪ Null(E1,k )
−2 k
∪ Null(E1,k ) ∪ L(E1,k−1 ) ε Lk
L(E1, j ) ε L(Ej +1,l ) ε Null(El+1,k )
L(E1,l ) ε Null(El+1,k )
A straightforward consequence of Definition 3 is that L(E1, j ) ε L(Ej +1,l ) ⊆ L(E1,l ), for all 1 ≤ j ≤ k − 2. As a consequence we have:
L(E1,k ) = L(E1,k−1 ) ε Lk
∪
−2 k l =1
∪ L(E1,k−1 ) ε Null(Ek ,k ) ∪ Null(E1,k )
L(E1,l ) ε Null(El+1,k ) ∪
Finally, L(E1,k ) = L(E1,k−1 ) ε Lk
∪
k −1 l=1
L(E1,k−2 ) ε L(Ek −1,k−1 ) ε Null(Ek ,k )
L(E1,l ) ε Null(El+1,k ) ∪ Null(E1,k ).
2
62
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
4. The position automaton of a multi-tilde–bar expression 4.1. Glushkov functions for a regular expression Let E be a regular expression. In order to construct a non-deterministic finite automaton recognizing L(E), Glushkov [13] and McNaughton–Yamada [17] have introduced independently the so-called position automaton. Given a regular expression E in linearized form, the following sets called Glushkov functions are defined as follows, where x ∈ Pos(E) and u , v ∈ Pos(E)∗ :
First(E) = {x ∈ Pos(E) | xv ∈ L(E)} Last(E) = {x ∈ Pos(E) | ux ∈ L(E)} Follow(x, E) = { y ∈ Pos(E) | uxyv ∈ L(E)} The position automaton of E is deduced from these position sets as follows. We first add a specific position q0 to the set Pos(E) and we set Pos0 (E) = Pos(E) ∪ {q0 }; the set Last0 (E) is equal to Last(E) if Null(E) = ∅ and to Last(E) ∪ {q0 } otherwise; the set Follow0 (x, E) is equal to Follow(x, E) if x ∈ Pos(E) and to First(E) if x = q0 . The position automaton PE of a regular expression E is defined by the 5-tuple Pos0 (E), A , δ, q0 , Last0 (E) such that:
δ(x, a) = { y | y ∈ Follow0 (x, E) and h( y ) = a}, ∀x ∈ Pos0 (E), ∀a ∈ A . The position automaton PE recognizes the language L(E) [13,17]. Glushkov functions can be defined for bar expressions and tilde expressions as follows, where x ∈ Pos(E):
First(E) = First(E)
= First( E ),
Last(E) = Last(E)
= Last( E ),
Follow(x, E) = Follow(x, E) = Follow(x, E ). As a consequence the computation of Glushkov functions can be extended to the family of EmtbREs. Such an extension is described in [6]; it addresses the subfamily of saturated EmtbREs for which every factor is equipped with either a tilde or a bar. 4.2. Glushkov functions for a multi-tilde–bar expression In this section, we address the general case: we show how to compute the Glushkov functions of an EmtbRE for which there is no restriction on the distribution of tilde and bar operators over the factors of the expression. Proposition 2. Let E = E1 · E2 · · · En , with n ≥ 1, and E = E1,n be an EmtbRE in linearized form. Let k be an integer such that 1 ≤ k ≤ n and x be a position in Pos(Ek ). The Glushkov functions associated with E are recursively computed according to the following formulas:
Pos(E1,n ) =
n
Pos(Ek )
k =1 E
First(E )
∅, ε
Last(E )
Follow(x, E )
a∈ A
∅
∅
a
a
∅ ∅
F+G
First(F) ∪ First(G)
Last(F) ∪ Last(G)
Follow(x, F) ∪ Follow(x, G)
F·G
First(F) ∪ Null(F) First(G)
Last(G) ∪ Null(G) Last(F)
Follow(x, F) ∪ First(G), if x ∈ Last(F), Follow(x, F), if x ∈ Pos(F) \ Last(F), Follow(x, G), if x ∈ Pos(G).
F∗
First(F)
Last(F)
Follow(x, F), if x ∈ Last(F), Follow(x, F) ∪ First(F), otherwise.
First(E1,n ) = First(E1 )
n −1
Null(E1, j ) · First(Ej +1, j +1 ),
(1)
Null(Ej +1,n ) · Last(Ej , j ),
(2)
j =1
Last(E1,n ) = Last(En )
Follow(x, E1,n ) =
n −1 j =1
Follow(x, Ek ) Follow(x, Ek ) First(Ek +1,n )
if (k = n) ∨ (x ∈ / Last(Ek )), otherwise.
(3)
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
63
Proof. Proof is restricted to the non-classical cases: (1) from the definition of the function First, one has: First(E1,n ) = {x ∈ Pos(E1,n ) | xv ∈ L(E1,n )}. Using Proposition 1 and by induction on n, one can deduce the following equalities:
First(E1,n )
=
x ∈ Pos(E1,n ) | xv ∈ L(E1,n−1 ) · Ln
−1 n ∪ x ∈ Pos(E1,n ) | xv ∈ L(E1, j ) · Null(Ej +1,n ) j =1
First(E1,n−1 ) ∪ Null(E1,n−1 ) · First(En ,n ) ∪
=
Null(E1,n−1 ) · First(En ,n ) ∪
= Ind. Hyp.
First(E1,1 )
=
n −1 j =1
n −1
n −1
First(E1, j )
j =1
First(E1, j ) ∪ First(E1,1 )
j =2
Null(E1, j ) · First(Ej +1, j +1 )
(2) from the definition of the function Last, one has: Last(E1,n ) = {x ∈ Pos(E1,n ) | xv ∈ L(E1,n )}. Using Proposition 1 and by induction on n one can deduce the following equalities:
Last(E1,n ) = x ∈ Pos(E1,n ) | vx ∈ L(E1,n−1 ) · Ln
−1 n ∪ x ∈ Pos(E1,n ) | vx ∈ L(E1, j ) · Null(Ej +1,n ) j =1
= Last(En ,n ) ∪ Last(E1,n−1 ) · Null(En ,n ) ∪ = Last(En ,n ) ∪
n −1 j =1
Last(E1,n )
Ind. Hyp.
=
n −1 j =1
Last(E1, j ) · Null(Ej +1,n )
Last(E1, j ) · Null(Ej +1,n )
Last(En ,n )
n −1 j =1
(3) proof is similar as for (1) and (2).
Last(Ej , j ) · Null(Ej +1,n )
2
Corollary 2. The Glushkov functions of a multi-tilde–bar expression can be written as a disjoint union which involves the First, Last, and Follow sets associated with sub-expressions of E1,n (not of E1,n ) and the value of the function Null(Ei , j ) for all 1 ≤ i ≤ j ≤ n. The following proposition can be deduced from Definition 1. Proposition 3. Let E be an EmtbRE in linearized form. The function Null(E ) can be recursively computed as follows:
Null(∅) = ∅, Null(ε ) = {ε }, Null(a) = ∅,
Null(E1,n ) =
⎧ ∅ ⎪ ⎪ ⎨ {ε }
Null(F + G) = Null(F) ∪ Null(G), Null(F · G) = Null(F) · Null(G), Null(F∗ ) = {ε }, if (1, n) ∈ Bn1 , if (1, n) ∈ Tn1 ,
n −1 ⎪ ⎪ Null(E1, j ) · Null(Ej +1,n ) otherwise. ⎩ j =1
(4)
64
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
Fig. 1. The position automaton AE . 1 ,7
Proof. Proof is by induction on the size of E. It is restricted to the non-classical case (4). If (1, n) ∈ Tn1 , then E1,n can be written as F. Thus, by the definition of the set Null, we have Null(E1,n ) = {ε }. If (1, n) ∈ Bn1 , then E1,n can be written as F . Thus, by the definition of the set Null, we have Null(E1,n ) = ∅. Let us suppose that (1, n) ∈ / Tn1 ∪ Bn1 , one has:
ε ∈ L(E1,n ) ⇔ ε ∈ Def. 3
n −1 j =1
⇔ ε∈
n −1
L(E1, j ) · L(Ej +1,n )
j =1
Null(E1, j ) · Null(Ej +1,n )
⇔ ε ∈ Null(E1,n )
2
Example 2. Let us consider the following EmtbRE:
E1,7 = a∗1 · b2 · (c 3 + ε ) · (d4 + ε ) · (e 5 + ε ) · f 6 · g 7∗
The language associated with E1,7 is:
{a1 b2 , a1 b2 g 7 , b2 , b2 c 3 , b2 g 7 , · · · , b2 e 5 , b2 c 3 d4 e 5 , b2 c 3 d4 e 5 f 6 , b2 c 3 f 6 g 7 , b2 d4 e 5 f 6 g 7 , · · · , d4 , d4 e 5 , d4 e 5 f 6 , d4 e 5 f 6 g 7 , · · · , e 5 , e 5 f 6 , e 5 f 6 g 7 , · · ·} The associated Glushkov functions (see Fig. 1) are:
Pos(E ) = {a1 , b2 , c 3 , d4 , e 5 , f 6 , g 7 } Null(E ) = ∅ First(E ) = {a1 , b2 , d4 , e 5 } Last(E ) = {b2 , c 3 , d4 , e 5 , f 6 , g 7 } Follow(a1 , E ) = {a1 , b2 } Follow(b2 , E ) = {c 3 , d4 , g 7 } Follow(c 3 , E ) = {d4 , e 5 , f 6 } Follow(d4 , E ) = {e 5 , f 6 } Follow(e 5 , E ) = { f 6 } Follow( f 6 , E ) = { g 7 } Follow( g 7 , E ) = { g 7 } 5. Efficient computation of the position automaton In this section, we present efficient algorithms to compute the Glushkov functions of a multi-tilde–bar expression E , based on the formulas of Proposition 2 and on the extended ZPC -structure to EmtbREs. According to Corollary 2, the worst case time complexity of these algorithms depends on the worst case time complexity of the function Null(E ) that we first study.
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
65
5.1. Computation of Null(E ) According to Proposition 3, a naive computation of the function Null of the EmtbRE E1,n can be performed using the following algorithm.
Data: Ei , j Result: Null(Ei , j ) for i ← 1 to n do if (i , i ) ∈ Bn1 then Null(Ei ,i ) = ∅ else if (i , i ) ∈ Tn1 then Null(Ei ,i ) = {ε } else
Null(Ei ,i ) = Null(Ei ,i )
end end end for k ← 1 to n − 1 do for i ← 1 to n − k do if (i , i + k) ∈ Bn1 then Null(Ei ,i +k ) = ∅ else if (i , i + k) ∈ Tn1 then Null(Ei ,i +k ) = {ε } else Null(Ei ,i +k ) =
i + k−1 j =1
Null(Ei , j ) · Null(Ej +1,i +k )
end end end end
The different steps of the algorithm are illustrated through the following example. Example 3. Consider the EmtbRE E1,3 such that T31 = {(1, 1), (2, 3)}, B31 = {(1, 2), (3, 3)}, and E1 = a, E2 = (b + ε ), E3 = (c + ε ). The diagram below is a graphical representation of the recursive dependency between different values of Null(Ei , j ).
EmtbRE
Null
E1,1
{ε }
(1, 1) ∈ T31
{ε }
(2, 3) ∈ T31
E1,2
∅
(1, 2) ∈ B31
∅
(3, 3) ∈ B31
E2,3
E3,3
It holds:
Null(E1,3 ) =
= =
Null(E1,1 ) · Null(E2,3 ) ∪
{ε } {ε }
∪
Null(E1,2 ) · Null(E3,3 )
∅
Let us consider the case of an EmtbRE E1,n . There are (n − k) vertices on the kth line, corresponding to tilde or bar operators (1, 1 + k), (2, 2 + k), . . . The computation of the associated Null(Ei ,i +k ) functions requires:
• a constant number of elementary test operations: if (i , i + k) ∈ Tn1 or (i , i + k) ∈ Bn1 ,
66
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
• (k − 1) concatenations of Null(Ei , j ) · Null(Ej +1,i +k ) , • (k − 2) unions. Finally,
n k=2
2 ∗ k ∗ (n − k + 1) operations are needed to compute the function Null(E1,n ).
Proposition 4. Let E1,n be an EmtbRE. The function Null(E1,n ) can be computed in O (| E1,n | + n3 ) time. Notice that the function Null(E1,n ) can be computed by making use of one of the numerous algorithms which compute the transitive closure of a DAG (see for example [12]). Although these algorithms have the same O (n3 ) worst case time complexity as the naive algorithm they likely have a better running time performance than the naive algorithm. In the next section, we first show how a multi-tilde–bar expression can be converted into its ZPC-structure. Then we review the major properties of ZPC -structures. 5.2. ZPC -structure of a multi-tilde–bar expression The ZPC -structure has been introduced by Champarnaud, Ponty and Ziadi in order to convert in quadratic time a regular expression into the associated position automaton [11,19]. Later, this structure has been generalized to weighted regular expressions [8]. It can be computed in linear time w.r.t. the size of the expression. The ZPC -structure offers many advantages for efficient constructions of finite automata from regular expressions (follow automaton [9], c-continuation automaton [10], and equation automaton [15]). In this section, we first extend this structure to EmtbREs. Then, we deduce efficient computations of Glushkov functions using some of its major properties. 5.2.1. From multi-tilde–bar expression to its ZPC -structure Let us provide a description of the conversion of an EmtbRE in the none classical case into its ZPC -structure. For more details about the ZPC -structure of a classical regular expression, we refer to [11,19]. Let us consider an EmtbRE E = E1,n and consider the concatenation as a right-associative operation. Let T(E ) be the syntactical tree of E . A node in T(E ) will be noted ν . The set of nodes of T(E ) is written Nodes(T(E )). If ν ∈ Nodes(T(E )) is a node in T(E ), sym(ν ), fathers(ν ), sons(ν ), right(ν ), and left(ν ) respectively denote the symbol, the fathers, the sons, the right sons and the left sons of the node ν . The relation of descendance over the syntactical tree is denoted . The expression Eν will denote the subexpression associated with the subtree rooted at the node ν and Null(ν ) = Null(Eν ). The ZPC -structure associated with the multi-tilde–bar expression E is based on two forests deduced from the syntactical tree T(E ), called Last forest and First forest, respectively denoted TL(E ) and TF(E ). These forests respectively encode the First sets and the Last sets of the subexpressions of E . The transitions function of the position automaton appears as a collection of links from the Last forest to the First forest. To represent the semantic of tilde operators Tn1 and bar operators Bn1 over the ZPC -structure, ‘•’-nodes in the product E1,n will be ranked with a sub-index denoting their positions in E . It is helpful to add two special positions, $ and #, and to process the expression E = $ •0 E1,n •n #, where
E1,n = (. . . (E1 •1 E2 ) •2 E3 ) · · · En−1 ) •n−1 En ) On the right-hand side, the First forest TF(E ) is obtained from T(E ) as follows: The edge connecting any ‘•’-node to its right son is deleted if the value of the function Null associated with its left son is empty. An edge between some ‘•i ’-node and ‘• j ’-node is added if Null(Ei , j −1 ) = ε , for all 0 ≤ i ≤ n and 1 ≤ j ≤ n − 1 such
that •n = #. Formally, • j ∈ sons(•i ) ∀0 ≤ i ≤ n and 1 ≤ j ≤ n − 1. A node in TF(E ) will be denoted ν F and the edges are oriented from the leaves to the root node. On the left-hand side, the Last forest TL(E ) is obtained from T(E ) as follows: The edge connecting any ‘•’-node to its left son is deleted if the value of the function Null associated with its right son is empty. A node in TL(E ) will be denoted ν L and the edges are oriented from the root node to the leaves. The two forests are connected as follows: If a node of TL(E ) is labeled ‘•’, its left son is linked to the right son of the corresponding node in TF(E ). If a node is labeled ‘∗’, its son is linked to the corresponding node in TF(E ). Such links are called follow links. Formally, the follow links can be defined by the function f : Nodes(TF(E )) ∪ ⊥ −→ Nodes(TL(E )) ∪ ⊥ as follows:
⎧ if sym(father(ν )) = ∗ and ν = νE , ⎨ father(ν F ) L f (ν ) = right(father(ν F )) if sym(father(ν )) = •, ⎩ otherwise. ⊥
(5)
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
67
Example 4. Let us consider the EmtbRE:
E1,6 = a1 · (b2 + ε ) · (c 3 + ε ) · (d4 + ε ) · e 5 ·
Let E be the MtbRE
f∗ 6
$ •0 E1,6 •6 # .
The following table gives different values of the function Null(Ei , j ) associated with factors of E . Null(Ei , j ) i/ j
1
2
3
4
5
1 2 3 4 5 6
∅
∅
∅
ε
ε
ε
– – – –
∅ ∅ ∅
∅
– – – – –
∅ ∅ – – –
∅ ∅ ∅
∅ ∅ ∅
–
ε
ε
ε – –
6
One can construct the ZPC -structure of E as follows:
Fig. 2. The ZPC -structure associated with the multi-tilde–bar expression E .
One has:
sons(•2F ) = {+ F , •3F , •6F , # F }, where E+ F = E2,2 , f (a1 ) = •2 . In the next, we will show that these forests respectively encode the Last and First sets associated to subexpressions of E . 5.2.2. ZPC -structure properties In this section, we give an efficient computation of Glushkov functions associated with an EmtbRE using some of the major properties of the ZPC -structure. Let E be an EmtbRE. Without loss of generality, one can consider that E = E1,n and regular expressions Ei , ∀1 ≤ i ≤ n, are classical. Computation of the First set Let us associate to a node
ν F in TF(E ) the set of positions
First(ν ) = {x ∈ Pos(E ) | x ν F , such that ν F ∈ sons(ν F )} F
68
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
Proposition 5. Let ν F be a node in TF(E ). Then:
First(Eν F ) = First(ν F ) Proof. Proof is restricted to the non-classical case: Let us consider the EmtbE Ei ,n = E•i , for some 0 ≤ i ≤ n, and prove that First(E•i ) = First(•i ). Using the equality (1) of Proposition 2 and by induction on n, one can deduce:
First(Ei ,n ) = First(Ei )
n −1 k =i
Null(Ei ,k ) · First(Ek +1,k+1 )
= {x ∈ Pos(Ei ) | x ν F , s.t. ν F ∈ left(•i )} n −1
Null(Ei ,k ) · {x ∈ Pos(Ek+1 ) | x ν F , s.t. ν F ∈ left(•k+1 )}
k =i
By the definition of the ZPC -structure, one has:
First(Ei ,n ) = {x ∈ Pos(Ei ) | x ν F , s.t. ν F ∈ left(•i )} −1 {x ∈ kn= Pos(Ek+1 ) | x ν F , s.t. ν F ∈ right(•i )} i
= {x ∈ Pos(Ei ,n ) | x ν F , s.t. ν F ∈ sons(•i )} = First(•i )
2
Computation of the Last set Let us associate to a node
ν L in TL(E ) the set of positions
Last(ν L ) = {x ∈ Pos(E ) | x ν L } Proposition 6. Let ν L be a node in TL(E ). Then:
Last(Eν L ) = Last(ν L ) Proof. Proof is similar as for the previous proposition.
2
Computation of Follow set Let x be a position in Pos(E ) ∩ Pos(Ei ) for some 1 ≤ i ≤ n. Let us show how compute the set Follow(x, E ) from the set of follow links associated with x, defined by the function f . Proposition 7. Let E be an EmtbRE and x ∈ Pos(E ) ∩ Pos(Ei ) for some 1 ≤ i ≤ n. Then:
Follow(x, E ) = { y ∈ First(ν F ) | ∃ν L ∈ TL(E ), s.t. f (ν L ) = ν F and x ν L } Proof. Proof is restricted to the non-classical case: Let us consider the EmtbE Ei ,n = E•i , for some 0 ≤ i ≤ n, and consider a position x in Pos(Ei ). By the definition of the ZPC -structure, we have f (left(•i )) = •i +1 . From equality (3) of Proposition 2 and by induction on n, one has: – If i = n or x ∈ / Last(Ei ). Then:
Follow(x, Ei ,n )
= Ind. Hyp.
Follow(x, Ei )
=
{ y ∈ First(ν F ) | ∃ν L ∈ TL(Ei ), s.t. f (ν L ) = ν F and x ν L }
=
Follow(x, Ei ) First(Ei +1,n )
– Else,
Follow(x, Ei ,n )
Ind. Hyp.
=
{ y ∈ First(ν F ) | ∃ν L ∈ TL(Ei ), s.t. f (ν L ) = ν F and x ν L } First(Ei +1,n )
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70 Prop. 5
=
69
{ y ∈ First(ν F ) | ∃ν L ∈ TL(Ei ), s.t. f (ν L ) = ν F and x ν L } First(•i +1 )
=
{ y ∈ First(ν F ) | ∃ν L ∈ TL(Ei ), s.t. f (ν L ) = ν F and x ν L } { y ∈ First(•i +1 ) | ∃left(•i ) ∈ TL(Ei ), s.t. f (left(•i )) = •i +1 and x left(•i )}
=
{ y ∈ First(ν F ) | ∃ν L ∈ TL(Ei ), s.t. f (ν L ) = ν F and x ν L }
2
According to Corollary 2 and the previous properties of the ZPC -structure, for an EmtbRE E1,n , the functions First(E1,n ), (resp. Last(E1,n )), and Follow(x, E1,n ) can be written as disjoint unions of some First(Ei ,i ) (resp. Last(Ei ,i )) sets. Their computation can be done by a simple linear-time tree traversal algorithm. Thus, the following proposition holds. Proposition 8. Lets E1,n be an EmtbRE and x ∈ Pos(E1,n ). The functions First(E1,n ), Last(E1,n ), and Follow(x, E1,n ) can be computed
in O (| E1,n | + n3 ) time.
The ZPC -structure [19] can be extended to multi-tilde–bar expressions in a natural way (see Fig. 2), by representing the tilde and bar operators by edges connecting the ‘•’-nodes of the product. Therefore, all the algorithms based on the ZPC -structure, i.e. the construction of the c-continuation automaton [10], of the equation automaton [10], of the follow automaton [9] and of the weighted position automaton [8] also work for multi-tilde–bar expressions. Moreover the worst case time complexity in the case of multi-tilde–bar expressions is the worst case time complexity of the standard case augmented with the worst case time complexity of the function Null. Therefore, the following theorem can be stated. Theorem 1. Let E be a multi-tilde–bar expression and N the worst case time complexity of the function Null. The position automaton associated with E can be computed in O (| E | × E + N ) time. 6. C-continuation computation The notion of C-continuation has been introduced by Champarnaud and Ziadi [10] in order to efficiently compute the equation automaton [1]. This notion can be naturally extended to the class of multi-tilde–bar expressions. Let E be an EmtbRE. We say that E is linear if each symbol in E occurs only once. In the following, we consider the concatenation as a left-associative operation. Definition 4. (See [10].) For every symbol x of a linear MtbRE E , the c-continuation c x (E ) is such that:
c x (x) = ε , c x (F + G) = if c x (F) exists then c x (F) else c x (G), c x (F · G) = if c x (F) exists then c x (F) · G else Null(F) · c x (G), c x (F∗ ) = c x (F) · F∗ , Corollary 3. (See [10].) For every symbol x of a linear MtbRE E , the c-continuation c x (E ) is either product of subexpressions.
ε or a subexpression of E or a
Recall that a c-continuation c x ( E ) is a concatenation of distinct sub-expressions Hi of E . The following proposition shows how c x ( E ) can be computed over the ZPC -structure associated to the linear EmtbRE E . Proposition 9. (See [10].) Let E be an EmtbRE and x a position in Pos(E ). The c-continuation c x (E ) is as follows:
c x (E ) =
Ef (ν L )
xν L ν L E
f (ν L )=⊥
The computation of a c-continuation using the ZPC -structure can be done in a similar way as in the standard case. Let (l1 , r1 ), (l2 , r2 ), . . . , (lk , rk ) , where f (li ) = r i , ∀1 ≤ i ≤ n, the list of follow links in the path going from a position x to the root of the Last tree. Let us denote by Fi the subexpression associated with the node r i in the First tree. Then the c-continuation c x associated with x is the expression F1 · · · Fk . In our example we have:
70
F. Ouardi et al. / Journal of Discrete Algorithms 33 (2015) 58–70
ca1 (E) = (b2 + ε ) · (c 3 + ε ) · (d4 + ε ) · e 5 · f 6∗
cb2 (E) = (c 3 + ε ) · (d4 + ε ) · e 5 · f 6∗
7. Conclusion In this paper, we give some answers to open questions raised in [6]. First, we formalize an explicit definition of the language associated with a multi-tilde–bar expression, which allows us to give a recursive computation of its Glushkov functions. Next, we show that the worst case time complexity to construct the position automaton depends on the worst case time complexity of the function Null(E). This function can straightforwardly be replaced by another type of function in order to control the application of each tilde or bar. Last, we provide an algorithm to convert a multi-tilde–bar expression into its position automaton, with a cubic worst case time complexity with respect to the size of the multi-tilde–bar expression. As a perspective work: One can generalize the semantics of the tilde operator and the bar operator to get a more expressive class of regular expressions. According to Corollary 2, a multi-tilde–bar expression can be viewed as a standard regular expression equipped with a specific computation for the function Null. The computation of the Glushkov functions of a multi-tilde–bar expression obviously depends on the definition of the function Null: for example, an alternative interpretation of the tilde operator and the bar operator could be associated with the following definition of Null:
Null( E1 · E2 · · · En ) = {ε } ⇔ ε ∈ L(E1 ) ∧ ε ∈ L(En )
Null(E1 · E2 · · · En ) = {ε } ⇔ ε ∈ L(E1 ) ∨ ε ∈ L(En )
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
V. Antimirov, Partial derivatives of regular expressions and finite automaton constructions, Theor. Comput. Sci. 155 (1996) 291–319. G. Berry, R. Sethi, From regular expressions to deterministic automata, Theor. Comput. Sci. 48 (1) (1986) 117–126. J. Berstel, J.-E. Pin, Local languages and the Berry–Sethi algorithm, Theor. Comput. Sci. 155 (2) (1996) 439–446. J.A. Brzozowski, Derivatives of regular expressions, J. ACM 11 (4) (1964) 481–494. A. Brüggemann-Klein, Regular expressions into finite automata, Theor. Comput. Sci. 120 (1993) 117–126. P. Caron, J.-M. Champarnaud, L. Mignot, Multi-tilde–bar expressions and their automata, Acta Inform. 49 (6) (2012) 413–436. P. Caron, J.-M. Champarnaud, L. Mignot, Acyclic automata and small expressions using multi-tilde–bar operators, Theor. Comput. Sci. 411 (38–39) (2010) 3423–3435. J.-M. Champarnaud, E. Laugerotte, F. Ouardi, D. Ziadi, From regular weighted expressions to finite automata, Int. J. Found. Comput. Sci. 15 (5) (2004) 687–700. J.-M. Champarnaud, F. Nicart, D. Ziadi, From the ZPC structure of a regular expression to its follow automaton, Int. J. Appl. Cryptogr. 16 (1) (2006) 17–34. J.-M. Champarnaud, D. Ziadi, Canonical derivatives, partial derivatives and finite automaton constructions, Theor. Comput. Sci. 289 (1) (2002) 137–163. J.-M. Champarnaud, J.-L. Ponty, D. Ziadi, From regular expressions to finite automata, Int. J. Comput. Math. 72 (1999) 415–431. Y. Chen, A new algorithm for computing transitive closures, in: ACM Symposium on Applied Computing, 2004, pp. 1091–1092. V.M. Glushkov, The abstract theory of automata, Russ. Math. Surv. 16 (1961) 1–53. L. Ilie, S. Yu, Follow automata, Inf. Comput. 186 (1) (2003) 140–162. A. Khorsi, F. Ouardi, D. Ziadi, Fast equation automaton computation, J. Discrete Algorithms 6 (3) (2008) 433–448. S. Kleene, Representation of events in nerve nets and finite automata, in: Automata Studies, in: Ann. Math. Stud., vol. 34, Princeton U. Press, 1956, pp. 3–41. R.F. McNaughton, H. Yamada, Regular expressions and state graphs for automata, IEEE Trans. Electron. Comput. 9 (March 1960) 39–57. L. Mignot, Des Codes Barres pour les langages rationnels, PhD thesis, LITIS, Université de Rouen, France, 2010, available online http://ludovicmignot. free.fr. D. Ziadi, J.-L. Ponty, J.-M. Champarnaud, Passage d’une expression rationnelle à un automate fini non-déterministe, in: Journées Montoises, Bull. Belg. Math. Soc. 4 (1997) 177–203.