Identifying Regular Languages over Partially ... - CiteSeerX

16 downloads 0 Views 200KB Size Report
Identifying Regular Languages over. Partially-Commutative Monoids. Claudio Ferretti { Giancarlo Mauri. Dipartimento di Scienze dell'Informazione via Comelico ...
Identifying Regular Languages over Partially-Commutative Monoids Claudio Ferretti { Giancarlo Mauri Dipartimento di Scienze dell'Informazione via Comelico 39, 20135 Milano Universita di Milano - ITALY

fferretti,[email protected]

NeuroCOLT Technical Report Series NC-TR-95-030 May 19951 Produced as part of the ESPRIT Working Group in Neural and Computational Learning, NeuroCOLT 8556

!()+, -./01 23456 NeuroCOLT Coordinating Partner

Department of Computer Science Egham, Surrey TW20 0EX, England For more information contact John Shawe-Taylor at the above address or email [email protected] 1 Received

March 7, 1995

Introduction 1

Abstract

We de ne a new technique useful in identifying a subclass of regular languages de ned on a free partially commutative monoid (regular trace languages), using equivalence and membership queries. Our algorithm extends an algorithm de ned by Dana Angluin in 1987 to learn DFA's. The words of a trace language can be seen as equivalence classes of strings. We show how to extract, from a given equivalence class, a string of an unknown underlying regular language. These strings can drive the original learning algorithm which identify a regular string language that de nes also the target trace language. In this way the algorithm applies also to classes of unrecognizable regular trace languages and, as a corollary, to a class of unrecognizable string languages. We also discuss bounds on the number of examples needed to identify the target language and on the time required to process them.

1 Introduction The problem of learning languages from examples has been studied by many authors, starting from the classical work by Gold [Gol67]. In recent years, Dana Angluin gave many results about learnability of regular languages, especially in [Ang87], where an ecient algorithm to learn deterministic nite automata by membership and equivalence queries with counterexamples was given. We call this algorithm DFAL, and we will show how to use it to learn languages not represented by automata. Other results on the learnability of regular languages are reviewed in [Pit89]. Recently, researches on formal models for concurrent processes underlined the importance of trace languages [Maz85], de ned as subsets of a free partially commutative monoid (f.p.c.m.), and a theory of trace languages has been developed [BMS89, AR86], parallel to that of classical languages on free noncommutative monoids (string languages). While this formalism becomes widely used to model concurrency, it turns out that it could be interesting to have a method to learn the corresponding trace languages. Consider to have a concurrent environment, and to accept the constraints needed to model it by traces, a learning system could experiment on it to automatically get its formal description. But the problem of nding a learning system for trace languages is especially interesting from a theoretical point of view, as it involves some speci c techniques and it has been seldom studied. A fundamental di erence between trace languages and string languages is that regular trace languages are in general not recognized by a nite state automaton on the f.p.c.m., i.e. Kleene's theorem cannot be generalized to trace languages. As known results about the identi cation of regular languages are based on automata, they cannot be directly used on regular trace languages. In this work we try to explore a method to identify regular trace languages on a class of f.p.c. monoids. In Section 2 we introduce the de nition of regular trace languages and some useful notations. In Section 3 we describe a method to extract a regular language from a regular trace language and we state our

De nitions and Notations 2

main results. In Section 4 we discuss some relations with other recent works and some open problems.

2 De nitions and Notations In the following  denotes a nite alphabet, of cardinality k, and ? the free monoid generated by . The empty word is denoted by . A concurrence relation  is a subset of   and  denotes the congruence relation of ? generated by the set C = (ab; ba) (a; b)  . The quotient M (; ) = ? =  is the free partially commutative monoid associated with the concurrence relation . An element of M (; ) = ? =  is a trace, and can be seen as a set of strings. Given a string s, [s] is the trace containing s; given a string language L, [L] is the set of traces containing at least a string from L. Any subset of T M (; ) will be said to be a trace language on M (; ). As in the sequential case, the class RTL of regular trace languages on M (; ) can be de ned as the least class containing nite trace languages and closed with respect to set-theoretic union, concatenation and ( )? closure of languages, being the concatenation of two traces the equivalence class of the concatenations of their strings. We know that these languages can always be de ned by regular expressions on nite sets. Moreover, it can be shown that a trace language T is regular if and only if there is a regular string language L such that T = [L] [BMS89]. We will consider only the case in which  is a transitive relation. As a consequence, the maximal cliques of the graph associated to the concurrence relation induce a partition on . Two letters will be in the same element of this partition if and only if they are nodes of the same maximal clique in the graph associated to , i.e., if and only if they commute in C . For instance, consider  = a; b; c; x; y and  = (a; b); (a; c); (b; c); (x; y) The corresponding graph on  is: 



f

j

2

g









f

g

f

g

a

x

r

b

r

c

r

r

y r

and the cliques a; b; c and x; y . we can de ne in an unique way the alphabet as a partition:  = Sn Then c ; where n is the number of maximal cliques in the graph of . The term i i=1 clique is from now on extended to the elements ci of the partition on , when not explicitly referred to the graph of . Given the above described conditions, we can prove a useful result, where letters from  are grouped as the variables in a usual algebraic monomial: f

g

f

g

Choosing a String 3

Theorem 1 Each trace t in M (; ), with transitive , can be uniquely rep-

resented as a sequence of monomials t1 : : :tm , where each monomial contains letters of a unique clique, and any two adjacent monomials are from di erent cliques. Proof. Any string of the trace t can be divided in substrings by grouping consecutive letters from the same clique. Any other string in t will have these groups of letters in the same order, even if internally permuted, as in C they commute among themselves and with no other letter from the contiguous sequences. Given a group, any permutation in it will be possible, and will create a string in the same trace t. So we only count the occurrences of each letter in each group and give this number as degree to the letter itself, nally constituting a factor of one monomial for each group. 2

Given a monomial ti , ti a denotes the degree of aj in ti , and MCD( ti ) denotes the Maximum Common Divisor of the degrees of the letters appearing in ti . Our results apply to the restricted class of regular trace languages of the following de nition: j

j j

j

j

De nition 1 Consider the trace languages de ned by a transitive  and by regular expressions where, when an operation of the expression joins two di erent traces, the trailing letters of the rst don't commute in  with the leading letters of the second. We call them isolating regular expressions and isolating regular trace languages. This means that in such a regular expression the joining of di erent traces never mix letters, while this is still allowed when concatenating one or more copies of the same trace. Brie y, no di erent words are concatenated on letters of the same clique. This is the subclass of RTL we are going to study, as it o ers a way to extract strings with interesting properties from each trace. In this way we mainly allow any concatenation between traces when they don't mix joining letters, and when they mix letters between two copies of the same trace. It includes both recognizable and unrecognizable regular trace languages. Given  = a; b; x and  = (a; b) , isolating languages are: [ax axb ?] , [ ab ?] . f

f

g

f

g

f

g

g

3 Choosing a String The learning algorithm we will use doesn't work directly on traces. It receives one string from the equivalence class of strings constituting a given trace, chosen as speci ed in the following de nition:

De nition 2 Given the monoid M (; ), with  transitive, we can choose

from any trace t, represented by the sequence of monomials t1 : : :tm , the string os(t) = s1 : : :sm made in the following way: for each ti write the string si = ap11 ap22 : : :ap11 ap22 : : :, where ai is a letter and pj = jti ja =MCD(jtij). We call these strings ordered strings. j

Choosing a String 4

E.g., given that (a; b) is in , the trace [aaabbbbbb] can be represented by the monomial a3 b6, and the corresponding ordered string is abbabbabb. The rst key property of this rule is that the ordered string of a trace, obtained concatenating an unbounded number of times the same unknown trace, is the concatenation of the ordered strings of the repeated trace. Lemma 1 If the trace t is represented by a single monomial, and os(t) is the ordered string of t, then os(t t) = os(t) os(t). 



Proof. The single monomial representing t t will have each letter with twice the degree it has in t. Therefore also the MCD is doubled, and the exponents pi in the resulting ordered string will be the same. Then this string will be the concatenation of two copies of the ordered string of t. 2 

When the concatenated trace is more complex we can state a weaker property: from any trace generated by the closure of a regular language, the ordered string we choose belongs to a slightly bigger regular language generating the same traces. Lemma 2 If  is transitive, s = s1s2s3 is a string on , with strings s1 and s3 containing letters from the same clique, and s2 an ordered string with trailing and leading letters from cliques di erent from that of s1 and s3 : [L f

s1 s2 s3 ?] = [ L

[f

gg

f

s 1 s2 s0 s2 ? s3

[f

f

g

gg

?] ; 

where s0 is the ordered string of [s3 s1] . Proof. Clearly, [s0 ] = [s3s1 ] , and the inner closure adds strings to the

language between square brackets, but doesn't add new traces to the trace language. Any trace [: : :s1 s2s3 s1 s2 s3 : : :] , generated by the rst language, will contain also the string : : :s1 s2 s0 s2 s3 : : :, which belong to the second language and that is its ordered string. 2 Given any regular expression for T we can nd an equivalent, w.r.t. , regular expression on ? made of ordered strings. This means that the ordered strings of traces of T belong to a regular trace language L such that [L] = T .

Theorem 2 Given an isolating regular trace language T over a transitive concurrence relation , there exists a regular language L on ? such that [L] = T and any ordered string extracted from a t 2 M (; ) belongs to L if and only if t belongs to T .

Proof. Consider the regular expression that de nes T as being built from nite trace languages, applying to them many subsequent union, concatenation, and closure operations. We will build a regular expression on ? , that de nes a language L which satis es our statement, by induction on the structure of the regular expression for T , using the properties stated for ordered strings over regular operations.

Choosing a String 5

Any nite trace language T 0 is clearly equal to [L0] , where L0 is the nite language made of the ordered words of each trace in T 0. Moreover, they also immediately de ne the regular expression of L0. Now consider to have proved the theorem on two trace languages T 0 and 00 T , having found regular languages L0 and L00 satisfying our statement, and their regular expressions. A trace language T = T 0 T 00 is easily veri ed to be equal to [L0 L00] , and the regular expression is de ned consequently. For a trace language T = T 0 T 00 we consider three situations. Being T isolating, the traces of T 0 and T 00 in their regular expressions can: either join on letters of di erent cliques, and in this case the concatenation of the respective ordered strings will do for us, as no letter would ow in the concatenated string, or be the same trace t = t1 : : :tm , but with t1 and tm monomials on the same clique; then their concatenation will be represented by the sequence of monomials t0 = t1 : : :tm?1 t0 t2 : : :tm . Then we can simply take the concatenation of the two original expressions, and use the union of it and of the set containing only the ordered string of the new trace represented by t0 , or be the same trace t = t1 , and by Lemma 1 we can keep only the concatenation of the original expressions. When T = T 0?, being it a union of concatenations, we could have the same three cases of concatenations, and we could make the same considerations, but being this time an unbounded operation, we have to act di erently in the second situation. In this case, each string in the regular expression de ning L0 of the form s1 s2 s3 can be substituted, by Lemma 2, by the expression s1 s2 (s0 s2 )?s3 . In this way we obtain an expression de ning a regular language R such that [R? ] = [L0?] = T . 2 [

[









3.1 Some Observations

Our de nition of isolating trace languages gives two conditions. The rst is related to the equivalence class on the alphabet: we have a transitive equivalence class. Then, we are allowed to apply a result shown in [BMS83]. It states that for any regular trace language, on a transitive equivalence class , there exists a regular string language L with two properties: T = [L], and, in each trace of T there is exactly one string of L. Unfortunately, it is easy to build such a string language having the regular expression on traces which de nes T , but not when being allowed to see only some traces. But we will see that is sucient to consider a string language di erent from the one just described, but w.r.t. which we know how to choose 



Identifying Isolating Languages 6

one positive string from each positive trace, even if this language can have more than one string in each trace. The second condition de ning isolating trace language is related to the concatenation of traces where letters can ow from one trace to the other. Our condition states that in this case the two trace must be the same, and with this we show some nice properties for concatenation and for ( )? closure. We could use a di erent property, requiring only that the concatenation of traces joining letters from the same clique should occour only on identical adjacent monoids. That is, [abcxyabb] and [abbxxyyy ] can be concatenated, being joined on the same ab2 monoid, for a given  = (a; b); (a; c); (b; c); (x; y ) . In this case it would be easier to show that we can consider a regular language with only ordered string, to describe T , using a property similar to that shown in Lemma 1. 

f

g

4 Identifying Isolating Languages We will learn an isolating target language T using a Minimally Adequate Teacher (MAT) working on T and a corresponding algorithm DFAL working on ? [Ang87] to identify a regular language L such that [L] = T . DFAL identi es an automaton, and in this way it represents L and then also the regular trace language[L]. An important fact is that some of the regular trace languages, also in the subclass of isolating regular trace languages, have no nite automaton recognizing them, as it is for [ ab ?] (otherwise one could obtain by regular operations an bn n 0 , which is not regular). Given any regular expression for T , by Theorem 2 we know that there exists an equivalent, w.r.t. , regular expression made of ordered strings. This means that the ordered strings of traces of T belong to a regular language L such that [L] = T . This same L is the real target of DFAL. Then, we can imagine a correct MAT for (isolating) regular trace languages. It will be able to answer to queries of the learning algorithm. These queries, we recall, give to it the representation of a trace language, to expressing an equivalence query, or an element of the domain, in this case a trace, to ask for membership. In its turn, the MAT answers with a trace and a label, when an equivalence query is not correct, or, even simpler, with only a label to a membership query. The algorithm at the base of our design is DFAL, and it creates queries and expects answers only about strings and deterministic nite state automata. First of all we stress that a DFA can as well represent a regular trace language, as it represents a string language L and [T ] is the corresponding trace language. What we will do is to de ne an algorithmic way to process the items coming from the two di erent domains, in their travelling between MAT and DFAL. >From the rst we receive counterexamples containing traces, from the second we receive membership queries for strings. In the rst case we will see how to chose a suitable string from the trace. In the second case, we are faced with an easier job. We only need to transform a string s in a corresponding trace [s] . In detail, using algorithm DFAL there may be three di erent interactions with the MAT: f

f

j



g

g

Identifying Isolating Languages 7

a membership query: DFAL gives a string s and we can translate it to the representation of [s] , giving this to the MAT, an equivalence query with a positive counterexample: we give to DFAL the oredered string of the counterexample trace, as it belongs to the target L, by Theorem 2, and no string in the trace is recognized by the current hypothesis language L0 unless [L0] = T , an equivalence query with a negative counterexample: we can give counterexample trace that is recognized by the current hypothesis (this action may take time). This is true because no string in the counterexample belongs to the target regular language, while at least one string of it belongs to the hypothesis. Our problem in this situation is that we are not guaranteed that this string, recognized by the hypothesis, is the ordered string of the counterexample trace. But the time required to nd this string in a trace [s] , with s of length m, is exponential in m, since there is in general a number exponential in m of strings in the trace. The fourth interaction, the equivalence query with no counterexample, require no processing, as it states the successful identi cation of the target language. The time needed by DFAL to identify a canonical DFA with n states, after having received counterexamples of length at most m, has been shown to be polynomial in n and in m [Ang87]. Given the regular expression for a trace language T , consider n the number of states of the minimal DFA recognizing a regular string language L such that [L] = T . Our operations enlarge the corresponding regular expression as shown in Theorem 2, to build the regular expression of the language we are going to learn with our protocol. The operations that enlarge the regular expression can only add permutations of strings from the original expression, that is accepting paths in the automaton, and star closures, that is loops in the automaton. But this additions cannot require more than a polynomial, in n, number of new states in the automaton recognizing the new regular language. The number of states of the resulting canonical DFA will then be polynomial in n, and we will have to learn it to identify T . Then, together with the observation on the time required to process a negative counterexample, we can state the following 





Theorem 3 Isolating regular trace languages with transitive concurrence re-

lation, and generated by a regular language recognized by a DFA of n states, can be exactly identi ed in polynomial time in n and in the length of positive counterexamples, but in general in exponential time in the length of negative counterexamples.

As a corollary to results on trace languages, we can use this method to learn string languages that are the union of the traces of a regular trace language. When the trace language is not recognizable, the union of its traces cannot be recognizable. Otherwise I could work on the automaton for the latter language to write an automaton for the former. But for our method of processing

Discussion and Open Problems 8

informations owing between the MAT and DFLA, to have as target a trace language or the union of its traces makes no di erence.

4.1 Problems with non-Isolating Languages

We are now asking whether this technique applies only to isolating languages. Consider to have  = (a; b); (b; a) and the regular trace language [b] [ab]?. This is not an isolating language. The traces belonging to this language are represented by the monomials anbn+1, and is easy to see that the Maximum Common Divisor of the two exponents is always 1, and then os([an bn+1] ) = an bn+1. This means that the ordered string of such traces will never be of the form b(ab)?, and the set of the ordered strings of positive traces won't build a regular language able to drive DFAL. We conjecture that without restricting ourselves to a subclass of regular trace languages any extraction rule will have similar troubles in collecting strings from a regular trace language to a regular string language. This trouble arises with the parts of regular expressions generating an unbounded number of traces, belonging to the given language. These parts always involve some ( )? -expression. As our aim is to show that a regular string language containing the strings we choose, and de ning the trace language we have to learn, does exist, we can always add our strings when they are in nite number. This is not the case if we try to add all the strings we choose when traces are generated by some ( )? closure, as we should do in the case shown above, because the number of these strings is unbounded. f

g







5 Discussion and Open Problems We extend the applications of the DFAL algorithm [Ang87] to regular trace languages. Observe that other extensions have been considered in literature. [BR87] apply it to a subclass of context-free languages, [Sak90] to bottom-up tree automata. The idea of learning regular expressions instead of automata, which would be useful in our context, is studied for expressions without union by [Bra93]. Our technique extends from the hidden trace language to the interaction with the DFAL algorithm. An interesting open problem is to re ne directly the algorithm, to make use of the information we have about the structure of the extracted regular language, de ned only by ordered strings. In this way we could give clear bounds on the ecience of the learning system on this class of regular trace languages, trying to reduce the exponential dependence on the length m of negative counterexamples.

References [AR86] IJ. Aalbersberg, G. Rozenberg. Theory of traces. Theoretical Computer Science, 60:1{83, 1986.

REFERENCES 9 [Ang87] D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87{106, 1987. [BR87] P. Berman, R. Roos. Learning one-counter languages in polynomial time. In Proceedings of the Symposium on Foundations of Computer Science, 61{67, 1987. [BMS83] A. Bertoni, G. Mauri, N. Sabadini. Unambiguous regular trace languages. Proceedings of Colloquia Mathematica Societatis Janos Bolyai, 113{123, 1983. [BMS89] A. Bertoni, G. Mauri, N. Sabadini. Membership problems for regular and context-free trace languages. Information and Computation, 82:135{150, 1989. [Bra93] A. Brazma. Ecient identi cation of regular expressions from representative examples. In Proceedings of the Computational Learning Theory Conference, 236{242, 1993. [Gol67] E.M. Gold. Language identi cation in the limit. Inform. Contr., 10:447{474, 1967. [Maz85] A. Mazurkiewicz. Semantics of concurrent systems: A modular xed point trace approach. Lecture Notes in Comp. Sci., vol. 188, 353{375, SpringerVerlag, 1985. [Pit89] L. Pitt. Inductive inference, DFAs, and computational complexity. In Proceedings of the Analogical and Inductive Inference Workshop, 18{, 1989. [Sak90] Y. Sakakibara. Inductive inference of logic programs based on algebraic semantics. New Generation Computing, 7:365{, 1990.