Polynomial Time Learning of Some Multiple Context-Free Languages ...

Polynomial Time Learning of Some Multiple Context-Free Languages with a Minimally Adequate Teacher Ryo Yoshinaka1? and Alexander Clark2 1

2

MINATO Discrete Structure Manipulation System Project, ERATO, Japan Science and Technology Agency [email protected] Department of Computer Science, Royal Holloway, University of London [email protected]

Abstract. We present an algorithm for the inference of some Multiple Context-Free Grammars from Membership and Equivalence Queries, using the Minimally Adequate Teacher model of Angluin. This is an extension of the congruence based methods for learning some Context-Free Grammars proposed by Clark (ICGI 2010). We define the natural extension of the syntactic congruence to tuples of strings, and demonstrate we can efficiently learn the class of Multiple Context-Free Grammars where the non-terminals correspond to the congruence classes under this relation.

1

Introduction

In this paper we look at efficient algorithms for the inference of multiple context free grammars (mcfgs). Mcfgs are a natural extension of context free grammars (cfgs), where the non-terminal symbols derive tuples of strings, which are then combined with each other in a limited range of ways. Since some natural language phenomena were found not to be context-free, the notion of mildly context-sensitive languages was proposed and studied for better describing natural languages, while keeping tractability [1]. Mcfgs are regarded as a representative mildly context-sensitive formalism, which has many equivalent formalisms such as linear context-free rewriting systems [2], multicomponent tree adjoining grammars [3, 4], minimalist grammars [5], hyperedge replacement grammars [6], etc. In recent years, there has been rapid progress in the field of context-free grammatical inference using techniques of distributional learning. The first result along these lines was given by Clark and Eyraud [7]. They showed a polynomial ?

He is concurrently working in Graduate School of Information Science and Technology, Hokkaido University. This work was supported in part by Grant-in-Aid for Young Scientists (B-20700124) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

identification in the limit result from positive data alone for a class of languages called the substitutable context free languages. We assume we have a finite non-empty alphabet Σ and an extra hole or gap symbol, which is not in Σ. A context is a string with one hole in lr and the distribution of a string u in a language L is defined as L/u = { lr | lur ∈ L }. There is natural congruence relation called the syntactic congruence which is defined as u ≡L v iff L/u = L/v. The learning approach of Clark and Eyraud is based on the following observations: Lemma 1. If u ≡L u0 then uw ≡L u0 w. If u ≡L u0 and v ≡L v 0 then uv ≡L u0 v 0 . This means that it is trivial to construct a cfg where the non-terminals correspond to congruence classes under this relation, as we can add rules of the form [uv] → [u], [v]. All rules of this form will be valid since [u][v] ⊆ [uv] by Lemma 1. Let us consider now the language Lc = { wcwc | w ∈ {a, b}∗ }. This is a slight variant of the classic copy language which is not context-free. It is formally similar to the constructions in Swiss German which established that the class of natural languages is not included in the class of context-free languages and is thus a classic test case for linguistic representation [8, 9]. We will extend the approach by considering relationships not between strings, but between tuples of strings. Here for concreteness we will consider the case of tuples of dimension 2, i.e. pairs; but in the remainder of the paper we will define this for tuples of arbitrary dimension. Let hu, vi be an ordered pair of strings. We define L/hu, vi = { lmr | lumvr ∈ L } and we then define an equivalence relation between pairs of strings which is analogous to the syntactic congruence: hu, vi ≡L hu0 , v 0 i iff L/hu, vi = L/hu0 , v 0 i. Note that when we look at Lc it is easy to see that hc, ci ≡Lc hac, aci ≡Lc hbc, bci, but that ha, ai is not congruent to hb, bi, since for example caac is in L/ha, ai but not L/hb, bi. Just as with Lemma 1 above, this congruence has interesting properties. Lemma 2. If hu, vi ≡L hu0 , v 0 i and hx, yi ≡L hx0 , y 0 i, then hux, vyi ≡L hu0 x0 , v 0 y 0 i. Note that there is now more than one way of combining these two elements. If they are single strings, then there are only two ways – uv and vu. When there are multiple strings we can combine them in many different ways. For example,it is also the case that huxy, vi ≡L hu0 x0 y 0 , v 0 i, but it is not the case in general that huyx, vi ≡L hu0 y 0 x0 , v 0 i. In the specific case of Lemma 2 we can consider it as a function f : (Σ ∗ × ∗ 2 Σ ) → Σ ∗ × Σ ∗ defined as f (hu, vi, hx, yi) = hux, vyi. If we use the notation w for tuples, then the lemma says that if u ≡L u0 and v ≡L v 0 , then f (u, v) ≡L f (u0 , v 0 ).

We can then generalise Lemma 2 to all functions f that satisfy a certain set of conditions; as we discuss formally later on, these must be linear regular and non-permuting. Our learning algorithm will work by constructing an mcfg where each nonterminal generates a congruence class of tuples of strings. For our learning model, we will use in this paper the Minimally Adequate Teacher (mat) model introduced by Angluin in [10]. This model assumes the existence of a teacher who can answer two sorts of queries: first, the learner can ask Membership Queries (mqs), where the learner can ask whether a given string is in the target language or not, and secondly the learner can ask Equivalence Queries (eqs), where the learner can ask whether a particular hypothesis is correct or not; if the hypothesis is not correct, then the teacher will give a counter-example to the learner. The learner is required to use only a limited amount of computation, before terminating – in particular there must be a polynomial p such that the total amount of computation used is less than p(n, `), where n is the size of the target representation, and ` is the length of the longest counter-example returned by the teacher. This model is rather abstract; but it has a number of important advantages that make it appropriate for this paper. First of all, it is a very restrictive model. There are strong negative results [11] that show that certain natural classes – regular grammars, cfgs are not learnable under this mat paradigm. Secondly, it is also permissive – it allows reasonably large classes of languages to be learned. Angluin’s famous lstar algorithm for the inference of regular languages is a classic and well-studied algorithm, and one of the great success stories of grammatical inference. Finally, if we look at the history of regular and context-free grammatical inference it appears to correspond well as a substitute for more realistic probabilistic models. Probabilistic dfas turn out to be pac-learnable under a very rigorous model [12], albeit stratified by certain parameters. Thus though this model is unrealistic it seems a good intermediate step, and it is technically easy to work with. The main result of this paper is a polynomial mat learning algorithm for a class of mcfgs. In Section 2 we will introduce our notation. Our learning algorithm is discussed in Section 3. We conclude this paper in Section 4.

2 2.1

Preliminaries Basic Definitions and Notations

The set of non-negative integers is denoted by N and this paper will consider only numbers in N. The cardinality of a set S is denoted by |S|. If w is a string over an alphabet Σ, |w| denotes its length. ∅ is the empty set and λ is the empty string. Σ ∗ denotes the set of all strings over Σ. We write Σ + = Σ ∗ − {λ}, Σ k = { w ∈ Σ ∗ | |w| = k } and Σ ≤k = { w ∈ Σ ∗ | |w| ≤ k }. Any subset of Σ ∗ is called a language (over Σ). If L is a finite language, its size is defined

P as kLk = |L| + w∈L |w|. An m-word is an m-tuple of strings and we denote the set of m-words by (Σ ∗ )hmi . Similarly we define (·)h∗i , (·)h+i , (·)h≤mi . Any m-word is called a multiword. Thus (Σ ∗ )h∗i denotes the set of all multiwords. For w = hw1 P , . . . , wm i ∈ (Σ ∗ )hmi , |w| denotes its length m and kwk denotes its size m + 1≤i≤m |wi |. We use the symbol , assuming that ∈ / Σ, for representing a “hole”, which is supposed to be replaced by another string. We write Σ for Σ ∪ {}. A string x over Σ is called an m-context if x contains m occurrences of . m-contexts are also called multicontexts. For an m-context x = x0 x1 . . . xm with x0 , . . . , xm ∈ Σ ∗ and an m-word y = hy1 , . . . , ym i ∈ ∗ hmi (Σ ) , we define x y = x 0 y 1 x1 . . . y n xn and say that y is a sub-multiword of x y. Note that is the empty context and we have hyi = y for any y ∈ Σ ∗ . For L ⊆ Σ ∗ and p ≥ 1, we define ∗ S ≤p (L) = { y ∈ (Σ + )h≤pi | x y ∈ L for some x ∈ Σ }, ∗ C ≤p (L) = { x ∈ Σ | x y ∈ L for some y ∈ (Σ + )h≤pi },

and for y ∈ (Σ ∗ )h∗i , we define ∗ L/y = { x ∈ Σ | x y ∈ L }.

Obviously computation of S ≤p (L) can be done in O(kLk2p ) time if L is finite. 2.2

Linear Regular Functions

Let us suppose a countably infinite set Z of variables disjoint from Σ. A function f from (Σ ∗ )hm1 i × · · · × (Σ ∗ )hmn i to (Σ ∗ )hmi is said to be linear, if there is hα1 , . . . , αm i ∈ ((Σ ∪ { zij ∈ Z | 1 ≤ i ≤ n, 1 ≤ j ≤ mi })∗ )hmi such that each variable zij occurs at most once in hα1 , . . . , αm i and f (w1 , . . . , wn ) = hα1 [z := w], . . . , αm [z := w]i for any w i = hwi1 , . . . , wimi i ∈ (Σ ∗ )hmi i with 1 ≤ i ≤ n, where αk [z := w] denotes the string obtained by replacing each variable zij with the string wij . We call f linear regular if every variable zij occurs in hα1 , . . . , αm i. We say that f is λ-free when no αk from hα1 , . . . , αm i is λ. A linear regular function f is said to be non-permuting, if zij always occurs to the left of zi(j+1) in α1 . . . αm for 1 ≤ i ≤ n and 1 ≤ j < mi . The rank rank(f ) of f is defined to be n and the size size(f ) of f is khα1 , . . . , αm ik. Example 1. Among the functions defined below, where a, b ∈ Σ, f is not linear, while g and h are linear regular. Moreover h is λ-free and non-permuting. f (hz11 , z12 i, hz21 i) = hz11 z12 , z11 z21 z11 i, g(hz11 , z12 i, hz21 i) = hz12 , z11 bz21 , λi, h(hz11 , z12 i, hz21 i) = ha, z11 bz21 , z12 i.

The following lemma is the generalization of Lemmas 1 and 2 mentioned in the introduction, which underlies our approach. Lemma 3. For any language L ⊆ Σ ∗ and any u1 , . . . , un , v 1 , . . . , v n ∈ (Σ ∗ )h∗i such that |ui | = |v i | and L/ui = L/vi for all i, we have L/f (u1 , . . . , un ) = L/f (v1 , . . . , v n ) for any non-permuting linear regular function f . Proof. Let mi = |ui | = |v i |. Suppose that x ∈ L/f (u1 , . . . , un ), i.e., x f (u1 , . . . , un ) ∈ L. The following inference is allowed: x f (hm1 i , u2 , . . . , un ) ∈ L/u1 = L/v1 =⇒ x f (v 1 , u2 , . . . , un ) ∈ L =⇒ x f (v 1 , hm2 i , u3 , . . . , un ) ∈ L/u2 = L/v2 =⇒ x f (v 1 , v 2 , u3 , . . . , un ) ∈ L =⇒ . . . =⇒ x f (v 1 , . . . , v n ) ∈ L. Hence x ∈ L/f (v 1 , . . . , v n ). 2.3

t u

Multiple Context-Free Grammars

A multiple context-free grammar (mcfg) is a tuple G = hΣ, Vdim , F, P, Ii, where – Σ is a finite set of terminal symbols, – Vdim = hV, dimi is the pair of a finite set V of nonterminal symbols and a function dim giving a positive integer, called a dimension, to each element of V , – F is a finite set of linear functions,3 – P is a finite set of rules of the form A → f (B1 , . . . , Bn ) where A, B1 , . . . , Bn ∈ V and f ∈ F maps (Σ ∗ )hdim(B1 )i × · · · × (Σ ∗ )hdim(Bn )i to (Σ ∗ )hdim(A)i , – I is a subset of V and all elements of I have dimension 1. Elements of I are called initial symbols. We note that our definition of mcfgs is slightly different from the original [13], where grammars have exactly one initial symbol, but this change does not affect the generative capacity of mcfgs. We will simply write V for Vdim if no confusion occurs. If a rule has a function f , then its right hand side must have rank(f ) occurrences of nonterminals by definition. If rank(f ) = 0 and f () = v, we may write A → v instead of A → f (). If rank(f ) = 1 and f is the identity, we may write A → B instead of A → fP(B), where dim(A) = dim(B). The size kGk of G is defined as kGk = |P | + ρ∈P size(ρ) where size(A → f (B1 , . . . , Bn )) = size(f ) + n + 1. For A ∈ V , A-derivation trees are recursively defined as follows: – If a rule π ∈ P has the form A → w, then π is an A-derivation tree for w. We call w the yield of π on the other hand. – If a rule π ∈ P has the form A → f (B1 , . . . , Bn ) and ti is a Bi -derivation tree for w i for all i = 1, . . . , n, then the tree whose root is π with immediate subtrees t1 , . . . , tn from left to right, which is denoted as π(t1 , . . . , tn ), is an A-derivation tree for f (w 1 , . . . , wn ), which is called the yield of π(t1 , . . . , tn ). 3

We identify a function with its name for convenience.

– nothing else is an A-derivation tree. For all A ∈ V we define L(G, A) = { w ∈ (Σ ∗ )hdim(A)i | there is an A-derivation tree for w }. The language L(G) generated by G means the set { w ∈ Σ ∗ | hwi ∈ L(G, S) with S ∈ I }, which is called a multiple context-free language (mcfl). If S ∈ I, an S-derivation tree for hwi is simply called a derivation tree for w ∈ L(G). Two grammars G and G0 are equivalent if L(G) = L(G0 ). This paper assumes that all linear functions from F are linear regular, λfree and non-permuting. In fact those assumptions do not affect their generative capacity modulo λ [13, 14]. We denote by G(p, r) the collection of mcfgs G whose nonterminals are assigned a dimension at most p and whose functions have a rank at most S r. Then we define L(p, S r) = { L(G) | G ∈ G(p, r) }. We also write G(p, ∗) = r∈N G(p, r) and L(p, ∗) = r∈N L(p, r). The class of context-free grammars (cfgs) is identified with G(1, ∗) and all cfgs in Chomsky normal form are in G(1, 2). Thus L(1, 2) = L(1, ∗). Example 2. Let G be the mcfg hΣ, V, F, P, {S}i over Σ = {a, b, c, d} whose rules are π1 : S → f (A, B) with f (hz11 , z12 i, hz21 , z22 i) = hz11 z21 z12 z22 i, π2 : A → g(A) with g(hz1 , z2 i) = haz1 , cz2 i, π4 : B → h(B) with h(hz1 , z2 i) = hz1 b, z2 di,

π3 : A → ha, ci, π5 : B → hb, di,

where V = {S, A, B} with dim(S) = 1, dim(A) = dim(B) = 2, and F consists of f , g, h and the constant functions appearing in the rules π3 and π5 . One can see, for example, aabccd ∈ L(G) thanks to the derivation tree π1 (π2 (π3 ), π5 ): ha, ci ∈ L(G, A) by π3 , haa, cci ∈ L(G, A) by π2 , hb, di ∈ L(G, B) by π5 and haabccdi ∈ L(G, S) by π1 . We have L(G) = { am bn cm dn | m, n ≥ 1 }. Seki et al. [13] and Rambow and Satta [15] have investigated the hierarchy of mcfls. Proposition 1 (Seki et al. [13], Rambow and Satta [15]). For p ≥ 1, L(p, ∗) ( L(p + 1, ∗). For p ≥ 2, r ≥ 1, L(p, r) ( L(p, r + 1) except for L(2, 2) = L(2, 3). For p ≥ 1, r ≥ 3 and 1 ≤ k ≤ r − 2, L(p, r) ⊆ L((k + 1)p, r − k). Theorem 1 (Seki et al. [13], Kaji et al. [16]). Let p and r be fixed. It is decidable in O(kGk2 |w|p(r+1) ) time whether w ∈ L(G) for any mcfg G ∈ G(p, r) and w ∈ Σ ∗ . 2.4

Congruential Multiple Context-Free Grammars

Now we introduce the languages of our learning target.

Definition 1. We say that an mcfg G ∈ CG(p, r) is p-congruential if for every nonterminal A and any u, v ∈ L(G, A), it holds that L(G)/u = L(G)/v. In a p-congruential mcfg (cmcfg), one can merge two nonterminals A and B without changing the language when L(G)/u = L(G)/v for u ∈ L(G, A) and v ∈ L(G, B). We let CG(p, r) denote the class of cmcfgs from G(p, r) and CL(p, r) the corresponding class of languages. Our learning target is CL(p, r) for each p, r ≥ 1. The class CL(1, 2) corresponds to congruential cfls introduced by Clark [17], which include all regular languages. The grammar G in Example 2 is 2-congruential. It is easy to see that for any ham , cm i ∈ L(G, A), we have L(G)/ham , cm i = { ai aj bn ck cl dn | i + j = k + l ≥ 0 and n ≥ 1 }, which is independent of m. Similarly one sees that all elements of L(G, B) share the same set of multicontexts. Obviously L(G)/ham bn cm dn i = { } for any ham bn cm dn i ∈ L(G, S). Another example of a cmcfl is Lc = { wcwc | w ∈ {a, b}∗ } ∈ CL(2, 1). On the other hand, neither L1 = { am bn | 1 ≤ m ≤ n } nor L2 = { an bn | n ≥ 1 } ∪ { an b2n | n ≥ 1 } is p-congruential for any p. We here explain why L2 6∈ CL(∗, ∗). A similar discussion is applicable to L1 . If an mcfg generates L2 , it must have an nonterminal that derives multiwords u and v that have different number of occurrences of a. Then it is not hard to see that u and v do not share the same set of multicontexts. However { an bn c | n ≥ 1 } ∪ { an b2n d | n ≥ 1 } ∈ L(1, 1) ∩ CL(2, 1) − CL(1, ∗).

3 3.1

Learning of Congruential Multiple Context-Free Grammars with a Minimally Adequate Teacher Minimally Adequate Teacher

Our learning model is based on Angluin’s mat learning [18]. A learner has a teacher who answers to two kinds of queries from the learner: membership queries (mqs) and equivalence queries (eqs). An instance of an mq is a string w over Σ and the teacher answers “Yes” if w ∈ L(G∗ ) and otherwise “No” where G∗ ˆ and is the grammar of learning target. An instance of an eq is a grammar G ˆ = L(G∗ ) and otherwise returns a counterthe teacher answers “Yes” if L(G) ˆ ∪ (L(G) ˆ − L(G∗ )). If w ∈ L(G∗ ) − L(G), ˆ then example w ∈ (L(G∗ ) − L(G)) ˆ − L(G∗ ), it is called a it is called a positive counter-example, and if w ∈ L(G) negative counter-example. The teacher is supposed to answer to every query in constant time. ˆ equivIn this learning scheme, a learner is required to output a grammar G alent to the target G∗ in polynomial time in the size of G∗ and the length of a longest counter-example given from the teacher. We remark that we do not restrict an instance of an eq to belong to the target class of grammars, while queries of this type are often called extended eqs for emphasizing the difference from the restricted type of eqs.

3.2

Hypotheses

Hereafter we arbitrarily fix two natural numbers p ≥ 1 and r ≥ 1. Let L∗ ⊆ Σ ∗ be the target language from CL(p, r). Our learning algorithm computes grammars in G(p, r) from three parameters K ⊆ S ≤p (L∗ ), X ⊆ C ≤p (L∗ ) and L∗ where K and X are always finite. Of course we cannot take L∗ as a part of input, but in fact a finite number of mqs is enough to construct the following mcfg G r (K, X, L∗ ) = hΣ, V, F, P, Ii. The set of nonterminal symbols is V = K and we will write [[v]] instead of v for clarifying that it means a nonterminal symbol (indexed with v). The dimension dim([[v]]) is |v|. The set of initial symbols is I = { [[hwi]] | hwi ∈ K and w ∈ L∗ }, where every element of I is of dimension 1. The set F of functions consists of λ-free and non-permuting functions that appear in the definition of P . The rules of P are divided into the following two types: – (Type I) [[v]] → f ([[v 1 ]], . . . , [[v n ]]), if 0 ≤ n ≤ r, v, v 1 , . . . , v n ∈ K and v = f (v 1 , . . . , v n ) for f λ-free and non-permuting; – (Type II) [[u]] → [[v]], if L∗ /u ∩ X = L∗ /v ∩ X and u, v ∈ K, where rules of the form [[v]] → [[v]] are of Type I and Type II at the same time, but they are anyway superfluous. Obviously rules of Type II form an equivalence relation in V . One can merge nonterminals in each equivalence class to compact the grammar, but the results are slightly easier to present in this non-compact form. We want each nonterminal symbol [[v]] to derive v. Here the construction of I appears to be trivial: initial symbols derive elements of K if and only if they are in the language L∗ . This property is realized by the rules of Type I. For example, for p = r = 2 and K = S ≤2 ({ab}), one has the following rules π1 , . . . , π5 of Type I that have [[ha, bi]] on their left hand side: π1 : [[ha, bi]] → ha, bi, π2 : [[ha, bi]] → fa ([[hbi]]) π3 : [[ha, bi]] → fb ([[hai]])

with with

fa (hzi) = ha, zi, fb (hzi) = hz, bi,

π4 : [[ha, bi]] → g([[hai]], [[hbi]]) with g(hz1 i, hz2 i) = hz1 , z2 i, π5 : [[ha, bi]] → [[ha, bi]], where π1 indeed derives ha, bi, while π5 is superfluous. Instead of deriving ha, bi directly by π1 , one can derive it by two steps with π3 and π6 : [[hai]] → hai (or π2 and π7 : [[hbi]] → hbi), or by three steps by π4 , π6 and π7 . One may regard application of rules of Type I as a decomposition of the multiword that appears on its left hand side. It is easy to see that there are finitely many rules of Type I, because K is finite and nonterminals on the right hand side of a rule are all λ-free sub-multiwords of that on the left hand side. If the grammar had only rules of Type I, then it should derive all and only elements of I.

The intuition behind rules of Type II is explained as follows. If L∗ /u = L∗ /v, then L∗ is closed under exchanging occurrences of v and u in any strings, and such an exchange is realized by the two symmetric rules [[u]] → [[v]] and [[v]] → [[u]] of Type II. However the algorithm cannot check whether L∗ /u = L∗ /v in finitely many steps, while it can approximate this relation with L∗ /u ∩ X = L∗ /v ∩ X by mqs, because X is finite. Clearly L∗ /u = L∗ /v implies that L∗ /u ∩ X = L∗ /v ∩ X, but the inverse is not true. We say that a rule [[u]] → [[v]] of Type II is incorrect (with respect to L∗ ) if L∗ /u 6= L∗ /v. In this construction of G r (K, X, L∗ ), the initial symbols are determined by K and L∗ and the rules of Type I are constructed solely by K, while X is used only for determining rules of Type II. Our algorithm monotonically increases K which will monotonically increase the set of rules, and monotonically increases X which will decrease the set of rules of Type II.

3.3

Observation Tables

The process of computing rules of Type II can be handled by a collection of matrices, called observation tables. For each dimension m ≤ p, we have an observation table Tm . Let Km and Xm be the sets of m-words from K and mcontexts from X, respectively. The rows of the table Tm are indexed with just elements of Km and the columns are indexed with elements of Xm . For each pair u, v ∈ Km , to compare the sets L∗ /u ∩ X and L∗ /v ∩ X, one needs to know whether x u ∈ L∗ or not for all of x ∈ Xm . The membership of x u is recorded in the corresponding entry of the observation table with the aid of an mq. By comparing the entries of the rows indexed with u and v, one can determine whether the grammar should have the rule [[u]] → [[v]]. Example 3. Let p = 2, r = 1 and L∗ = { am bn cm dn | m + n ≥ 1 } ∈ CL(2, 1). Indeed L∗ is generated by the cmcfg G∗ ∈ CG(2, 1) whose rules are S → f (E) with f (hz1 , z2 i) = hz1 z2 i, E → ga (E) with ga (hz1 , z2 i) = haz1 , cz2 i, E → ha, ci, E → gb (E) with gb (hz1 , z2 i) = hz1 b, z2 di,

E → hb, di

and whose initial symbol is S only. ˆ = G 1 (K, X, L∗ ) for Let G K = { habcdi, ha, ci, hb, di, hab, cdi, haab, ccdi }, X = { , abcd, abcd }.

We have the following rules of Type I: π1 : [[habcdi]] → habcdi, π3 : [[ha, ci]] → ha, ci,

π2 : [[habcdi]] → f ([[hab, cdi]]),

π4 : [[hb, di]] → hb, di,

π5 : [[hab, cdi]] → hab, cdi,

π6 : [[hab, cdi]] → gb ([[ha, ci]]), π7 : [[hab, cdi]] → ga ([[hb, di]]), π8 : [[haab, ccdi]] → haab, ccdi, π9 : [[haab, ccdi]] → ga ([[hab, cdi]]), π10 : [[haab, ccdi]] → gaa ([[hb, di]]) π11 : [[haab, ccdi]] → h1 ([[ha, ci]])

with with

gaa (hz1 , z2 i) = haaz1 , ccz2 i, h1 (hz1 , z2 i) = hz1 ab, z2 cdi,

π12 : [[haab, ccdi]] → h2 ([[ha, ci]]) π13 : [[haab, ccdi]] → h3 ([[ha, ci]])

with with

h2 (hz1 , z2 i) = hz1 ab, cz2 di, h3 (hz1 , z2 i) = haz1 b, z2 cdi,

π14 : [[haab, ccdi]] → h4 ([[ha, ci]])

with

h4 (hz1 , z2 i) = haz1 b, cz2 di,

where superfluous rules of the form [[v]] → [[v]] are suppressed. On the other hand we have the following observation tables:

T1 habcdi 1

T2 abcd abcd ha, ci 1 0 hb, di 1 1 hab, cdi 1 0 haab, ccdi 1 0

Thus the three nonterminals [[ha, ci]], [[hab, cdi]] and [[haab, ccdi]] can be identified thanks to the corresponding rules of Type II. ρ1 : [[ha, ci]] → [[hab, cdi]], ρ2 : [[ha, ci]] → [[haab, ccdi]], ρ3 : [[hab, cdi]] → [[haab, ccdi]], ρ4 : [[hab, cdi]] → [[ha, ci]], ρ5 : [[haab, ccdi]] → [[ha, ci]], ρ6 : [[haab, ccdi]] → [[hab, cdi]]. ˆ is [[habcdi]]. The unique initial symbol of G ˆ It is clear that Let us see some derivations of G. ˆ [[ha, ci]]) = L(G, ˆ [[hab, cdi]]) ha, ci ∈ L(G, ˆ by π3 , ρ4 , π2 . Moreover derivation trees of the form and thus ac ∈ L(G) (m−1)-times

}| { z ρ3 (π9 (. . . (ρ3 (π9 (ρ4 (π3 )))) . . . )) give us ham , cm i ∈ L([[hab, cdi]]) for all m ≥ 1. Similarly n-times

(m−1)-times

z }| { z }| { π6 (ρ1 (. . . (π6 (ρ1 (ρ3 (π9 (. . . (ρ3 (π9 (ρ4 (π3 )))) . . . ))))) . . . )) give us ham bn , cm dn i ∈ L([[hab, cdi]]) for all n ≥ 0. Finally by applying π2 , we ˆ see am bn cm dn ∈ L(G). However the rules ρ1 , ρ2 , ρ4 , ρ5 of Type II, which involve [[ha, ci]], are all incorrect, because abcd ∈ L∗ /ha, ci − L∗ /hab, cdi = L∗ /ha, ci − L∗ /haab, ccdi. In fact the derivation tree π2 (ρ3 (π11 (ρ1 (π5 )))), for instance, yields ababcdcd ∈ ˆ − L∗ . L(G)

ˆ = G r (K, X, L∗ ) in polynomial time in K and Lemma 4. One can compute G X. Proof. We first estimate the number of rules of Type I that have a fixed [[v]] on the left hand side, which are of the form [[v]] → f ([[v 1 ]], . . . , [[v n ]]). Roughly speaking, this is the number of ways to decompose v into sub-multiwords v 1 , . . . , v n with the aid of a linear regular function f . Once one determines where the occurrence of each substring from v 1 , . . . , v n starts and ends in v, the function f is uniquely determined. We have at most pr substrings from v 1 , . . . , v n , hence the number of ways to determine such starting and ending points are at most kvk2pr . Thus, the number of rules of Type I is at most O(|K|`2pr ) where ` is the maximal size of elements of K. Clearly the description size of each rule is at most O(`). One can construct the observation tables by calling the oracle at most |K||X| times. Then one can determine initial symbols and rules of Type II in polynomial time. t u The next subsection discusses how our learner determines K and X. 3.4

Undergeneralization

The problems that we have to deal with are undergeneralization and overgeneralization. We first show when data suffice to avoid undergeneralization. Lemma 5. Let D ⊆ L∗ = L(G∗ ) be such that every rule of G∗ occurs in a derivation tree for some w ∈ D. Then, for any X such that ∈ X, we have ˆ where G ˆ = G r (S ≤p (D), X, L∗ ). L∗ ⊆ L(G) Proof. Let G∗ = hΣ, V∗ , F∗ , P∗ , I∗ i and K = S ≤p (D). By induction we show that ˆ [[v]]) for some v ∈ L(G∗ , A) ∩ K. In particular if w ∈ L(G∗ , A), then w ∈ L(G, ˆ so this proves the lemma. when A ∈ I∗ , [[v]] will be an initial symbol of G, Suppose that π(t1 , . . . , tn ) is an A-derivation tree for w of G∗ where π is of the form A → f (B1 , . . . , Bn ) and ti is a Bi -derivation tree for w i for all i = 1, . . . , n. By induction hypothesis, there are ui ∈ L(G∗ , Bi ) ∩ K such that ˆ [[ui ]]) for all i. On the other hand, by the assumption on the rule wi ∈ L(G, π ∈ P∗ , we have w ∈ D that is derived by using the rule π, which can be represented as w = x v = x f (v 1 , . . . , v n ) where v i ∈ L(G∗ , Bi )∩K for all i = 1, . . . , n and v = f (v 1 , . . . , v n ) ∈ L(G∗ , A)∩ ˆ has a rule K. Then G [[v]] → f ([[v 1 ]], . . . , [[v n ]]) of Type I and moreover rules [[v i ]] → [[ui ]] of Type II for all i = 1, . . . , n whatever X is, since G∗ is a cmcfg. By applying those rules of Types I and II to w i ∈ ˆ [[ui ]]) for i = 1, . . . , n, we obtain w ∈ L(G, ˆ [[v]]) with v ∈ L(G∗ , A) ∩ K. t L(G, u ˆ it adds all When our algorithm gets a positive counter-example w ∈ L∗ − L(G), multiwords in S ≤p ({w}) to K.

3.5

Overgeneralization

Overgeneralization is a little more difficult to treat. Because overgeneralization is caused by incorrect rules of Type II, we must add appropriate multicontexts to X in order to remove them. ˆ is a correct cmcfg for the target L∗ . Then it Suppose that our conjecture G must satisfy that ˆ [[v]]), it holds that L∗ /u = L∗ /v. – for any [[v]] ∈ V and any u ∈ L(G, When we get a negative counter-example w from the teacher, this means that we ˆ [[hvi]]) for some initial symbol [[hvi]], but ∈ L∗ /hvi − L∗ /hwi. have hwi ∈ L(G, ˆ does not satisfy the The triple ([[hvi]], hwi, ) is a witness for the fact that G ˆ above desired property. It is not hard to see by Lemma 3 that in such a case G must have an incorrect rule of Type II which is used for deriving w. In order to ˆ and get a derivation find such an incorrect rule, we first parse the string w with G tree t for w. We then look for an incorrect rule in t that causes this violation by recursively searching t in a topdown manner as described below. Suppose that we have a pair (t, x) such that t is a [[v]]-derivation tree for u and x ∈ L∗ /v − L∗ /u. We have two cases. Case 1. Suppose that t = π(t1 , . . . , tn ) for some rule π of Type I. Then, there are [[v 1 ]], . . . , [[v n ]] ∈ V , u1 , . . . , un ∈ (Σ + )h∗i and f such that – π is of the form [[v]] → f ([[v 1 ]], . . . , [[v n ]]), i.e., v = f (v 1 , . . . , v n ), – u = f (u1 , . . . , un ), – ti is a [[v i ]]-derivation tree for ui for i = 1, . . . , n. By the assumption we have x v = x f (v 1 , . . . , v n ) ∈ L∗ , x u = x f (u1 , . . . , un ) 6∈ L∗ . This means that n cannot be 0. One can find k such that x f (v 1 , . . . , v k−1 , v k , uk+1 , . . . , un ) ∈ L∗ , x f (v 1 , . . . , v k−1 , uk , uk+1 , . . . , un ) 6∈ L∗ . That is, for x0 = x f (v 1 , . . . , v k−1 , h|vk |i , uk+1 , . . . , un ), we have x0 ∈ L/vk − L/uk . We then recurse with (tk , x0 ). Case 2. Suppose that t = π(t0 ) where π : [[v]] → [[v 0 ]] is a rule of Type II. Then t0 is a [[v 0 ]]-derivation tree for u. If x v 0 6∈ L∗ , this means that [[v]] → [[v 0 ]] is an incorrect rule, because x ∈ L∗ /v − L∗ /v0 . We then add x to X for removing the incorrect rules [[v]] → [[v 0 ]] and its converse [[v 0 ]] → [[v]]. Otherwise, we know that x ∈ L∗ /v 0 − L∗ /u. We then recurse with (t0 , x).

ˆ w) to this procedure for computing a multicontext We refer as FindContext(G, that witnesses incorrect rules from a negative counter-example w. The size of the output x is not too large, because x is obtained from w by replacing some occurrences of nonempty sub-multiwords ui by v i ∈ K and an occurrence of a sub-multiword by a hole . That is, |x| ≤ |w|` for ` = max{ kvk | v ∈ K }. The procedure FindContext contains parsing as a subroutine. We can use any of polynomial-time parsing algorithms for mcfgs proposed so far [13, 19]. ˆ w) works in polynomial time in |w| Lemma 6. The procedure FindContext(G, ˆ and kGk. 3.6

Algorithm

We are now ready to describe the overall structure of our algorithm. The pseudocode is presented in Algorithm 1. Algorithm 1 Learn CL(p, r) ˆ = G r (K, X, L∗ ) let K := ∅; X := {}; G ˆ do while the teacher does not answer “Yes” to the eq on G let w be the counter-example from the teacher; ˆ then if w ∈ L∗ − L(G) let K := K ∪ S ≤p ({w}); else ˆ w)}; let X := X ∪ {FindContext(G, end if ˆ = G r (K, X, L∗ ); let G end while ˆ output G;

Let G∗ = hΣ, V∗ , F∗ , P∗ , I∗ i ∈ CG(p, r) be the target grammar. Lemma 7. The algorithm receives a positive counter-example at most |P ∗ | times. The cardinality of K is bounded by O(|P∗ |`2p ), where ` is the length of a longest positive counter-example given so far. Proof. The proof of Lemma 5 in fact claims that any of the rules of G∗ that ˆ That is, are used for deriving positive counter-examples can be simulated by G. whenever the learner gets a positive counter-example w from the teacher, there is a rule of G∗ that is used for deriving w but not used for any of previously presented positive counter-examples. Therefore the learner receives a positive counter-example at most |P∗ | times. It is easy to see that |S ≤p ({w})| ∈ O(|w|2p ). Therefore |K| ∈ O(|P∗ |`2p ) where ` is the length of a longest positive counter-example given so far. t u Lemma 8. The number of times that algorithm receives a negative counterexample and the cardinality of X are both bounded by O(|P∗ |`2p ).

Proof. Each time the learner receives a negative counter-example, one element is added to X and some incorrect rules of Type II are removed. That is, the number of equivalence classes induced by rules of Type II is increased. Because there can be at most |K| equivalence classes in V , the learner expands X at most |K| times, that is, |X| ∈ O(|P∗ |`2p ) by Lemma 7. t u Theorem 2. Our algorithm outputs a grammar that derives the target language and halts in polynomial time in the description size of the target grammar G ∗ ∈ CG(p, r) and the length ` of a longest counter-example given by the teacher. Proof. By Lemmas 4 and 6, each time the learner receives a counter-example, it computes the next conjecture in polynomial time in kG∗ k and `. By Lemmas 7 and 8, it makes eqs at most polynomial times in kG∗ k and `. All in all, it runs in polynomial time in kG∗ k and `. The correctness of the output of the algorithm is guaranteed by the teacher. t u 3.7

Slight Enhancement

An easy modification properly expands the learnable class. We say that L ∈ L(p, r) is almost p-congruential if $L$ ∈ CL(p, r) where $ is a new marker not in Σ. While every language in CL(p, r) is almost p-congruential, the converse does not hold. An example in the difference is Lcopy = { ww ∈ {a, b, a ¯, ¯b}∗ | w ∈ ∗ ¯ {a, b} }, where · denotes the homomorphism that maps a, b to a ¯, b, respectively. One can easily modify our learning algorithm so that it learns all almost pcongruential languages in L(p, r) as done in [20].

4

Conclusion

This work is based on a combination of the mat algorithm for learning congruential cfgs presented in [17] and the algorithm for learning some sorts of mcfgs from positive data and mqs in [20]. ˆ of our algorithm is not always consistent with the observaThe conjecture G ˆ tion tables in the sense that it might be the case L(G)∩(X K) 6= L∗ ∩(X K), ˆ is computed from the observation tables. If one finds x ∈ X and even though G ˆ then x y can be seen as a positive countery ∈ K such that x y ∈ L∗ − L(G), ˆ − L∗ , it is a negative counter-example. example. Symmetrically if x y ∈ L(G) One can modify our algorithm so that the conjecture is always consistent by substituting this self-diagnosis method for eqs to the teacher, which will be used only when the learner by itself does not find any inconsistency. Verification of the consistency of the conjecture can be done in polynomial time, however we do not employ this idea into our algorithm, because each time the learner computes a counter-example, the observation tables are expanded and this might cause exponential-time computation, where the size of a counter-example computed by the learner itself is not considered to be a parameter of the input size. Lemma 3 offers another method for finding incorrect rules in some cases without getting a negative counter-example from the teacher. That is, if there are

u, u1 , . . . , un , v, v 1 , . . . , v n ∈ K such that u = f (u1 , . . . , un ), v = f (v 1 , . . . , v n ) for some linear regular function f , L∗ /ui ∩X = L∗ /vi ∩X for all i and L∗ /u∩X 6= L∗ /v ∩ X, then one knows that [[uk ]] → [[v k ]] is an incorrect rule for some k. Yet we do not take this idea into our algorithm by the same reason discussed in the previous paragraph. In Clark et al. [21] the approach uses representational primitives that correspond not to congruence classes but to minimal elements of the syntactic concept lattice. { v | L/v ⊇ L/u } (1) An advantage of using congruence classes is that any element of the class will be as good as any other element; if we use the idea of Equation 1 then we cannot be sure that we will get the right elements, and it is difficult to get a pure mat result. Distributional lattice grammars (dlgs) are another interesting development in the field of rich language classes that are also learnable [22]. While these models can represent some non-context-free languages, it is not yet clear whether they are rich enough to account for the sorts of cross-serial dependency that occur in natural languages; they are basically context free grammars, with an additional operation, somewhat analogous to that of Conjunctive Grammars [23]. We suggest two directions for future research: a pac-learning result along the lines of the pac result for nts languages in [24] seems like a natural extension, since mcfgs have a natural stochastic variant. Secondly, combining dlgs and mcfgs might be possible – we would need to replace the single concatenation operation in dlgs with a more complex range of operations that reflect the variety of ways that tuples of strings can be combined. We also note that given the well-known relationship between synchronous cfgs and mcfgs [25] this algorithm gives as a special case the first learning algorithm for learning synchronous context free grammars, and therefore of syntaxdirected translation schemes. Given the current interest in syntax-based statistical machine translation, we intend to explore further this particular class of algorithms.

References 1. Joshi, A.K.: Tree adjoining grammars: how much context-sensitivity is required to provide reasonable structural descriptions? In Dowty, D.R., Karttunen, L., Zwicky, A., eds.: Natural Language Parsing. Cambridge University Press, Cambridge, MA (1985) 206–250 2. Vijay-Shanker, K., Weir, D.J., Joshi, A.K.: Characterizing structural descriptions produced by various grammatical formalisms. In: Proceedings of the 25th annual meeting of Association for Computational Linguistics, Stanford (1987) 104–111 3. Joshi, A.K., Levy, L.S., Takahashi, M.: Tree adjunct grammars. Journal of Computer and System Sciences 10(1) (1975) 136–163 4. Joshi, A.K.: An introduction to tree adjoining grammars. In Manaster-Ramer, A., ed.: Mathematics of Languge. John Benjamins (1987)

5. Stabler, E.P.: Derivational minimalism. In Retor´e, C., ed.: LACL. Volume 1328 of Lecture Notes in Computer Science., Springer (1996) 68–95 6. Engelfriet, J., Heyker, L.: The string generating power of context-free hypergraph grammars. Journal of Computer and System Sciences 43(2) (1991) 328–360 7. Clark, A., Eyraud, R.: Polynomial identification in the limit of substitutable context-free languages. Journal of Machine Learning Research 8 (Aug 2007) 1725– 1745 8. Huybrechts, R.A.C.: The weak inadequacy of context-free phrase structure grammars. In de Haan, G., Trommelen, M., Zonneveld, W., eds.: Van Periferie naar Kern. Foris, Dordrecht, Holland (1984) 9. Shieber, S.M.: Evidence against the context-freeness of natural language. Linguistics and Philosophy 8 (1985) 333–343 10. Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75(2) (1987) 87–106 11. Angluin, D., Kharitonov, M.: When won’t membership queries help? Journal of Computer and System Sciences 50(2) (1995) 336–355 12. Clark, A., Thollard, F.: PAC-learnability of probabilistic deterministic finite state automata. Journal of Machine Learning Research 5 (May 2004) 473–497 13. Seki, H., Matsumura, T., Fujii, M., Kasami, T.: On multiple context-free grammars. Theoretical Computer Science 88(2) (1991) 191–229 14. Kracht, M. In: The Mathematics of Language. Volume 63 of Studies in Generative Grammar. Mouton de Gruyter (2003) 408–409 15. Rambow, O., Satta, G.: Independent parallelism in finite copying parallel rewriting systems. Theoretical Computer Science 223(1-2) (1999) 87–120 16. Kaji, Y., Nakanishi, R., Seki, H., Kasami, T.: The universal recognition problems for parallel multiple context-free grammars and for their subclasses. IEICE Transaction on Information and Systems E75-D(7) (1992) 499–508 17. Clark, A.: Distributional learning of some context-free languages with a minimally adequate teacher. In: Proceedings of the ICGI, Valencia, Spain (September 2010) 18. Angluin, D.: Learning regular sets from queries and counterexamples. Information and Computation 75(2) (1987) 87–106 19. Kanazawa, M.: A prefix-correct earley recognizer for multiple context-free grammars. In: proceedings of the Ninth International Workshop on Tree Adjoining Grammars and Related Formalisms. (2008) 49–56 20. Yoshinaka, R.: Polynomial-time identification of multiple context-free languages from positive data and membership queries. In: proceedings of the 10th International Colloquium on Grammatical Inference, Valencia, Spain (September 2010) 21. Clark, A., Eyraud, R., Habrard, A.: A polynomial algorithm for the inference of context free languages. In Clark, A., Coste, F., Miclet, L., eds.: ICGI. Volume 5278 of Lecture Notes in Computer Science., Springer (2008) 29–42 22. Clark, A.: A learnable representation for syntax using residuated lattices. In: Proceedings of the 14th Conference on Formal Grammar, Bordeaux, France (2009) 23. Okhotin, A.: Conjunctive grammars. Journal of Automata, Languages and Combinatorics 6(4) (2001) 519–535 24. Clark, A.: PAC-learning unambiguous NTS languages. In: Proceedings of the 8th International Colloquium on Grammatical Inference (ICGI). (2006) 59–71 25. Melamed, I.D.: Multitext grammars and synchronous parsers. In: proceedings of NAACL/HLT. (2003) 79–86

Polynomial Time Learning of Some Multiple Context-Free Languages ...

Polynomial Time Learning of Some Multiple Context-Free Languages ...

Suggest Documents

Learning Multiple Languages in Groups - NUS Computing

Learning and Using Multiple Languages - Cambridge Scholars ...

Reliably Learning the ReLU in Polynomial Time

Parsing Free Word-Order Languages in Polynomial Time

A Polynomial Time Extension of Parallel Multiple Context-Free Grammar

CHARACTERIZATIONS OF SOME POLYNOMIAL VARIANCE

Some Polynomial Theorems - Google Sites

A polynomial time - arXiv

On some nice polynomial automorphisms

Managing multiple collections, multiple languages, and ... - CiteSeerX

Managing multiple collections, multiple languages, and ... - CiteSeerX

On some arithmetic properties of polynomial

Near-Optimal Reinforcement Learning in Polynomial Time - UPenn CIS

Polynomial-Time Algorithm for Learning Optimal BFS-Consistent ...

Very Simple Grammars and Polynomial-Time Learning - RIMS, Kyoto ...

Polynomial Time Corresponds to Solutions of Polynomial Ordinary ...

Polynomial Time Corresponds to Solutions of Polynomial Ordinary ...

Polynomial Time corresponds to Solutions of Polynomial Ordinary ...

Polynomial Time corresponds to Solutions of Polynomial Ordinary ...

On some Recognizable Picture-languages

Investigating some programming languages for

Endangered Languages and Language Learning - arctic languages ...

Learning Languages with Help

some economic implications of taking languages ...