learning of stochastic context-free grammars from ... - CiteSeerX

VIII Simposium on Pattern Recognition and Image Analysis. Vol.1, pp.119−126. Bilbao, 1999.

LEARNING OF STOCHASTIC CONTEXT-FREE GRAMMARS FROM BRACKETED CORPORA BY MEANS OF REESTIMATION ALGORITHMS F. Amayay, J.M. Benedí and J.A. Sánchez

Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia Camino de Vera s/n, 46022 Valencia (Spain) e-mail: {famaya, jmbenedi, jandreu}@dsic.upv.es

Abstract The learning of stochastic context-free grammars from bracketed corpora by means of reestimation algorithms was explored in this work. The Comparisons of the Inside-Outside algorithm (IO) and it's bracketed version (IOb), the Viterbi-Score algorithm (VS) and the bracketed version of the Viterbi-Score (VSb) algorithms was also carried out. A complete experiment with the Penn-Tree Bank was made in order to study the performance of the resulting model and the speed of convergence of the process. A combination of the IOb with the VSb and VS models was also studied.

Keywords: Stochastic Context-Free Grammars, Inside-Outside Algorithm, Viterbi Score Algorithm, Bracketed Sentences

1 Introduction The most widely-used method for learning Stochastic Context-Free Grammars (SCFGs) is based on the well-known Inside-Outside (IO) algorithm [2, 6, 3]. Unfortunately, the application of this algorithm presents important problems which are accentuated in real tasks: the time complexity per iteration and the large number of iterations that are necessary to converge. An alternative to the IO algorithm is an estimation algorithm based on the Viterbi score (VS algorithm) [8]. The convergence of the VS algorithm is faster than the IO algorithm, since the VS algorithm only considers the information obtained from the best derivation. However, the SCFGs obtained are not, in general, well learned. A modication of the IO algorithm which learns SCFGs from partially bracketed corpora has been proposed [9]. This algorithm only considers the possible derivations according to partial parses which are dened on the bracketed corpus. The rst results wich were reported in [9] have shown that the convergence of this algorithm is signicantly faster than the IO algorithm. Here, we explore an estimation alternative from structural information content in the bracketed corpora.We study both the behaviour of the extension of the IO algorithm (IOb) proposed in [9] and, along with this idea, we describe a new algorithm based on the VS algorithm (VSb). We also present a complete experimental work on a part of the Wall Street Journal task processed in the Penn Tree-bank project [7], emphasizing the study of the number of iterations that are necessary to converge and the goodness of the estimated SCFGs. Work partially supported by the Spanish CICYT under the contracts (TIC95/0884-C04) and (TIC98/0423C06). y Granted by Universidad del Cauca, Popayan (Colombia).

In the following sections, the formulation of extensions of the IO and VS algorithms are presented together with the notation used. Next, the experiments illustrating the behaviour of these algorithms are reported.

2 Notation and denitions A Context-Free Grammar (CFG) G is a four-tuple (N; ; P; S ), where N is a nite set of nonterminal symbols, is a nite set of terminal symbols (N \ = ;), P is a nite set of rules of the form A ! (A 2 N and 2 (N [ )+ ) (we only consider grammars with no empty rules) and S is the initial symbol (S 2 N ). A CFG in Chomsky Normal Form is a CFG in which the rules are of the form A ! BC or A ! a (A; B; C 2 N and a 2 ). A left-derivation of x 2 + p p in G, is a sequence of rules dx = (p1 ; p2 : : : ; pm ), m 1 such that: (S ) 1 ) 2 : : : p)m x), where i 2 (N [ )+ , 1 i m ? 1, and pi rewrites the left-most non-terminal of i?1 . The + xg. language generated by G is dened as L(G) = fx 2 + j S ) A Stochastic Context-Free Grammar (SCFG) Gs is dened as a pair (G; q) where G is aPCFG and q : P !]0; 1] is a probability function of rule application such that 8A 2 N : 2(N [) q(A ! ) = 1: Let dx be a left-derivation (derivation from now on) of the string x, the expression N(A ! ; dx ) represents the number of times that the rule A ! has been used in the derivation dx and N(A; dx ) is the number of times that the non-terminal A has been derived in Q dx . We dene the probability of the derivation dx of the string x as: Pr(x; dx j Gs) =P (A!)2P q(A ! )N(A!;dx ) : The probability of the string x is dened as Pr(x j Gs) = 8dx Pr(x; dx j Gs ): We dene the probability of the best derivation of the string x as Pcr(x j Gs) = max8dx Pr(x; dx j Gs); and we dene the best derivation, dbx , as the argument which maximizes this function. The language generated by Gs is dened as L(Gs) = fx 2 L(G)j Pr(x j Gs) > 0g. 1

2

+

Partially bracketed text

Informally, a partially bracketed corpus is a set of sentences annotated with parentheses marking constituent frontiers [9]. More precisely: given a bracketed corpus , a bracketed string is a pair (x; B ) where x is the string and B the bracketing of x. Given the string x = x1 x2 : : : xn , the pair of integers (i; j ) , 0 i j n form a span of x. A span (i; j ) delimits substring xi : : : xj . A bracketing B of x is a nite set of spans on x, B = f(i; j )j0 i j ng which satisfy consistency conditions, i.e.two spans do not overlap. The span (i; j ) overlaps (k; l) if i < k < j < l or k < i < l < j . Two bracketings of x are compatible if their union is consistent. The bracketing of a derivation is dened as follows: Let (x; B ) be a bracketed string and dx a derivation of x with the GIP Gp. If the GIP does not have useless symbols, then every non-terminal that appears in every sentential form of the derivation causes a substring xi : : : xj of x, 1 i j jxj, and denes a span (i; j ). A derivation of x is compatible with B if all the spans are compatible with B . Given a SCFG and a bracketed corpus , for each bracketed string (x; B ) we dene the function:

c(i; j ) =

1

if (i; j ) does not overlaps any b 2 B 0 otherwise

3 Estimation from bracketed corpora

Bracketed IO algorithm

The inside-outside (IO)algorithm [2, 3, 6] used in the reestimation of the rule probabilities is modied to take advantage of the bracketing of a string [9]. Given string x = x1 : : : xn , if e(Ahi; j i) is the probability of deriving substring xi : : : xj from the non-terminal A and if f (Ahi; j i) is the probability of deriving the sentential form x1 : : : xi?1 Axj +1 : : : xn from the initial simbol S , then for each bracketed sentence in , the inside probabilities can be computed by the recurrence formula

e(A < i; i >) = p(A ! xi ) 1 i jxj; e(A < i; j >) = c(i; j )

X

B;C 2N

p(A ! BC )

j ?1 X k=i

e(B < i; k >)e(C < k + 1; j >) 1 i < j jxj:

Thus, Pr(xjGp ) = e(S < 1; jxj >). Similarly the outside probabilities can be computed by the recurrence:

1

si A = S 0 si A 6= S

f (A < 1; jxj >) =

f (A < i; j >) = c(i; j )

X

B;C 2N

+ p(B ! AC )

P Thus, Pr(xjG ) = p

p(B ! CA)

jxj X

k=j +1

i?1 X k=1

f (B < k; j >)e(C < k; i ? 1 >)

1 f (B < i; k >)e(C < j + 1; k >)A

1 i j jxj:

A2N f (A < i; i >)p(A ! xi ), 1 i jxj.

Observing the modications introduced by the brackets in the inside and the outside probabililties and keeping in mind the method used in the process of estimation of the probabilities of the rules with the usual IO [2, 3] algorithm, no modications are needed in the IO algorithm to compute the probability rules.

Bracketed VS algorithm

In the Viterbi-Score algorithm only the best derivation of string x (the derivation with maximum probability) is taken into account . If eb(Ahi; j i) is the probability of the best derivation of substring xi : : : xj , then we propose a bracketed version of the Viterbi algorithm that may be expressed as follow:

eb(A < i; i >) = p(A ! xi); eb(A < i; j >) = c(i; j ) B;C max p(A ! BC ) k=max eb(B < i; k >)eb(C < k + 1; j >) 2N i;:::;j ?1 1 i < j jxj:

Thus, Pcr(xjGp ) = eb(S < 1; jxj >). Something similar to the mentioned in the IO case is valid for the Viterbi algorithm. The method used in the raw (unbracketed) case is still valid in the bracketed case.

4 Experimental results with a syntetic task Firstly, we studied the behaviour of the IOb and the VSb proposed algorithms in contrast to the classical IO and VS algorithms. Given the characteristics of the IO algorithm, only a synthetic experiment was possible to carry out. For this experiment, we selected the palindrome language with two terminals which had previously been used in another work [6, 9]. A random sample of 100 strings was generated with the GIP shown in Figure 1. The initial grammar was composed of the rules that can be generated with 5 non-terminals and 2 terminals, i.e. 135 rules. Initial probabilities were assigned at random. Two aspects were considered: First, the convergence process for the four algorithms,(shown in gure 2). Although the Vs and VSb converged faster than the respective IO's, the performance of the model were worse. As expected IOb was signicantly faster than the IO algorithm. While IOb took 30 iterations to converge, IO tooke more than 300. The second aspect considered was the capability of learning the structural information of the models.

S ! AC 0:4 S ! BD 0:4 S ! AA 0:1 S ! BB 0:1

PSfrag replacements

C ! SA 1 D ! SB 1 A!a1 B!b1

Figure 1: the SCFG used to generate the training sample. This SCFG generates the language L = fwwR j w 2 fa; bg+ g [9]. -800

IOb

-1000

IO

-1200 -1400

VS

-1600 -1800 -2000 -2200

VSb 0

50

100

150 200 Iterations

250

300

Figure 2: Evolution of the function maximized by each algorithm. IOp and IOnp are labels for IOb and IO, vitp and vitnp for VSb and VS. The nal models were used to generate a test sample. With the original GIP (Figure 1) we checked the rate of palindromes and non palindromes (see Table 1). As table 1 shows, the gratest part of probability mass of the test sample in the rst 3 models was concentrate in the palindrome sentences. In the IOb case 61.7% of the probability mass which was assigned to the sample was concentrated in the palindrome. The percentages in the IO case and the VSb case were 86.5% and 72.2% respectively.

Table 1: Given a test sample (without repetitions) which was generated by the four nal models, the percentage of palindromes (and non palindromes) and the probability mass represented by them are shown. Estimation algorithm IOb IO VSb VS

Pal.P % Pr 11.3 0.21 33.4 0.45 14.8 0.39 12.2 0.09

Non pal. P Pr % 88.7 0.13 66.6 0.07 85.2 0.15 87.8 0.62

5 Experimental results with the Penn Tree-bank corpus The corpus used in the experiments was the part of the Wall Street Journal which had been processed in the Penn Treebank project1 [7]. This corpus consists of English texts of the approximate one million words collected from the Wall Street Journal from editions form the late eighties. This corpus was automatically labelled, analyzed and manually checked as described in [7] (an example is shown in Figure 3). There are two kinds of labelling: a part of speech (POStag) labelling and a syntactic labelling. The size of the vocabulary is greater than 25,000 dierent words, the POStag vocabulary is composed of 45 labels2 and the syntactic vocabulary is composed of 14 labels. ( (S

(NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Figure 3: The sentence Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. labelled and analyzed in the Penn Treebank project. We decided to work only with the POStag labelling, since the vocabulary of the original corpus was too large for the experiments to be carried out. The corpus was divided into sentences. A sequence of POStags which ended with the label ., with end of paragraph (a sequence of = labels) or with end of le was considered a sentence. In this way, we obtained a corpus whose main characteristics are shown in Table 2. Given the time complexity of the bracketed IO algorithm, we decided not to consider the sentences with more than 15 POStags in order to reduce the computational eort. The corpus was organized in 25 directories (from 00 to 24) and it was divided into a training corpus and a test corpus. The characteristics of these sets can be seen in Table 3. The release 2 of this data set can be obtained from the Linguistic Data Consortium with Catalogue number LDC94T4B (http://www.ldc.upenn.edu/ldc/noframe.html) 2 There is 48 label dened in [7], however three do not appear in the corpus 1

Table 2: Characteristics of the Penn Treebank corpus once it was divided into sentences. No. of sent. Av. length Stad. Dev. Min. length Max. length 54393 23.75 11.31 1 249 Table 3: Characteristics of the data sets dened for the experiments when the sentences with more than 15 POStags were removed. Data set Directories No. of sent. Average length Standard deviation Training from 00 to 19 10,631 10.50 3.61 Test from 20 to 24 2,788 10.29 3.63 The perplexity per word was used to evaluate the goodness of the obtained model [1, 5]. This measure is evaluated with a data set (test set) which has not been used in the training process. When the model is a SCFG this measure is dened as [4]: P Ts xjGp ? x2P x2Ts jxj PP (Ts; Gp) = e : We can observe that when the test set is exactly the training set, maximizing the likelihood of the sample is equal to minimizing this measure. Therefore, if the test set has a similar distribution to the training set then this value is expected to decrease. By the way, the perplexity3 with 3-grams was 9.63. An initial SCFG to be estimated was constructed. This SCFG had the maximum number of rules which can be composed with 45 terminal symbols (the number of POStags) and 14 nonterminal symbols (the number of syntactic labels), which sums up 3; 374 rules (143 +14 45). The probabilities were randomly generated and three seeds were tested, but only one was reported. log Pr(

)

Results on the Penn Tree-bank

Table 4 shows the perplexity of the nal model for the three algorithms considered (IOb,VS,VSb). The column headed by IOb+VS(30) represents the perplexity of the model obtained as follows: we used the grammar resulted from 30 iterations of the learning process with the IOb algorithm (the process had not yet converged) as the initial model for the VS algorithm, expecting the process of convergence accelerate and then the VS and VSb algorithms was used to train until convergence was obtained. A similar task was done with IOb+VS(50), IOb+VSb(30) and IOb+VSb(50) as Table 4 shows. Table 4: Test set perplexity for dierent algorithms. The training set was composed by all the other partitions. Algorithm IOb VSb VS IOb+VS(30) IOb+VS(50) IOb+VSb(30) IOb+VSb(50) Test set per. 13.54 22.05 21.14 14.29 13.62 14.18 13.55 As expected, the IOb had the low perpelxity. The perplexity of the combined model IOb+VSb(30) was slightly greater than the IOb and the model had similar performance. Figure 4. shows the The values were computed with software tool described in [10] (the release 2.04 is available in http://svrwww.eng.cam.ac.uk/ prc14/toolkit.html). 3

-275000

IOb+VS

-280000

IOb+VSb, IOb+VS

-285000

-290000

IOb+VSb

IOb

-295000

-300000

PSfrag replacements

-305000

-310000

-315000

-320000 20

25

30

35

40

45

Iterations

50

55

60

65

Figure 4: Evolution of the convergence process and the eect of the use of the VS and VSb algorithms from the 30th and 50th iterations. development of the likelihood function of IOb together with the IOb+VS(30) and IOb+VS(50). The IOb converged after 55 iterations (aprox.), but the application of the VS or the VSb algorithm to the model obtained after 30 iterations of IOb, accelerate the convergence of the IOb+VS(30) models to 33 iterations (aprox) thus obtaining a model with a performance similar to IOb. In the case of models IOb+VS(50) and IOb+VSb(50) the perpelxity was less than the one corresponding to the (30) models this fact may be due to the convergence of the IOb algorithm, Note that the likeliehood function has practically comverged in the 50th iteration of IOb.

6 Conclusions In this work, Comparisons of the learning process of stochastic context-free grammars using the classical models and the presentation of a new model (VSb), has been made. Experiments with the palindromes were made which demostrate advantage of IOb over the others. Experiments with Penn Tree-Bank also show the advantage of IOb over the others. Combining the IOb with the VS's algorithms accelerates the convergence and produces models with good characteristics.

References [1] L.R. Bahl, F. Jelinek, and R.L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI5(2):179190, 1983. [2] J.K. Baker. Trainable grammars for speech recognition. In Klatt and Wolf, editors, Speech Communications for the 97th Meeting of the Acoustical Society of America, pages 3135. Acoustical Society of America, June 1979. [3] F. Casacuberta. Growth transformations for probabilistic functions of stochastic grammars. IJPRAI, 10(3):183201, 1996. [4] P. Dupont. Utilisation et apprentissage de modèle de langage pour la reconnaissance de la parole continue. Ph. d. dissertation, ?, 1996.

[5] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1998. [6] K. Lari and S.J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer, Speech and Language, 4:3556, 1990. [7] M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313330, 1993. [8] H. Ney. Stochastic grammars and pattern recognition. In P. Laface and R. De Mori, editors, Speech Recognition and Understanding. Recent Advances, pages 319344. Springer-Verlag, 1992. [9] F. Pereira and Y. Schabes. Inside-outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pages 128135. University of Delaware, 1992. [10] R. Rosenfeld. The cmu statistical language modeling toolkit and its use in the 1994 arpa csr evaluation. In ARPA Spoken Language Technology Workshop, Austin, Texas, USA, 1995.

learning of stochastic context-free grammars from ... - CiteSeerX

learning of stochastic context-free grammars from ... - CiteSeerX

Suggest Documents

Learning context-free grammars from stochastic

Recursive Markov Chains, Stochastic Grammars, and ... - CiteSeerX

Learning stochastic regular grammars with recurrent neural networks

Parameter Learning from Stochastic Teachers and Stochastic ...

The Application of Stochastic Context-Free Grammars to ... - CiteSeerX

Querying Parse Trees of Stochastic Context-Free Grammars - CiteSeerX

Learning Rigid Lambek Grammars and Minimalist ... - CiteSeerX

INTEGRATION OF TWO STOCHASTIC CONTEXT-FREE GRAMMARS

Learning Correction Grammars

Learning context-free grammars

Splice site prediction using stochastic regular grammars - CiteSeerX

Recursive Markov Chains, Stochastic Grammars, and Monotone

Reversible Stochastic Attribute-Value Grammars - Association for ...

Recursive Markov Chains, Stochastic Grammars, and Monotone

Estimators for Stochastic``Unification-Based''Grammars

Combining Stochastic Grammars and Genetic ... - FBK | SE

Combining Stochastic Grammars and Genetic ... - FBK | SE

Implicit and Explicit Learning of Artificial Grammars from Letter Strings

Efficient Learning of Context-Free Grammars from ... - Science Direct

Evolutionary Learning of Novel Grammars for Design ... - CiteSeerX

Random Grammars - CiteSeerX

Learning Stochastic Finite Automata from Experts

Covariance in Unsupervised Learning of Probabilistic Grammars

Unsupervised learning of probabilistic grammars ... - UCLA Statistics