Performance of a SCFG-based language model

3 downloads 0 Views 77KB Size Report
for learning SCFGs in General Format and in Chomsky Normal Form are con- sidered. ... When the SCFG is in Chomsky Normal Form (CNF), an initial exhaustive ergodic grammar is ..... 149.6. 145.0. 142.8. 140.4. % improv. 26.7. 22.9. 20.7. 19.6. 19.0. 18.5. 17.5. 16.9 ... and is trained on approximately 40 million words.
Performance of a SCFG-based language model with training data sets of increasing size ? J.A. Sánchez1 , J.M. Benedí1 , and D. Linares2 1

Depto. Sistemas Informáticos y Computación Universidad Politécnica de Valencia Camino de Vera s/n, 46022 Valencia (Spain) {jandreu,jbenedi}@dsic.upv.es 2 Pontificia Universidad Javeriana - Cali Calle 18 No. 118-250 Av. Cañasgordas. Cali (Colombia) [email protected]

Abstract. In this paper, a hybrid language model which combines a word-based n-gram and a category-based Stochastic Context-Free Grammar (SCFG) is evaluated for training data sets of increasing size. Different estimation algorithms for learning SCFGs in General Format and in Chomsky Normal Form are considered. Experiments on the UPenn Treebank corpus are reported. These experiments have been carried out in terms of the test set perplexity and the word error rate in a speech recognition experiment.

1 Introduction Language modeling is an important aspect to consider in the development of speech and text recognition systems. N-gram models are the most extensively used models for a wide range of domains [1]. A drawback of n-gram models is that they cannot characterize the long-term constraints of the sentences of the tasks. Stochastic ContextFree Grammars (SCFGs) efficiently model the long-term relations of the sentences. The two main obstacles to using these models in complex real tasks are the difficulties of learning SCFGs and of integrating SCFGs. With regard to the learning of SCFGs, taking into account the existence of robust techniques for the automatic estimation of the probabilities of the SCFGs from samples [7, 11, 14, 8], in this work, we consider other possible approaches for the learning of SCFGs by means of a probabilistic estimation process [11]. When the SCFG is in Chomsky Normal Form (CNF), an initial exhaustive ergodic grammar is iteratively estimated by using the inside-outside algorithm or the Viterbi algorithm [7, 11, 10, 3]. When a treebank corpus is available, it is possible to directly obtain an initial SCFG in General Format (GF) from the syntactic structures that are present in the treebank corpus. Then these SCFGs in GF are estimated by using the Earley algorithm [14, 8]. With regard to the problem of SCFG integration in a recognition system, several proposals have attempted to solve this problem by combining a word n-gram model ?

This work has been partially supported by the Spanish MCyT under contract (TIC2002/04103C03-03) and by Agencia Valenciana de Ciencia y Tecnología under contract GRUPOS03/031.

and a structural model in order to take into account the syntactic structure of the language [5]. We proposed a general hybrid language model in [3] along the same line. This was defined as a linear combination of a word n-gram model, which was used to capture the local relation between words, and a stochastic grammatical model, which was used to represent the global relation between syntactic structures. In order to capture the long-term relations between syntactic structures and to solve the main problems derived from large-vocabulary complex tasks, we also proposed a stochastic grammatical model defined by a category-based SCFG together with a probabilistic model of word distribution into the categories. Previous works have shown that the weight of the stochastic grammatical model defined in [8, 3] was less than expected and most of the information was conveyed in the n-gram model. This seemed reasonable, because there was enough data to adequately estimate the n-gram model. However, the performance of the hybrid language model has not been adequately studied when there is little training data. Taking this idea into consideration, in this work, we propose to study the performance of the hybrid language model with training data sets of increasing size. In the following section, we briefly describe the hybrid language model and the estimation of the models. Then, we present experiments with the UPenn Treebank corpus. The experiments have been carried out in terms of the test set perplexity and the word error rate.

2 The Language Model An important problem related to statistical language modeling is the computation of the expression Pr(wk |w1 . . . wk−1 ). The n-gram language models are the most widely used for a wide range of domains [1]. The n-gram model reduces the history length to only wk−n+1 . . . wk−1 . The n-grams are simple and robust models and adequately capture the local restrictions between words. Moreover, the way to estimate the parameters of the model and the way to integrate it in a speech and text recognition system are well known. However, they cannot characterize the long-term constraints of the sentences in these tasks. Some works have proposed combining a word n-gram model and a structural model in order to take into account the syntactic structure of the language [5, 12]. Along the same line, in [3], we proposed a general hybrid language model defined as a linear combination of a word n-gram model, which is used to capture the local relations between words, and a word stochastic grammatical model Ms , which is used to represent the global relation between syntactic structures and which allows us to generalize the word n-gram model: Pr(wk |w1 . . . wk−1 )=α Pr(wk |wk−n+1 . . . wk−1 )+(1−α)PrMs (wk |w1 . . . wk−1 ),(1) where 0 ≤ α ≤ 1 is a weight factor that depends on the task. The first term of expression (1) is the word probability of wk given by the word ngram model. The parameters of this model can be easily estimated, and the expression Pr(wk |wk−n+1 . . . wk−1 ) can be efficiently computed [1]. In order to capture long-term relations between syntactic structures and to solve the main problems derived from large-vocabulary complex tasks, we proposed a stochastic

grammatical model Ms defined as a combination of two different stochastic models: a category-based SCFG (Gc ) and a stochastic model of word distribution into categories (Cw ). Thus, the second term of expression (1) can be written as: PrGc ,Cw (wk |w1 . . . wk−1 ).

(2)

In this proposal, there are still two important questions to consider: the definition and learning of Gc and Cw , and the computation of expression (2). 2.1 Learning of the Models Here, we explain the estimation of the models. First, we introduce some notation. Then, we present the framework in which the estimation process is carried out. Finally, we describe how the parameters of Gc and Cw are estimated. A Context-Free Grammar (CFG) G is a four-tuple (N, Σ, P, S), where N is a finite set of non-terminal symbols, Σ is a finite set of terminal symbols, P is a finite set of rules, and S is the initial symbol. A CFG is in Chomsky Normal Form (CNF) if the rules are of the form A → BC or A → a (A, B, C ∈ N and a ∈ Σ). We say that the CFG is in General Format (GF) if no restriction is imposed on the format of the right side of the rules. A Stochastic Context-Free Grammar (SCFG) Gs is defined as a pair (G, q), where G is a CFG P and q : P →]0, 1] is a probability function of rule application such that ∀A ∈ N : α∈(N ∪Σ)+ q(A → α) = 1. We define the probability of the derivation dx of the string x, PrGs (x, dx ), as the product of the probability application function of all the rules P used in the derivation dx . We define the probability of the string x as: PrGs (x) = ∀dx PrGs (x, dx ). Estimation Framework. In order to estimate the probabilities of a SCFG, it is necessary to define both an objective function to be optimized and a framework to carry out the optimization process. In this work, we have considered the framework of Growth Transformations [2] in order obtain the expression that allows us to optimize the objective function. In reference to the function to be optimized, we will consider the likelihood of a Q sample which is defined as: PrGs (Ω) = x∈Ω PrGs (x), where Ω is a multiset of strings. Given an initial SCFG Gs and a finite training sample Ω, the iterative application of the following function can be used to modify the probabilities (∀(A → α) ∈ P ): P P 1 ∀dx N(A → α, dx )PrGs (x, dx ) x∈Ω (x) Pr G s P . (3) q 0 (A → α) = P 1 ∀dx N(A, dx )PrGs (x, dx ) x∈Ω PrG (x) s

The expression N(A → α, dx ) represents the number of times that the rule A → α has been used in the derivation dx , and N(A, dx ) is the number of times that the nonterminal A has been derived in dx . This transformation optimizes the function PrGs (Ω). Algorithms which are based on transformation (3) are gradient descendent algorithms and, therefore, the choice of the initial grammar is a fundamental aspect since it affects both the maximum achieved and the convergence process. Different methods have been proposed elsewhere in order to obtain the initial grammar.

Estimation of SCFG in CNF. When the grammar is in CNF, transformation (3) can be adequately formulated and it becomes the well-known Inside-Outside (IO) algorithm [7]. If a bracketed corpus is available, this algorithm can be adequately modified in order to take advantage of this information and we get the IOb algorithm [11]. If we use only the best derivation of each string, then transformation (3) becomes the Viterbi-Score (VS) algorithm [10]. The initial grammar for these estimation algorithms is typically constructed in a heuristic fashion by constructing all possible rules that can be composed from a given set of terminals symbol and a given set of non-terminal symbols [7, 3]. Estimation of SCFG in GF. When the grammar is in GF, transformation (3) can be adequately computed by using an Earley-based algorithm [14, 8] (the IOE algorithm). When a bracketed corpus is available, the algorithm can be adequately modified by using a similar function to the one described in [11], and we get the IOEb algorithm [8]. If we use only the best derivation of each string, then transformation (3) becomes the Viterbi-Score (VSE) algorithm [8]. In these algorithms, the initial grammar can be obtained from a treebank corpus [4, 8]. Estimation of the parameters of Cw . We work with a tagged corpus, where each word of the sentence is labeled with part-of-speech tags (POStag). From now on, these POStags are referred to as word categories in Cw and are the terminal symbols of the SCFG in Gc . The parameters of the word-category distribution, Cw = Pr(w|c) are computed in terms of the number of times that the word w has been labeled with the POStag c. It is important to note that a word w can belong to different categories. In addition, it may happen that a word in a test set does not appear in the training set, and, therefore, its probability Pr(w|c) is not defined. We solve this problem by adding the term Pr(UNK|c) for all categories, where Pr(UNK|c) is the probability for unseen words of the test set. 2.2 Integration of the model The computation of probability (2) can be expressed as: PrGc ,Cw (w1 . . . wk . . .) PrGc ,Cw (wk |w1 . . . wk−1 ) = , PrGc ,Cw (w1 . . . wk−1 . . .) where PrGc ,Cw (w1 . . . wk . . .) .

(4)

represents the probability of generating an initial substring given Gc and Cw . Expression (4) can be easily computed by a simple modification [3] of the LRI algorithm [6] when the SCFG is in CNF, and by an adaptation [8] of the forward algorithm [14] when the SCFG is in GF.

3 Experiments with the UPenn Treebank Corpus In this section, we describe the experiments which were carried out to test the language model proposed in the previous section for training sets of increasing size. The corpus used in the experiments was the part of the Wall Street Journal that had been processed in the UPenn Treebank project [9]. It contains approximately one million words distributed in 25 directories. This corpus was automatically labeled, analyzed

and manually checked as described in [9]. There are two kinds of labeling: a POStag labeling and a syntactic labeling. The size of the vocabulary is greater than 49,000 different words; the POStag vocabulary is composed of 45 labels; and the syntactic vocabulary is composed of 14 labels. The corpus was divided into sentences according to the bracketing. For the experiments, the corpus was divided into three sets: training (see Table 1), tuning (directories 21-22; 80,156 words) and test (directories 23-24; 89,537 words). Sentences labeled with POStags were used to learn the category-based SCFGs, and sentences labeled with both POStags and words were used to estimate the parameters of the hybrid language model. First, we present the perplexity results for the described task. Second, we present word error rate results on a speech recognition experiment. In both experiments, the hybrid language model has been tested with training data sets of increasing size. Preliminary results of these experiments appeared in [8, 3]. 3.1 Perplexity Results We carried out the experiments taking into account the restrictions considered in other works [5, 12, 8, 3]. The restrictions that we considered were the following: all words that had the POStag CD (cardinal number [9]) were replaced by a special symbol which did not appear in the initial vocabulary; all capital letters were uncapitalized; the vocabulary was composed of the 10,000 most frequent words that appear in the training. Baseline Model. We now describe the estimation of a 3-gram model to be used as both a baseline model and as a part of the hybrid language model. The parameters of a 3-gram model were estimated using the software tool described in [13]. Linear discounting was used as the smoothing technique with the default parameters in order to compare the obtained results with results reported in other works [8]. The out-of-vocabulary words were used in the computation of the perplexity, and back-off from context cues was excluded. Hybrid Language Model. In this section, we describe the estimation of the stochastic grammatical model, Ms , and the experiments which were carried out with the hybrid language model. The parameters of the word-category distribution (Cw ) were computed from the POStags and the words of the training corpus. The unseen events of the test corpus were considered as the same word UNK, and we assigned a probability based on the classification of unknown words into categories in the tuning set. A small probability  was assigned if no unseen event was associated to the category. With regard to the estimation of the category-based SCFG (Gc ) of the hybrid model, we first describe the estimation of SCFGs in CNF, and we then describe the estimation of SCFGs in GF. For initial SCFGs in CNF, a heuristic initialization based on an exhaustive ergodic model was carried out. This initial grammar in CNF had the maximum number of rules that can be formed using 35 non-terminal symbols and 45 terminal symbols. In this way, the initial SCFG had 44,450 rules. Then, the parameters of this initial SCFG were estimated using several estimation algorithms: the VS algorithm and the IOb algorithm. Note that the IO algorithm was not used to estimate the SCFG in CNF because of the time that is necessary per iteration and the number of iterations that it needs to converge.

For SCFG in GF, given that the UPenn Treebank corpus was used, an initial grammar was obtained from the syntactic information which is present in the corpus. Probabilities were attached to the rules according to the frequency of each one in the training corpus. Then, this initial grammar was estimated using several estimation algorithms based on the Earley algorithm: the VSE algorithm, the IOE algorithm, and the IOEb algorithm. Finally, once the parameters of the hybrid language model were estimated, we applied expression (1). In order to compute expression (4), we used: – the modified version of the LRI algorithm [3] with SCFGs in CNF, which were estimated as we described above; – the modified version of the forward algorithm described in [8], with SCFGs in GF, which were estimated as described above. The tuning set was used to determine the best value of α for the hybrid model (2), that is, the weight factor. In order to study the influence of the size of the training data set on the learning of the hybrid language model, we carried out the following experiment. All the parameters of the hybrid language model were estimated for different training sets of increasing size. The same restrictions which have been described above for estimating the parameters of the models (the category-based SCFG, the word distribution into categories and the n-gram model) were considered for each training set. In addition, a new baseline was computed for each training set. The tuning and test sets were the same for all cases. The results obtained can be seen in Table 1. Table 1. Test set perplexity, percentage of improvement and value of α for the hybrid model for SCFGs estimated with different estimation algorithms and training data sets of increasing size (in number of words). Directories Training size n-gram baseline HLM-VS % improv. α HLM-IOb % improv. α HLM-VSE % improv. α HLM-IOE % improv. α HLM-IOEb % improv. α

00-02 142,218 253.4 224.6 11.4 0.61 190.8 24.7 0.45 185.8 26.7 0.47 184.9 27.0 0.46 188.7 25.5 0.47

00-04 232,392 231.2 209.7 9.3 0.66 185.9 19.6 0.53 178.2 22.9 0.54 177.0 23.4 0.53 180.9 21.8 0.54

00-06 328,551 211.2 194.7 7.8 0.70 174.9 17.2 0.54 167.5 20.7 0.58 166.7 21.1 0.57 169.7 19.7 0.58

00-08 391,392 203.5 188.6 7.3 0.72 169.6 16.6 0.59 163.7 19.6 0.60 162.5 20.2 0.58 165.3 18.8 0.60

00-10 487,836 197.5 184.0 6.8 0.74 166.3 15.8 0.61 159.9 19.0 0.61 158.5 19.8 0.61 161.8 18.1 0.62

00-12 590,119 189.0 175.6 7.1 0.75 157.9 16.5 0.62 154.1 18.5 0.63 152.4 19.4 0.63 155.6 17.7 0.63

00-14 700,717 181.4 169.8 6.4 0.76 151.9 16.3 0.62 149.6 17.5 0.65 147.6 18.6 0.64 151.0 16.8 0.65

00-16 817,716 174.4 163.7 6.1 0.78 145.2 16.7 0.62 145.0 16.9 0.66 143.1 18.0 0.66 146.4 16.1 0.66

00-18 00-20 912,344 1,004,073 171.2 167.3 161.4 157.2 5.7 6.0 0.79 0.79 143.8 142.3 16.0 15.0 0.62 0.65 142.8 140.4 16.6 16.1 0.67 0.67 140.9 138.6 17.7 17.2 0.66 0.66 144.1 142.1 15.8 (15.1) 0.67 0.67

Table 1 shows that the test set perplexity with the n-gram models increased as the size of the training set decreased. The test set perplexity with the hybrid language model

improved in all cases. It is important to note that both the percentage of improvement and the weight of the grammatical part increased as the size of the training set decreased. For SCFG in GF, the percentage of improvement was better than the percentage of improvement for SCFG in CNF. These results are very significant because they show that the proposed hybrid language model can be very useful when little training data is available. 3.2 Word Error Rate Results Here, we describe preliminary speech recognition experiments which were carried out to evaluate the hybrid language model. Given that our hybrid language model is not integrated in a speech recognition system, we reproduced the experiments described in [12, 8] in order to compare our results with those reported in those works. The experiment consisted of rescoring a list of n best hypotheses provided by a speech recognizer that used a different language model. In our case, the speech recognizer and the language model were the ones described in [5]. Then the list was reordered with the proposed language model. In order to avoid the influence of the language model of the speech recognizer, it is important to use a large value of n; however, for these experiments, this value was lower. The experiments were carried out using the DARPA ’93 HUB1 test setup. This test consists of 213 utterances read from the Wall Street Journal with a total of 3,446 words. The corpus comes with a baseline trigram model using a 20,000-word open vocabulary and is trained on approximately 40 million words. The 50 best hypotheses from each lattice were computed using Cyprian Chelba’s A* decoder, along with the acoustic and trigram scores. Unfortunately, in many cases, 50 different string hypotheses were not provided by the decoder [12]. An average of 22.9 hypotheses per utterance were rescored. The hybrid language model was used to compute the probability of each word in the list of hypotheses. The probability obtained using the hybrid language model was combined with the acoustic score, and the results can be seen in Table 2 along with the results obtained for different language models. The word error rate without language model, that is, using only the acoustic scores was 16.8. Table 2. Word error rate results for several models, using different training sizes, and the best language model weight. Directories 00-02 00-04 00-06 00-08 00-10 00-12 00-14 00-16 00-18 00-20 n-gram baseline 16.8 16.7 16.8 16.5 16.8 16.7 16.7 16.7 16.7 16.6 LM weight 2 3.5 3.5 3 3 5 2.5 5.5 3 5 HLM-IOb 16.7 16.5 16.7 16.3 16.4 16.3 16.3 16.3 16.3 16.0 LM weight 2.2 2.3 2.4 5.1 4 5.7 5.2 5 5.1 6 HLM-IOE 16.7 16.6 16.8 16.4 16.5 16.4 16.3 16.2 16.4 16.2 LM weight 4.2 5.0 5.7 5.1 5.2 5.9 5.4 6.1 5.9 4

Table 2 shows that in all cases, the hybrid language model slightly improved the ngram model. However, the good results in perplexity did not correspond with the WER result. While the percentage of improvement in perplexity increased as the training data size decreased, the WER result did not reflect this improvement. It should be pointed out that better WER results were obtained in [12]. However, it should be noted that the

model proposed in [12] is more complex. Whereas our stochastic grammatical model is simple, and it is learned by means of well-known estimation algorithms.

4 Conclusions

We have studied the performance of a SCFG-based language model using different training set sizes. One model uses a SCFG in CNF and the other uses a SCFG in GF. Both models were tested in an experiment on the UPenn Treebank corpus. Both models showed good perplexity results, and their percentage of improvement increased as the size of the training set decreased. The WER results were not as good as the perplexity results, and the performance seemed to increase as the size of the training set increased. For future work, we propose to test the proposed hybrid language models in other real tasks.

Acknowledgements The authors would like to thank Dr. Brian Roark for providing them with the n-best lists for the HUB1 test set.

References 1. L.R. Bahl, F. Jelinek, and R.L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, PAMI5(2):179–190, 1983. 2. L.E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities, 3:1–8, 1972. 3. J.M. Benedí and J.A. Sánchez. Estimation of stochastic context-free grammars and their use as language models. Computer Speech and Language, 2005. To appear. 4. E. Charniak. Tree-bank grammars. Technical report, Departament of Computer Science, Brown University, Providence, Rhode Island, January 1996. 5. C. Chelba and F. Jelinek. Structured language modeling. Computer Speech and Language, 14:283–332, 2000. 6. F. Jelinek and J.D. Lafferty. Computation of the probability of initial substring generation by stochastic context-free grammars. Computational Linguistics, 17(3):315–323, 1991. 7. K. Lari and S.J. Young. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 4:35–56, 1990. 8. D. Linares, J.M. Benedí, and J.A. Sánchez. A hybrid language model based on a combination of n-grams and stochastic context-free grammars. ACM Trans. on Asian Language Information Processing, 3(2):113–127, June 2004. 9. M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330, 1993. 10. H. Ney. Stochastic grammars and pattern recognition. In P. Laface and R. De Mori, editors, Speech Recognition and Understanding. Recent Advances, pages 319–344. Springer-Verlag, 1992. 11. F. Pereira and Y. Schabes. Inside-outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pages 128–135. University of Delaware, 1992. 12. B. Roark. Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2):249–276, 2001. 13. R. Rosenfeld. The CMU statistical language modeling toolkit and its use in the 1994 ARPA csr evaluation. In ARPA Spoken Language Technology Workshop, Austin, Texas, USA, 1995. 14. A. Stolcke. An efficient probabilistic context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2):165–200, 1995.