use rules during the Viterbi [2] search or when applying the ... have memory of the best history (Viterbi) or only the .... error rate can be divided into 3 equal parts :.
Introduction of rules into a stochastic approach for language modelling Thierry Spriet, Marc El-Bèze LIA - Université d'Avignon CERI - BP 1228 84911 Avignon cedex - France (spriet,elbeze)@univ-avignon.fr
I - Introduction Automatic tagging at a morpho-syntactic level is an area of Natural Language Processing where statistical approaches have been more successful than rule-based methods [1]. During the last decade, probabilistic systems have been holding a leading position amongst all the systems built for solving the tagging problem. Nevertheless, even if they are the best possible, they present serious drawbacks as they are unable to hold long span dependencies and to modelise rare structures. In fact, the weakness of statistical techniques may be compensated by the strong point of rule-based methods. In order to take advantage of their complementary features, we have developed a hybrid approach in ECSta, the tagger developed by the LIA team. It is difficult not to say impossible to use rules during the Viterbi [2] search or when applying the Forward-Backward algorithm. Although ECSta mainly relies on statistical models, it jointly uses some knowledge rules. In doing that, we took care to preserve the consistency of the mathematical model. In ECSta, we mix a stack-decoding algorithm with the Viterbi classical one. In this paper, we report which strategy has been implemented in order to combine the two components in a complete algorithm. A probabilistic tagger is a process designed to disambiguate a multiple choice of syntactic tags for each lexical component of a word sequence. We call the structure used by such a system, hypotheses lattice (Fig. 1).
have memory of the best history (Viterbi) or only the probabilities of the paths leading to the tags (BaumWelch) [3] i= N
~
C
w2
wi
wn
C1,1
C2,1
Ci,1
CM,1
C1,J
C2,J
Ci,J
CM,J
C1,N
C2,N
Ci,N
CM,N
Figure 1 : hypotheses lattice. The probability of a tag sequence depends on the history of each tag (A). Most often this history is limited to two previous units. Classical algorithms just
i=1
)
To introduce new information, history dependent, in P(C) The complete history must be saved for each unit of the lattice. Since it is not conceivable to explore all the paths (about a milliard for a typical French sentence), we propose the use of a specific algorithm based on an A* strategy.
II - Stack decoding strategy II - 1 Algorithm The A* algorithm [4,5] finds the optimal path in a graph where, in the actual application, the nodes are partial syntactic interpretations of the sentence, and each connection corresponds to the tagging of a word or a word compound. It means to expand the tag sequence with all the tags proposed by the dictionary. This algorithm is admissible and explores the minimum number of hypotheses (Fig. 2). w1
w2
wi
wn
C1,1 C2,1 C1,1
C1,1C2,J C1,1C2,N
C1,J
C1,JC2,1 C1,JC2,J C1,JC2,N
w1
(
P( C ) = arg max ∏ P ci H ( ci ) × P ( wi ci )
(A)
C1,N
C1,1C2,1 ...Ci,1 C1,1C2,1...C i,1 C1,1C2,1 ...Ci,1
...CM,1 ...CM,J ...CM,N
Figure 2 : hypotheses developed in the stack For an A* strategy, a good evaluation function is needed, or, at least, an excellent estimation of this function. The better the estimation is, the fewer hypotheses are explored. The estimation function is described in the next paragraph, and called E(C). The principle of the algorithm is to partly develop the most competing hypothesis. At each step, E(C) points out the hypothesis which represents the best candidate.
This function must estimate an upper bound of the score which can be reached by a tagged word sequence, if it is completely developed. We can distinguish three parts in its computation. If CS1,i is a tagged word sequence c1..ci in a sentence of N words, where ck represents the tag of the word k. we have
E (CS1,i ) = V (c1 ,.., Ci − 1 ) × S (ci ) × VB (c1 ,..ci ) where − V(c1,..,ci-1) is the score obtained by the sequence c1,..,ci-1, it can be provided by a probabilistic model, − S(ci ) is the consequence of the choice of the tag ci for the word i, − VB(c1,..,ci ) is the estimation of the best score which can be obtained by the development of the sequence. The function E needs to be continuous and decreasing with the size of the sequence. It means that E (CS 1,i − 1 ) ≥ E (CS 1,i ) . The probabilities provided by a stochastic model respect these properties and lead to a straightforward computation of the two first components. With a triclass model, we have :
blindly ! ». In practice, T(Ci ) is the result of some rules application according to the history which may be now as large as possible and no more restricted to only two words.
II - 3 What processing ?
Ci + 1, N
N− 1
= arg max ∏ P (ck + 1 ck − 1 , ck ) * P ( wk + 1 ck + 1 ) So Ci + 1, N
k= i
that, the score E(CS1,i) is similar to a Viterbi scoring if CS1,i is a part of the sequence pointed out by a classical Viterbi algorithm. VB may be estimated, in a very simple method, through a Viterbi-Back scheme, so that a first pass has to be performed in a right-left direction. In this way, all VB estimations are made available before the search of the best path in the lattice. It is now possible to introduce the valuation of extrainformation in E(CS1,i) by adding to the S(ci ) a third term.
S (ci ) = P (ci ci − 2 , ci − 1 ) P (mi ci ) × T (ci ) where T(ci ) corresponds to a confidence rate associated to the choice ci . By default, T(Ci )=1, which can be interpreted as : «we trust our stochastic model
a
post-
Rule
il
a
été
pris
Viterbi
PPERMS
VA3S
NMS
AMS
Stack decoding
PPERMS
VA3S
Rule
and if Ci+1,N is a sequence of tags for wi+1,.. ,wN
VB(c1 ,..ci ) = arg max P (Ci + 1,N )
than
Notwithstanding the gain directly obtained by these rules, the algorithm permits some context corrections that a post processing cannot perform. When the evaluation function has penalised a sequence of tags, the algorithm explores competitive paths in the lattice. By doing that, we can : • find some errors on a tag ci-k by detecting an error on the tag ci . • or have a different analysis of the right context of ci , if a rule gives another interpretation for this tag. A post-processing can only change a tag to another but does not analyse the consequences for the other tags of the sequence. Nevertheless, changing a tag in a sequence changes the probability of the whole sequence and can encourage the model to revise the affectation of neighbouring tags. Figure 2 shows the kind of corrections which can be made by a stack decoding strategy.
V (c1 ,.., ci− 1 ) = V (c1 ,.., ci − 2 ) * S (ci − 1 ) with S (ci ) = P ( ci ci − 2 , ci − 1 ) * P ( wi ci )
more
Stack
II - 2 The evaluation function
VPPMS
VPPMS
Figure 3 : Double correction In this example, a classical algorithm makes two mistakes. The first one comes from a training drift , which is explained in the following section. The second one, proves the stack decoding ability to avoid the error propagation by taking advantage from extrainformation and to produce another tag sequences analysis. Here, the statistical model choose the tag AMS (masculine singular adjective) for pris because the sequence Noun-Adjective is more frequent than the sequence VPPMS-VPPMS (two past participles)
III Rules III - 1 Correction of biases Training a good model involves taking a lot of precautions, but as good as the model, some biases may have been introduced into it. As a base reference, we take a classical tri-class model language model using Viterbi [6]. It was trained on newspaper corpora (Le Monde) at approximately 40 million words for 103 parts of speech. We have observed an error rate of
nearly 6%. More details can be found in [7]. Our purpose here is not to discuss this result, but to analyse our errors. It is interesting to note that this error rate can be divided into 3 equal parts : The first error set is the most interesting, it is due to only two error kinds, both of them coming from a divergence of the model : • The past participle été got the tag Singular Masculine Noun (summer) after the auxiliary verb avoir. This is impossible in French. In order to cope with this problem, we have implemented the rule : « If there is the tag auxiliary for any derivation of the verb avoir just before the word été, T(Past_Participle) = 0 ». • The preposition en is wrongly tagged Adverb and more often Personal Pronoun before a country name or a noun. As this is also impossible in French, we have introduced a similar kind of rule into the rule set. The second set of errors is due to the affectation of the OOV tag (Out-Of-Vocabulary). Proper nouns are involved in more or less 2/3 of these errors. Most of the cases should be resolved by few simple rules. But in this case, it is somewhat different because the rules need to introduce a new tag (Proper Noun) in the lattice. In doing that, the VB scores become inconsistent because this new tag was not taken into consideration in the computation of the probabilities. In practice, if we want to use this kind of rules, we have to produce two tags (OOV and Proper Noun) for all OOV units in the first step of the algorithm, which computes the Viterbi-Back scores. For the last set, the frequencies of the errors are so low that it is not specially interesting to write specific rules. We find here spelling mistakes, dictionary errors, under-represented structures or long span dependencies.
III - 2 Under-represented structures and long span dependencies The two samples of rules previously given take binary decision ; The structure is possible or impossible. In this part, we show how to estimate T(ci ) for underrepresented structures or long span dependencies which are completely inhibited by the statistical model. Unfortunately we are still obliged to perform a hand made error analysis. The first step is of course to describe as accurately as possible the underrepresented structure or the dependencies. If it is possible to exactly describe the apparition context AC(ci ) of the structure so that no other structures can appear in this context, we can use the same kind of rules as previously. Otherwise, we have to train T(ci ) for this context. This means counting NM(ci ) on a test corpora which is the number of mistakes made by the statistical model on AC(ci ). If TN(ci ) is the occurrence number of AC(ci ) we have:
1 − NM (ci ) TN (ci ) TN (ci ) + NM (ci ) and T ( ci ) = TN (ci ) T (¬ ci ) =
where ¬ ci represents all the tags proposed by the model except ci . It is interesting to remind the reader that we can use all the tags of the sequence to describe AC(ci ) contrary to the Viterbi algorithm in which we only have access to the two previous tags. Using this particularity, it is possible to develop rules for long span dependencies. A method to automatically produce rules can be to use some particular decision trees [8]. The questions used to train the tree must concern a greater history than that used by the n-gram model. In fact, the constraints which can be expressed with a short history are already modelised by the model.
IV - Multi level interactions IV - 1 Speech Recognition LINGUISTIC AND SYNTACTIC Another application where this mixed approach can be interesting is in linguistically disambiguating homophonic outputs of a speech recognition system (Fig. 4). We begin by developing all the nodes in order to associate syntactic information to each word proposed by the acoustic component, so we obtain a hypotheses lattice which can be used by a tagger. 1
2
3
M
w1,1
w2,1
wi,1
wM,1
w1,J
w2,J
wi,J
wM,J
w1,N
w2,N
wi,N
wM,N
Figure 4 : lattice of homophones In this graph, all the words wi,1,.. wi,x are homophones and can have different tags. As shown in Figure 5, the number of paths is strongly increasing. Even if we could have the temptation to develop all the paths for a classical tagging task, it is clearly impossible now. A classical approach can search the best "linguistic" sequence (with a n-gram) or the best syntactic sequence (with a n-class) but the interaction of these two levels is not real. Even with a combination of ngram and n-class, the calculations of the probabilities are not completely interactive. With the kind of rules we propose these two levels can be jointly used. For example, a rule can verify that a word sequence does not appear with a particular tag sequence. In a n-
class, this inter-connection of syntactic and linguistic level is only assumed through the probability P(wi /ci ) [9].
Unfortunately we have not yet made any tests, so we cannot present results here.
V - Conclusion 1
2
M
w1,1 ,c1
w2,1,c1
wM,1,c1
w1,1 ,c2 w2,1 ,ck w1,1 ,ck w2,2,c1
w1,2 ,c1
w2,2,cn
w1,2,cl
wM,h,ck
w2,x ,c1
w1,x,c1
w2,x ,cm
w1,x,cn
In the debate that opposes the partisans of probabilistic methods to those that sustain an approach based on rules, we propose a real mix solution. The strategy we exposed in this paper permits the joint use of a set of rules and a stochastic model. The first interest of our method is to introduce the modelisation of under-represented structures and log span dependencies in a stochastic model. But an other interest, not yet fully developed, is the possibility to have multi-level interactions in the same rule.
VII - Bibliography
Figure 5 : hypotheses lattice in a speech task PHONOLOGY In the framework of the decoding strategy, we develop specific rules based on phonologic constraints for the lexical unit sequence. The possibilities offered by the joint use of phonetic, lexical and syntactic levels in an unique rule is very interesting to handle some peculiarities of a given language as for example the liaison phenomenon in French language. Figure 6 give an example of this.
[1]
[2]
[3]
[4] phonetic constraint : liaison between les and a vocalic Noun
[5] les / DETMP
1
enfants / NMP
2
enfant / NMS
les / PPOB-lait / NMS
i
[6]
i+1
Figure 6 : A phonetic constraint [7] In the case the 3-class tagger prefers the path 1 because of the frequent syntactic structure DETMPNMP (masculine plural determiner followed by a masculine plural noun). If the acoustic component has not detected the liaison between les and enfants, ECSta penalises the path 1 and finally chooses the path 2.
[8]
[9]
IV - 2 Semantic level But still, to linguistically disambiguate homophonic outputs of a speech recognition system, we are looking to use rules at a semantic level. We plan to obtain, automatically, rules derived from word collocations.
Brill E., A simple Rule-Based Part of Speech Tagger, 3nd conf. On Applied Natural Language, Trento, Italie, April 1992, pp.6366 Jelinek F., Continuous speech recognition by statistical methods, Proceeding of the IEEE, vol. 64, April 1976, pp.532-556. Derouault A.M., Merialdo B., Natural Language Modeling for Phoneme-to-text transcription, IEEE trans. On Pattern analysis and machine intelligence, vol. PAMI-8 No 6, Nov. 1986. Paul D.B., Algorithms for an optimal A* search and linearizing the search in stack decoder, IEEE Trans. on pattern analysis and machine intelligence, pp 693-696, 1991 Hart P., Nilsson N., Raphael B., A formal basis for the heuristic determination of minimum cost paths, IEEE Trans. Systems Sci. Cyberne, 1968, pp 100-107 Jardino M., Adda G., Automatic determination of a stochastic bi-gram class language model, Proc. Grammatical Inference and Applications, 2nd Int. Coll. ICGI94, Spain, Sept. 1994, pp.57-65 El-Bèze M., Spriet T., Intégration de contraintes syntaxiques dans un système d'étiquetage probabiliste, TAL January 1996 Kuhn R.,De Mori R., The application of semantic classification trees to natural language understanding, IEEE Trans. on pattern analysis and machine intelligence, vol. 17, No. 5, may 1995 Mérialdo B., Tagging text with a probabilistic model, ICASSP 1991, Toronto, vol. S2, pp 809-812.