INTEGRATION OF TWO STOCHASTIC CONTEXT-FREE GRAMMARS

0 downloads 0 Views 100KB Size Report
Some problems in speech and natural language processing involve combining two information sources each modeled by a stochastic context-free grammar.
INTEGRATION OF TWO STOCHASTIC CONTEXT-FREE GRAMMARS Anna Corazza Universit`a di Milano – Dipartimento di Tecnologie dell’Informazione via Bramante 65 – 26013 Crema, Italy [email protected] ABSTRACT

Recently, approaches based on the intersection of two models have been proposed also for language generation [3], where a forest represented by a non-recursive CFG is processed by another language model. In this case, stochastic bigrams have been used, but the approach could be extended to a context-free language model to find the most likely string of words which represent a given conceptual representation. Also this case is an instantiation of the problem of integrating two SCFGs, of which at least one is non-recursive. Eventually, another application is in machine translation, where the aim of preserving ambiguity suggests the intersection of two non-recursive CFGs languages [4]: also in this case, the use of a statistical version of the algorithm could help in ranking the fluency of the different solutions. In [1] an efficient solution is presented for the case that neither of the two CFGs is stochastic. Interestingly, it performs the parsing of the non-recursive CFG with a PushDown Automata (PDA) and partition the computation in segments centered around each scan transition. When the grammars are stochastic, this partitioning can be effectively used to define probabilities which are then combined following a search strategy driven by the second SCFG. As the skeleton of the approach is the same, the computational complexity in practical cases is of the same order of magnitude. In the following section the definitions and the notation used in the paper are discussed. It is followed by a section discussing the computations of the probabilities derived by the input grammar. In Section 4, the combination of such probabilities with the other SCFG is presented. A discussion section closes the paper.

Some problems in speech and natural language processing involve combining two information sources each modeled by a stochastic context-free grammar. Such cases include parsing the output of a speech recognizer by using a contextfree language model, finding the best solution among all the possible ones in language generation, preserving ambiguity in machine translation. In these cases usually at least one of the two grammars is non-recursive. In order to find the best solution while taking into account both grammars, the two probabilities must be integrated. One of the most important advantages of using a non-recursive context-free model is its compactness. Therefore, it is important to exploit this property when searching for the solution. In this paper, an algorithm aiming to this goal is presented, based on a recent work [1] in which the non probabilistic case is considered. 1. INTRODUCTION One important advantage of representing regular languages, including finite languages, by a context-free grammar (CFG) is compactness. The N -best or word lattice output by a speech recognizer are a case in which a regular language can be represented in a much more compact way by using a non-recursive CFG [2], as the different elements of the language have large substrings in common. In addition to that, probabilities could be added to the CFG in order to take into account not only the acoustic score of each word, but also the fact that a given substring occurs many times in the N -best, and is therefore more likely. The representation so obtained is processed by a linguistic language model, which could be an unrestricted Stochastic CFG (SCFG). In these cases, it would be of great interest to find an algorithm able to exploit the compactness of the language model to find the best solution either as the most likely string of words or as the best derivation compatible with both SCFGs, without considering each string separately, but by efficiently exploiting the factorization given by the non-recursive SCFG representation.

2. NOTATION AND DEFINITIONS A SCFG can be defined as a 5-tuple [5] {Σ, N, S, R, P }, being Σ and N the finite sets of terminal and nonterminal symbols respectively, with Σ ∩ N empty, S ∈ N the start symbol, R the set of rewriting rules and P the corresponding probabilities. The two SCFGs are the non-recursive input grammar, Gi = {Σ, N i , S i , Ri , P i }, and the parsing grammar, Gp = {Σ, N p , S p , Rp , P p }, having in common Σ. Both grammars are proper, that is, the sum of the proba-

1

bilities of the rules rewriting each nonterminal symbol is 1, and consistent, that is the probability of the language generated by the grammar is 1. Moreover, in Gi there are no epsilon rules and only one rule rewrites the start symbol, while Gp is in Chomsky normal form (i.e., the right-hand side, of each rule is either composed by two nonterminals or by one terminal). In the following, a, b, c, . . . will refer to symbols in Σ, while the symbols in N i and in N p will be respectively indicated by Ai , B i , C i , . . . and Ap , B p , C p ; αi , β i , γ i , . . . and αp , β p , γ p , . . . will respectively indicate strings of terminal and nonterminal symbols in the two grammars. In order to integrate the two SCFGs, the product of the two probabilities is associated to each hypothesis, as the two models are statistically independent. In this way, a new language model is defined on the intersection of the two languages, but it results not to be consistent. In fact, if we call pi (·) and pp (·) the probability distributions induced by Gi and Gp on the languages Li and Lp respectively, and L = Li ∪ Lp (pi = 0 out of Li and pp = 0 out of Lp ), then: 

pi (w) = 1;

w∈Li



w∈L

 w∈Lp

pi (w) · pp (w) ≤



These probabilities are derived from the input grammar probabilities, by following the principle of considering the probability of each rule as soon as the rule itself is introduced. In this way, while preserving the same probability distribution on the language, the search for the best solution can be focused by making use of all information available at a given point. For simplicity, in the following, Q, X, Y, Z will indicate symbols of the PDA, and q, x, y, z strings of symbols of the PDA. For each pair of input grammar rules Ai → αi B i β i , i B → γ i , the rules set of the PDA will contain the following rule (note that this rule corresponds to a predictor step): [Ai → αi • B i β i ] → [Ai → αi • B i β i ][B i → •γ i ] (1) with probability Pr(B i → γ i ). This is a push rule, as it has the form X → XY . For the same pair of rules, ∆ will contain a unitary probability pop rule (of form XY → Q), corresponding to a completer step:

pp (w) = 1 

[Ai → αi • B i β i ][B i → γ i •] → [Ai → αi B i • β i ] (2) Eventually, the scan transition: for each rule Ai → α aβ i , the following rule is included in the PDA rules set:

pp (wp ) · pi (wi ) = 1

i

wi ∈Li wp ∈Lp

a

[A → αi • aβ i ] → [Ai → αi a • β i ]

Whenever the score is only used for hypotheses comparison, its inconsistency is irrelevant. Anyway, it results important for obtaining measures such as perplexity, or for integrating this probability with others. In Section 4, also the computation of this normalization factor is presented. The computation is divided in two steps:

(3)

a

that is X → Y with probability one or the acoustic probability of the word of part of word a, Pr(a). 3.1. Input probabilities computations

Input probabilities. The parsing of Gi is based on the construction of a PDA with stack of bounded size. The parsing is then partitioned in a sequence of segments, whose probabilities will be used as the fundamental pieces for the further computations. Parsing probabilities. The input probabilities are then combined by applying the CKY parsing to Gp .

One of the most interesting points in [1] is the factorization a of the parsing of the input grammar in segments x −→ y, where x is the top part of the stack before the segment and y after: only the symbols involved in the transitions part of the segment are considered. Moreover, the hypothesis is made that these segments are minimum and no useless transition is inserted. If these conditions hold, then each segment is a sequence of (zero or more) pushes, one scan of a, and a sequence of (zero or more) pops and at least one of x and y has unitary size. Following the deduction system defined in Figure 1 of [1], a set of probabilities associated to these segments will be computed, namely Pr(x, y|a), which is the probability asa sociated to the segment x −→ y. All the triples which can not be derived in this way correspond to non valid segments; therefore the corresponding probabilities are null. As the PDA stack size is bounded, also the number of these probabilities is bounded. Moreover, not all combinations of symbols in the stack are allowed, as can be evinced by the stack rules above. In addition to that, only the nonnull probabilities are considered in the computation, and this limits the complexity in the average case.

3. THE INPUT PROBABILITIES As said before, the input grammar parsing is based on a PDA having a stack of bounded size. The PDA is a 5-tuple (Σ, N s , Xinit , Xfinal , ∆), where Σ is the same terminal symbols set as Gi and Gp , N s is a set of stack symbols, each referring to a rule of Gi and a dot position in its Right-Hand Side (RHS): in general, [Ai → αi • β i ]; Xinit is the initial stack symbol [S i → •αi ], Xfinal is the final stack symbol [S i → αi •] (S i → αi is the only rule rewriting S i in Gi ). The rule set ∆ contains all the PDA rules: in this section, these rules will be described and a probability will be associated to each of them to be used in the computation.

2

3.1.1. Initialization

its presence in the first stack or if the symbol on the other side is derived from the preceding step. This is true if it is the result of a pop or scan transition. Only in this case a pop can be performed on the second stack. b Therefore, whenever Pr(P1 P2 → X) > 0 or Pr(P3 → X) > 0 or, equivalently, if X has the form [Ai → αi • β i ], where |αi | > 0:

All probabilities not explicitly considered are null. For eva ery scan production X → Y in ∆ add the corresponding probability to Pr(X, Y |a). a

Pr(X, Y |a) = Pr(X → Y )

(4)

a

As X → Y is a scan production, X has the form [Ai → αi • aβ i ]. Therefore, for every X, there is only one terminal symbol a and only one PDA symbol Y for which the relation holds.

Pr(QαX, Z|a) = Pr(αX, Y |a) Pr(QY → Z)

which, of course, becomes interesting as long as the left-hand side of Equation (7) is greater than zero. As each computation involves an increment of the stack size, it can be safely halted when no new probabilities can be computed. For reasons analogous to the preceding case, a push transition can be added to the first stack only in the case of Equation (5) or if the corresponding symbol in the second stack is part of the following step. Therefore, whenever b Y → Y P1 or Y → P2

3.1.2. Bilateral rule If S(a) is the set composed by all and only the pairs (X, Y ) such that Pr(Q → QX) > 0, Pr(X, Y |a) > 0 and Pr(QY → Z) > 0, then it is possible to move from Q to Z by these three moves. Therefore: Pr(Q, Z|a) =  = Pr(Q → QX) Pr(X, Y |a) Pr(QY → Z) (5) X,Y :S(a)

Pr(Q, QzY |a) = Pr(X, zY |a) Pr(Q → QX)

The pairs X, Y belong to S(a) if and only if the corresponding term is grater than zero. Note that in (5) it is not necessary to consider the terms deriving from (4), as the two cases are mutually exclusive:

(8)

which, again, is relevant only when the RHS is greater than zero. The condition on Y can be also imposed by giving its form: [Ai → αi • β i ], where |β i | > 0. With relations (7) and (8) the computation is performed by using probabilities of shorter stacks. All of them can be computed in finite time as the stack size is bounded. The preceding relations, on the other side, allow the computations of such probabilities from Gi . Note that for every triples, all possible compatible derivation segments are considered, not only the best one.

a

Pr(Q → Z) > 0 ⇒ Q : [B i → αi • aβ i ], Pr(Q → QX) > 0 ⇒ Q : [B i → αi • Ai β i ]

(7)

(6)

where Q can’t assume both forms at the same time, as a ∈ Σ, Ai ∈ N i , and Σ ∩ N i is empty. As the grammar is non-recursive, it is possible to find a complete ordering of N i such that a symbol Ai precedes B i whenever Ai appears in the RHS of some rule rewriting B i . Every stack symbol X is associated to a rule in Ri and then to its Left-Hand Side (LHS) symbol. Therefore, an ordering in the stack symbols can be found, such that if Q follows X, then the LHS of Q follows the LHS of X and can not appear in the RHS of the rule associated to X. In this case, Pr(X → XQ) = 0 and the computation of Pr(X, Y |a), for all Y , does not depend on Pr(Q, Z|a), for any a or Z. This implies that the computation can be safely performed by following such order. Up to now, only unitary size stacks have been considered, while in the following longer stacks are involved. Therefore, the bilateral rule can be safely applied before the following two rules. Last but not least, note that the symmetrical action of this rule is crucial in maintaining the minimality property in the PDA stack, as no unused symbol is introduced in the new triple.

4. THE PARSING PROBABILITIES The segment probabilities computed in the preceding section are to be combined respecting the constraints imposed by the parsing grammar Gp . While [1] considers two parsing strategies, bottom-up (CKY) and top-down with bottom-up filtering (Early), the work presented here focuses on the former, for which in [1] experimental results are presented that show that the number of hypotheses considered is computationally acceptable. While traditionally the parsing algorithm works on an input string and uses terminal symbols of the grammar, together with their position in the input, for the initialization of the CKY table, here, the triples x, y|a defined above are used instead. Such triples describe the transformation of the PDA stack from x to y while scanning the terminal symbol a. In general, the computation of the deduction system used in [1] can be followed, with all the implementation suggestions they give, such as representing stacks by tries and limiting the computation only to the triples involving legal

3.1.3. Left and right rules For the minimality condition, a pop rule can be applied to the second step in a triple only if a push condition justifies

3

stacks. In fact, only the hypotheses that they obtain from the deduction system will be considered in the probability computation, as all the other are impossible, and then their probability is left to zero. The algorithm can be adapted to solve different problems. First of all, it is interesting to find the total score of the intersection of the two languages generated by Gi and Gp , as this gives the normalization factor N allowing to transform the score in a probability distribution. This problem is dealt by computing N = SN (S p , Xinit , Xfinal ) as shown in the top part Table 1. When looking for an optimal solution, one of two approaches is usually followed: maximum probability derivation (best derivation) or maximum probability string (best string). For the former, a Viterbi-like strategy [5] is used to compute SV (S p , Xinit , Xfinal ). The latter, on the other hand, considers that the probability of each string is given by the sum of the probabilities of all its derivations in the grammar and requires an Inside-like approach. The quantities used to solve these problems are called scores, namely SN , SV , SI respectively, as they do not sum to one on all the possible events. Nevertheless, it is easy to derive a probability distribution by each of them by using N . All the three algorithms are very similar, and only differ for the application of the maximum operator instead of the sum to the probability of partial hypotheses. In general, the final solution is given by the score computed on the combination (S p , Xinit , Xfinal ) and can be computed by a dynamic programming approach based on the relations presented in Table 1.

Normalization factor N = SN (S p , Xinit , Xfinal ) p SN (A , x, y) = Pr(Ap → a) Pr(x, y|a) a SN (Ap , qx, z) = Pr(Ap → B p C p )SN (B p , x, y) B p ,C p ,y

p

SN (A , x, qz) =



SN (C p , qy, z) Pr(A → B C )SN (B p , x, qy) p

p

p

B p ,C p ,y

SN (C p , y, z)

Best derivation SV (S p , Xinit , Xfinal ) p SV (A , x, y) = max Pr(Ap → a) Pr(x, y|a) a

SV (Ap , qx, z) = max Pr(Ap → B p C p )SV (B p , x, y) p p B ,C ,y

SV (C p , qy, z) SV (A , x, qz) = max Pr(A → B C )SV (B p , x, qy) p p p

p

p

p

B ,C ,y

SV (C p , y, z)

Best string SI (S p , Xinit , Xfinal ) SI (Ap , x, y) = max Pr(Ap → a) Pr(x, y|a) a SI (Ap , qx, z) = Pr(Ap → B p C p )SI (B p , x, y) B p ,C p ,y

p

SI (A , x, qz) =



SI (C p , qy, z) Pr(A → B C )SI (B p , x, qy) p

p

B p ,C p ,y

p

SI (C p , y, z)

Table 1. Parsing probabilities relations. 6. REFERENCES

5. CONCLUSIONS AND FUTURE WORKS

[1] M.J. Nederhof and G. Satta, “Parsing Non-Recursive Context-Free Grammars,” in Proceedings of ACL-02, Philadelphia, PA, 2002.

As the intersection of two unrestricted CFGs is known to be intractable, it is necessary to restrict the problem by imposing further constraints on at least one of the two grammars. The hypothesis that one of the two CFGs is non-recursive is a constraint acceptable in most cases of practical interest. On the other hand, statistical approaches have been widely adopted for their effectiveness in taking into account behaviors difficult to model in other ways. In this paper an algorithm has been presented, to integrate the probability distribution induced by two SCFGs without renouncing to their compactness. Such algorithm allows, on one side, to adopt context-free language models which are more informative and compact than regular models such as N -grams, and, on the other hand, to guide the search by all information available under form of probabilities also on partial hypotheses. This latter aspect can be further explored by considering different parsing strategies, such as Early parsing, which can prune more hypotheses and then attain a better efficiency in the search for the best solution.

[2] A. Corazza and A. Lavelli, “An n-best representation for bidirectional parsing strategies,” in Proc. of AAAI94 Workshop on the Integration of Natural Language and Speech Processing, Seattle,Washington, USA, July 1994, pp. 7–14. [3] I. Langkilde, “Forest-based statistical sentence generation,” in 6th Applied Natural Language Processing Conference and 1st Meeting of the North American Chapter of the ACL, 2000, pp. 170–177 (Section 2). [4] K. Knight and I. Langkilde, “Preserving Ambiguities in Generation via Automata Intersection,” in National Conference on Artificial Intelligence (AAAI), 2000. [5] C.D. Manning and H. Sch¨utze, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, Ma, USA, 2000.

4