Using symbol clustering to improve probabilistic automaton inference

1 downloads 0 Views 225KB Size Report
Using symbol clustering to improve probabilistic automaton inference. Pierre Dupont1 and Lin Chase2. 1 EURISE, Universit e Jean Monnet. 23, rue P. Michelon.
Using symbol clustering to improve probabilistic automaton inference Pierre Dupont1 and Lin Chase2 1 EURISE, Universite Jean Monnet

23, rue P. Michelon 42023 Saint-Etienne Cedex { France [email protected]

2 LIMSI/CNRS

B.P.133 91403 Orsay Cedex { France [email protected]

Abstract. In this paper we show that clustering alphabet symbols before PDFA inference is performed reduces perplexity on new data. This result is especially important in real tasks, such as spoken language interfaces, in which data sparseness is a signi cant issue. We describe the application of the ALERGIA algorithm combined with an independent clustering technique to the Air Travel Information System (ATIS) task. A 25 % reduction in perplexity was obtained. This result outperforms a trigram model under the same simple smoothing scheme.

1 Introduction Inference of deterministic nite automaton (DFA) from positive and negative data can be solved by the RPNI algorithm, proposed independently by Trakhtenbrot et al. [16] and by Oncina et al. [13]. This algorithm was used by Lang in his extensive experimental study of learning random deterministic automata from sparse samples [10]. An adapted version of this algorithm proved to be successful in the recent Abbadingo competition [9]. However the use of the RPNI algorithm and other grammatical inference techniques requires several adaptations in order to be feasible for real applications such as speech recognition interfaces. Since most databases do not include negative information, the learning can be controlled by probabilistic information instead. Several inference algorithms for probabilistic automata are known [15, 3, 14]. In particular, the ALERGIA algorithm proposed by Carrasco and Oncina [3] is the probabilistic version of the RPNI algorithm. As the purpose of the machines built by these algorithms is not to de ne hard decision boundaries, the quality of learning is not estimated by classi cation errors on new data. Instead we use a distance measure, such as the Kullback-Leibler divergence, between the learned distribution and the target distribution. Because the target distribution is unknown in practice, this distance measure is approximated by test set perplexity (see section 2 for more details).

Another problem arises from the fact that simulated learning curves do not accurately predict the performance on actual data sets of limited size. First, learning curves usually depend on the size of the target machine which is unknown. Second, real alphabets, at least for speech applications, typically contain up to several thousand symbols instead of just a few. Thus an extremely small fraction of all strings up to the maximal length observed is actually available. This data sparseness problem is a key concern to reserarchers who deal with human language applications. In this context, clustering alphabet symbols can lead to a denser training set at the cost of a reduced ability to discriminate between individual symbols. The present work aims at showing that the performance of the ALERGIA algorithm improves when a clustering algorithm is used before the actual inference.

2 De nitions and notations A probabilistic DFA (PDFA) is a 6-tuple (Q; ; ; q0; ;  ) where Q is a nite set of states,  is an alphabet,  is a transition function, i.e. a mapping from Qx to Q, q0 is the initial state, is the next symbol probability function, i.e. a mapping from Qx to [0; 1] and  is the end of string probability function, i.e. a mapping from Q to [0; 1]. The probability functions must satisfy the following constraints: P

a2

(q; a) = 0 , if (q; a) = ;

(q; a) +  (q) = 1 ; 8q 2 Q

The probability PA (x) of generating a string x = x1 : : : xn from a PDFA A = (Q; ; ; q0; ;  ) is de ned as n

8 Q > >
> :

i=1

0



(qi ; xi )

 (qn ) , if (qi ; xi ) 6= ; with qi+1 = (qi ; xi ) for 1  i < n and q1 = q0 , otherwise

Our de nition of probabilistic automaton is equivalent to a stochastic regular P PA (x) = 1. Note that some works grammar used as string generator. Thus, x2  on P the learning of discrete distributions use distributions de ned on  n (that P (x) = 1, for any n  1), instead of   (see for instance [1, 7]). is n x2 The probabilistic automaton A= denotes the automaton derived from the probabilistic automaton A with respect to the partition  of Q, also called the quotient automaton A=. It is obtained by merging states of A belonging to the same subset in . When q results from the merging of the states q0 and q00 , the following equalities must hold

(q; a) = (q0 ; a) + (q00 ; a) ; 8a 2   (q) =  (q0 ) +  (q00 )

Let PPTA(I+) denote the probabilistic pre x tree acceptor built from a positive sample I+ . Let C (q) denote the count of state q, that is the number of times the state q was used while generating I+ from PPTA(I+ ). Let C (q; ) denote the number of times a string of I+ ended on q ( denotes the empty string). Let C (q; a) denote the count of the transition (q; a) in PPTA(I+ ). The PPTA(I+ ) is the maximal likelihood estimate built from I+ . In particular, for PPTA(I+ ) ^ ) ) and ^ (q) = CC(q; . the probability estimates are (q; a) = CC(q;a (q ) (q ) Let Lat(PPTA(I+ )) denote the lattice of automata which can be derived from PPTA(I+ ). Let A be a target PDFA and A^ a hypothesis PDFA, D(PA k PA^) denotes the Kullback-Leibler divergence or cross entropy. X D(PA k PA^ ) = PA (x) log PPA ((xx)) A^ x2  The divergence can be rewritten as follows. P PA (x) log PA1(x) + D(PA k PA^) =

P

x2 h x2  i 1 = EPA log PA^ (x) ? H (PA ) ^



PA (x) log PA (x)

The rst term above denotes the expectation of log PA1(x) according to the distribution PA and H (PA ) denotes the entropy of PA . In other words, the divergence is the additional number of bits1 needed to encode data generated from PA when the optimal code is used for PA^ instead of PA . Note that the second term does not depend on the hypothesis. Thus the rst term can be used to measure the relative quality of several hypotheses. In practice, the target distribution is unknown. Instead, an independent sample S , which is assumed to have been drawn according to the distribution PA ,  can be used. Let P S (x) denote the empirical distribution computed from the sample S containing jS j strings x and kS k symbols xi . Letting CS (x) denote the count of the string x in S , one can write ^

i

h



EPA log PA1(x) ' P S (x) log PA1(x) x2S P = ? jS1 j CS (x) log PA^ (x) P

^

^

x2S kP Sk = ? kS1 k log PA^ (xi jqi ) i=1

Thus the quality measure is the average log-likelihood of x according to the distribution PA^ computed on the sample S . Most commonly used is the sample perplexity PP given by   2

? kS1 k

kP Sk

i=1

1 Base 2 is assumed for the log function.

log

PA^ (xi jqi )

The minimal perplexity PP = 1 is reached when the next symbol is always predicted with probability 1 while PP = j j corresponds to random guessing from a lexicon of size j j.

3 Probabilistic DFA Inference Algorithm ALERGIA input

I+

// A positive sample // A precision parameter

output

a PDFA

// A probabilistic DFA

begin // N is the number of states of PPTA(I+)

 ff0g; f1g; : : : ; fN ? 1gg A PPTA(I+)

// One block for each pre x in the order

Suggest Documents