one with the closed nodes (i.e. expanded nodes). The algorithm runs as follows: 1. Put all source nodes on the open list. 2. If the open list is empty then theĀ ...
A Single Pass Heuristic Search for Segmental Speech Recognizers Nick Cremelie and Jean-Pierre Martens
ELIS, University of Gent, St.-Pietersnieuwstraat 41, B-9000 Gent (Belgium)
Abstract
The continuous speech recognition problem is usually modeled as a search for the best path in a network of transitions between states. A full search can be very expensive in terms of computation and storage requirements. By adopting a segment based rather than a frame based approach, one can already obtain a reduction of these requirements, but this may still be insucient to allow for real time recognition. For our segment based Neural Network / Dynamic Programming hybrid, we have therefore developed a heuristic search method performing the search in a single forward pass. The key problem was to identify a suitable heuristic function which estimates the score of the best path yet to be determined. We found that a simple heuristic function taking into account an average path score per segment, does very well. Even if the admissible loss in recognition accuracy is kept small, our heuristic search method outperforms a traditional Viterbi beam search algorithm.
1 Introduction A very common model for continuous speech recognition is that of a search for the best path in a network of transitions between states. This search can be performed with the well known Viterbi algorithm, which is based on the principle of Dynamic Programming. However, as the vocabulary of the recognizer grows, a full Viterbi search tends to become (too) expensive in terms of computation and storage requirements | even with the reduction in complexity obtained by adopting a phonetic segment approach rather than a frame based approach. This problem has urged us to look for a single pass heuristic search method. In this paper we will rst describe heuristic search methods in general. Next we will explain how such a method can be applied eciently in a speech recognizer, in particular in a Neural Network / Dynamic Programming Hybrid recognizer [1, 2]. Finally, we will compare the results of our method with those of more classical search methods.
2 Heuristic Search Methods Many search problems, including the speech recognition problem, are of the following type: given a set N of nodes n, a set T of possible transitions between nodes (each with an associated score), a set S N of source nodes and a set G N of goal nodes, nd the sequence (path) of joining transitions, starting in a node of S and ending in a node of G, with the largest possible accumulated score (i.e. the sum of the scores of all transitions along the path). Blind search methods, like the Viterbi search, will examine all paths between S and G in order to nd the best one. They do so by expanding all nodes (expanding a node means examining all transitions leaving this node). In heuristic search methods, only the nodes with the \strongest belief" of being on the global best path are selected for expansion (best- rst principle). Moreover, the search is completed as soon as one path to G is found. If one is guaranteed that this path is actually the global best path, the heuristic search is called admissible. The selection of nodes is based on an evaluation function f^(n). A special class of heuristic search methods emerges when the evaluation function is equal to f^(n) = g^(n) + h^ (n), with g^(n)
Research Associate of the National Fund for Scienti c Research
1
representing the score of the best path found so far between S and n, and with ^h(n) estimating the score of the best path from n to G. The term ^h(n) is called the heuristic function. One can show that the heuristic search will be admissible, if h^ (n) is an upper bound of the exact best path score from n to G for all nodes n. A practical algorithm was proposed by Nilsson in [3]. The idea is to maintain two lists, one with the open nodes (i.e. nodes to which a path is found, but which are not expanded yet), and one with the closed nodes (i.e. expanded nodes). The algorithm runs as follows: 1. Put all source nodes on the open list. 2. If the open list is empty then the problem has no solution and the search is stopped; else remove from the open list the node n with the largest value of f^, put it in the closed list, and goto 3 (the expansion step) with that node n. 3. If n is a goal node then the solution is found and the search is stopped; else For all successor nodes ni of n: Calculate f^(ni ); If ni is not in the open list, nor in the closed list then put ni in the open list and store the path to ni ; else if f^(ni) is greater than the f^ of the old path to ni, then replace the old by the new path, and put ni in the open list. 4. goto 2. As can be noted from this description, each expansion step in a heuristic search requires some overhead compared to the expansion of a node in a full search. However, a heuristic search aims at reducing the number of nodes to be expanded. Apparently this requires that ^h(n) is close to the exact best path score from n to G. In fact, if ^h(n) is exact, then it is possible to show that only nodes on the best path from S to G will be expanded.
3 A Heuristic Function for Segment Based Recognition Stochastic segment based speech recognition [4] can be viewed as a search in a trellis network with the initial segment boundaries (b1 ; :::; bI) in one dimension and the states (s1 ; :::; sJ ) of the statistical word models in the other dimension. Each transition between states can be accomplished on a segment between two initial boundaries. In our system [1, 2], only segments comprising zero to four initial segments are examined, yielding ve successors per node in the trellis. Various systems are based on the algorithm described above [5, 6]. However, the heuristic search is often applied only in a second pass (see e.g. [6]: the best path scores between S and all nodes n are gathered during a Viterbi search, and used in a backward heuristic search in order to obtain the N best hypotheses). It seems more appealing to conceive a single pass algorithm. Obviously, the main problem then is to nd a suitable heuristic function ^h(n). We have used a heuristic function which is based on the properties of the initial segmentation algorithm that produced the boundaries bi for the utterance: h^ (n) = ^h(i; j ) = Np (i) Sp + Nw (i) Sw with (i; j ) being the coordinates of node n in the trellis; Np (i) the estimated number of phonetic segments between bi and bI ; Sp the average score per phonetic segment; Nw (i) the estimated number of inter-word transitions between bi and bI ; Sw the average score of an inter-word transition (emerging from a language model). This function will take into account the scores of the intraword as well as the inter-word transitions. An estimate of Nw (i) can be derived from Np (i): if the expected number of phonetic segments per word is q, then Np (i)=q is a possible estimate of Nw (i). The accuracy of this estimate will obviously depend on the vocabulary and on the language model. It seems reasonable to take the number of phonetic segments Np (i) proportional to (I ? i), the number of boundaries following bi. During the evaluation of the initial segmentation 2
a1 a2
v Q Q f a3 vf QQf
AA A
AA A
n
AA A
v````
``` ```
f Q Q f f QQf
AA A
AA A
AA A
N
a4
`v Q a5 QQ fv f Qf AA AA AA A A A
ni
Figure 1: N -ary maximum tree; the terminal nodes 2 are search space nodes (drawn here in one dimension), the nodes are tree nodes. algorithm, it was established that the number of phonetic segments was equal to about 70% of the number of boundaries I . Consequently, 0:7(I ? i) was considered a good estimate of Np (i). If the language model (grammar) has a perplexity PLM , the score Sw is given by ln(1=PLM ). The average (acoustical) score per segment (Sp ) will slightly depend on PLM , as the acoustical scores contribute more to the total score when PLM is larger. However, this dependency is hard to quantify. Therefore we have estimated Sp by processing a set of training utterances (of course using a full search and the same language model that will be used afterwards). Taking all this into account, we obtain: 4 S (I ? i) ^h(i; j ) = 0:7(I ? i) S + 1 ln( 1 ) = p
q
PLM
i
The quantity Si can be interpreted as an average score per initial segment, incorporating the contributions of both the intra-word and the inter-word transitions. Clearly one may expect that for an independent test set the optimal value of Sp (and thus Si ) will be somewhat smaller.
4 A New Implementation of Nilsson's Algorithm As stated before, each expansion in a heuristic search constitutes a higher computational cost than a similar expansion in a full search: rst one must identify the open node n with the largest score (this is the node to be expanded); then, for each of the successor nodes ni, one must check whether ni is already in the open or in the closed list, or not. Although the description of the algorithm suggests an implementation with lists, we found that such an implementation is highly inecient, especially when the lists grow large. We have therefore chosen a totally dierent approach, relying on the fact that in case of speech recognition, all nodes are known beforehand. Thus, one can simply set up an array comprising for each node the status (open/closed) and the value of f^. To keep track of the open node with the highest f^, we use an N -ary maximum tree ( gure 1). The search space is divided into M groups of N nodes. These nodes constitute the terminal nodes of the N -ary tree (marked as 2 in gure 1). Each remaining tree node a (marked as in gure 1) holds a pointer to the best open terminal node reachable from a in the tree. Consequently, the root node of the N -ary tree will point to the best open node in the entire search space. For reasons of simplicity, we will call the value of f^ in the node pointed to by a tree node a \the maximum of tree node a." Initially, only the source nodes are open nodes, so that initializing the tree is straightforward. During each step of the search, only small portions of the tree need to be updated. There are two dierent steps in the expansion of a node: 1. The node n becomes closed. As n was selected for expansion, it was the open node with the highest f^. This means that the maximum of all ancestors of n (a1 , a2 and a3 in gure 1) was f^(n) and is no longer valid. Hence an update of these tree nodes is necessary. Updating the parent of n (a3 in gure 1) requires N + N tests (with 1): check the open/closed status of each child, and if open, check whether the parent's maximum has to be updated. Updating the remaining ancestors of n (a1 , a2 in gure 1) requires exactly N ? 1 tests per ancestor. The update process thus requires a total of N + N + (N ? 1) (dlogN M e ? 1) = T1 tests. 3
2. The value of f^ in a node ni , reached by the expansion of n, changes. The node ni becomes or remains open. The new f^(ni ) must rst be compared to the maximum of its parent (a5 in gure 1). Should the maximum change, then f^(ni) must also be compared to the maximum of its grandparent (a4 in gure 1), and so on, until the ancestor's maximum remains unaltered or until the root of the tree is reached. The number of tests for this update is thus dlogN M e = T2 , with 1. The values of and can only be estimated a posteriori. During the expansion of n, step 1 occurs only once, while step 2 can occur for each successor of n. The expected number of times that step 2 occurs, can be estimated from an experiment with training utterances. The value of N can then be chosen as to minimize the number of tests per expansion, given by T1 + T2 .
5 Results Our system was tested on a speaker independent continuous speech recognition task (a vocabulary of 413 Dutch words). The test set consisted of 130 dierent sentences (10 dierent speakers, 13 sentences per speaker). We used a bigram language model and an arti cial language model obtained by randomly adding word pairs to the bigram model in order to increase the perplexity. Three dierent search methods were evaluated: the standard Viterbi search, a Viterbi search with threshold pruning (beam search), and the heuristic search method presented in this paper. During the beam search, the maximum score on a particular segment boundary, diminished by some amount , serves as the pruning threshold at that boundary. For the heuristic search we have calculated the factor Si as indicated before. We found Si = 0:92 for the bigram and Si = 0:71 for the arti cial language model. On the test set we have mainly applied smaller values of Si in order to obtain a faster (although somewhat less accurate) recognition. The results are summarized in tables 1 and 2. Displayed are the total word error rate (deletions + insertions + substitutions), the average percentage of nodes being expanded, the average percentage of transitions being examined, and the average CPU-time per second of speech both for the acoustic processing and the search (measured on an IBM RS/6000). Apparently the heuristic search method outperforms the Viterbi beam search: it can attain the same recognition accuracy with less expanded nodes and less computations. Although the heuristic search has a higher computational cost per node expansion, its xed computational cost (examination of nodes) is smaller than that of the beam search. The calculated values of Si did not result in an admissible heuristic function. There were de nitely a number of best path errors, but they did not aect the recognition results. Apparently, there are enough near-optimal paths yielding the same word string. Using the calculated values of Si , we obtained similar degrees of inadmissibility (about 18% best path errors) for both tasks. It appears that in most cases where the best path is not found, the heuristic method nds another path with a very high ranking, whereas this is not always true with a beam search. This can be an important advantage of our heuristic method. If desirable, the number of best path errors can be reduced by continuing the search until more than one path to a goal node is found or word % nodes % CPU-time/sec.speech (in sec.) errors expanded transitions acoust. pr. search total Viterbi Search 3.46% 100% 100% 0.66 1.21 1.87 Beam Search, = ln(200000) 3.46% 4.42% 5.17% 0.66 0.22 0.88 = ln(100000) 4.07% 3.72% 4.40% 0.66 0.21 0.87 = ln(20000) 4.44% 2.28% 2.92% 0.66 0.18 0.84 Heuristic Search, Si = 0:92 3.46% 3.25% 3.60% 0.66 0.18 0.84 Si = 0:50 3.70% 0.48% 0.66% 0.66 0.07 0.73 Si = 0:25 4.07% 0.31% 0.45% 0.66 0.04 0.70
Table 1: Results with the bigram language model (775 word pairs, test set perplexity = 4.1) 4
word % nodes % CPU-time/sec.speech (in sec.) errors expanded transitions acoust. pr. search total Viterbi Search 19.26% 100% 100% 0.66 2.48 3.14 Beam Search, = ln(200000) 18.77% 14.05% 11.09% 0.66 0.44 1.10 = ln(20000) 19.88% 6.87% 5.34% 0.66 0.32 0.98 = ln(5000) 21.73% 4.07% 3.18% 0.66 0.25 0.91 Heuristic Search, Si = 0:71 18.77% 9.20% 7.60% 0.66 0.58 1.24 Si = 0:50 19.26% 3.81% 3.14% 0.66 0.26 0.92 Si = 0:31 21.60% 1.81% 1.47% 0.66 0.14 0.80
Table 2: Results with the arti cial language model (7000 word pairs, test set perplexity = 26.5) until a maximum number of expansions is performed. Allowing 0:5% additional expansions after the rst path to a node of G is found, we obtained a 60% reduction of the number of best path errors. In fact, this method turned out computationally cheaper than increasing the value of Si in order to reach the same number of best path errors. Part of the success of our method undoubtedly emerges from the fact that our heuristic function is based on the properties of a good speaker-independent initial segmentation algorithm. Note that is was shown in [7] that a similar heuristic function as ours, but based on the number of frames left to be recognized, was not at all capable of yielding an ecient search. One possible drawback of the heuristic search method is its higher storage requirement. In our case, about 3.5 times the storage necessary for a beam search was required. We are currently examining how to reduce the storage requirement as well.
6 Conclusion In this paper we have presented an ecient single pass heuristic search method for segment based continuous speech recognition. We introduced a heuristic function based on the initial segmentation of the utterance, and we proposed an optimized implementation of Nilsson's algorithm. The fact that the search is strictly not admissible, did not essentially alter the recognition accuracy.
References [1] J.P. Martens, A. Vorstermans, N. Cremelie (1993), \A new Dynamic Programming/ Multi-Layer Perceptron Hybrid for continuous speech recognition," in Proceedings of EUROSPEECH-93, 1937-1940. [2] A. Vorstermans, J.P. Martens, N. Cremelie (1993), \Speaker-independent Phone Recognition with a Dynamic Programming / Multi Layer Perceptron Hybrid," IEEE ProRisc-93, 335-340. [3] N.J. Nilsson (1980), Principles of Arti cial Intelligence, Palo Alto, California: Tioga. [4] M. Ostendorf, S. Roucos, \A Stochastic Segment Model for Phoneme-based Continuous Speech Recognition," IEEE Trans. ASSP-37, 1857-1869. [5] D.B. Paul (1992), \An Ecient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model," in Proceedings of ICASSP-92, I25-I28. [6] S. Austin, R. Schwartz, P. Placeway (1991), \The Forward-Backward Search Algorithm," in Proceedings of ICASSP-91, 697-700. [7] H. Ney (1992), \A comparative Study of Two Search Strategies for Connected Word Recognition: Dynamic Programming and Heuristic Search," in IEEE Trans. PAMI-14, No. 5, 586-595. 5