In the proposed method, the dialog system auto- matically ... vestigated a method to avoid user's out-of-grammar ut- .... was performed with Wizard-of-Oz method.
Noise Adaptive Spoken Dialog System based on Selection of Multiple Dialog Strategies Akinori Ito, Takanobu Oba, Takashi Konashi, Motoyuki Suzuki and Shozo Makino Graduate School of Engineering Tohoku University, Japan aito,bacchi,konashi,moto,makino@makino.ecei.tohoku.ac.jp
Abstract Speech recognition under noisy environment is one of the hottest topic in the speech recognition research. In this paper, we propose a method to improve accuracy of spoken dialog system from a dialog strategy point of view. In the proposed method, the dialog system automatically changes its dialog strategy according to the estimated recognition accuracy in noisy environment in order to keep the performance of the system constant. In a noisy environment, the system restricts its grammar and vocabulary to improve recognition accuracy. On the other hand, the system accepts any utterance from a user in a noise-free environment. To realize this strategy, we investigated a method to avoid user’s out-of-grammar utterances through an instruction given by the system to a user. Furthermore, we developed a method to estimate recognition accuracy from features extracted from noise signal. Finally, we constructed a proposed dialog system and confirmed its effectiveness.
1. Introduction Spoken dialog based user interface is one of the most promising way to communicate with a mobile robot[1] or a electronic appliance. The spoken dialog based interface has many advantages. It is easy to use, no special device like a remote controller is needed, and there is no need to contact the equipment. However, there are several factors that degrade the performance of speech recognition in a real environment. The biggest one among these factors is environmental noise. Many efforts has been made for speech recognition under noise [2]. Improvement of recognition accuracy under noisy environment is very important to realize noise-robust spoken dialog system. However, recognition accuracy under noise is not still enough. Here, we propose a new approach to improve robustness of a spoken dialog system from dialog strategy point of view. The basic idea of this work is how a user can help a spoken dialog system. If a user knows that the dialog system has difficulty under noisy environment, the user can help the system by limiting vocabulary or using simple expres-
sion. In the next section, we show the overview of the proposed spoken dialog system. To realize the proposed system, two issues have to be solved. One issue is how to reduce out-of-task utterances. This issue is discussed in section 3. The other one is how to estimate speech recognition performance under a certain environmental condition. This is discussed in section 4. In section 5, the result of an experiment is presented to prove the effectiveness of the proposed system.
2. Overview of the system We assume relatively small-vocabulary dialog system as the target of this work. The system aims at a user interface for mobile robots or home appliances such as TV, air conditioner etc. The vocabulary size is assumed to be up to several hundred, and the task is assumed to have up to six slots to be filled through the dialog. The proposed system relies on the fact that the simpler the grammar of the recognition system is, the higher the recognition accuracy is. Under a noise-free environment, recognition accuracy is expected to be high. In such environment, the dialog system can accept wide variety of utterances, namely, an utterance can contain any number of words to be filled into slots and their order can be scrambled. On the other hand, recognition accuracy degrades under a noisy environment. In this case, the system can keep recognition accuracy high by limiting variety of acceptable expressions. Therefore, if recognition accuracy under a certain environment can be estimated, the dialog system can change its recognition grammar according to the estimated accuracy. When the system is used in a clean environment, the system uses a large vocabulary and a complex grammar. If the environment is noisy, the system uses a smaller vocabulary and a simpler grammar. To realize this system, we have to solve the following two issues. 1. How to avoid out-of-grammar utterances. Even if the system employs a small vocabulary and a simple grammar, dialog will fail if the user doesn’t know what he/she can utter to the system. There-
fore, we should develop a method to inform a user of the grammatical limitation, in other words, a method to control a user to follow the system’s instruction and to suppress out-of-grammar utterances. 2. How to estimate recognition accuracy under a certain environment. There are many factors that affects the recognition accuracy. We have to find an effective set of features for the estimation, as well as to develop a method to estimate the accuracy. At the same time, a method to choose proper vocabulary and grammar should be developed too. Figure 1: Out-of-grammar rate for each dialog strategy
3. Suppression of out-of-grammar utterances In this section, we examine dialog strategies to suppress a user’s out-of-grammar (OOG) utterances. To avoid a user’s OOG utterances, we tried to change the system’s dialog strategy. Here we call two factors ‘dialog strategy.’ First one is an instruction the system first gives to the user. The second one is the form of questions the system gives to the user. We compared the following five kinds of dialog strategies.
¯ ‘One item (OI)’ strategy is to ask one question to obtain one item. ¯ ‘Two items (TI)’ strategy is to ask one question to obtain up to two items. ¯ Questions in ‘answer the question (AQ)’ strategy are same as that in OI strategy, but a system using this strategy first asks a user to answer the question only to suppress utterances the system is not ready to receive. ¯ ‘Answer using a word (AW)’ strategy is similar to AQ strategy, but the system asks a user to answer using an isolated word. If a user obeys this instruction, the system can keep its grammar quite simple. ¯ ‘choice(CH)’ strategy is to tell a user all available choices and their numbers. We carried out a dialog experiment. Subjects are 22 males and 10 females. One subject carried out two dialogs per strategy. The task domain of the dialog was operation of electric appliances. Goal of a dialog was given to the user before the dialog started. The dialog was performed with Wizard-of-Oz method. OOG rates for each strategy is shown in Figure 1. OOG rate of the CH strategy was 0% (no OOG utterance was observed). For the AQ strategy, there was only one OOG utterance. The OI strategy gave higher OOG rate than the TI strategy because users tend to utter an answer that contains more than one items even if the system requires only one item. From this result, CH, AQ and TI
strategies were found to be effective to avoid OOG utterances.
4. Estimation of recognition accuracy under noisy environment The second issue to realize the proposed system is to develop a method to estimate recognition accuracy under certain environment. To tackle this problem, we employed neural network based method. We examined the following 11 parameters.
¯ S/N ratio () ¯ Standard deviation of power divided by mean of power ( ) ¯ Mean and variance of correlations between contiguous two frames ( ) ¯ Mean and variance of spectral slope of each frame ( ) ¯ Number of phonemes obtained from the speech recognizer when noise is input to the recognizer ( ) ¯ Vocabulary size ( ) ¯ Number of sentences generated from the recognition grammar ( ) ¯ Number of words included in the above sentences ( ) ¯ Branching factor of the grammar ( ) Correlation between two frames is used to detect voiced period. Number of phoneme from the recognizer is used as a practical index how the noise is similar to speech. and can be calculated because we use finite state automata without loop as grammars. Then we reduced number of parameters to investigate which parameters are essential for estimation of recognition accuracy. The selection procedure is as follows. When kinds of parameters are given, we can create models, each of them uses parameters by removing one parameter out of parameters. Then estimation
0.76
Table 1: Properties of the examined grammars
min 30 155 860 1.7
max 340 ¢ ¢ 3.6
closed
0.72 correlation coef.
property
0.74
0.7
open
0.68 0.66 0.64 0.62 0.6 0.58
errors are measured for all models, and the model with the minimum error is chosen. In this way, we can reduce number of parameters one by one. In the experiment, 52 grammars were examined. Minimum and maximum values of each parameter for all grammars are shown in Table 1. Utterances for recognition experiment were chosen from utterances gathered through the experiments described in the previous section. For each grammar, five utterances that can be accepted by the grammar were chosen. For noise signals, we chose eight environmental sounds from the JEIDA noise database[4] and four musical pieces from the RWC music database[5]. 135 noise data are chosen among these databases. First 1 or 2 second noise data are used to estimate parameters of the noise. The observed noise signal is analyzed using 25msec Hamming window with 25msec frame shift. Five speech data are mixed to each noise signal respectively, with 1 second interval between each data. Seven kinds of SNR (40, 30, 20, 15, 10, 5, 0dB) were examined for each combination of speech and noise. Now, we prepared ¢ ¢ data, each of which contains five utterances. Speech recognition was carried out upon the all data using the corresponding grammar. 4000 data were drawn randomly from the all data for open test. 45140 data are used for training, and 4000 data of the training data are used for closed test. The 4-layer neural network was trained for the estimation. Number of neuron of the input layer was the number of parameters ( ), that of the two hidden layers are 20 respectively, and that of the output layer was 1. The experimental result is shown in Figure 2. This figure shows correlation coefficient between estimated recognition accuracies and the real recognition accuracies. From this result, it is found that the correlation is around 0.7, and the correlation get worse when less than 4 parameters are used. Therefore, we chose four parameters in the experiments described later. The selected four parameters were S/N ratio (), mean of spectral slope of each frame ( ), number of phonemes obtained from the speech recognizer when noise is input to the recognizer ( ) and the vocabulary size ( ).
2
3
4
5
6
7
8
9
10
11
# of parameters
Figure 2: Correlation coefficient between estimated and real accuracies
5. Realization and evaluation of the dialog system In this section, we describe realization of the proposed dialog system and the result of the dialog experiment. The proposed system works as follows. First, the system observes noise signal and extracts the noise parameters. The dialog strategy selector receives the parameters and estimates recognition accuracy for each grammar. Then it chooses the optimum grammar and dialog strategy. The selected grammar is passed to the speech recognizer, and the speech recognizer gets ready to hear user’s utterance. In this system, the estimation of accuracy is carried out only once before the dialog begins. It would be possible to estimate the accuracy before each utterance, but it is still unclear whether it is useful to switch dialog strategy utterance by utterance. Therefore, we decided to determine the strategy at the very first of the dialog. Next we explain how the strategy selector determines the optimum dialog strategy. The basic criteria to select a strategy is number of user utterances and recognition accuracy. The strategy selector chooses a strategy whose estimated number of user utterance is minimum and estimated recognition accuracy is higher than a certain threshold. First, the noise parameter extractor calculates three parameters and that are selected in the last section. The dialog strategy selector receives these parameters, and estimates the recognition accuracy using the vocabulary size of one of the prepared dialog strategy. If a strategy uses more than one grammar, average size of the vocabularies is regarded as . Let the strategies be . Let be vocabulary size of the grammar used in strategy . Using tuple , the neural network estimates the recognition accuracy . Then estimated number of user utterance is calculated as follows. Let be number of items expected to be contained
Table 2: and of the dialog strategies strategy Free utterance (FR) 223.0 3.0 Free+choice (FC) 97.6 2.0 75.0 1.5 Two items(TI) Answer the question (AQ) 42.0 1.0 in a utterance when a dialog strategy is employed. Assume the system should obtain items to achieve the task. Here,
½
Table 3: Average number of utterance and word accuracy for one dialog strategy #utterance accuracy(%) FR 4.09 52.96 FC 3.59 59.38 TI 2.97 63.92 AQ 3.18 71.78 Selected 2.91 72.31 Oracle 2.09 68.74
6. Conclusion (1) (2)
Finally, a dialog strategy is selected that has the lowest and is higher than a threshold . Next, we describe dialog strategies actually used in the experiment. We prepared four dialog strategies. They are TI, AQ and the following two dialog strategies.
¯ Free utterance (FR): using this strategy, the system first prompts “May I help you?” Then a user utters without any restrictions. ¯ Free+choice (FC): The system asks the first question using CH strategy, then the following dialog is carried out using FR strategy. Vocabulary sizes of the grammars used in these strategies are shown in Table 2. In Table 2, (average number of items conveyed by one utterance) are also shown. These values are estimated using the result of the experiment in section 3. In the experiment, 58 kinds of noise were chosen from JEIDA database and RWC music database. Therefore, there were 232 ¢ conditions to be tested. 24 users (19 males and 5 females) were asked to carry out dialog with the system. One user carried out dialogs of all four strategies under several environments. The order of dialog strategies used in the experiment was randomized. Number of items was set to 3 for all dialogs. The threshold is set to 60% according to the preliminary experiment. Table 3 shows the number of utterances and recognition accuracy. In Table 3, ‘FR’, ‘FC’, ‘TI’ and ‘AQ’ denote the results obtained using each strategy. ‘Selected’ is the result when the proposed dialog strategy selection method was used. ’Oracle’ shows the result when strategies that minimizes the number of user utterances were chosen a posteriori. This result shows that number of user utterances of the proposed method were smaller than that of other strategies, and that recognition accuracy was the highest among all strategies.
A noise-robust dialog system is proposed. This system changes its dialog strategy according to the environmental noise. To achieve this, the following to issues are investigated: dialog strategy to avoid out-of-grammar utterances and estimation method of recognition accuracy under a certain noise environment. Finally, we realized the proposed system and carried out dialog experiment. From the results, it is confirmed that the proposed system achieves a task with smaller number of user utterances compared to conventional system.
7. References [1] T. Matsui, H. Asoh, J. Fry, Y. Motomura, F. Asano, T. Kurita, I. Hara, and N. Otsu. Integrated natural spoken dialogue system of Jijo-2 mobile robot for office services, Proc. AAAI-99, 1999. [2] S. Nakagawa: A Survey on Automatic Speech Recognition, Trans. IEICE, Vol.E85-D, No.3, pp.465–486, 2002. [3] Y. Niimi, T. Nishimoto and Y. Kobayashi, Analysis of interactive strategy to recover from misrecognition of utterances including multiple information items, Proc. Eurospeech97, Vol.4, pp.2251-2254, 1997. [4] S. Itahashi, Creating speech corpora for speech science and technology, Trans. IEICE, Vol. E74, No. 7, pp. 1906–1910, 1991. [5] M. Goto, H. Hashiguchi, T. Nishimura and R. Oka: RWC Music Database: Popular, Classical, and Jazz Music Databases, Proc. ISMIR 2002, pp.287-288, 2002. [6] T. Konashi, M. Suzuki, A. Ito and S. Makino: A Portable spoken dialog system for autonomous robots, Proc. Int. Workshop on Language Understanding and Agents for Real-world Interaction, pp.79–84, 2003.