2009 International Nuclear Atlantic Conference - INAC 2009 Rio de Janeiro,RJ, Brazil, September27 to October 2, 2009 ASSOCIAÇÃO BRASILEIRA DE ENERGIA NUCLEAR - ABEN ISBN: 978-85-99141-03-8
MAN SYSTEM INTERFACE BASED ON AUTOMATIC SPEECH RECOGNITON: INTEGRATION TO A VIRTUAL CONTROL DESK Carlos Alexandre F. Jorge1, Diogo V. Nomiya2, Antônio Carlos A. Mól1, Cláudio M.N.A. Pereira1, Maurício Alves C. Aghina1 1
Instituto de Engenharia Nuclear, IEN - CNEN R. Hélio de Almeida, 75 Cidade Universitária - Ilha do Fundão 21941-906 Rio de Janeiro, RJ
[email protected],
[email protected],
[email protected],
[email protected] 2
Universidade Federal do Rio de Janeiro Av. Athos da Silveira Ramos, 149 Cidade Universitária - Ilha do Fundão 05508-000 São Paulo, SP
[email protected]
ABSTRACT This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented.
1. INTRODUCTION Nuclear power plants (NPP) operation poses high safety requirements, thus NPP simulation for personnel training programs must reproduce well the conditions that operators would face during real operational routines. The simulations’ purpose is two-fold: (i) to supply personnel training for NPP operation; (ii) to support ergonomic evaluations for new NPP control systems design or for the modernization of existing ones. Operators interact with a NPP through a control desk, monitoring variables’ measurements, and dealing with actuator controls, to keep operation under normal conditions. In the case of abnormal situations, they have to identify them correctly from the corresponding variables and alarm indications, and to mitigate the incident or accident in a very effective way. Also, tests are usually carried out to evaluate human-interfaces from the ergonomics and humanfactors points of view [1]-[5]. All these tasks are typically based on systems that comprise a computer simulator that aims to run NPP operational simulations, and a human-system interface [6]. There is a number of international guidelines in the literature with related recommendations [6]-[16]. The traditional approach required the construction of physical (reduced scale) control desks, with
an associated computer simulation program, which is known as full-scope simulation, where operators or trainees are faced with control desk models that are very similar to the original ones. The disadvantage is clearly the need of constructing those physical models. Figure 1 shows the reduced scale control desk, developed by the Korea Atomic Energy Research Institute (KAERI), in which this work has been based.
Figure 1. Reduced-scale control desk.
An alternative approach consists in the use of synoptic windows, in which computer monitors with schematic diagrams substitute the physical control desk. This kind of simulator is known as compact simulator [17]. Our staff has worked in cooperation with KAERI, through a project supported by the International Atomic Energy Agency (IAEA), in the development of a NPP compact simulator based on synoptic windows [18]. The disadvantages of this approach are: (i) the synoptic windows do not resemble much the original control desk in which they had been based; (ii) operators or trainees need to navigate through a number of screens of those synoptic windows, during operational training. The navigation may sometimes become confusing, specially during simulated incidents or accidents. Trainees have to navigate through many synoptic windows to see the alarms and the affected systems, to identify the type of malfunction and to make decisions to bring the plant back to normal operational condition. These questions are treated in more details elsewhere [19], which also deals with improvements that are being carried out to overcome the drawbacks. Other related improvements are treated in [20], [21]. Figure 2 shows the synoptic windows approach implementation of a pressurized water reactor (PWR) NPP compact simulator at Instituto de Engenharia Nuclear (IEN/CNEN).
INAC 2009, Rio de Janeiro, RJ, Brazil.
Figure 2. Synoptic windows approach.
More recently, virtual control desks (VCD) have been developed to replace the synoptic windows [22]-[25]. This approach is based on virtual reality (VR) technology, aiming to reproduce visually, with fidelity, the original control desk in which the simulator had been based, with respect to the layout. Thus, this approach combines the online dynamic operation functionalities of a PWR NPP compact simulator, with a realistic view of the corresponding control desk. Our staff has been developing one such VCD, based on the full-scope control desk shown in Fig. 1, to operate in conjunction with the existing PWR NPP compact simulator, through computer network. Preliminary results were reported [26]-[28]. This VCD was operated through computer keyboard and mouse, thus requiring that users were in front of a personal computer (PC). Our objective is that users can navigate and control this VCD, free from keyboard and mouse, standing in front of a PC monitor or a projection screen, so they could perform the operational simulations in a more natural way. We now developed an interface based on automatic speech recognition (ASR), so that spoken commands (isolated words) could be automatically recognized and then the corresponding tasks executed.
2. THE AUTOMATIC SPEECH RECOGNITION SYSTEM The ASR methodology is based on two techniques: (i) cepstral analysis for the pre-processing of speech signals; (ii) ANN’s for the recognition task. The cepstral analysis [29],[30] is a
INAC 2009, Rio de Janeiro, RJ, Brazil.
well-know technique to extract parameters from the vocal tract, while ANN’s [31] aims at identifying the spoken commands from the cepstral coefficients. The pre-processing stage is required for two reasons [32]: (i) to isolate, from the recorded signal, the interval that contains only the spoken word itself; (ii) to extract relevant features from the speech signal, for pattern recognition. Figure 3 shows a block diagram of a typical ASR system, for isolated word identification. The first and second blocks, from left to the right, comprise the preprocessing stage.
Speech signal
Blank space elimination
Feature extraction
Pattern recognition
Recognized word
Figure 3. Typical speech recognition system.
The vocal tract is a slowly time-varying system, that can be considered approximately stationary for time intervals ranging from 10 to 30 ms [32]. Thus, the speech signal must be segmented in this range before any further signal processing is performed. All analysis techniques are then applied in short-time; the Fourier analysis, for example, is applied as Short-time Fourier Transform (STFT), resulting in the well-known speech spectrogram [32]. Seven commands were used: "abaixo", "acima", "direita", "esquerda", "aproxima", "afasta" and "para", Portuguese words for: "down", "up", "right", "left", "zoom in", "zoom out" and "stop", respectively. The later word was used because, once the system recognizes any of the other commands above, it continues to execute the corresponding task indefinitely (for example, zooming in the scene), so the user must say "para", to stop the task execution. 2.1. Pre-processing Stage 2.1.1. Blank Space Elimination Blank spaces have to be eliminated from the word to be recognized, as these "silence" parts contain no useful information, but only background noise recorded from the environment. A simple approach was used, considering only the short-time energy information. Such an approach was reported by [33], for speaker recognition, showing good results, despite its simplicity. Segmentation was performed by 20 ms-duration adjacent segments, without superposition. Short-time energy was evaluated along all segments, and those with energies below a threshold of 1 % of the maximum energy in the whole signal were eliminated; the resulting valid segments were kept for subsequent processing.
INAC 2009, Rio de Janeiro, RJ, Brazil.
2.1.2. Parameter Extraction The parameters to be extracted constitute in a vocal tract model. Since the vocal tract dynamics is related to the spoken word, its model constitutes a signature that can be used for pattern recognition. This work makes use of the cepstral coefficients [29],[30], derived from the FFT [34], [35], given the simplicity of this approach. Speech segmentation is performed through the use of Hamming windows, to avoid spurious high frequencies due to abrupt amplitude variations in the speech signal, as would result with the use of rectangular windows. A superposition of 50% between adjacent segments was adopted, to compensate for the signal attenuation in the segment boarders. After the blank space elimination stage, the resulting signals have variable length, depending on how fast or slow the word had been spoken. As long as this work makes use of ANN’s for the pattern recognition stage, the number of inputs of the feedforward ANN should be fixed [31]. Thus, it was adopted a strategy similar to one cited elsewhere [36], [37], where both the superposition percentage and the number of segments were fixed, letting the segment’s length vary to accommodate the word length. The cepstral analysis is summarized in the following. For speech processing applications, only the real cepstrum is sufficient [30]. A simplified model of speech production considers the speech signal s(n) as resulting from a convolution between the excitation u(n) and the vocal tract impulse response h(n), the later corresponding to the vocal tract model in the time domain, as indicated by Eq. 1. To extract the vocal tract impulse response from the speech signal, the cepstral coefficients perform deconvolution of the speech signal, as indicated by Eq. 2.
s (n ) = u (n ) ∗ h(n ) cS = IDFT {log DFT (s (n ))} = IDFT {log S (k )} =
(1)
(2)
= IDFT {log U (k )H (k )} = IDFT {log U (k ) + log H (k )} = cU + c H where: • S (k ) : DFT of the speech signal; • U (k ) : DFT of the excitation; • H (k ) : DFT of the vocal tract impulse response; • cS : cepstrum of the speech signal;
•
cU : cepstrum of the excitation;
•
cH : cepstrum of the vocal tract impulse response.
As the cepstral coefficients are well separated in the quefrency domain [30], both information can be separated by a low-pass liftering process, which is in general performed with a sinusoidal lifter [38]. In this R&D, the DFT was obtained with a 512-point FFT, by zeropadding [30], since it was verified that the segments’ lengths ranged around 300 samples.
INAC 2009, Rio de Janeiro, RJ, Brazil.
Twelve cepstral coefficients were used, discarding the zero-index one, which corresponds to the excitation [39].
2.2. Classification Stage 2.2.1. The Classificators The inputs to ANN’s are generated by appending all cepstral coefficients for all segments, resulting in an input dimension of 600, corresponding to 12 cepstral coefficients over 50 segments. Two types of ANN’s were used: (i) Feed-forward (FF) ANN, trained with backpropagation algorithm; (ii) General Regression Neural Network (GRNN). The backpropagation FF ANN was designed with three layers: one input layer and two neuron layers with non-linear sigmoidal activation function, as recommended for classification purpose [31]. The number of outputs corresponds to the number of commands to be recognized, − in this case, seven commands. The pre-processed signal realizations were sorted into three sets, with approximately the following percentages: (i) training set: 60 %; (ii) validation set (used to avoid overtraining): 20 %; (iii) test set (also known as production set, used to assess the ANN’s generalization capability): 20 %. Preliminary results have been reported before, only for a FF ANN [40], [41]. Currently, given the good results achieved, this R&D focused in the practical implementation of the techniques used, for integration with the existing VCD. For this purpose, FFT/IFFT code was implemented in C language, based on the reference [42]; all the programming is now in C language, as well.
2.2.2. Training Results Results are shown for both FF ANN and GRNN. Right answers arise when the ANN result agree with the desired output; wrong answers arise when the ANN result doesn’t agree with the desired output. Besides results obtained independently with each type of ANN, another configuration considers the overall results obtained with both types of ANN in parallel. In this case, right answers arise when both ANN’s agree on the result and this result agree with the desired output; wrong answers arise when both ANN’s agree on the result but this result doesn’t agree with the desired output; “don’t know” answers arise when both ANN’s disagree on the result. The use of two ANN’s of different types in parallel also improves the ASR robustness, by avoiding those cases when both ANN’s disagree on the result. The data consists of a total of 140 locutions, 20 for each of the seven commands. Table 1 and Table 2 show results for two runs where commands were spoken to the system not in sequence, but randomly.
INAC 2009, Rio de Janeiro, RJ, Brazil.
Table 1. Results for ANN’s – first run; second column: results for FF ANN; third column: results for GRNN; fourth column: results for a configuration with both ANN’s in parallel. FF
GRNN
FF and GRNN
Total of locutions
140
140
140
Right answers
139
139
138
Right answers (%)
99.29 %
99.29 %
98.57 %
Wrong answers
1
1
____________
Wrong answers (%)
0.71 %
0.71 %
____________
“Don’t know” answers
____________ ____________ 2
“Don’t know” answers (%) ____________ ____________ 1.43 %
Table 2. Results for ANN’s – second run; second column: results for FF ANN; third column: results for GRNN; fourth column: results for a configuration with both ANN’s in parallel. FF
GRNN
FF and GRNN
Total of locutions
140
140
140
Right answers
133
136
129
Right answers (%)
95.00 %
97.14 %
92.14 %
Wrong answers
7
4
____________
Wrong answers (%)
5.00 %
2.86 %
____________
“Don’t know” answers
____________ ____________ 11
“Don’t know” answers (%) ____________ ____________ 7.85 %
INAC 2009, Rio de Janeiro, RJ, Brazil.
3. THE VIRTUAL CONTROL DESK The VCD now replaces the synoptic windows, including also the networking capabilities to operate the PWR NPP compact simulator remotely. Figure 4 shows the new networking scheme, in which the synoptic windows are replaced by the VCD with the TCP/IP Socket. Figure 5 shows a complete view of the VCD; compare it with Fig. 1. Figure 6 shows a partial view of the VCD in comparison with the real one.
PWR NPP Simulator
Shared Memory
TCP/IP Socket
Figure 4. The new networking scheme.
Figure 5. The virtual control desk.
INAC 2009, Rio de Janeiro, RJ, Brazil.
Virtual Control Desk
a)
b)
Figure 6. Comparison between the real and the VCD’s: a) a module of the real control desk; b) the virtual counterpart of the corresponding module.
4. INTEGRATION WITH THE VIRTUAL CONTROL DESK For the integration of the ASR system with the existing VCD, the spoken commands are captured automatically online with a microphone. The input signal is monitored at each 0.2 s, and recorded if it is greater than a threshold, for a time that is sufficient to record an entire word in most typical situations. This signal is then further processed for speech versus blank elimination, parameter extraction and recognition, as already explained. Integration between the ASR system with the VCD was performed with the use of OpenGL, a graphics pack for C language, that was also used for the VCD development. This pack has a function related to the computer keyboard, for the execution of tasks according to the keys pressed by the user; this is already in use in this VCD. In this work, another pack was used, named OpenAL, related to audio, through which the camera (that implements the VCD 3D view) position is changed according to the recognized command.
3. CONCLUSIONS In this work, an ASR system was further developed and adapted, by using C programming, for online implementation with the purpose of integrating it with an existing PWR NPP VCD. The ASR system now monitors and captures automatically speech signals from users, processing them for command recognition. The recognized commands are then input to the VCD as it were commands coming from the computer keyboard, by using OpenAL and OpenGL packs. The tests’ results show the ASR system robustness for isolated word recognition. Besides the high indices of right answers, for each of the ANN’s used, some other strategies contribute to
INAC 2009, Rio de Janeiro, RJ, Brazil.
improve its robustness: (i) the use of two different types of ANN’s in parallel; (ii) the use of the command “para” (“stop”). The former increases the avoidance of wrongly recognized commands, for the cases when the ANN’s disagree on the result; it can’t avoid, though, those cases when both ANN’s agree on the result but both are wrong. The later, lets the user stop immediately the task execution, not only when they really want to stop the camera movement, but when they notice the task is not the one they wanted. This system can be readily used, not only for this VCD, but for other applications. The ASR system can be adapted for speaker recognition, by slightly adapting the training scheme, what may improve the whole system by permitting only some authorized personnel to operate the control system.
ACKNOWLEDGMENTS Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq and Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro – FAPERJ.
REFERENCES 1. T. B. Malone, “Human Factors of Control Room Design and Operator Performance at Three Mile Island”, Washington, USA (1980). 2. B. H. Kantowitz, R. D. Sorkin, Human Factors: Understanding People System Relationships, Wiley, New York, USA (1983). 3. R. N. Pikaar, “Ergonomics in control room design”, Ergonomics, Volume 33 (5), pp. 589600 (1990). 4. J. O’Hara, W. Stubler, J. Wachtel, “Methodological Issues in The Validation of Complex Human-Machine Systems”, Computer-based human support systems: technology, methods and future, Philadelphia, USA, June 25-29 (1995). 5. M. P. Feher, “Application of human factors to the Candu 6 Control Room upgrade”, IAEA Workshop Specialists’s Meeting on Approaches for the 171 Integration of Human Factors into the Upgrading and Refurbishment of Control Rooms, Halden, Norway, August 23-25 (1999). 6. ANSI/ANS-3.5, “Nuclear power plants simulators for use in operators training and examinations” (1993). 7. ANSI/ANS-3.1, “Selection, qualification and training of personnel for nuclear power plants” (1993). 8. ANSI/ANS-3.2, “Administrative controls and quality assurance to the operational phase of nuclear power plants” (1994). 9. IAEA-TECDOC 565, “Control rooms and man-machine interface in nuclear power plants” (1990). 10. IAEA-TECDOC 668, “The role of automation and humans in nuclear power plants” (1992). 11. IAEA-TECDOC 812, “Control room systems design for nuclear power plants” (1995). 12. IAEA-TECDOC 812, “Control room systems design for nuclear power plants” (1995). 13. IEC 964, “Design of control rooms for nuclear power plants” (1989). 14. IEC 1227, “Nuclear power plants control rooms – operator controls” (1993).
INAC 2009, Rio de Janeiro, RJ, Brazil.
15. IEC 1771, “Nuclear power plants – main control room – verification and validation of design” (1995). 16. IEC 1772, “Nuclear power plants – main control room – application of visual display units” (1995). 17. IAEA-TECDOC 995, “Selection, specification, design and use of various nuclear power plant training simulators” (1998). 18. P. V. R. Carvalho, I. J. Obadia, “Projeto e implementação do Laboratório de Interfaces Homem Sistema do Instituto de Engenharia Nuclear”, Revista Brasileira de Pesquisa e Desenvolvimento, Volume 4 (2), pp. 226-231 (2002). 19. P. V. R. Carvalho, I. J. A. L. Santos, J. O. Gomes, M. R. S. Borges, S. Guerlain, “Human factors approach for evaluation and redesign of human-system interfaces of a nuclear power plant simulator”, Displays, Volume 29 (3), pp. 273-284 (2008). 20. I. J. A. L. Santos, D. V. Teixeira, F. T. Ferraz, P. V. R. Carvalho, “The use of a simulator to include human factors issues in the interface design of a nuclear power plant control room”, Journal of Loss Prevention in the Process Industries, Volume 21 (3), pp. 227-238 (2008). 21. M. V. Oliveira, I. J. A. L. Santos, P. V. R. Carvalho, A. C. A. Mól, C. H. S. Grecco, D. M. Moreira, “Design and evaluation of humam-system interfaces for industrial plants”, 2007 International Nuclear Atlantic Conference (INAC 2007), Santos, Brazil, September 30 October 05, DVD-ROM (2007). 22. A. Drøivoldsmo, M. N. Louka, “Virtual reality tools for testing control room concepts”, In: Liptak, B.G. (Ed.), Instrument Engineer’s Handbook: Process Software and Digital Networks, 3rd ed., CRC Press, Boca Raton, USA, Volume 3, pp. 209-306 (2002). 23. E. Nystad, S. Strand, “Using virtual reality technology to include field operators in simulation and training”, 27th Annual Cannadian Nuclear Society Conference and 30th CNS/CNA Student Conference, Toronto, Canada, June 11-14 (2006). 24. S. Markidis, Rizwan-uddin, “A virtual control room with an embedded, interactive nuclear reactor simulator”, The American Nuclear Society Winter Meeting & Nuclear Technology Expo, Albuquerque, USA, November 12-16 (2006). 25. L. F. Hanes, J. Naser, “Use of 2.5-D and 3-D technology to evaluate control room upgrades”, The American Nuclear Society Winter Meeting & Nuclear Technology Expo, Albuquerque, USA, November 12-16 (2006). 26. M. A. C. Aghina, A. C. A. Mól, A. A. H. Almeida, C. M. N. A. Pereira, T. F. B. Varela, G. G. Cunha, “Full scope simulator of a nuclear power plant control room using 3D stereo virtual reality techniques for operators training”, 2007 International Nuclear Atlantic Conference (INAC 2007), Santos, Brazil, September 30 - October 05, DVD-ROM (2007). 27. M. A. C. Aghina, A. C. A. Mól, C. A. F. Jorge, C. M. N. A. Pereira, T. F. B. Varela, G. G. Cunha, L. Landau, “Virtual control desks for nuclear power plant simulation: improving operator training”, IEEE Computer Graphics and Applications, Volume 28 (4), pp. 6-9 (2008). 28. M. A. C. Aghina, Mesa de controle virtual para treinamento de operadores: um estudo de caso para um simulador de usina nuclear, M.Sc. Dissertation, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil (2009); available online (in Portuguese) at: http://www.coc.ufrj.br/index.php?option=com_docman&task=doc_download&gid=2000 &Itemid=84. 29. S. Furui, “Cepstral analysis technique for automatic speaker verification”, IEEE Transactions on Acoustics, Speech and Signal Processing, Volume ASSP-29 (2), pp. 254272 (1981).
INAC 2009, Rio de Janeiro, RJ, Brazil.
30. A. V. Oppenheim, R. W. Schafer, Discrete-Time Signal Processing, Prentice-Hall, Englewood Cliffs, USA (1989). 31. S. Haykin, Neural networsks – a comprehensive foundation, Prentice-Hall, Upper Saddle River, USA (1999). 32. L. R. Rabiner, R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Englewood Cliffs, London (1978). 33. R. G. Pinto, H. L. Pinto, L. P. Calôba, “Using neural networks for automatic speaker recognition”, Proceedings of the IEEE 38th Midwest Symposium on Circuits and Systems, Rio de Janeiro, Brazil, August 13-16, Volume 2, pp. 1078-1080 (1995). 34. A. A. Lima, Análises comparativas em sistemas de reconhecimento de voz, M.Sc. Dissertation, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil (2000); available online (in Portuguese) at: http://www.lps.ufrj.br/profs/sergioln/theses/msc02amarolima.pdf. 35. A. A. Lima, M. S. Francisco, S. Lima Netto, F. G. V. Resende Jr, “Análise comparativa de sistemas de reconhecimento de voz”, XVIII Simpósio Brasileiro de Telecomunicações, Gramado, Brazil, September 3-6, pp. 001-004 (2000). 36. S. S. Diniz, Uso de técnicas neurais para o reconhecimento de comandos à voz, M.Sc. Dissertation, Instituto Militar de Engenharia, Rio de Janeiro, RJ, Brazil (1997); available online (in Portuguese) at: http://www.labic.nce.ufrj.br/downloads/suelaine_dissertacao.zip. 37. S. Diniz, A. G. Thomé, “Uso de técnica neural para o reconhecimento de comandos à voz”, IV Simpósio Brasileiro de Redes Neurais, Goiânia, Brazil, pp. 23-26 (1997). 38. B. -H. Huang, L. R. Rabiner, J. G. Wilpon, “On the use of bandpass liftering in speech recognition”, IEEE Transactions on Acoustics, Speech and Signal Processing, Volume ASSP-35 (7), pp. 947-854 (1987). 39. J. R. Deller Jr., J. G. Proakis, J. H. Hansen, Discrete-Time Processing of Speech Signals, MacMillan, New York, USA (1993). 40. C. A. F. Jorge, M. A. C. Aghina, A. C. A. Mól, C. M. N. A. Pereira, “Man machine interface based on speech recognition”, 2007 International Nuclear Atlantic Conference (INAC 2007), Santos, Brazil, September 30 - October 05, DVD-ROM (2007). 41. C. A. F. Jorge, A. C. A. Mól, “Reconhecimento de Voz baseado em Análise Cepstral e Redes Neurais”, Relatório Técnico RT-IEN- 03/2007; available online (in Portuguese) at: http://memoria.cnen.gov.br/doc/pdf/Textos%20Completos/E0000857.pdf. 42. W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing, 2 edition, Cambridge University Press, New York, USA (1992).
INAC 2009, Rio de Janeiro, RJ, Brazil.