first experiments of automatic speech activity detection, source

4 downloads 0 Views 95KB Size Report
shaped microphone arrays each with four omnidirectional sensors, and a ... The results of the first experiments of acoustic source localization are shown in Table ...
FIRST EXPERIMENTS OF AUTOMATIC SPEECH ACTIVITY DETECTION, SOURCE LOCALIZATION AND SPEECH RECOGNITION IN THE CHIL PROJECT Dušan Macho, Jaume Padrell, Alberto Abad, Climent Nadeu, Javier Hernando TALP Research Center Universitat Politecnica de Catalunya Barcelona, SPAIN Email: [email protected] Maurizio Omologo, Alessio Brutti, Piergiorgio Svaizer Istituto Trentino di Cultura (ITC)-irst, Via Sommarive, 18 - Povo, Trento, ITALY Email: [email protected]

John McDonough, Ulrich Klee, Matthias Wolfel Institut fur Logik, Komplexitat und Deduktionssysteme, Universitat Karlsruhe, Karlsruhe, GERMANY Email: [email protected] Gerasimos Potamianos, Stephen Chu Human Language Technologies, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA Email: [email protected]

In the workspace of the future, a so-called “ambient intelligence” will be realized through the widespread use of sensors (e.g., cameras, microphones, directed audio devices) connected to computers that are unobtrusive to their human users. Towards this end of ubiquitous computing, technological advances in multi-channel acoustic analysis are needed in order to solve several basic problems, including speaker localization and tracking, speech activity detection (SAD) and distanttalking Automatic Speech Recognition (ASR) [1]. The long-term goal is the ability to monitor speakers and noise sources in a real reverberant environment, without any constraint on the number or the distribution of microphones in the space nor on the number of sound sources active at the same time. This problem is surpassingly difficult, given that the speech signals collected by a given set of microphones are severely degraded by both background noise and reverberation. The European Commission integrated project CHIL, Computers in the Human Interaction Loop, aims to make significant advances in the three technologies mentioned above, and to integrate them in several technology demonstrators for seminar and meeting scenarios. To this purpose, most of the CHIL partners have set-up an experimental room for data collection and demonstrator development, which includes a variety of sensors: close-talking microphones, table-top microphones, Tshaped microphone arrays each with four omnidirectional sensors, and a 64-channel Mark III microphone array developed at the National Institute of Standards and Technologies (NIST). For every sensor, the input signal is recorded at 44.1 kHz with 24 bits per sample. The higher sample rate is preferable to permit more accurate TDOA estimation, while the higher bit depth is necessary to accommodate the large dynamic range of the far-field speech data. The CHIL consortium plans to make a portion of the data available to NIST for use in upcoming meeting evaluations. The first purpose of this contribution is to describe the current status of acoustic data collection and labeling in CHIL, based on seminars collected at the Universitat Karlsruhe in Karlsruhe, Germany. In particular, manual labeling will be addressed as it plays a relevant role for training and testing the systems under study. For instance, very different “observations” of the same acoustic scenario can be extracted from different sensors, leading to the need of new criteria for multi-channel labeling of speech and of other acoustic events that will eventually have to be classified by automatic speech activity detectors and speaker localization systems. Another critical aspect regards the lack of synchronicity between audio and video recording systems, and even between different clusters of microphone signals: this is still an open problem and it prevents from exploiting all the available sensors in an ideal way. The second purpose of this contribution is to present the results of initial experiments conducted at the given four partner sites, by using a first portion of the CHIL Seminar Corpus (CSC). In fact, while the given rooms were set-up, some preliminary seminars were recorded during a series of seminars held in the Fall of 2003 by students at the Universitat Karlsruhe. The students spoke English, but with German or other European accents, and with varying degrees of fluency. This data collection was done in a very natural setting, as the students were far more concerned with the content of their seminars, their presentation in a foreign language and the questions from the audience than with the recordings themselves. Moreover, the seminar room is a common work space used by other students who are not seminar participants. Hence, there are many “real world” events heard in the recordings, such as door slams, printers, ventilation fans, typing, background

HSCMA, March 17-18, 2005, Rutgers University, Piscataway, New Jersey, USA

p-24

chatter, and the like. In that case, the seminar speakers were recorded with a Sennheiser close-talking microphone as well as two linear eight-channel microphone arrays. The sample rate of those recordings was 16 kHz with 16 bits per sample. In addition to the audio data capture, the seminars were simultaneously recorded with four calibrated video cameras with a rate of 15 frames per second. For what concerns Speech Activity Detection (SAD), results were expressed in terms of Mismatch Rate (MR), Speech Detection Error Rate (SDER) and Non-speech Detection Error Rate (NDER). The techniques explored at ITC-irst and UPC were based on energy information and on filter-bank output combined with LDA, respectively (see [2] for more details). Preliminary results are given in Table 1. Site

MR (%)

SDER (%)

NDER (%)

ITC-irst

17.3

10.1

43.0

UPC

12.6

11.4

15.0

Table 1. Speech activity detection results.

The results of the first experiments of acoustic source localization are shown in Table 2. In this case, all the partners explored the use of a GCC-PHAT based localization system. Different results might be due to a different number of pairs used to derive the time delay estimates as well as to the application of post-processing: to this regard, UKA adopted a Kalman-filter to smooth resulting estimates, while UPC applied a correction (based on some knowledge of the experimental scenario) in case of out-of-room estimates. Finally, note that UKA compared GCC-PHAT with the technique recently proposed by Benesty, et al. [3] No interpolation for this new method was implemented. As shown in Table 2, the technique of Benesty and his coauthors provided azimuth and range estimates comparable to those obtained with the GCC plus interpolated TDOA estimates, and hence is of interest for further study. Other work in source localization at UKA on the CHIL seminar corpus is described in [4]. Site UKA(Benesty)

Azimuth (o) 12.3

Depth (cm) 91.2

Number of mic. pairs N/A

UKA

10.9

95.5

12

ITC-irst

9.75

98.1

2

UPC

9.05

73.5

4

Table 2. Speaker localization results.

The CHIL seminar data present significant challenges to both modeling components used in ASR, namely the language and acoustic models. For instance, large portions of the data contain spontaneous, disfluent, and interrupted speech, due to the interactive nature of seminars and the varying degree of the speakers’ comfort with their topics. On the acoustic modeling side, and in addition to the latter difficulty, the seminar speakers exhibit moderate to heavy German accents in their English speech. ASR results on the above mentioned seminars were so far obtained just at IBM and UKA and regarded only closetalking input signals: in the two sites the word error rates were 37.3% and 41.6%, respectively. In both cases, the recognition systems were based on state-of-the-art HMM technology and use of large training corpora (not collected in the target environment) and MAP and MLLR adaptation. Although the results presented here are only for the close-talking microphone and show the very high complexity of the given task, the CHIL consortium is actively investigating the use of the given microphone network to enhance ASR performance in the far-field case: presently, the best result is not better than about 70% WER. References [1] M. Brandstein and D. Ward, Eds, Microphone Arrays, Springer Verlag, 2000. [2] J. Padrell, D. Macho, and C. Nadeu, “Robust speech activity detection using LDA applied to FF parameters,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 2005. [3] J. Chen, J. Benesty, and Y. Huang, “Robust time-delay estimation exploiting redundancy among multiple microphones,” IEEE Trans. on Speech and Audio Processing, vol. 11, pp. 540-557, 2003. [4] U. Klee and J. McDonough, “Kalman Filtering for Acoustic Source Localization based on Time Delay of Arrival,” submitted to HSCMA, 2005.

HSCMA, March 17-18, 2005, Rutgers University, Piscataway, New Jersey, USA

p-25

Suggest Documents