Speech Event Detection by Non Negative Matrix Deconvolution - eurasip

6 downloads 0 Views 841KB Size Report
The input features were 12 MFCCs plus energy, and their first and second order time derivatives, computed at a rate of 5 ms and within a window of 15 ms. Only.
15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

SPEECH EVENT DETECTION BY NON NEGATIVE MATRIX DECONVOLUTION Carla Lopes1,2, Fernando Perdigão1,3 1

Instituto de Telecomunicações, Instituto Politécnico de Leiria-ESTG, 3Universidade de Coimbra - DEEC Pólo II, P-3030-290 Coimbra, Portugal phone: + 351 239 79 62 36, fax: + 351 239 79 62 93, email:{calopes, fp}@co.it.pt web: www.it.pt 2

Support Vector Machines (SVM) are applied to the problem of detecting and classifying broad acoustic-phonetic classes (events). In this paper an approach based on Non-Negative Matrix Deconvolution (NMD) is proposed to merge framebased SVM predictions into segmental events. To turn the SVM outputs, which are frame-based, into a signal segmented in terms of events, two different event merger methods were tested and the results, using TIMIT speech data, were compared to a broad class detector, built using HMMs with an MFCC front-end. Results show that NMD efficiently controls the number of insertion and deletion errors and outperforms HMM’s accuracy. The quality of the event segmenter was measured by means of a recently proposed methodology to evaluate event detectors performance and the results show that the proposed approach also outperforms the competing ones. 1. INTRODUCTION Automatic Speech Recognition systems (ASR) are mainly based on Hidden Markov Modelling, which takes a topdown approach. This approach may benefit from cues generated by bottom-up front-end processing to achieve improved recognition results, instead of only consider acoustic parameters. In this paper we propose a bottom-up approach which focuses on the identification of important elements (events) present in the speech structure, other than the conventional phonemes. These events can provide important cues to a traditional ASR system, for example by allowing the use of different parameters for different classes of events and by significantly reducing the search path. There are several advantages to the prior detection of events. Since events can describe the phonemic characteristics of the signal, they become an additional source of information and ASR will be the result of a combination between sources of information. On the other hand if we have accurate event detection, analysis can be anchored around the events, instead of processing speech in uniform blocks, as usual. Events may be associated with the signal acoustics, the signal production, the language, the speaker, etc, because any significant change in the speech signal may be treated as an event. In the literature, event-based systems are described in several contexts,[1][6]. In this paper we address the clas-

sification of the speech signal into broad classes according to the presence of some specific features on the acoustic structure of the signal. 2. EVENT-BASED SYSTEM DESCRIPTION A front-end, which performs utterance segmentation in terms of a sequence of events along time, is proposed. For that purpose four attributes to be detected were defined: silence, frication, stops and sonorancy, in such a way that the output signal of the proposed front-end is a segmented signal in terms of four broad classes: silences, fricatives, stops and sonorants. These classes have already performed good acoustic characterization of speech [10]. The proposed event-based system has a modular structure as shown in Figure 1. The first module consists of an SVM classifier which is described in detail in Section 3. The outputs of this module provide membership predictions for each event class, for each frame. With this modular structure this classifier may be easily replaced by another classifier, such as an artificial neural network. Since SVM do not naturally give out posterior probabilities, the predictions were normalized by a softmax function, ensuring that a number between 0 and 1 is allocated to each class (with unit sum over classes). To turn the SVM outputs, which are framebased, into a signal segmented in terms of events, an event merger is required. This event merger outputs segmental events. It is described in Section 4. Speech Signal

SVM Classifier

Softmax Function

Event Merger

Rule Based Stage

Event Merger Block

ABSTRACT

Speech Signal Segmented in 4 broad classes of events

Figure 1 – Event-based system’s modular structure.

©2007 EURASIP

1280

15th European Signal Processing Conference (EUSIPCO 2007), Poznan, Poland, September 3-7, 2007, copyright by EURASIP

Speech Signal

Two methods of event merging were tested and compared. Due to high number of event insertions we combined the event merger with a rule-based stage that is also described in Section 4.

Acoustic Feature (AF) Extraction

×

Max amplitude Spectral Flatness Measure Spectral Centroid Log energy ratio at high/low frequencies Median of energy in a 9th filter bank Energy