PORTING AN AUDIO PARTITIONER ACROSS DOMAINS Mauro Cettolo ITC-irst, Centro per la Ricerca Scientifica e Tecnologica I-38010 Povo di Trento - Italy
[email protected] ABSTRACT Partitioning an audio stream means to segment it in acoustically homogeneous chunks, classify segments into acoustic classes, and cluster speech segments. The process represents the earliest stage of automatic transcription stations, since it allows to filter out portions of the audio not containing speech and to improve recognition accuracy through the use of condition-dependent acoustic models and adaptation techniques. Hence, when transcription systems are applied to new domains, the process of porting involves the partitioner module too. In this work, the porting of the partitioner of the ITC-irst broadcast news transcription system to the domain of historical films is described in detail and experimentally evaluated. Moreover, a new technique that makes the porting easier for the automatic estimation of the working point of the BIC-based segmentation algorithm is introduced. 1. INTRODUCTION The problem of acoustic partitioning has become crucial to the application of automatic speech recognition to audio stream processing. For instance, in order to generate transcripts of broadcast news programs, it is necessary to isolate and filter out portions of the signal which do not contain speech, e.g. jingles and signature tunes. Moreover, transcription accuracy can be significantly improved by using condition dependent acoustic models, if the speech signal is segmented and classified according to bandwidth, speaker gender, and speaker identity. In recent years, several different architectures of the partitioner module have been presented. Many of them include an algorithm for segmenting the audio stream based on some “distance” measure, like the Hotelling’s T 2 -test [1], the Kullback Leibler distance [2, 3], the generalized likelihood ratio [4], the entropy loss [3], and the Bayesian Information Criterion (BIC) [5, 6, 7, 8, 9, 10]. In the BIC, a threshold is usually introduced to tune the algorithm to the particular data under processing. This This work was partially financed by the European Commission under the projects CORETEX (IST-1999-11876) and ECHO (IST-1999-11994).
means that whenever the segmenter is ported on a new domain, the setting of that threshold is required. In this paper, the porting of the partitioner of the ITCirst broadcast news (BN) transcription system [11] to the domain of historical films is described in detail and experimentally evaluated. Moreover, a new technique for the automatic estimation of the working point of the BIC-based segmentation algorithm is introduced, and its effectiveness showed. 2. CORPORA The starting partitioner was that of the ITC-irst BN transcription system. The BN domain covers both radio and television broadcast news programs. The two subtasks are rather similar both from the acoustics viewpoint, and from that of contents: this justifies the choice of developing a unique transcription station instead of two. For developing the BN partitioner, training and test data were selected from the IBNC corpus [12], a collection of radio news programs (Grr), and from internal recordings of television news programs (Tg). Figures of data sets are reported in the three rightmost columns of Table 1. In addition, 21 minutes of excerpts from different types of music were employed for training purposes.
training test
Echo 6h:01’ 3h:15’
Grr 28h:50’ 1h:15’
BN Tg 4h:10’ 0h:40’
Grr+Tg 33h:00’ 1h:55’
Table 1. Training and test sets sizes. The goal of the work presented here was that of porting the BN partitioner to the Echo domain. Echo is a European project (http://pc-erato2.iei.pi.cnr.it/echo/) - the acronym stands for European CHronicles On-line - in which automatic recognition techniques are applied to transcribe the audio track of Italian historical films. Data were provided by Istituto Luce (http://archivioluce.com); sizes of training and test sets employed in the porting operation described in this work are given in the column “Echo” of Table 1.
3. THE PARTITIONER The partitioner consists of two main modules (see Figure 1): the Bayesian Information Criterion (BIC)-based segmenter and the Gaussian Mixture Model (GMM)-based classifier. X X 1
X
2
X X 1
2
X
input audio stream
BIC segmenter
n
n1
segment to classify
Viterbi Decoder
segment sequence
seg 1
seg s
AC i
AC j
segment ‘‘segmented’’ & classified
where k is the number of free parameters in the model. The sensitivity of the method can be tuned by replacing the zero threshold of Eq. 4 with a value that is used to adjust the method to the particular task under consideration. This means that whenever the segmenter is ported on a new domain, the setting of that threshold is required. In order to apply the above described method to an arbitrary large number of potential changes, a sliding analysis window is applied to the input audio signal, as described in [8]. 3.2. Segment Classification
1.0 AC1
FSN 1.0 AC2 AC
i
: AUDIO CLASSES (GMMs)
1.0 ACM
α
< 1.0
Fig. 1. Architecture of the Partitioner. 3.1. BIC-based Segmentation Segmenting an audio stream means to detect the time indexes corresponding to changes in the nature of audio, in order to isolate segments that are homogeneous in terms of bandwidth and speaker. Our technique bases segmentation on a statistical model selection criterion, by applying the BIC [13, 5]. Let x1 : : : xk : : : xn be an ordered sample of data in the 0
(4)
where Lk and Ln are the likelihoods given by the models Mk and Mn respectively, while Pk and Pn are the corre-
sponding penalties. The BIC is one of a number of penalty functions proposed in the literature [14], and is defined as:
P
= k2 log n
The final goal of the partitioning stage is to classify each acoustically homogeneous segment in terms of broad acoustic classes, in order to supply the recognizer with information useful for either activating specialized acoustic models and performing properly adaptation processes. The acoustic classes are modeled by GMMs. The classification is done through the Viterbi algorithm on a search space in which the activation of a new class is possible at each time (loop topology). This process induces a further segmentation of the segments output by the BIC, since the time indexes of class changes correspond to segment boundaries. In this way, changes missed by the BIC can be recovered. Porting the classifier from one domain to another requires the adaptation/training of the GMMs, and the setting of the weight of the loop network. 3.3. Automatic Tuning of the BIC The decision threshold of the BIC can be estimated exploiting the segmentation induced by the classifier in the following way. The whole audio stream is given as input to the GMM-based classifier; as explained above, it induces a segmentation. For each hypothesized boundary, the BIC threshold that would allow to hypothesize that boundary is computed. The set of these s is then used to select the operating point of the BIC algorithm. For example, a possible choice is the minimum of all the s, since in this way the BIC would be able to hypothesize at least all the boundaries induced by the classifier, in addition to the boundaries inter-acoustic classes that the classifier cannot detect. Other choices are the average of a certain number of the lowest s, or the mean of all of them; both the alternatives make the method more robust to the errors of the segmentation induced by the classifier. Actually, the best experimental results were obtained by setting the BIC decision threshold to the mean of all the s. This new partitioner now depends on only the availability of a set of GMMs for modeling the acoustic classes, being the tuning completely automatic. (The weight of the loop network is kept fixed in this work: experimental results
will show that this parameter is not critical for cross-domain porting of the partitioner.)
performance is the word error rate (WER) of the whole system. Then, for the most significant experiments, the WER is also supplied.
4. PORTING THE PARTITIONER In this section it will be experimentally shown that the partitioner developed for the BN domain cannot be applied to the Echo domain, without a proper porting. This is true in particular for the GMMs of the classifier (see Subsection 4.3), while the of the BIC has to be estimated on a development set (see Subsection 4.2), unless the automatic tuning is employed, whose effectiveness is proved with results of Subsection 4.4.
4.2. Porting the Segmenter As shown in Figure 2, the value of affects a lot the performance of the segmenter. In particular, the optimal values of for the two domains is quite different. On the training set of Echo, the best segmentation in terms of F0 is obtained with = 300; from here on, that value will be referred as E ho . 100
90
88
87
4.1. Evaluation Metrics 80
In order to experimentally assess the steps of the porting process, let some metrics be defined.
77 68
%
70
60
4.1.1. Evaluation of the Segmenter
54 50
Performance of automatic spectral change (SC) detection should be calculated with respect to a set of target SCs. Each target SC is usually associated a time interval [SSC ; ESC ℄, rather than a single point. This because silence or other non-speech events may occur between changes. Tolerances in detecting SCs can be introduced by extending such intervals. Hence, an hypothesized SC is considered correct if it falls inside one of the augmented target intervals [SSC tol; ESC + tol℄, where tol is the admitted tolerance. For comparing target and hypothesized SCs, the recall and precision measures are adopted. A combined measure of recall R and precision P is the F0-score that, if an equal relative importance of recall and precision is assumed, is defined as F 0 = P2PR +R . 4.1.2. Evaluation of the Classifier For evaluating the classifier, we simply compute the frame classification accuracy (FCA), that is the percentage of the frames classified in the correct class out of the total number of frames. The classes under consideration are the following 5 generic acoustic classes:
WB : wideband spee h ffemale; maleg NB : narrow band spee h NonSpee h : silen e; musi ; noise:::
The choice of such broad classes tries to summarize in just one measure the gender and bandwidth errors, and the rates of speech lost and of non-speech wrongly classified as speech. 4.1.3. Evaluation of the Partitioner Since the partitioner is a module of the transcription station, whose final goal is to transcribe the speech of the input audio stream, the most important metric for evaluating its
47
43
40 36
BN precision BN recall BN f0 Echo precision Echo recall Echo f0
30
20 0
250
500
750
1000
lambda
Fig. 2. Precision, recall and F0 on the BN and Echo test sets as functions of the BIC (tol=500ms). 4.3. Porting the Classifier In Figure 3 both classification results in terms of FCA and recognition results in terms of WER of the Echo test set are given as functions of the training data. These experiments were conducted by employing the ITC-irst BN recognizer, with just the acoustic model adapted to the Echo domain, and the E ho for the BICbased segmentation. It emerges that GMMs trained on BN data poorly classify ECHO documents, although speech recognition is rather accurate, while even a small amount of Echo training data (i.e. 1 hour) allows to obtain performance that are quite close to those corresponding to the use of all the available training data (about 6 hours). 4.4. Impact of the Partitioner on the WER In order to quantify the impact of the automatic partitioner on the performance of the transcription station, a set of recognition experiments were performed by using different partitioners on both the Echo and the BN domains. In Table 2, recognition results are reported. The first row shows WERs obtained by using manual annotations instead of any automatic partitioner; these performance have to be considered as reference. The second row gives performance obtained with domain specific partitioners, in which both the BIC s and the GMMs are tuned/trained on the domain
100
93.8
95
94.4
94.7
94.4
able models of other domains and/or techniques of unsupervised training will be investigated.
94.6
92.3 90
39.0
38.8
6. REFERENCES 37.5
80
WER (%)
FCA (%)
85
75 36.4 36.1
70 67.5 35.7
65
35.7
35.9
35.9
5h
6h
35.6 FCA WER
60 BN
1h
2h
3h 4h ECHO training data
Fig. 3. FCA and WER as functions of the training data: using BN GMMs and using an increasing amount of ECHO data for training. specific training sets. The last row refers to experiments in which the partitioner with the automatic tuning of BIC is employed; it has to be noted that in these cases, while GMMs are domain specific, the segmenter is exactly the same for the two domains. partitioner manual
E ho ; BN AutEst
Echo 32.8 35.9 35.0
Grr 17.7 18.9 18.8
Tg 20.0 21.3 21.5
Grr+Tg 18.5 19.7 19.7
Table 2. Recognition results (WERs) with different partitioners. 5. COMMENTS (a) The degradation of recognition performance due to the automatic partitioner is similar across domains/subdomains. Among the other things, this confirms that it is possible to keep the same weight of the loop network across domains (as in fact we did) without affecting performance. (b) The automatic estimation of the BIC allows to get performance close to those obtained with its explicit tuning on a development set; in the case of the Echo domain, it allows even better accuracy: this is possible because, unlike the explicit tuning in which a single is used for segmenting all the test documents, the automatic tuning is done for each document, permitting a more refined behavior, at least in theory. (c) Both the classification and the automatic estimation need a set of domain specific GMMs. Figure 3 shows that in order to well train such models not too much data is required; this is due to the fact that the acoustic classes to be discriminated are few, well separated, and the models have few free parameters to be estimated (in our set-up, the GMMs have 32 components, and the feature space size is 39). Then, in the future, the use of a small amount of annotated data together with techniques of adaptation from avail-
[1] S. Wegmann, P. Zhan, and L. Gillick, “Progress in broadcast news transcription at Dragon systems,” in Proc. ICASSP, Phoenix, AZ, 1999, pp. 33–36. [2] M. A. Siegler, U. Jain, B. Raj, and R. M. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” in Proc. of the DARPA Speech Recognition Workshop, Chantilly, VA, February 1997. [3] T. Kemp, M. Schmidt, M. Westphal, and A. Waibel, “Strategies for automatic segmentation of audio data,” in Proc. ICASSP, Istanbul, Turkey, 2000, pp. III:1423–1426. [4] H. Gish, M.H. Siu, and R. Rohlicek, “Segregation of speakers for speech recognition and speaker identification,” in Proc. ICASSP, Toronto, Canada, 1991, vol. 2, pp. 873–876. [5] S. S. Chen and P. S. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” in Proc. of the DARPA Broadcast News Transcr. & Underst. Workshop, Lansdowne, VA, 1998. [6] M. Harris, X. Aubert, R. Haeb-Umbach, and P. Beyerlein, “A study of broadcast news audio stream segmentation and segment clustering,” in Proc. EUROSPEECH, Budapest, Hungary, 1999, vol. 3, pp. 1027–1030. [7] A. Tritschler and R. Gopinath, “Improved speaker segmentation and segments clustering using the Bayesian Information Criterion,” in Proc. EUROSPEECH, Budapest, Hungary, 1999, vol. 2, pp. 679–682. [8] M. Cettolo, “Segmentation, classification and clustering of an Italian broadcast news corpus,” in Proc. of the 6th RIAO Content-Based Multimedia Information Access - conference, Paris, France, 2000. [9] P. Sivakumaran, J. Fortuna and A. M. Ariyaeeinia, “On the use of the Bayesian Information Criterion in multiple speaker detection,” in Proc. EUROSPEECH, Aalborg, Denmark, 2001, vol. 2, pp. 795–798. [10] C.J. Wellekens, “Seamless navigation in audio files,” in Proc. of the ITRW on Speaker Recognition, Crete, Greece, 2001, pp. 9–12. [11] F. Brugnara, M. Cettolo, M. Federico, and D. Giuliani, “Advances in automatic transcription of broadcast news,” in Proc. ICSLP, Beijing, China, 2000, pp. II:660–663. [12] M. Federico, D. Giordani, and P. Coletti, “Development and evaluation of an Italian broadcast news corpus,” in Proc. of the 2nd Int. Conf. on Language Resources and Evaluation (LREC), Athens, Greece, 2000. [13] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978. [14] M. Cettolo and M. Federico, “Model selection criteria for acoustic segmentation,” in Proc. of the ISCA Automatic Speech Recognition Workshop, Paris, France, 2000.