Exploiting speaker segmentations for automatic ... - Semantic Scholar

2 downloads 0 Views 413KB Size Report
An application to broadcast news documents. Benjamin ..... 46 radio shows recorded on 4 radio stations. Table 1. The ESTER2-DEV corpus. Radio nb. time slot.
Exploiting speaker segmentations for automatic role detection. An application to broadcast news documents. Benjamin Bigot

Isabelle Ferran´e

Julien Pinquier

IRIT - Universit´e de Toulouse 118, route de Narbonne - 31062 Toulouse Cedex 9 - France {bigot, ferrane, pinquier}@irit.fr Abstract In the field of automatic audiovisual content-based indexing and structuring, finding events like interviews, debates, reports, or live commentaries requires to bridge the gap between low-level feature extraction and such highlevel event detection. In our work, we consider that detecting speaker role to enrich interaction sequences between speakers is a first step to reach this goal. The generic method we propose follows a data mining approach. We assume that speaker roles can emerge from parameters extracted from speaker segmentations without taking any prior information into account. Each speaker is then represented by a feature vector carrying temporal, signal and prosodic information. In this paper, we study how methods for dimensionality reduction and classification can help to recognize speaker roles. This method is applied to the corpus of the ESTER2 evaluation campaign and our best result reaches about 72% of well recognized roles that corresponds to nearly 79% of speech time.

1. Introduction Archiving and exploiting audiovisual data masses require automatic methods for content indexing, structuring as well as filtering. It is of first importance to guarantee efficient access to information through complex queries of a high semantic level, and to develop tools which enable relevant browsing through audiovisual data masses. Finding events like interviews, debates, reports, or live commentaries in audiovisual documents requires to bridge the gap between low-level feature extraction and high-level event detection using relevant descriptors. Although tools which automatically extract low-level features from audio and video are numerous, this is not the case for high-level features. For the past few years, methods have been develc 978-1-4244-8027-2/10/$26.00 2010 IEEE

oped in several domains in order to bridge this gap, particularly in web document search [17], summarization, information retrieval in sport video [13] and content discovery in audio data [4]. First, we present our main motivations in detecting interaction and speaker roles. Section 3 explains how our work on speaker role detection stands among existing work. The generic method we propose is described in section 4. It is based on the idea that a data mining approach applied to various low level features extracted from temporal segmentations as well as from prosodic or basic signal can make dominant descriptors emerge. To do so methods for dimensionality reduction are applied followed by a classification method. Finally, experiments carried out on the audio corpus are presented in section 5.

2. Motivations 2.1. Detecting interaction sequences In order to achieve high-level indexing of audio or video contents, we are particularly interested in detecting and characterizing interaction sequences between speakers. As a matter of fact, these sequences are important clues about content structure because they can (1) delimit other sequences, (2) be part of recurring patterns or (3) be considered as high-level events if they correspond to interviews or debates. Furthermore, developing methods for uncovering interaction sequences, can be a way to focus on verbal exchanges, more informal or less structured because they may contain spontaneous and conversational speech. We try to bring our contribution to this field, by studying temporal structuring of audiovisual contents, detecting and characterizing interaction zones between speakers. This work has found an applicative framework through the EPAC Project [7]1 . Detecting conversational speech sequences 1 This work is conducted within the French ANR Project ANR-06CIS6-MDCA-006.

CBMI’2010

could be a way to anticipate difficulties met by automatic speech transcription systems on this type of data [12]. Segmenting the data flow into interaction zones could as well bring some interesting clues in the field of named entity recognition as such sequences generally start with the presenter greeting his guests or introducing them to the listeners and end with the presenter taking his leave of them. Finally, another interesting aspect of interaction zone detection and their characterization as debates for instance, will be helpful in the field of opinion mining.

2.2. Detecting speaker role We think that role detection is a central element in content-based indexing and that studying interaction and extracting information about main speaker role will help to go a step further toward high-level event detection. Usually in audiovisual shows, behaviours adopted by speakers depend on the role they are in charge. Some of them can be present all along the show, while some others can make a short appearance in it. They also may appear alone or interact with one or more other speakers. Some people act as the anchorman of the sequence and other ones as guests. The way people interact together depends on this, as well as the liveliness of the interaction, that can be soft or stronger if speakers are involved in an animated debate. That is why we aim to extract some information about speaker role and study their behaviour in terms of interaction. In order to better place our contribution, the next section gives an overview of the state of the art about the role recognition.

3. Related work In work relative to role recognition, three main categories of roles have been classically studied: anchorman, journalists and a third one gathering the other speakers and most of the time called guests or others. In a content structuring perspective, different approaches are followed.

3.1. State of the art Role detection is a quite recent research field and to our knowledge little work on this topic has been reported in computer science literature. A first type of research work is based on observing role sequence patterns that can be found in a set of documents recorded from a same program show while others build summaries of broadcast news by detecting the journalist who presents the headlines. In 1999, Stolcke [15] worked on broadcast news document structuring, pointing up relations between changes of speaker roles along each document as well as changes of topics. In 2000, Barzilay [1] presented one of the first results for an automatic role recognition task. This work pro-

poses a parameterization of roles based on lexical and contextual features extracted from audio transcriptions as well as a first accurate definition of role categories: anchorman, journalist and guest characterized by means of linguistic considerations. The corpus used for evaluation was composed of 35 recordings of the same broadcast show i.e. with similar content and structure, which represents 17 hours of audio data. In 2006, Liu [14] proposed two different approaches in order to attribute speaker roles chosen among three categories: anchorman, reporter and others. The first approach exploits Hidden Markov Models, the second one is based on a Maximum Entropy classifier. In both cases, manual transcriptions are used to train role specific N-gram language models and predict roles for a set of test documents. Evaluation was conducted on 336 broadcast news from different sources (170 hours of audio data) and 77% of the overall number of speech turns were correctly labelled by these two classifiers. More recently, Vinciarelli [16] has proposed two methods for role detection, both applied to speaker segmentations. The first method exploits intervention duration distributions, while the second one is based on Social Network Analysis. Both have been applied to a very homogeneous corpus consisting of 96 recordings from the same twelveminute broadcast news show (around 19 hours). Performances reached 85% of the overall duration correctly labelled using 6 different roles (anchorman, second anchorman, guest, interview participant, abstract and meteo). These propositions have been followed by Favre’s contribution [9] who has integrated a Social Affiliation Network to extract features characterizing speaker interactions and speaker positions within documents. The prediction is achieved with Hidden Markov Models and N-gram language models on a less homogeneous corpus (news, talkshow) and a less structured (meetings) one (1) 36 hours of broadcast news and talkshows and (2) 45 hours of meetings. The performances 80% on (1) and 45% on (2) highlight the difficulty of detecting roles in low structured documents.

3.2. About our contribution Most of the research work mentionned above is based on very homogeneous audio corpora. Our first goal is to propose a generic method to be able to process any type of document, from different sources and programs, because this diversity has an impact on speaker roles, document duration as well as document structures. In the perspective of data mass structuring, we have to be able to make different types of structure emerged in order to automatically cluster documents according to their temporal structure. Another reason for proposing a generic method is that we want to act before the transcription step in order to help transcrip-

4.3. Parameterization To our knowledge it is the first time these types of descriptors are used for speaker role detection. As a result of our investigations in [2], we think that temporal features combined with prosodic features can be characteristic of speaker roles in the context of broadcast news documents whatever the source and the program are. For each speaker, we extract a set of 34 features. Five of them are based on temporal measurements: his overall speaking time, his speaking span (duration between his first and last segments), his inactivity rate which is based on the difference between the two previous parameters, the number of his segments and his speaking time over the duration of the show in which the speaker appears. A second subset of 25 features is based on signal energy and is directly extracted from the audio files. For example, we look for characteristics describing the zones of silence: length, number, rate, mean, variance, minimum and maximum. We as well realise these calculations over the high energy zones, the Signal over Noise Ration and the results of a telephone detector. The last four features are related to the pitch of the speech signal: pitched zone rate, pitch average, pitch variance, maximal pitch. Depending on the document we are processing, several of these features may be insignificant or strongly correlated for a speaker role detection task. However, since we do not use any prior knowledge about speakers, we have to propose an exhaustive set of features to guarantee our method to be suitable for any document.

4.4. Dimensionality reduction In this work, reducing the dimensionality of the feature vectors is carried out using two different classical methods. First, we apply a Principal Component Analysis (PCA) [5]. We keep the main components which represent 95% of the information of the former representation. The second method is the Canonical Discriminant Analysis (CDA) [10]. The number of dimensions is then reduced to (N −1) where N is the number of classes (role types).

4.5. Classification methods A lot of tests involving either unsupervised or supervised classification methods [5] were carried out. The unsupervised classification is based on the assumption that speakers who share the same roles should share the same features and then constitute clusters in the feature space. We applied the K-means algorithm and the DBSCAN [8] one, which is a density-based clustering methods. The supervised classification methods used in this work are Gaussian Mixture Models (GMM), Support Vector Machine (SVM) and k-Nearest Neighbours (k-NN). The GMM

is usually efficient when the learning samples are numerous enough. SVM is insensitive to differences between the number of training examples in each learning classes. We will use this method in two-classe problems. The k-NN algorithm offers the advantage of being still efficient for small learning populations. These sets of methods and their properties will allow us to treat different corpora in future work. In the next section we describe the corpus, the creation of the role ground truth.

5. Test data and ground truth 5.1. Corpus For our experiments, we used several documents taken out from the corpus of the ESTER2 evalutation campaign [11] and focused on the development and test sets. As shown in table 1 and table 2, this corpus consists of 46 radio shows recorded on 4 radio stations. Table 1. The ESTER2-DEV corpus. Radio nb. time slot type speakers Fr. Inter 3 7-7.20 pm news 20 Fr. Inter 1 7.20-8 pm debate 13 Fr. Inter 1 12 am-1pm debate 4 TVME 4 8.45-9 pm news 14 Africa 1 9 7.30-7.45 am news 13 RFI 1 6.30-6.50 am news 16 RFI 1 7-8 am news 21

Table 2. The ESTER2-TEST corpus. Radio nb. time slot type speakers Fr. Inter 3 7-7.20 pm news 20 Fr. Inter 1 7.20-8 pm debate 13 Fr. Inter 2 10-11 am society 10 TVME 4 9.35-9.50 pm news 20 Africa 1 2 7- 7.10 pm news 6 Africa 1 7 12 -12.10 am news 9 RFI 6 8.30-8.40 pm news 7 RFI 1 9.30-9.40 am news 7 There are 13 different programs in terms of structure, time slots, duration, number of speakers and even in terms of document types since 5 shows are not broadcast news. Ground truth for speaker segmentation was provided by the organizers of the ESTER2 campaign, therefore we can measure the quality of the automatic speaker segmentations used as input.

5.2. Ground truth for speaker roles Since we have participated to the speaker diarization task of ESTER2, speaker segmentation references called ESTER2-DEV-ref and ESTER2-TEST-ref are at our disposal. We have created the speaker role reference for these segmentations as well as for their automatic versions called ESTER2-DEV-auto and ESTER2-TEST-auto. The table 3 shows the number of speakers for every role in the corpus. Table 3. Number of speakers per role. ESTER2-DEV-ref ESTER2-TEST-ref ESTER2-TEST-auto

Anchorman 20 26 26

Journalist 149 143 90

Other 117 128 87

The Anchorman class is less represented than the two other classes since there is usually only one anchorman in a show. Among the overall number of 583 speakers present in the reference (ESTER2-DEV-ref and ESTER2-TEST-ref) some are not taken into account because we chose to consider only ”significant” speakers whose speaking activity is higher than 10 seconds. The ESTER2-DEV-ref corpus is used for the training (or the tuning) of the supervised classification methods. The role recognition is realised on the ESTER2-TEST corpus. In order to evaluate the influence of the errors introduced by the automatic speaker diarization, the performances obtained on ESTER2-TEST-ref and on ESTER2-TEST-auto are compared. Tests are also done after applying a PCA or a CDA to reduce the dimensionality of feature vectors.

6. Experiments and results First, a basic 3-class recognition process (Anchorman, Journalist, Other) is applied. Results are reported in table 4. The accuracy expresses the proportion of speakers whose role has been correctly labelled. The k-NN classifier reaches higher performances than GMM and SVM. Actually, because of the small number of samples in the Anchorman class, we were not able to apply the GMM method but only a Gaussian Model which could be to few to model the Journalist and the Other classes. Among the different SVM kernel classifiers, the best results have been reached using a Gaussian kernel. The two dimensionality reduction methods (PCA and CDA) reach similar results. A positive result is that the Diarization Error Rate (DER) of the speaker diarization tool (11.35%) has no real impact on the performances: results on either manual or automatic segmentations are equivalent, which may prove the robustness of our approach.

Table 4. Role recognition results. Accuracy(%) Gauss. Mod. k-NN TEST-ref PCA 61.75 63.6 TEST-auto PCA 57.64 64.04 TEST-ref CDA 56.68 63.59 TEST-auto CDA 53.69 63.05

SVM 61.12 60.01 60.56 58.96

In second experiment, a preprocessing step aims to separate punctual speakers from non punctual ones. We define a punctual speaker as a speaker who appears in only one segment and in this case his span and his speaking activity are equals. In the test corpus, 48 Journalist and 36 Other are punctual speakers (no Anchorman). Thanks to this strategy performances increase significantly: about 6% for the k-NN classifier and more than 12% for the Gaussian Model with the CDA method (see table 5). The best result (accuracy 70.92%) is obtained with the PCA and k-NN combination. Table 5. Role recognition results using punctual/non punctual distinction. Accuracy(%) Gauss. Mod. K-NN SVM TEST-auto PCA 60.50 70.92 64.15 TEST-auto CDA 65.96 69.02 64.11 The confusion matrix (tables 6 and 7) describe more precisely the results for the best classifier (k-NN) according to the method applied. CDA is the best pre-processing method to isolate the Anchorman class: all speakers are detected. Contrarily, PCA method provides best results for discriminating between Journalist and Other. Table 6. Matrix confusion with PCA. Anchorman Journalist Other Anchorman 19 4 3 Journalist 0 61 29 Other 4 27 56

Table 7. Matrix confusion with CDA. Anchorman Journalist Other Anchorman 26 0 0 Journalist 2 77 11 Other 6 45 36

In a third experiment, we improve the overall results by using another specific strategy after a PCA dimensionality reduction. We apply a Gaussian Model classifier for the punctual speakers and the k-NN algorithm for the non punctual. The fusion of these two sub-systems reaches 71.92% of the overall number of speakers correctly attributed. This corresponds to 78.66% of the overall speech time correctly annotated in terms of role.

7. Conclusion and perspectives In this paper, we describe our contribution to the domain of speaker role recognition for 3 generic roles occurring in broadcast news: Anchorman, Journalist and Other. We assume there are clues about roles in temporal, prosodic and basic signal features extracted from the audio files and from speaker segmentations. Evaluations are conducted on 13 hours of radio documents coming from the ESTER2 campaign and corresponding to 13 different radio shows. The performances reach 71.92% of well recognized speakers and 78.66% of the overall document duration is correctly annotated. These good results, similar to the state of art, have to be highlighted because first they are obtained from automatic speaker segmentations and second data come from a heterogeneous corpus. Besides, a complete detection of speaker from the Anchorman class can be achieved which is on one hand quite motivating and above all essential in the context of document structuring. Actually, this role is known in literature as the most central speaker either for document classification or for information retrieval. To go a step further and validate the generic aspect of our approach, it is necessary in future work to increase the size and the diversity of our corpus, for instance by taking TV shows into account. This will also lead us to extend our parameter set. Studying interaction between speakers, given their role will help to better characterize the interaction sequences, to make high level event emerged and to find patterns thanks to which clustering will be possible. These encouraging results open a way to applications in audiovisual content structuring.

References [1] R. Barzilay, M. Collins, J. Hirschberg, and S. Whittaker. The rules behind roles: Identifying speaker role in radio broadcasts. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 679– 684. AAAI Press / The MIT Press, 2000. [2] B. Bigot and I. Ferran´e. From Audio Content Analysis to Conversational Speech Detection and Characterization. In ACM SIGIR Workshop: Searching Spontaneous Conversational Speech (SSCS), Singapore, pages 62–65, 2008.

[3] B. Bigot, I. Ferran´e, and Z. A. A. Ibrahim. Towards the detection and the characterization of conversational speech zones in audiovisual documents. In International Workshop on Content-Based Multimedia Indexing (CBMI), pages 162– 169. IEEE, 2008. [4] R. Cai, L. Lu, and A. Hanjalic. Unsupervised content discovery in composite audio. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia, pages 628–637, 2005. [5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd Edition). Wiley-Interscience, 2 edition, 2000. [6] E. El Khoury, C. Senac, and R. Andr´e-Obrecht. Speaker diarization: Towards a more robust and portable system. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, pages 489–492, Honolulu, Hawaii, USA, April 2007. IEEE, IEEE. [7] EPAC. The EPAC Project. http://epac.univ-lemans.fr/. [8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data Mining, pages 226–231. AAAI Press, 1996. [9] S. Favre, A. Vinciarelli, and A. Dielmann. Automatic role recognition in multiparty recordings using social networks and probabilistic sequential models. In ACM International Conference on Multimedia, Beijing, October 2009. [10] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals Eugen., 7:179–188, 1936. [11] S. Galliano, G. Gravier, and L. Chaubard. The ESTER 2 evaluation campaign for the rich transcription of french radio broadcasts. In INTERSPEECH 2009, pages 6–10, Brighton, UK, 2009. [12] L. Lamel and J. Gauvain. Alternate phone models for conversational speech. Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, 1:1005–1008, 2005. [13] B. Li, J. H. Errico, H. Pan, and I. Sezan. Bridging the semantic gap in sports video retrieval and summarization. Journal of Visual Communication and Image Representation, 15(3):393–424, March 2004. [14] Y. Liu. Initial study on automatic identification of speaker role in broadcast news speech. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pages 81–84, New York City, USA, 2006. Association for Computational Linguistics. [15] A. Stolcke, E. Shriberg, D. Hakkani-T¨ur, G. T¨ur, Z. Rivlin, A. S. E. Shriberg, G. Tur, and K. S¨onmez. Combining words and speech prosody for automatic topic segmentation. In In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pages 61–64, 1999. [16] A. Vinciarelli. Speakers role recognition in multiparty audio recordings using social network analysis and duration distribution modeling. Multimedia, IEEE Transactions on, 9(6):1215–1226, Oct. 2007. [17] R. Zhao and W. Grosky. Narrowing the semantic gap - improved text-based web document retrieval using visual features. Multimedia, IEEE Transactions on, 4(2):189–200, 2002.