Automatic Detection and Annotation of Disfluencies in Spoken French Corpora George Christodoulides 1 Mathieu Avanzi 1,2 2 Department
1 Centre VALIBEL, Institute for Language & Communication, Université catholique de Louvain, Belgium of Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages, University of Cambridge, United Kingdom
[email protected]
[email protected]
Introduction
DisMo Disfluency Annotation Scheme
Automatic Detection Results
An important characteristic of spoken language is the prevalence of a class of phenomena known as disfluencies (e.g. Shriberg 1994). They are interesting in their own right, especially in psycholinguistics; it is also necessary to take them into account when applying natural language processing (NLP) techniques to speech data. • Disfluencies help manage the flux of time, as the speaker incrementally constructs his message (Levelt 1989) through planning and self-monitoring. • In specific contexts, disfluencies may be used as communicative devices, e.g. in order to manage dialogue interaction (Moniz et al. 2009), or indicate information status (Arnold et al. 2003); they may coincide with other prosodic events or influence their perception (e.g. prominence, major boundaries).
The annotation scheme is based on the output of the DisMo annotator (Christodoulides et al. 2014), which is structured in three levels: minimal tokens, multi-word expressions and discourse structuring devices. The following figure shows how this system can be represented on a Praat text-grid:
The automatic disfluency detection system is organised in different modules. Each module is responsible for detecting specific types of disfluencies, using the most appropriate method for each type. • Filled pauses (FIL) + Lexical false starts (FST): matched with regular expressions, based on transcription conventions. • Hesitation-related lengthening (LEN): Optional detection, used for corpora that include a reliable automatic or manual syllabification. Support Vector Machine (SVM) classifier, based on prosodic features: length of the syllable relative to windows of ±3 neighbouring syllables (normalise for articulation rate), syllable structure (consonants – vowels), syllable position within the token, relative pitch over the same windows. • Repetitions (REP): two algorithms are available for detecting repetitions • Pattern Matching: sequences of i = 1…n tokens are matched with j = 1…m following tokens (default n = 7, m = 8, based on the analysis of our corpus). • Conditional Random Fields (CRF) model and post-processing • Editing-type disfluencies (INS, SUB, DEL): CRF model, features: word form, POS, lemma, edit distance with next 1…7 tokens / lemmas + interruption point hypothesis (from an SVM classifier based on prosodic features: difference in articulation rate, mean pitch and mean intensity 500 ms before and after each possible interruption point).
Previous work has focused on describing disfluencies (e.g. co-occurrence with syntactic, prosodic, interactional events) and their acoustic properties, and on their automatic detection with a view to improving ASR systems. Various machine learning algorithms and feature sets have been proposed for the automatic detection of disfluencies (decision trees, HMM models, CRF models, CRF models + ILP constraints etc.). Germesin et al. (2009) suggest using “different techniques, each specialised in its own disfluency domain”. Motivation and objectives of this study: • Propose a language- and theory-independent, yet detailed, disfluency annotation scheme and test its coverage by manually annotating a spoken corpus. • Develop a system for the semi-automatic detection and annotation of disfluencies in manually-transcribed data. • Break the vicious circle of “not having enough data” to train automatic systems for French, by making it easier to add disfluency annotation to existing and new, publicly-available spoken language corpora.
Corpus Data Corpus CPROM-PFC (Avanzi 2014) created from the PFC corpus (Durand et al 2002, 2009) • 14 regional varieties of French recorded in France, Belgium and Switzerland (4 different cities per country) • 112 speakers (4 M, 4 F per city) aged 20-80 • Semi-directed sociolinguistic interviews (3 min per speaker) + reading of a text (398 words). • In total, approximately 11 hours.
Tournai
Brussels
Béthune
Gembloux
Liège
BeF
Paris Ogéviller Brécey
NoF Nyon
Lyon
Neuchâtel Fribourg Martigny Geneva
SwF
For this study we used the Spontaneous Speech sub-corpus (approx. 7 hours).
Annotating with the Disfluency Editor In order to facilitate annotators in applying the DisMo scheme, we have developed a tool called the Disfluency Editor. The user selects a contiguous region of tokens and the type of disfluency, and the tool automatically assigns the correct annotation codes.
Presented at INTERSPEECH 2015, 6-10 September 2015, Dresden, Germany
The level of minimal tokens includes tiers tok-min, pos-min and disfluency. Multiword units may span several minimal tokens (tiers tok-mwu and pos-mwu); this allows to annotate MWEs having a different POS function than their constituent parts. Discourse markers (DMs) and related phenomena (tier discourse) may span multiple tokens; thus the POS tags of any token or expression functioning as a DM are preserved. We propose a detailed multi-level annotation scheme for disfluencies, by combining aspects of the systems proposed in Shriberg (2001), Brugos et al. (2012), Heeman et al. (2006) and others. The unit of annotation for disfluencies under our system is the minimal token, with optional extensions that add annotations to the syllable level. Level 1: Simple disfluencies are those affecting only one token c’ est pour ça que j’ hésite euh un peu en parler FIL Filled pauses FIL au cercle d’oenologie de= Bruxelles LEN Hesitation-related LEN lengthening comme infirmière so/ sociale FST Lexical false start FST il m’ a dit ça su+ _ +ffit WDP Pause within word WDP Level 2: Repetitions where a token or a series of tokens are repeated (exactly the same form) The numbering describes the repetition pattern les disques et et lancer les jingles REP Repetition REP* REP_ il a il a il a dit que REP:1 REP:2 REP:1 REP*:2 REP_ REP_ c’ est pas c’ est pas un système génial hein REP:1 REP:2 REP*:3 REP_ REP_ REP_
Results of 5-fold cross validation Disfluency type / Method LEN – SVN classifier REP – CRF model IP – Interruption point hypotheses SUB, INS, DEL – CRF models
Reparandum / Repair Prec Recall Editing terms region Gold standard Interruption Points (upper limit) 1 Separate 77.6% 51.4% Predict 2 Merged 74.7% 44.7% 3 Separate 82.4% 62.8% Ignore 4 Merged 76.9% 53.2% Predicted Interruption Points (actual performance) 1 Separate 54.3% 36.5% Predict 2 Merged 48.6% 31.3% 3 Separate 62.6% 42.1% Ignore 4 Merged 59.2% 36.2%
Level 4: Complex disfluencies are combinations of several structured ones that cannot be decomposed as L1+L2+L3. Annotated separately using a backtracking table (adopting the method proposed by Heeman et al. 2006) les ac/ COM Complex COM(1,1) COM(2,1)+FST les actions enfin COM(1,2) COM(2,2) COM(3,2):edt les activités enfin professionnelles COM(1,3)_ COM(2,3)_ COM(3,3):edt COM(4,3)_
www.corpusannotation.org/disfluency
Recall 87.4% 75.8% 52.0% (see Table 2)
F-meas 82.5% 79.8% 62.0%
Evaluation of 4 BIO (begin-inside-outside) schemes for editing-type disfluencies Editing terms annotated as a separate region (method 1, 3), or included in the reparandum region (method 2, 4). Predict the repair region (method 1 and 2) or not (method 3 and 4).
Level 3: Structured editing disfluencies are those that follow the pattern: reparandum – interruption point – interregnum (including optional editing terms) – repair c’ est vraiment un en tout cas la parole DEL Deletion DEL DEL DEL DEL* cette personne était enfin c’ est un ami de SUB Substitution SUB* SUB:edt SUB_ SUB_ c’ est vrai que Béthune euh vivre à Béthune ça aurait été INS Insertion INS* INS+FIL INS_ INS_ INS_
The annotation scheme is hierarchical: Level 1 annotation codes (simple disfluencies) may combine with Level 2 and Level 3 codes; and Level 2 codes (repetitions) may combine with Level 3 codes (structured disfluencies). This allows us to model the co-occurrence of simple disfluencies within the interregnum part or near the interruption point of structured disfluencies. The interruption point (IP) is indicated by appending an asterisk (*) to the annotation code. Optional editing terms are marked by appending “:edt” – these may be discourse markers or any other phrase used by the speaker to indicate his intention to change the utterance. The repair region is marked by appending an underscore (_) to the code. For more examples, check out our website!
Prec 78.2% 84.3% 76.7%
F-meas 61.9% 55.9% 71.3% 62.9% 43.7% 38.0% 50.3% 44.9%
The strategy of considering the reparandum and the editing terms as one contiguous region, contrasted with the repair region, yields marginally better F-measure results.
Conclusion and Perspectives • •
•
•
We presented a detailed, language-independent annotation scheme for disfluencies in spoken corpora and applied it to a 7-hour French spontaneous speech corpus. The automatic detection of the majority of speech disfluencies in a transcribed corpus is feasible, even if, as expected, some phenomena are more easily recognizable automatically than others. Our system is currently applied to several large publicly-available French spoken language corpora. An iterative re-training methodology is used to improve accuracy. This annotation campaign opens up perspectives for further research. As part of the DisMo multi-level annotator, our system can be applied to newly-acquired corpora. Disfluency detection is used to improve POS tagging and chunking accuracy.