Combining pattern matching and shallow parsing techniques for detecting and correcting spoken language extragrammaticalities Mohamed-Zakaria KURDI Natural Interactive Systems Laboratory (NISLab) University of Southern Denmark
[email protected] http://www.nis.sdu.dk/~kurdi Abstract
In the context of spoken language understanding systems, we investigated the possibility of normalizing oral utterances. Our approach is based on the integration of different knowledge sources for detecting and correcting spoken language extragrammaticalities. The processing has been done following two main steps: at the first, a lexical processing allows the normalization of oral words and other spontaneous phenomena like hesitations and filled pauses. Afterwards, Post-filtering modules assume the task of normalizing repetitions, selfrepairs, false starts and incomplete utterances (Supra Lexical Extragrammaticalities). This is done by the way of a set of patterns and rules representing the various manifestations of these phenomena. Our evaluation on a transcribed corpus shows the efficiency of our approach to detect and correct the main forms of spoken language extragrammaticalities.
The other solution consists in incorporating the extragrammaticality models in the system. These models give the system the ability to normalize these different phenomena. The normalization consists in detecting and correcting the extragrammaticalities Shriberg (1994), Heeman and Allen (1994). This is done by using a set of patterns and rules which represents the various manifestations of the extragrammaticalities. The outline of this paper is as follows: we present in section 1 the spontaneous speech characteristics and especially speech extragrammaticalities; in section 2, a literature review is made. Our approach is presented in section 3 and the results of our system’s evaluation are presented in section 4. Summary and proposals for future work will conclude the paper. 1
Spoken language extragrammaticalities
Speech extragrammaticalities can be divided into five types: 1. A lexical Extragrammaticality (LE) may consist of a filled pause (oh, uh, mm, ..), oral words or expressions (wanna, I'll, ..), a fragment of a word, etc. 2. The repetition of one or several words1: (..) you need you need to get five 2 five boxcars of orange there. 3. A false-start is where the speaker abandons what he or she was saying and starts again: so we could just probably so now the thso what we're probably gonna have to do is to (…). 4. A self-repair consists of modifying what was said before: So engine two is going
Introduction
Spontaneous spoken language has many properties that make its processing very difficult. These difficulties occur at various levels from lexicon to pragmatics. One of the most important problems is the prevalence of speech extragrammaticalities. Speech extragrammaticalities are spontaneous phenomena that have been viewed as noisy and irregular events and that have to be ignored in order to correctly understand an utterance. The most common solution for repairing sentences with extragrammaticalities consists in using selective symbolic or stochastic strategies enabling the system to ignore words that it cannot process Lavie (1997), Mayfield (1995). Generally, this kind of strategies allows the system to avoid the failure of parsing but it reduces the chances to obtain an acceptable semantic interpretation.
1
All the examples presented in this paper are extracted from the Trains corpus. 2 In the Trains corpus annotation schema the mark: is used to indicate a silence. 1
uh to corning to dansville pick up three boxcars and back to Corning. 5. Incomplete utterances are utterances which need one or more syntactic units or whose last unit is incomplete. (…) and would + + the w- + p- +. 2
2.
Literature review
In the spoken language processing framework several strategies were investigated in order to allow traditional systems to process spoken language. Two main types of approaches can be cited in this context: • The pattern based approaches: Heeman and Allen (1994), Heeman (1997) investigated the integration of multiple knowledge sources in a statistical framework. They used 72 different patterns for detecting and correcting speech repair. For reducing the overgeneration of the patterns, they used a POS (Part Of Speech) tagger as a statistical filter which judges the repetitions and self-corrections proposed by the pattern builder. • The syntax based approaches: they are usually used only for the correction of the extragrammaticalities, not in the detection like in the works of Hindle (1983) who supposed that the extragrammaticalities were already detected and Core (1999) who used the system of Heeman for the detection. The main idea behind such approaches is to use some general metarules like the following: X! X Hes. In this meta-rule used for hesitations processing, we can see that there is no precise information about the nature of the involved non-terminal. Despite the economy of this process (we have one rule for the whole non-terminals of the grammar), it is not enough for constraining some cases in which we need to know precisely the nature of the involved nonterminal and the nature of its right hand and left hand contexts. So we can have two remarks about the previous works: 1. The use of only syntax or pattern based processing techniques seems to be limited. In fact, both pattern matching and syntax based normalization seem to be good for a
3
specific task. For example, pattern matching is efficient for normalizing repetitions and self-corrections while syntactic information is necessary for false starts and incompleteness cases processing. The syntactic frameworks used seem to be very general and do not allow to take into consideration in a precise way the syntactic nature of the extragrammatical segment and its context. Our approach
To overcome the lacks of the previous approaches we decided to adopt an approach which has the following two features: • Our approach is based on the combination of two different techniques: pattern matching and shallow parsing in order to combine their advantages for processing the different types of extragrammaticalities. • Adopting the chunk parsing as a general framework for the syntactic normalization. This allows a more flexible analysis of the input. Furthermore it is suited to our model which is based on the segmentation of the extragrammaticalities into different areas (chunks). Our study is done on the TRAINS Corpus (Heeman et al, 1995)3 which is a corpus of problem solving dialogues. The corpus consists of 93 dialogues, about 441 KB of words and about 5600 speaker turns. In our corpus, we found about 928 cases of Supra Lexical Extragrammaticalities (SLEs) and about 6000 cases of LEs. Only 619 cases of SLEs and around 4000 cases of LEs were used for the patterns and grammar writing. As shown in figure 1, our system can be divided into two components: lexical processing and supra-lexical processing. 3.1
Lexical processing
This component serves as a preprocessing step before normalizing SLEs. Its main function consists of normalizing LE or providing information enabling the repair processing component to normalize SLE. Lexical processing is done following two steps:
3
The corpus is available at the following address: http://www.cs.rochester.edu/research/speech/93dialogs/
2
3.1.1 Lexical normalizing The main function of this component is the correction of LEs. This kind of normalization is very useful for reducing the tagging errors and for allowing the detection of some cases of SLEs like the one shown in the following example: (cf. section 4 for a detailed presentation of the processing of this example): (…) the only place where there are there’s a warehouse at Corning (…). The goal of our work in this stage is to extract (by hand) these extragrammaticalities from our training corpus and to associate them with their regular forms. A conversion table is created and used in the processing.
3.1.2 POS tagging The Part Of Speech (POS) tagger associates each lexical item with one morphological category. It also allows the recognition of unknown words. The information provided by this module permits the SLE processing module to detect Self-repair, incomplete utterances and false repetitions. We are using the Xerox tagger for this task. 3.1.3 Post-tagging The main function of this module is to correct the systematic errors of POS tagging that we observed during the processing of the specific data of our corpus. For example, the word Corning which was used in our corpus as a city name was systematically considered by the tagger as a participle +PARTPRES. So we decided to create a conversion table that allows to associate the words that are systematically analysed incorrectly with their right POS tags.
Transcribed spontaneous utterances
3.2
SLEs processing
As we said, we are using two different techniques: pattern matching and syntax based normalization. lexical Normalisation
3.2.1 The processing by patterns We begin by building the patterns in the training corpus (60 dialogues). The extragrammaticalities have been hand annotated following the scheme proposed by SRI international researchers Shriberg (1994) with some modifications. The main labels used in the annotation process are presented in Table 1.
Posttagging
POS tagging
Local pattern matching
Syntactic normalization
Mx Rx Ed
Global pattern matching
X
Syntactic parsing
Word matching Replacement Editing terms (silence, hesitation, incomplete words, phatics, etc.) Neutral and inserted words Table 1. Labels used in annotating
Extragrammaticalities annotation is done in two steps: • Global annotation: in this step the labelling consisted of collecting the occurrences of extragrammaticalities in an utterance without any segmentation. A sample of global annotation is given in figure 2.
Normalized utterances
Figure 1. The general architecture of our system
3
..with the do i i need two M1 M2 E M22 M3
do
i
detailed context which may vary following the nature of the non-terminals of the grammar. The general schema of the meta-rules is presented in the following figure:
M4 E
need two engines (..)
False_start
M12 M23 M32 M42 Beginning_frontier Extg_sgment
Editor
Final_frontier
Figure 2. A sample of global annotation
•
Local annotation: this kind of annotation consists in collecting the segmented extragrammaticalities. We obtained 12 patterns corresponding to repetitions and 35 others corresponding to self-corrections. The obtained patterns were extended by hand during the implementation: for example, after observing the pattern R1M1 R1’M1 we added the non-observed patterns: (R1M2 R1’M2, R1M3 R1’M3, etc.). So we obtained finally 61 patterns for repetitions and self-corrections. Repetitions Patterns M1 ed M1 M2 ed M2 M1M1 M4 ed M4 M3 ed M3
% 43,95 25,82 10,98 4,39 3,46
Self-corrections Patterns R1 R1’ M1R1 M1R1’ M2R1 M2R1 R1 ed R1’ R1M1 R1’M1
Figure 3. General schema of the false start rule
As we can see in the previous figure, the nonterminals in our meta-rules refer to precise entities in the extragrammaticalities. 1. Beginning frontier: this part of the rule is designed to help the detection of the starting of the extragrammaticality. The beginning frontier may be the beginning of the utterance, or any other indicator of sentence or phrase beginning like coordinators, hesitations, some adverbs, etc. 2. Extragrammatical segment: following our observation of the corpus, we distinguished between two types of extragrammatical segments: • Segments absolutely extragrammatical: theses segments are extragrammatical for reasons relative to their own nature and are considered as extragrammatical whatever the context they occur in. For example, the occurrence of two determinants is extragrammatical whatever the context in which they occur. • Segments relatively extragrammatical: this kind of segments is much more frequently observed than the previous one. It consists of segments that are extragrammatical in a specific context while they are completely grammatical in another one. For example, the segment: pronpers + vpres + infto, may be considered as grammatical if it is followed by an infinitive verb but it is considered as extragrammatical if it is followed by an NP. 3. Editor: it may consist of a hesitation, discourse connector, or any phatic expression. The editing area may be
% 24,71 8,64 6,74 6,74 6,74
Table 2. The more frequently observed patterns for repetitions and self-corrections
3.2.2 The syntactic processing The syntactic normalization module has two main tasks: • The detection of the false-starts. • The delimitation of the extragrammatical area. For doing this, the system should be able to take into consideration the syntactic dependencies between the different chunks of the utterance that are involved in the extragrammaticality. We are using a meta-rule based approach. The main difference between our work and the previous ones is that we consider the extragrammaticalities in terms of related areas corresponding each to a precise non-terminal in the grammar (in the previous works general nonterminals for different categories are used instead). This allows the consideration of a
4
mandatory in some contexts and optional in others. 4. Final frontier: the final frontier has two functions: on the one hand it helps to decide whether a segment is extragrammatical or not by playing the role of the right hand context. On the other hand, it allows the system to delimitate the space of the extragrammaticality which is useful for the correction process. The final frontier may be more than the indicator of the beginning of a new segment. In fact, it usually consists of one or more syntactic units or constructions that we consider as necessary to determine whether a segment is grammatical or not. Thus, we created a set of possible final frontiers corresponding to the different configurations that we observed in our corpus. Examples of these configurations are: Subject Pronoun, NP, VP, NP+VP, etc. As we can see from these possible right hand contexts, our model allows to take into consideration both large chunks as a right hand context as well as fragments with very precise nature (subject pronoun is subtype of NP) following the needs. The general schema of the meta-rules used for the incompleteness processing is similar to the one used for false starts except that the final frontier of the incompleteness is the actual end of the utterance. The used grammars both for the meta-rules and the chunk syntactic parser are CFG equivalent. These grammars are implemented as a Recursive Transition Network extended with specific features for robust parsing like a selective algorithm and a partial parsing strategy (C.f. Kurdi (2000a) for more detailed presentation of this approach). 3.3
occurrences of extragrammaticalities are even more sophisticated (and by consequence harder to detect) when there are some patterns which are totally or partially contained in others like in the example presented in figure 2. In order to avoid the conflict between the different patterns we adopted a double pass strategy: 1. Local pattern matching: it consists of using the micro patterns (M1 E M1, M1 M1, M1- M1-, M1 M2 M1 M2, M1 M2 E M1 M2, etc.) for processing the simple and small extragrammaticalities. 2. Global pattern matching: it consists of using the macro patterns (the rest of the patterns) for processing the largest extragrammaticalities. 3.3.2
Editing terms modelling and overgeneration resolution Another important aspect of processing the extragrammaticalities consists in modelling the editing terms. Editing terms play an important role in the processing following two main factors: 1. The number: the number of editing terms plays an important role for the reduction of the overgeneration (false positive cases). The likelihood of overgeneration is more significant when the editing terms are more numerous. In order to avoid the overgeneration, the patterns with editing fragments containing more than two words were not considered. 2. The meaning: the overgeneration of a pattern depends also on the meaning of the editing fragment. For example, a simple enumeration like: two engines and two boxcars can be considered as self-repair and by consequence can be wrongly corrected. Moreover, there are phatic expressions of which the word count exceeds the number two and which may be used as editing expressions like: it’s wait for a moment please it’s five o’clock. We are using two different methods to take into consideration the meaning of the editing fragments: i. To avoid the correction of false positive expressions that can be semantically (or syntactically) modelled (like enumeration, comparison, ..), we used a set of control patterns representing the
Special problems
3.3.1
Multiple extragrammaticalities processing The distinction between global and local annotations allowed us to observe the cases of multiple extragrammaticalities in the same utterance and to see if there is any conflict between their corresponding patterns. This problem is particularly important because multiple occurrences of extragrammaticalities covers 14% of the total number of occurrences. The multiple
5
ii.
4
different manifestations of these phenomena. To avoid the failure of the detection of phatic expressions, we represent them by semantic grammar which is a shallow formalism based on both topological (word order) and semantic/pragmatic information Kurdi (2000a).
maybe maybe I’ll try taking um taking taking one boxcar that would be sufficient. • After the lexical processing, the local pattern matching module detects and corrects the simple repetition of the word maybe and the edited one of taking by using respectively the patterns M1M1 and M1 Ed M1. The output of this module is: Because I have to mm maybe I’ll try taking taking one boxcar would be sufficient. • The pre-processed utterance is then analysed by the syntactic normalisation and parsing modules. These modules detect the false start Because I have to mm and correct it by using the following meta-rule: Falsestart ! Beginning frontier inc_chunc_vpres_infto editor End_frontiere_declarative_utterance. This rule means that when a segment: NP V INFTO is followed by a declarative utterance (like: adv NP VP) this segment is considered as a false start and marked with the appropriate category. As we can see in this example the consideration of the extended right-hand side context allowed us to detect and delimitate the false-start. In fact the detection is done thanks to the disambiguation of the syntactic dependency of the segment (adv NP) which may formally depend on the two VPs on its left and right hand sides have to and will try.
Processing Examples
Below are simple examples of SLEs processing4. Example 1- the processing of the utterance: there are ther’s is done as the following: • At first, a lexical rule normalizes ther’s by associating it with its written form there is. The input sequence becomes there are there is. • The sequence is then tagged and it becomes: (there PRON), (are VBPRES), (there PRON), (is VBPRES). • The sequence is detected by the pattern: M1R1 M1R1’ (where R1 and R1’ are two different words from the same morphological category). • The system corrects the input by labelling the unnecessary chunk: [[unnecessary[there are]] [normal[there is]]]. Example 2- the processing of the utterance presented in figure 2. (..) do I I need two do I need two (..). is done as follows: • After the lexical processing, the local normalization module detects the correction of I and corrects it by labelling the unnecessary word. This is done with the micro pattern M1 M1. • At the second pass the pre-corrected input (do I need two do I need two) becomes detectable by the pattern: M1M2M3M4 M1M2M3M4 (which is the biggest possible pattern in this case). • The input is then corrected by labelling the unnecessary words. The final output is: [[unnecessary [do I need]] [normal[do I need]]]. Example 3- as an example of processing an imbricated false-start with a repetition let’s take the following utterance: Because I have to mm
5
Evaluation and Results
Our test corpus consists of 305 utterances not used for the grammar and patterns writing. Among these utterances, 255 contain 309 SLEs and the other 50 are clean (without extragrammaticalities) that we selected at random from different dialogues for testing the overgeneration of the system. Our results are presented in table 3.
4
For making the presentation simple, only the relevant steps of the processing are considered in the examples.
6
Phenomena
oranges (adjective, noun) is possible and in the other hand, the two chunks (make ..oranges, orange juice) are well formed. The correction failure is due, in general, to complicated cases like the interaction between an extragrammaticality and another linguistic phenomenon (in general anaphora or ellipsis). we’re gonna have this engine we’re gonna hook it. A comparison of our work with the one of Heeman, (1997)5 is done in the following two types of phenomena: • Modification repairs: modification repairs cover both repetitions and self-corrections. We obtained a recall of 81,27% and a precision rate of 87,81%. If we compare these results with the ones of Heeman (a recall of 77,95% and a precision of 80,36%) we see that our system achieves an improvement of about 3% for the recall and 7% for the precision. • False-starts: we obtained a recall of 53,19% and a precision of 60,97% which means that, compared with the results of Heeman (recall 36,21% and a precision of 51,59%), our system achieve an improvement of about 17% of recall and 9% of precision. This advantage shows the effectiveness of our combined approach in general and especially the adaptation of our syntactic normalization paradigm for normalizing the false starts.
% Detection
Total SLEs
Recall Precision Recall
81,56 88 71,79
Precision Recall
76,44 89,67
Precision
92,76
Recall
84,47
Precision
86,61
Correction
Detection Total Correction
Table 3. Results of SLEs processing
Our results show the effectiveness of our system in general for detecting and correcting the SLEs. The good precision rates show particularly the effectiveness of our approach for avoiding the overgeneration cases. In order to have a more precise idea about the performance of our system, we decided to make a detailed analysis of the results on the different phenomena as shown in figure 4. 120 100 80 60 40 20 0
Recall Precision
LE s R ep et iti Se on lfs co rre ct io ns Fa ls e st In ar co ts m pl et en es s
6
Summary and future work
In this paper we described a method which proposes to integrate different knowledge sources for detecting and correcting speech extragrammaticalities in exploiting high level information by shallow techniques. Key aspects of our approach are:
Figure 4. The results of correction for the different types of extragrammaticalities
The main reasons for the detection errors are: • The undergeneration: the system was not able to handle some of the cases that were not observed in the training corpus. • Tagging errors. • Complex cases like in the following example: (...) make them into oranges er orange juice. This case is not detectable because, in one hand, the transition between the morphological categories of orange
5
The work of Heeman is the closest one to our both in terms of the addressed phenomena and the linguistic resource (Trains corpus). Further more, following our knowledge, the system of Heeman is the one which gives the best results in the literature. Finally, due to many reasons, like the differences in the taxonomy, the evaluation methodology, the utterances used in the evaluation, etc., the evaluation should be seen as approximative more than a precise one.
7
7
1.
Overgeneration resolution by way of control patterns and semantic grammar. 2. The resolution of conflict between patterns by the way of a two pass processing strategy. 3. Proposition and adoption of syntactic framework that allows to take into consideration detailed contextual information for processing self-corrections and incompleteness cases. Our results show the efficiency of our approach in detecting and correcting speech extragrammaticalities and particularly its effectiveness in avoiding overgeneration. We are extending the use of the semantic grammar for detecting some interpretable extragrammatical phenomena like insistence. This will allow us to introduce the recognition of the extragrammaticalities directly into the interpretation process. A first integration of an adapted version of Corrector to French with a speech recognition was done in the framework of the Oasis system Kurdi (2002). The results of our first evaluation of this new version showed a comparable performance with Corrector. Although this may be considered as a positive indicator of the adaptation of our approach to deal with speech recognition output, it is not enough to proof formally that there is no negative effect of the use of a speech recognition output on the extragrammaticalities normalisation. In fact, on the one hand the French corpus we used in our experiments contains much less SLEs than the Train corpus and on the other hand, French and English have different linguistic properties especially in terms of morphology richness, which has a key role in the detection of the extragrammaticalities (POS taggers for French have a better performance than the ones used for English). Thus we are working on the integration of Corrector with a speech recognition in order to evaluate formally the potential effect of real speech recognition output on the normalisation process. Finally, we are investigating the use of learning based techniques in order to optimise the relation between the recall and the precision of our syntactic meta-rules.
References
ABNEY, Steven, (1991) Parsing by chunks, in Robert BREWICK, Steven ABNEY, Carol TENNY, eds., Principle-based parsing, Kluwer Academic Publishers. CORE, Marc G., (1999) Dialogue parsing: from speech repairs to speech acts, Ph.D. University of Rochester, USA. HEEMAN, Peter A., (1997) Speech Repairs, Intonational Boundaries and Discourse Markers: Modeling Speakers' Utterances in Spoken Dialog, Ph.D. dissertation, University of Rochester. HEEMAN, P., ALLEN, J., (1994) Detecting and correcting speech repairs, In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL-94), pages 295-302, Las Cruces, June 1994. HEEMAN, P., ALLEN, J., (1995) The Train 93 dialogs, TRAINS Technical note94-2, The University of Rochester computer science department, March, 1995. HINDLE, D., (1983) Deterministic parsing of syntactic non-fluencies, Proceedings of the 21st annual meeting of the Association of Computational Linguistics. KURDI, Mohamed-Zakaria, (2002) Analyse linguistique robuste et profonde du langage oral spontané, Thèse de doctorat, Université de Grenoble 1. KURDI, Mohamed-Zakaria, (2001) A spoken language understanding approach which combines the parsing robustness with the interpretation deepness, in the proceedings of the International Conference on Artificial Intelligence IC-AI01, Las Vegas, USA, June 25 - 28. KURDI, M. Z., (2000a) A semantic based approach to spoken dialogue understanding, in D-N Christodoulakis, Natural Language Processing - NLP’00, Lecture notes in Artificial Intelligence 1835, Berlin: Springer. KURDI, M. Z., (2000b) Une approche intégrée pour la normalization des extragrammaticalités de la parole spontanée, 7th Conference on Automatic Natural Language Processing TALN'00, 15-18 October, Lausanne Swizerland.
8
LAVIE, A., (1997) GLR*: a robust grammarfocused parser for spontaneously spoken language, Ph.D. dissertation. Carnegie Mellon University, 1997. LICKLEY, R., (1994) Detecting disfluency in spontaneous speech, Ph.D. Edinburgh University, 1994. MAYFIELD, L., et al, (1995) Concept based speech translation, in proceedings of ICASSP95, Detroit, 1995. SHRIBERG, E., (1994) Preliminaries to a theory of speech disfluencies, Ph.D. University of Berkeley, 1994.
9