Processing Self Corrections

0 downloads 0 Views 71KB Size Report
interruption point (IP): marker at the end of the ... no need for a complete syntactic analysis to detect and ... syntactically well-formed even in the absence of self.
3URFHVVLQJ6HOI&RUUHFWLRQV Jörg Spilker, Martin Klarner and Günther Görz University of Erlangen-Nuremberg - Computer Science Institute, IMMD 8 - Artificial Intelligence, Am Weichselgarten 9, 91058 Erlangen - Tennenlohe, Germany

$EVWUDFW Self corrections are a frequent phenomenon in spontaneous speech. The ability to manage such a phenomenon therefore is necessary for every automatic speech understanding system. This paper presents a framework where self corrections are handled by a method derived from statistical machine translation. It realizes a filter for word lattices that detects self corrections and inserts the intended word sequence as alternative path in the lattice. Finally, a lattice parser selects the best path.

.XU]IDVVXQJ Selbstkorrekturen sind ein häufig auftretendes Phänomen in spontaner Sprache. Die Fähigkeit diese Art von Phänomenen zu bewältigen, ist daher notwendig für jedes automatische Sprachverarbeitungsprogramm. Dieser Artikel beschreibt einen Ansatz, bei dem Selbstkorrekturen mit Methoden aus der statistischen maschinellen Übersetzung verarbeitet werden. Er realisiert einen Filter für Wortgraphen, der Selbstkorrekturen erkennt und die intendierte Wortfolge als einen alternativen Pfad in den Graphen einträgt. Ein Wortgraphenparser wählt schließlich den besten Pfad aus.



,QWURGXFWLRQ

Robustness is one of the major research topics in current natural language engineering. On the one hand, automatic speech understanding systems have to be robust against system errors like recognition failures. But on the other hand, systems also have to be robust against user errors. One kind of such user errors are self corrections, where the speaker starts a sentence, then believes that something went wrong and corrects part of his utterance. This phenomenon occurs frequently in spontaneous speech. We found out that in the Verbmobil1 corpus, a corpus dealing

1

with appointment scheduling and travel planning, nearly 21% of all turns contain at least one self correction. A system that copes with these self corrections (=repairs) must recognize the spoken words and identify the repair in order to get the intended meaning of an utterance. To characterize a repair, it is commonly segmented into the following four parts (cf. Figure 1): • • • •

reparandum: the “wrong” part of the utterance interruption point (IP): marker at the end of the reparandum editing term: special phrases, which indicate a repair like “well”, “I mean” or filled pauses such as “uhm”, “uh” reparans: the correction of the reparandum

Even if repairs are defined by the violation of syntactic and semantic well-formedness [1], we observe that most of them are local phenomena This work is part of the Verbmobil project and was funded by the German Federal Ministry for Research and Technology (BMBF) in the framework of the Verbmobil Project under Grant

BMBF 01 IV 701 V0. The responsibility for the contents of this study lies with the authors.

on Thursday I cannot

Reparandum

no

Interruptionpoint

I can meet "ah after one

Editing Term

Reparans

)LJXUH$UHSDLUH[DPSOH affecting only a few words. Depending on the structure of the repair we can distinguish between UHVWDUWV, where the speaker completely aborts the started syntactic construction, and PRGLILFDWLRQ UHSDLUV, where only some words are modified2. Modification repairs have a strong correspondence between reparandum and reparans, whereas restarts are less structured. This makes us believe that there is no need for a complete syntactic analysis to detect and correct most modification repairs. Thus for the rest of the paper we will concentrate on this kind of repairs.

ending in an IP3. On the basis of this IP hypothesis editing terms are searched in the lattice. Editing terms are assumed to be a finite set of short phrases. Thus if a phrase out of this set is found after an IP, the corresponding words are marked as an editing term and skipped. Now every partial path through the IP (and through the editing term, if there is one) could be a repair. But repairs are not of arbitrary length. For Verbmobil we found out that 98% of all repairs have a reparandum (5') and a reparans (56) containing less than five words. Therefore we can restrict our search to four words left and right of the IP. Within the remaining paths every possible (56, 5') pair is checked by a scope model to determine the score and extent of the repair. The limitation to four words leads to 42 pairs per paths. Thus scoring of a path is constant in time. on Thursday I cannot no I can meet "ah after one Speech recognizer

There are two major arguments to process repairs before parsing. First spontaneous speech is not always syntactically well-formed even in the absence of self corrections. Secondly (meta-) rules increase the search space for the parser. This is perhaps acceptable for transliterated speech, but not for speech recognizers output like lattices because these represent millions of possible spoken utterances. In addition to that, systems which are not based on a deep syntactic and semantic analysis — e.g. statistical dialog act prediction — require a repair processing step to resolve contradictions like the one shown in Figure 1. We propose an algorithm for word lattices that divides repair detection and correction in three steps (cf. Figure 2). First, a trigger indicates potential IPs. Second, a stochastic model tries to find an appropriate repair for each IP by guessing the most probable segmentation. To accomplish this, repair processing is seen as a statistical machine translation problem where the reparandum is a translation of the reparans. For every repair found a path representing the speaker's intended word sequence is inserted into the lattice. As the last step, a lattice parser selects the path which fits best.



5HSDLUGHWHFWLRQ

acoustic detection of interruption point want on

to

Thursday

cannot

I

no

I

local word based scope detection of Reparandum Editing Term Reparans

on

to

Thursday

I

cannot

meet

can icon

say 6

can

"ah

a after

one

lattice editing to represent result

I want

at

icon

say

no

I

can

at meet

"ah

a after

one

selection by linguistic analysis S

on Thursday I can meet "ah after one

)LJXUH$QDUFKLWHFWXUHIRUUHSDLUSURFHVVLQJ



7KHVFRSHPRGHO

The main idea of the scope model is to score possible (5', 56) pairs by a probability function similar to the one used in stochastic machine translation [2]. In the machine translation task a sentence 6 in a source language has to be translated in a sentence T in the target language. Stochastic machine translation defines a scoring function 3 7¶_6 which is expected to reflect the likelihood that an arbitrary sentence 7¶ is a translation of 6. The goal now is to find the sentence 7with the highest score:

Processing starts with the prosodic detection of IPs. A neural network based on more than 90 prosodic features calculates the probability for each word 3 2

Often a third kind of repair is defined: DEULGJHG UHSDLUV. These repairs consist solely of an editing term and are not repairs in our sense.

The prosodic classifier was developed at the chair of pattern recognition, Univ. Erlangen-Nürnberg. Special thanks to A. Batliner, J. Buckow, R. Huber and V. Warnke.

7 = arg max 3 (7 ’| 6 ) 7’

This formula is simplified using the concept of alignments: Let 7 be the word sequence W ... W and 6 the sequence V ... V . Now a correspondence between words in 7 and 6 is postulated. This means that one or more words V are translated to one ore more words W. For a sentence pair (7, 6) a lot of different correspondences could be defined, which vary in their probabilities. A special sort of correspondence are those where each word in 6 corresponds exactly to one word in 7. Words in 7 which have no correspondence are linked to the empty word εSuch a word correspondence is called an alignment. It can be represented by a vector D, where D L if W correspond to V . By assuming that this is the only type of correspondence relevant in translation4, this leads to: 



P

O

M

M

L

D LV DOLJQPHQW

Without loss of information this can be reformulated as follows:

3(7 , D | 6 ) = 3(P | 6 ) * P

∏ 3(D | D M

M

M

1

−1

, 7 , P, 6 ) * 3(7 | D1 , 71 M

−1

M

M

1

M

−1

, 6)

=1

Repair processing is a kind of translation as well. So we take the same fundamental formula and replace 6 by 56 and 7 by 5'. But the involved conditional probabilities can not be reliable estimated, because of the huge number of parameters. For example, all probabilities depend on the whole source sentence. We have to introduce some simplifications in order to reduce the parameter problem to a tractable size. Let us assume that: • • •

P depends only on the length of 56, so 3 P_56 becomes 3 P_O aj depends only on M,P and O 5' depends on 56D M

M

Thus the resulting formula is:

3( 5', D | 56 ) = 3(P | O ) * P

∏ 3(D | M, P, O ) * 3( 5' |, 56 ) M

M

M

M

3( 5' M | 56 D ) = α * 3 (:RUG ( 5' M ) | :RUG ( 56 D )) * M

M

β * 3( 326 ( 5' M ) | 326 ( 56 D )) * M

γ * 3 ( 6HP( 5' M ) | 6HP( 56 D )) M

where α+β+γ must sum up to one. Word(), POS() and Sem() denote the corresponding selector function for reparandum and reparans .



∑ 3(7 , D | 6 )

3(7 , 6 ) =

As repairs are characterized by syntactic and/or semantic anomalies, we do not only use word information for the ”translation”, but also part-ofspeech (POS) tags and semantic classes. With this additional information, 3 ( 5'M | 56D ) can be expressed by a linear combination of word replacement, POS replacement and semantic class replacement probability:

/DWWLFH3URFHVVLQJ

How can the additional word information be obtained in a lattice? Semantic classes are constructed to be unique but POS tags can be ambiguous. Therefore we map the word lattice onto a tag lattice. Its construction is adapted from Samuelsson [9]: for every word edge and every denoted POS tag a corresponding tag edge is created, and the resulting probability is calculated. If a tag edge already exists, the probabilities of both edges are merged. The original words are stored in an associated list together with their unique semantic class. Paths are scored by combining the edge scores with POS trigram scores. The scope model enables us to select the best repair segmentation of a given partial lattice path including an IP. But many IP hypotheses are false alarms, which is an inherent problem of an acoustic IP detector. Therefore only repair segmentations with a score above a certain threshold are taken as correct repairs. Once a repair is found on a path the intended word sequence has to be integrated into the lattice. Therefore a new path is created which consists of the words representing the reparans (cf. Figure 2). Integrating a new path into the lattice requires more than simply skipping reparandum and editing term, because a repair is a phenomenon that depends on a lattice path rather than on single words. Imagine that the system has detected the repair “I cannot no I can” in our example lattice (cf. Figure 2) and marked all

M

=1

4

We don’t want to discuss the sense of such an assumption as it is only a motivation for our repair model.

words by their repair function. Then a search process (e.g. the parser) selects the path “On Thursday I cannot no icon”. Here “I cannot” represents no reparandum and “no” is no editing term. A solution is to introduce a new path into the lattice where reparandum and editing terms are deleted. As we said before, we do not want to delete these segments, so they are associated with the first word of the reparans. The original path can now be reconstructed if necessary. To ensure that these new paths are comparable to other paths we score the reparandum the same way the parser does, and add the resulting value to the first word of the reparans. As a result, both the original path and the one with the repair get the same score except for one word transition. The (probably bad) transition in the original path from the last word of the reparandum to the first word of the reparans is replaced by a (probably good) transition from the reparandum onset to the reparans. We use the lattice in Figure 1 to show an example. The scope model has marked “I cannot” as the reparandum, “no” as an editing term, and “I can” as the reparans. We sum up the acoustic scores of “I”, “can” and “no”. Then we add the maximum language model scores for the transition to “I”, to “can” given “I”, and to “no” given “I” and “can”. This score is added as an offset to the acoustic score of the second “I”.



5HVXOWV

To evaluate our approach we performed two different tests. In the first one we used only acoustic triggers. In the second test we also assumed that a perfect word fragment detector exists, which is an idealized assumption because no state-of-the art recognizer can hypothesize word fragments. The test set contains 1737 turns with 549 repairs. Both tests are based on transliterated speech rather than output of a speech recognizer, because we want to measure the performance of the repair module without the influence of speech recognition errors. Nevertheless speech input is implicitly evaluated in the evaluation of the complete Verbmobil system. The processing time for all tests was restricted to 10 seconds on a Sun Ultra 300 MHz. The final parser search is simulated by a word trigram search in the resulting repair lattice. Detection Correct Scope Recall Precision Recall Precision Test 1 49% 70% 47% 70% Test 2 71% 85% 62% 83% 7DEOH5HVXOWV

Table 1 shows the results for repair detection (that means a correct IP is found) and correct scope (that means the correct segmentation for the IP is found) The recall value is defined by:

Correct guesses Correct guesses + Missed events and gives an impression of the ability to detect repairs. But a repair detector which always guesses a repair gets a score of 100%. It don’t take false alarms into account. The precision value gives a hint about prediction quality. It is defined as Correct guesses Correct guesses + false alarms

The result shows the strong impact of word fragments on repair detection. Both recall and precision increases when this additional trigger is used. A lot of incorrect scope guesses results from a wrong path selection in the final lattice search. We observed that the correct scope is in the lattice but is not selected by the word trigram. Sentence boundaries causes another problem. A turn can contain more than one sentence. Sometimes the repair module hypothesizes a repair across such a sentence boundaries. A possible solution for this problem is the integration of sentence boundaries, which are also hypothesized from the prosodic classifier, in our algorithm.



5HODWHG:RUN

A direct comparison to similar work is rather difficult due to very different corpora, evaluation conditions, and goals. Nakatani and Hirschberg [6] reported a recall of 83.4% and a precision of 93.9% in detecting the IP of a repair, but they did not discuss the problem of finding the correct segmentation in detail. In addition their results were obtained on a corpus where every utterance contained at least one repair. Stolcke et.al [8] introduced hidden events to model the IPs of different classes of repairs in the speech recognition process. This reduces their recognition errors by about 0.9% absolute, but nothing was said about recall and precision of IP detection. Likewise they made no suggestion about getting the correct segmentation. An early and comprehensive attempt to repair processing is described in Bear et. al. [4]. They used a pattern matcher to trigger possible repairs and verified these hypotheses with a parser. The simple pattern matching algorithms achieved a recall of 76.1% and a precision of 61.8% for repair detection. 57% of the detected repairs were successfully corrected (=43.6% Rec.;

48.1% Prec.). A second evaluation based on a different test set (26 repairs) included a verification of the hypothesized repairs by a parser. If the parser found an unacceptable utterance, the hypothesized repairs were successively parsed until a parseable utterance was selected. In this case a detection recall of 42.3% and a precision of 84.6% was obtained. Correction recall was 30.8% with a precision of 61.5%. They remark that this procedure is not very efficient in a real-time speech system. Hindle [5] suggested a parsing approach using a deterministic parser. He assumed a perfect repair detector, so there could be no comparison to the detection and correction algorithms. An algorithm which is inherently capable of lattice processing is proposed by Heeman and Allen [3]. They redefine the word recognition problem to identifying the best sequence of words, corresponding POS tags and special repair tags achieving a recall rate of 81% and a precision of 83% for detection and 78%/80% for correction. The test setup was nearly the same as the one for test 2 (cf. Table 1). Unfortunately they do not discuss the complexity of the algorithm. Because they extend every node in the lattice to several different tags the there are many readings for each path. A beam search is used to reduce the search space. But the search process incrementally prunes at each lattice node. So they have to take care that a repair is not pruned right before editing term and reparans are seen. As their results show it works on transliterated speech, but nothing is said about influences of concurrent lattice paths on pruning. Core and Schubert [7] built a parser on top of this module in a similar way to Bear et. al. They observed a slight improvement of about 2% in recall but a drop of about 50% in precision.

&RQFOXVLRQ We present an approach to score potential reparandum/reparans pairs with a relative simple scope model. Our result show that repair processing with statistical methods and without deep syntactic knowledge is a promising approach at least for the most frequent repair type, that of modification repairs. Within this framework, more sophisticated scope models can be evaluated.

[1] Levelt, W.: Monitoring and self-repair in speech. Cognition. 1983, pp. 41-104 [2] Brown, P. F.; Cocke, J.; Della Pietra, S. A.; Della Pietra, V. J.; Jelinek, F.; Lafferty, J. D.; Mercer, R. L.; Roossin, P. S.: A Statistical Approach to Machine Translation.

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Computational Linguistics. Vol. 16, No. 2, 1990, pp. 79-85 Heeman, P. A.; Allen, J. F.: Speech Repairs, Intonational Phrases, and Discourse Markers: Modelling Speakers’ Utterances in Spoken Dialogue. Computational Linguistics. Vol. 24, No. 4,1999, pp. 527-571 Bear, J.; Dowding, J.; Shriberg, E.: Integrating multiple knowledge sources for detection and correction of repairs in human computer dialogs. Newark, Delaware: Proceedings of ACL. June 1992, pp. 56-63 Hindle, D.: Deterministic Parsing of Syntactic Nonfluencies. Cambridge, Massachusetts: Proceedings of ACL. 1983 Nakatani, C.; Hirschberg, J.: A Speech-First Model for Repair Detection and Correction. Columbus, Ohio: Proceedings of ACL. 1993 Core, M. G.; Schubert, K.: Speech repairs: A Parsing Perspective. Satellite meeting ICPHS. 1999 Stolcke, A.; Shriberg, E.; Hakkani-Tur, D.; Tur,G.: Modeling the Prosody of Hidden Events for Improved Word Recognition. Budapest: EUROSPEECH, 1999, pp. 307-310 Samuelsson, C.: A Left-to-Right tagger for word graphs. Bosten, Massachusetts: Proceedings of the 5th International workshop on Parsing technologies

Suggest Documents