Dictated Input for Broad-coverage Speech Translation

6 downloads 0 Views 127KB Size Report
4 Billings Way. Framingham, MA ... Seligman, Mark, Mary Flanagan, and Sophie Toole. 1998. ... tions originated in the target language (here, French). Bilingual ...
Dictated Input for Broad-coverage Speech Translation Mark Seligman Université Joseph Fourier, GETA, CLIPS, IMAG-campus, BP 53 385, rue de la Bibliothèque, 38041 Grenoble Cedex 9, France and Spoken Translation, Inc. 1100 West View Drive, Berkeley, CA 94705 [email protected] Mary Flanagan, Sophie Toole Linguistix, Inc. 4 Billings Way Framingham, MA 01701 [email protected], [email protected]

Abstract Two demonstrations of highly-interactive, broad-coverage speech translation are reported. Both were arranged by adding commercial speech recognition and speech synthesis capabilities to an existing system for online chat translation. Focus here is upon the use of dictation products for speech input. Their interactive correction facilities allow the user to act as a filter between speech recognition and machine translation, thus assuring that translation input is correct and well-formed. We describe, discuss, and evaluate our demo system, and conclude with remarks on its place in the speech translation field.

Introduction

current techniques cannot reliably resolve. In a transfer-based MT system, for instance, such residual ambiguities may arise in the analysis, transfer, or generation stage, or in several stages consecutively.

This paper reports on two demonstrations of highlyinteractive, broad-coverage speech translation (ST). Both demos were arranged by adding commercial speech recognition and speech synthesis capabilities to an existing system for online chat translation. We focus here upon the use of dictation products to provide spoken input for such real-time machine translation (MT). Current dictation programs include extensive facilities for interactive correction of speech recognition errors, and thus permit the user to act as a filter between speech recognition (SR) and machine translation. The user is thus in a position to reduce the overall ambiguity of the system at a crucial point, and to assure the syntactic well-formedness of the input to the translation component. Before going on to describe and evaluate the demos, let us expand upon these points.

The ambiguity problem only becomes worse in systems attempting speech translation, or translation of the spoken word. After all, speech recognition introduces very considerable uncertainty of its own, and does so at the earliest stage in the overall speech translation process, where there is a maximum possibility that ambiguity will be amplified by later stages. To make matters worse, most speech translation systems accept unrehearsed input, and thus must handle such spontaneous speech features as hesitations, fragments, repetitions, etc. Consequently, even if speech recognition performed perfectly, it would often produce ill-formed or noisy text strings, inappropriate as input for most text translation systems.

At the present state of the art, several stages of machine translation often leave ambiguities which

Seligman, Mark, Mary Flanagan, and Sophie Toole. 1998. “Dictated Input for Broad-coverage Speech Translation.” Association for Machine Translation in the Americas (AMTA-98), Workshop on Embedded MT Systems: Design, Construction,

1

2

To date, many experimental speech translation systems (e.g. the Verbmobil system (Wahlster 1993)) have attempted to meet the multi-level ambiguity problem head-on by bringing multiple knowledge sources to bear on ambiguity resolution. Ideally, phonological, prosodic, morphological, syntactic, semantic, discourse, and domain knowledge should all work together in close coordination to resolve both SR and translation ambiguities.1 To address the problem of ill-formed input, several speech translation systems have incorporated robust parsers, which may ignore noisy parts of an input string , or may attempt to patch together well-formed fragments so as to form a coherent utterance (Jackson et al 1991; Seneff 1992; Stallard and Bobrow 1992). However, knowledge source integration is difficult to achieve, and robust parsers are difficult to construct. At present, such measures provide acceptable fullyautomatic translations only in very tightly constrained domains. To enable the construction of broad-coverage speech translation systems in the near term, alternative measures seem called for. Accordingly, (Seligman 1997) proposed a speech translation design which temporarily substitutes intensive user interaction for knowledge source integration and robust parsing measures. In hopes of fielding practical, broad-coverage systems as soon as possible, a radical design simplification was suggested, in which speech recognition and translation would be kept cleanly separate. Following this approach, the user provides the sole interface between SR and MT. He or she interactively modifies the SR results until they are correct—for instance, using commercial dictation systems—and only then passes the resulting text string along to MT. Thus SR ambiguity is not allowed to compound MT ambiguity. Further, since the user can modify the input string until it becomes syntactically wellformed, the need for robust parsing is considerably reduced.

MARK SELIGMAN, MARY FLANAGAN, SOPHIE TOOLE

Translation Advanced Research) in Grenoble, France, in January, 1998. Both demos were organized and conducted under the supervision of the second author (Flanagan), and both demo systems were based upon a text-based chat translation system previously built by her team using CompuServe's proprietary online chat technology (as distinct from Internet Relay Chat, or IRC (Pyra 1995).) The third author (Toole) organized and conducted the Grenoble demo. We now proceed to discuss the demonstrations in detail. In Section 1, we describe the design of the demo systems. In section 2, we discuss and evaluate them. We conclude with thoughts concerning the place of the reported research in the speech translation field.

1 Demo System Design As mentioned, both demo systems were based upon a text-based chat translation system. We begin with a brief description of this system. In an online chat session, users most often converse as a group, though one-on-one conversations are also easy to arrange. Each conversant has a small window used for typing input. Once the input text is finished, the user sends it to the chat server by pressing Return. The text comes back to the sender after an imperceptible interval, and appears in a larger window, prefaced by a header indicating the author. Since this larger window receives input from all parties to the chat conversation, it soon comes to resemble the transcript of a cocktail party, often with several conversations interleaved.

These design concepts have been informally tested in two demos, first at the Machine Translation Summit in San Diego in October, 1997, and a second time at the meeting of C-STAR II (Consortium for Speech

Each party normally sees the "same" transcript window. However, prior to the speech translation demos, CompuServe had arranged to place at the chat server a commercial translation system of the direct variety, enabling several translation directions. Once the user of this experimental chat system had selected a direction (say English-French), all lines in the transcript window would appear in the source language (in this case, English), even if some of the contributions originated in the target language (here, French). Bilingual text conversations were thus enabled between English typists and writers of French, German, Spanish, or Italian.

1. A precursor of this trend toward tight integration of knowledge sources was the HEARSAY-II system (Erman and Lesser 1980), which first employed a blackboard to integrate system components to aid speech recognition.

At the time of the demos, total delay from the pressing of Return until the arrival of translated text in the interlocutor's transcript window averaged about six seconds, well within tolerable limits for conversation.

DICTATED INPUT FOR BROAD-COVERAGE SPEECH TRANSLATION

Having described the pre-existing chat translations setup, we can now go on two describe its modification for speech input and output. At the suggestion of the first author (Seligman) and with his consultation, highly-interactive speech translation demos were created by adding speech recognition front ends and speech synthesis back ends to CompuServe’s text-based chat translation system. The third author (Toole) played an active role in making the speech recognition and speech synthesis software operational. Two laptops were used, one running English input and output software (in addition to the CompuServe client, modified as explained below), and one running the comparable French programs. Commercial dictation software was employed for speech recognition. For the first demo, both sides used discrete dictation, in which short pauses are required between words; for the second demo, English dictated continuously—that is, without required pauses—while French continued to dictate discreetly. (Continuous French was released just before the second demo, but because little testing time was available, a decision was made to forego its use.) At the time of the demos, the discrete products allowed dictation directly into the chat input buffer, but the continuous products required dictation into their own dedicated window. Thus for continuous English input it became necessary to employ thirdparty software2 to create a macro which (1) transferred dictated text to the chat input buffer and (2) inserted a Return as a signal to send the chat. (By March 1998, upgrades of the continuous software had already made this macro less necessary. Direct dictation to the chat window would then have been possible without it, with some sacrifice of advanced features for voice-driven interactive correction of errors.) Commercial speech synthesis programs packaged with the discrete dictation products were used for voice synthesis. Using development software sold separately by the dictation vendor, CompuServe's chat client software was customized so that, as each text string returning from the chat server was written to the transcript window, it was simultaneously sent to the speech synthesis engine to be pronounced in the appropriate language. The text read aloud in this 2. SpeechLinks software from SpeechTrieve, Inc.

3

way was either the user's own, transmitted without changes, or the translation of an interlocutor's input. The first demo took place in an auditorium before a quiet audience of perhaps one hundred, while the second was presented to numerous small groups in a booth in a noisy room of medium size. Each demo began with ten scripted and pre-tested utterances, and then continued with improvised utterances, sometimes solicited from the audience—perhaps six in the first demo, and fifty or more in the second. Some examples of improvised sentences: FRENCH : Qu’est-ce que vous étudiez? (What do you study?) E NGLISH : Computer science. (L’informatique.) FRENCH : Qu'est-ce que vous faites plus tard? (What are you doing later?) E NGLISH : I'm going skiing. (Je vais faire du ski.) FRENCH : Vous n'avez pas besoin de travailler? (You don't need to work?) E NGLISH : I'll take my computer with me. (Je prendrai mon ordinateur avec moi.) FRENCH : Où est-ce que vous mettrez l'ordinateur pendant que vous skiez? (Where will you put the computer while you ski?) E NGLISH : In my pocket. (Dans ma poche.) As these examples suggest, the level of language remained basic, and sentences were purposely kept short, with standard grammar and punctuation.

2 Discussion of Demos A primary purpose of the chat speech translation demos was to show that speech translation is both feasible and suitable for online chat users, at least at the proof-of-concept level. We feel the demos were successful in this respect. The basic feasibility of the approach appears in the fact that most demo utterances were translated comprehensibly and within tolerable time limits. It is true that the language, while mostly spontaneous, was consciously kept quite basic and standard. It is also true that there were occasional translation errors. Nevertheless, the demos can plausibly be claimed to show that chatters making a reasonable effort could successfully socialize in this way. As preliminary evidence that many users could adjust to the system’s limitations, we can remark that the dozen or so utterances suggested by the audience, once repeated verbatim by the demonstrators, were successfully

4

MARK SELIGMAN, MARY FLANAGAN, SOPHIE TOOLE

recognized, translated, and pronounced in every case.

correction, see (Seligman, 1997, Blanchon 1996, or Boitet 1996).

The demos also highlight the potential usefulness of interactive disambiguation in moving toward practical broad-coverage systems.

Correction of dictation The interaction required in the current demos for correcting dictation is just that currently required for correcting text dictation in general. All current dictation products require interactive correction. The question is, do the advantages of dictation over typing nevertheless justify the cost of these products, plus the trouble of acquiring them, training them, and learning to use them? Their steadily increasing user base indicates that many users think so. (For the record, portions of this paper were produced using continuous dictation software.) We believe that, during the demos, continuous dictation with spoken corrections supplied correct text faster than moderately skilled typing would have done.

Coverage was indeed broad by contemporary standards. There was no restriction on conversational topic—no need, for instance, to remain within the area of airline reservations, appointment scheduling, or street directions. As long as the speakers stayed within the dictation and translation lexica (each in the tens of thousands of words), they were free to chat and banter as they liked. We believe, in fact, that the reported demos were the first anywhere to show broad-coverage speech translations of usable quality.3 The usefulness of interaction in achieving this breadth was also clear: verbal corrections of dictation results were indeed necessary for perhaps 510% of the input words. To give only one example, “Hello” was once initially transcribed as “Hollow”. (Here we clearly see the limitations of an approach which substitutes interactive disambiguation for automatic knowledge-based disambiguation: even the most rudimentary discourse knowledge should have allowed the program to judge which word was more likely as a dialog opener. On the other hand, the approach’s capacity to compensate for lack of such knowledge was also clear: a correction was quickly made, using techniques to be described shortly.) This reliance on interactive correction raises obvious questions: Is the current amount and type of dictation correction tolerable for practical use? Would additional interaction be useful and tolerable, e.g. for guiding or correcting translation? Only the question concerning dictation correction is in our current scope. For opinions regarding interactive translation 3. (Kowalski et al 1995) arranged the only previous demonstration known to the authors of speech translation using dictated input. Since users (spectators at twin exposition displays in Boston, Massachusetts and Lyons, France) were untrained, little interactive correction of dictation was possible. For this and other reasons, translation quality was generally low (Burton Rosenberg, personal communication); but as the main purpose of the demo was to make an artistic and social statement concerning future hi-tech possibilities for cross-cultural communication, this was no great cause for concern. Text was transmitted via FTP, rather than via chat as in the experiments reported here. See (Seligman 1997) for a fuller account.

A quick description of the dictation correction process may help readers who have never tried dictating to realistically estimate the correction burden. The product used for continuous dictation in the second demo, for example, permits voice-driven application of many standard text editing commands. Users can move the cursor forward and backward by words or paragraphs, or move to the beginning or end of specified text elements; they can select various text elements; they can copy, cut, and format the selected elements. However, the voice commands most useful in the demos were specialized correction commands: Scratch That (which kills the most recent text block); Select (which selects a block of contiguous text already dictated, as in S ELECT "most useful"); C ORRECT T HAT (which puts up a menu of recognition candidates for a selected text block, from which an item can be verbally selected); and S PELL T HAT (which permits spoken spelling for rare or difficult words). Once text has been selected, it can be replaced by simply pronouncing an alternative word. Perhaps the most useful command of all for our purposes was Correct , which combines Select and Correct That: the specified text block is first selected, and then a menu of candidates arises. If the intended text is among the numbered candidates, it is selected with the phrase Choose . If not, the command Spell That initiates a mode in which spoken spelling is enabled, and the menu is dynamically stocked with words that begin with the letters supplied so far. In either case, once the desired text is selected from the menu, the cursor automatically returns to the end of the text in progress, so that dictation can take up where it left off.

DICTATED INPUT FOR BROAD-COVERAGE SPEECH TRANSLATION

While a strict hands-off policy was adopted for the demos, it is worth noting that typed text and commands can be freely interspersed with spoken text and commands. It is sometimes handy, for instance, to select an error using the mouse, and then to verbally apply any of the above-mentioned correction commands. Similarly, when spelling becomes necessary, typing often turns out to be faster than spoken spelling. Thus verbal input becomes one option among several, to be chosen when –- as often happens –- it offers the easiest or fastest path to the desired text. The question, then, is no longer whether to type or dictate the discourse as a whole, but which mode is most convenient for the input task at hand. As broad-coverage speech translation systems in the near term are likely to remain multi-modal rather than exclusively telephonic, they can take advantage of this flexibility.

Conclusions We have described a new approach to speech translation, in which interactive correction of speech recognition (and perhaps other) errors is temporarily substituted for knowledge source integration and robust analysis. This approach, we believe, may yield broad-coverage systems with usable quality sooner than approaches which aim for maximally automatic operation. Two demonstrations of highly-interactive speech translation were reported. For the demos, an experimental chat translation system created by CompuServe, Inc. was provided with front and back ends, using commercial dictation products for speech input and commercial speech synthesis engines for speech output. The dictation products’ standard interfaces were used to interactively debug dictation results, thus permitting the user to act as a filter between speech recognition and translation. While evaluation of these experiments remained informal,

5

coverage was much broader than in most ST experiments to date—in the tens of thousands of words. Output quality was probably sufficient for many social exchanges. While we believe that the unprecedented breadth of coverage in our demos indicates the usefulness of the highly-interactive approach, we do not argue that the push toward integrated, robust, and maximally automatic systems should be abandoned. Rather, the suggestion is simply that emphasizing interaction in order to ease the need for tight integration and robust analysis may offer the quickest route to widespread usability, and that experience with real use is vital for progress. We suggest that even systems aiming for fully automatic operation recognize the need for interactive ambiguity resolution when automatic resolution is insufficient. As progress is made and increasing knowledge can be applied to automatic resolution, interactive resolution should be necessary less often. When it is necessary, its quality should be improved: questions put to the user should become more sensible and more tightly focused. Similarly, as robust parsing becomes more advanced, the need for syntactically well-formed input will decrease. In the near future, however, clean input is likely to remain important in real-time translation aiming for high quality and broad coverage.

Acknowledgements Warmest appreciation to CompuServe, Inc. for making the chat-based speech translation demonstrations possible. The demos made use of pre-existing proprietary software. Particular thanks are due to Phil Jensen and Doug Chinnock, translation system engineers. The opinions expressed throughout are ours alone.

References Blanchon, Hervé. 1996. “A Customizable Interactive Disambiguation Methodology and Two Implementations to Disambiguate French and English Input.” In Proceedings of MIDDIM-96 (International Seminar on Multimodal Interactive Disambiguation), Col de Porte, France, August 11 15, 1996. Boitet, Christian. 1996. “Dialogue-based Machine Translation for Monolinguals and Future Self-explaining Docu-

ments.” In Proceedings of MIDDIM-96 (International Seminar on Multimodal Interactive Disambiguation), Col de Porte, France, August 11 - 15, 1996. Erman, L.D. and V.R. Lesser. 1980. “The Hearsay-II Speech Understanding System: A Tutorial.” In Trends in Speech Recognition, W.A. Lea, ed., Prentice-Hall, 361-381. Jackson, E., D. Appelt, J. Bear, R. Moore, and A. Podlozny. 1991. “A Template Matcher for Robust NL Interpreta-

6

MARK SELIGMAN, MARY FLANAGAN, SOPHIE TOOLE

tion.” In Proceedings of DARPA Speech and Natural Language Workshop, pp. 190–194, Feb. 1991. Kowalski, Piotr, Burton Rosenberg, and Jeffery Krause. 1995. Information Transcript. Biennale de Lyon d’Art Contemporain. December 20, 1995 to February 18, 1996. Lyon, France. Pyra, Marianne. 1995. Using Internet Relay Chat. Indianapolis, IN: Que Corporation, 1995. Seligman, Mark. 1997. “Interactive Real-time Translation via the Internet.” In Working Notes, Natural Language Processing for the World Wide Web. AAAI-97 Spring Symposium, Stanford University. March 24–26, 1997.

Seneff, S. 1992. “Robust Parsing for Spoken Language Systems.” In Proceedings of ICASSP, pp. 198–192, March, 1992. Stallard, D. and R. Bobrow. 1992. “Fragment Processing in the DELPHI System.” In Proceedings of the DARPA Speech and Natural Language Workshop, pp. 305–310, Feb., 1992. Wahlster, Wolfgang. 1993. Verbmobil: Translation of Face-toFace Dialogs. Research Report RR-93-34, German Research Center for Artificial Intelligence (DFKI GmbH), Saarbrücken, Germany, 1993.

Suggest Documents