Interactive MT and Speech Translation via the Internet

3 downloads 0 Views 109KB Size Report
nize my rendering of Abraham Lincoln's Gettysburg. Address: infrequent words like continent, conceived, liberty, dedicated, proposition, etc. were correctly.
Interactive MT and Speech Translation via the Internet Mark Seligman Université Joseph Fourier, GETA, CLIPS, IMAG-campus, BP 53 385, rue de la Bibliothèque, 38041 Grenoble Cedex 9, France and Spoken Translation, Inc. 1100 West View Drive, Berkeley, CA 94705 [email protected]

Abstract Two related ideas are put forward: first, that interactive speech recognition could be joined to interactive machine translation to yield interactive speech translation; second, that the Internet offers a tremendous opportunity for experiments with real-time translation of dialogues, whether or not speech is involved. I begin by making the case for interactive speech translation. In particular, isolated-word (as opposed to continuous) speech recognition is viewed as an especially interactive input mode with several advantages which are usually overlooked. Discussion then turns to the use of the Internet as a transmission medium for on-line MT, with special attention to Internet Relay Chat. Preparatory research concerning components and architecture of a “quick and dirty” speech translation system is reported, and a related project organized by Piotr Kowalski and others is described.

Introduction

within translation stages—within analysis, or transfer, or generation—and as earlier stages pass unresolved problems to later ones. A basic motivation for human intervention during text translation is to resolve ambiguities before they can permute or propagate. The hope is that by intervening within or between translation stages to select and prune, users can reduce the uncertainty which is passed along. To nip propagating ambiguities in the bud, intervention should take place as early as possible.

This informal note is intended to prompt discussion of two related ideas: first, that interactive speech recognition could be joined to interactive machine translation to yield interactive speech translation; and second, that the Internet offers a tremendous opportunity for experiments with real-time translation of dialogues, whether or not speech is involved. I've carried out no experiments in these areas so far; but I have done some preparatory research concerning components and architecture of a “quick and dirty” speech translation system, and can briefly share the results. I'll also report on a related project now being organized by Piotr Kowalski and others.

But if this argument for early human intervention is persuasive for text translation, it should be even more persuasive for translation based on spoken input. Speech recognition, after all, introduces very considerable uncertainty of its own, and does so at the earliest stage in the overall speech translation process, where there is a maximum possibility that ambiguity will be amplified by later stages. If much uncertainty remains in the speech recognition results, the whole speech translation process stands on a very shaky foundation. Any efforts to eliminate that uncertainty will certainly be worthwhile.

My first aim will be to very briefly make the case for interactive speech translation. Secondarily, I'll suggest viewing isolated-word (as opposed to continuous) speech recognition as a particularly interactive input mode with several advantages which are usually overlooked. Finally, I'll discuss the Internet as a transmission medium, with special attention to Internet Relay Chat.

And so: How important does interactive disambiguation appear for speech translation? At least as important as for text translation, I would say, and— in view of the urgent need to establish a solid foundation for the whole translation process—probably more important.

1 The need for interactive speech translation Translation ambiguities tend to multiply exponentially as several sources of uncertainty permute

1

2

In making this point, however, I don't mean to suggest that the effort to develop fully automatic speech translation should be abandoned. On the contrary, participation in the development of ATR's ASURA speech translation system and observation of the Verbmobil system at DFKI have left me with a strong impression of its excitement and challenge. Progress in this effort will require progress in knowledge source integration, clearly a fundamental issue in cognitive science and computational engineering. My argument is only that while we continue work on fully automatic speech translation systems as a longterm goal, it's smart to explore interactive approaches for the near term.

2 The high road vs. the low road The sort of human intervention I'm discussing has been applied for practical reasons even in speech translation systems which ideally aim for automatic operation. For example, in the ASURA speech translation system used in the international demo of January, 1993, speech recognition provided an n-best list of utterance candidates. If the correct text string was among the candidates, the operator selected it, and it became the input to translation—now guaranteed to be correct. If the correct string was absent, the entire utterance had to be repeated. Granted, this sort of human interface teaches little about how to integrate speech processing and translation. However, it did increase the reliability of the system to the point of usability (and demoability), and this seems to me crucial. I presently believe that such interactive or interventionist speech-to-translation interfaces offer the quickest path to speech translation systems which are reliable enough for actual use. And I believe that actual use is vital—that actual experience with working systems will lead to rapid progress, even if the systems are initially relatively primitive. The question is, Which research path will reach the goal of largely automatic speech translation sooner: extensive experience beginning with “quick and dirty” interactive systems, or ambitious efforts to build systems which are fully automatic from the start? Right now, I'd bet on the former (while continuing basic research on the latter). As the Scottish song goes, “You take the high road, and I'll take the low road, and I'll be in Scotland beforrre ye.”

3 What sort of interaction? Assuming that the near-term advantages of interactivity for speech translation can be accepted for the

M ARK S ELIGMAN

sake of argument, many questions remain about the style of interaction, the points of interaction, and so on. For example, it's unlikely that the procedure described above, selection from an n-best list for the entire utterance, is at all optimal from an ergonomic point of view. The user had to choose the right candidate from among ten or so similar competitors. While this style of choice was manageable in a carefully planned demo, it would have been tiring in a more spontaneous dialogue. An improved interface for a connected speech recognizer might collapse the common elements of several candidates, thus presenting a word or phrase lattice to the user, who might then indicate the correct path using a pointing device. Or it might present only the best candidate, and provide a means to quickly correct erroneous portions. But there's another approach to selection of speech recognition candidates which is particularly interactive, since it intervenes particularly early and particularly often to nip uncertainty and error in the bud before they can propagate. This is the use of isolated-word speech recognition, as seen in several dictation programs currently available on the market. When dictating word by word, the user corrects each word before going on to the next one. The prefix word string is thus known to be correct, and the language model can take full advantage of this knowledge in guessing what comes next. Further, the boundaries of the current word are known exactly, and this knowledge, too, greatly constrains the recognition task. When ambiguity remains, the dictation program presents a menu of candidates for the next word, and the user can choose the right one, using a pointing device or saying “Choose one” or “Choose two”. The disadvantages of isolated-word speech input are obvious enough: this entry mode is much slower than natural speech, and considerably less natural, so that some learning is required. However, the advantages of this input mode are also rather striking, and may cover a multitude of sins at the current stage of research. The first advantage has already been stressed: reliability or robustness. Good results can now be obtained even for user-independent dictation systems, those requiring no registration for new users. Perhaps more surprising, though, is the breadth of coverage which is immediately available—the capacity to recognize words across many domains,

INTERACTIVE MT AND SPEECH TRANSLATION VIA THE INTERNET

even when they are quite rare. I was impressed, for instance, by one dictation system's ability to recognize my rendering of Abraham Lincoln's Gettysburg Address: infrequent words like continent, conceived, liberty, dedicated, proposition, etc. were correctly recognized immediately and without ambiguity, although there had been no prior registration of my voice. To a researcher accustomed to the need for extensive domain-specific training in order to recognize common words in connected speech using lexicons of a few hundred words, this success was striking indeed. However, when word boundaries are known, the frequency of the target word and the size of the lexicon become less important. What matters much more under these conditions is confusability with similar words. A word which sounds like no other word in the lexicon can be recognized easily, even if is quite arcane, and even if the lexicon is quite large. (The word “arcane”, for instance, would probably cause little difficulty for the dictation program under discussion.) Since it would not require users to remain within a specified and extensively trained domain of discourse, a speech translation system based on isolated word dictation could potentially be used for many purposes with a minimum of domain adaptation for speech recognition. (Of course, the need to customize the translation component would remain as an obstacle to completely unrestricted use.) An additional practical advantage of isolated-word input for the early stages of speech translation: it's likely to eliminate many of the disfluencies of spontaneous speech, since it forces users to speak relatively slowly and carefully and gives continuing opportunities for revision and editing. So the syntax of whole utterances is likely to be much cleaner than in spontaneous speech, and accordingly easier to analyze. Further, whole utterances will almost certainly be shorter on the average. Thus the translation processes in an interactive speech translation system using isolated input will gain a triple advantage over translation processes in a fully-automatic system: their input will not only be unambiguous—since it will be confirmed by the user—but shorter and syntactically simpler as well. To sum up, I've been arguing the advantages of interactivity for near-term speech translation in general, and have taken the argument to its extreme by advocating near-term experimentation with extremely interactive speech input—that is, isolated word dictation. The mentioned potential advantages of extremely interactive speech translation are greatly

3

augmented robustness; wide coverage; and an hypothesized tendency to eliminate disfluencies and syntactically simplify and shorten the input. The suggestion is that these advantages, taken together, may well outweigh the intrinsic problems of slowness and unnaturalness, at least for several years to come. Further, because an extremely interactive, “quick and dirty” speech translation system could actually be put to actual use in the near term, it might hasten the advent of fully automatic systems by providing necessary experience. So far I've been discussing issues of interactive speech recognition as an element of an interactive speech translation system. I've said nothing about interactive translation so far, and in fact I will say little, since I hope my colleagues will say much. However, I do want to make one point related to my hypothesis that highly interactive speech recognition will shorten utterances and clean up their syntax. I'll risk this further hypothesis: because utterances will be shorter and cleaner than for connected or spontaneous speech input, less interaction will be needed post-speech to produce sufficient translations. At the limit, there may be applications in which no interaction during translation is required at all. These might include interactions whose purpose is mainly social contact rather than information transfer, for example between the families of exchange students, or between Internet chatters. (In point of fact, social calls do currently account for many international telephone calls employing human interpreters.) In such cases, translation bloopers are likely to provide more fun than hindrance: everyone laughs, and the ice is broken.

4 The Internet and Internet Relay Chat Mention of Internet chat brings me to my next point: that the Net poses a tremendous opportunity for research concerning real-time translation, whether the translation is interactive or not, and whether the mode is text or speech. While there are many possible transmission routes, to my present knowledge the most natural is Internet Relay Chat, or IRC. In the most usual setup, users from all over the world select a chat channel from among hundreds. A user types into a special input window, and hits Return when his/her utterance is finished. The finished utterance instantly appears in a larger window visible to all who have selected the same channel. This common window thus comes to resemble the transcript of a cocktail party, with

4

numerous conversations interleaved. Any user can easily record all of the utterances in a session from his or her vantage point. Chatters can use prefixes or other tricks to direct conversation to particular fellow chatters. Some current chat software even includes voice synthesis, so that a given user not only sees but hears the incoming utterances—differentiated, if desired, by different synthesized voices. By mutual agreement, chatters can “move” into “private rooms” to hide their words from others. It is also straightforward to establish a new channel, in which the establisher controls participation. My proposal, then, is to conduct experiments in interactive translation via Internet Relay Chat. Interactive MT would serve as a front end for IRC: a user would first interactively facilitate or debug translations in the source language, and when satisfied would press return or give an equivalent verbal signal. The MT component would in turn pass its output, the translated text string, to chat for transmission. Experiments could be conducted with or without interactive speech recognition as an additional front-end layer. Experiments should stress questions of usability. For specific communication tasks, how much interaction is needed for what degree of quality, and how much quality is enough? Preliminary research regarding the components and architecture of a “quick and dirty” real-time MT system for the Net has been encouraging. Now available commercially or as shareware are a variety of isolated-word dictation systems; sufficiently fast text translation systems; IRC clients with sufficiently open architectures to permit the easy addition of a front end; and operating systems with sufficient inter-application protocols. The greatest remaining uncertainty at present concerns the availability of

M ARK S ELIGMAN

interactive MT components sufficiently fast for online use. I'm hoping meeting participants can give pointers in this area; in return, I'd be glad to share the information I've gathered. I hope I've said enough to prompt interest in “quick and dirty” MT on the Net. Let me repeat that in my view “quick and dirty” systems featuring high interactivity would not be final destinations, but rather alternate routes to more automatic, more integrated, and ultimately more satisfying systems. Given the furious rate of change in the software world, there is a danger that if we as researchers neglect this “low” road toward more automatic systems, events will pass us by. For instance, while surfing the net in search of the components mentioned above, I learned that a “quick and dirty” speech translation system for the net is already under development, though apparently its use is planned for a limited time only. This is the MIT-Lyon Information Transcript Project organized by Piotr Kowalski and others (see http://sap.mit.edu/projects/mit-lyon/ project.html). The plan is to link sites at MIT and at the Lyon Biennale for four hours. “Visitors at each site will speak freely in their native tongue…” Source language captured by IBM's isolated-word recognition software will be transmitted (along with video and the original audio). At the receiving end, it will be translated via “appropriate” software. The aim is largely artistic and inspirational: …the intention … is to reveal some hidden aspects of communication at the time of its globalization.” Obviously enough, this is social interaction in which translation quality is not crucial. All the same, the handwriting is now on the wall: technology is now widely available which permits the organization and execution of such projects even by those outside the mainstream NLP research community. Why should we let them have all the fun?

Suggest Documents