choosing the best among several photographs of a store-front, writing product descriptions, or ... Jibbigo running on an iPhone 4 for about 5 minutes, or until they.
146th mee'ng: Acous'cal Society of America Kansas City, 25 October 2012
4pSC17
146th mee'ng: Acous'cal Society of America Kansas City, 25 October 2012
Evalua'ng an Automa'c Speech-‐to-‐Speech Interpreter in Context
4pSC17
Jared C. Bernstein, Linguis'cs, Stanford University Elizabeth P. Rosenfeld, Tasso Partners, Palo Alto, California
GOAL Estimate functional accuracy in context for a speech-to-speech interpreter
INTRODUCTION
A naïve speech-to-speech interpreter can be implemented as three component processes in series: speech recognition, machine translation, and speech synthesis. However, a fair evaluation of a speech-to-speech interpreting system may be different from just calculating the product of the accuracies of those three component processes. This is because human users are sensitive to the rate at which an interpreter operates and because a system’s communication success rate is properly measured by a listener’s correct understanding of the speaker’s intention in a particular context rather than by the system’s word-for-word accuracy and intelligibility. This poster looks at the effect of context. We are evaluating one currently available system (Jibbigo), operating in one direction for two language pairs: Spanish-toEnglish and Chinese-to-English. For each language pair, L2 output that is incorrect by conventional accuracy measurement is re-evaluated with respect to its potential for communication of speaker intent-in-context. Results for the Spanish and Chinese as L1 (source language) going into English as L2 (target language) suggest that conventional accuracy (in quiet) is surprisingly high at about 70-90% correct, and that a more appropriate functional measure like communication in context nearly halves the error rate.
EXAMPLE DATA Intended Utterance Recognized String NEXT PROBLEM
RESULTS Re-Interpreting Task: EXAMPLE DATA (non-match cells in yellow)
Machine Translation
Context
What is the number of my room?
Quisiera alquilar un coche pequeño.
Quisiera alquilar un coche pequeño.
I would like to rent a small car.
A waiter speaks to a customer at a restaurant.
May I take?
May I take your order?
Quiero marcharme cuanto antes.
Quiero marcharme cuanto antes.
I want to leave as soon as possible.
A waiter speaks to a customer at a restaurant.
May I take?
May I take this for you?
Me gustaría una botella de vino tinto.
Me gustaría una botella de vino tinto.
I would like a bottle of red wine.
An officer speaks to a traveler at customs.
The purpose of your visit you.
What is the purpose of your visit?
Quisiera pescado con papas hervidas.
Quisiera pescado con papas hervidas.
I would like fish with boiled potatoes.
An officer speaks to a traveler at customs.
Democratic Party.
How many in your party?
¿Me puede alcanzar una servilleta, por favor?
Me puede alcanzar una servilleta por favor?
Can you get me a napkin please?
Customer speaks to the waiter in a restaurant
I love the shrimp Scampi to the aisle.
I'd like the shrimp scampi on the side.
Un salón de té es a veces, un poco exclusivo.
Un salón de té sabes es un poco exclusivo.
A tearoom know it's a little sole.
A guest speaks to a clerk at the hotel desk.
I live in - please.
Can I check in, please?
Tráigame otra botella de vino blanco, por favor.
Traigame otra botella de vino blanco por favor.
Please bring me another bottle of white wine, please.
A guest speaks to a clerk at the hotel desk.
Could you call me a taxi for one?
Could you call a Taxi for me?
¿Puede usted enseñarme una habitación matrimonial, por favor?
Puede usted enseñarme una habitación matrimonial por favor?
Can you show me one twin room please?
A guest speaks to a clerk at the hotel desk.
Could you call me a taxi for one?
Could you call me a taxi for 1PM?
Quiero marcharme cuanto antes.
Quiero marcharme cuanto antes.
I want to leave as soon as possible.
Customer speaks to a clerk in a shop
Is a portion of this cake takeout please?
Could I have a piece of this cake to go?
Waiter speaks to a customer in a restaurant
Serve birds meat and hunting.
We serve game meat here, such as birds.
Guest speaks to a clerk at the hotel desk
I take bus should I take to the University.
Which bus should I take to get to the University?
A traveler speaks to a clerk at customs.
Tour.
I'm here as a tourist.
Clerk speaks to a traveler at the travel counter It's a it varies to the youth.
2012 EXPERIMENT
Materials 197 Spanish original texts were from a Berlitz phrase book for travelers, along with reference translations in English. 75 Chinese texts came from a Chinese website for travelers. Reference translations are adapted from those on the site. Longer utterances with clause structure were selected, along with some elliptical phrases. Apparatus and Procedure Sessions were conducted in whatever quiet location was convenient for the participants; mostly home or office. L1 speakers practiced with Jibbigo running on an iPhone 4 for about 5 minutes, or until they could produce 5 perfect ASR transcriptions in a row. Participants were instructed to speak so that the device performed at its best. The Spanish speakers were permitted to try a second rendition of a text, as necessary.
There is a different cost for those under 18.
Intermediate product: MTurk workers produced ~750 re-interpretations of 224 incorrect ASRxMT translations from Spanish and Chinese spoken material. These were set up in another MTurk task, shown below, and each was rated by several Turkers.
1. Conventional accuracy measures underestimate the functional accuracy of noisy communication. 2. Automatic Voice-in, Voice-out translation is emerging, but its functional accuracy is unknown and unreported. 3. Functional accuracy is not well understood. 4. We present a method for estimating functional accuracy, working with actual translations produced by ASR x MT (automatic speech recognition passed to machine translation).
Participants Six Spanish speakers & two Chinese speakers; both Chinese talkers were highly educated; four Spanish talkers were highly educated, and 2 were just literate. 39 Mechanical Turk interpreters; 17 Mechanical Turk judges (in US).
We assumed that all experimental utterances are intelligible and that the synthetic speech is also intelligible. Furthermore, no tests involved repair, accommodation, interviews, long turns, role plays, or ratings of output text in context. The results were quite good.
Re-Interpreted
Cuál es el número de mi habitación?
Equivalence Judgement Task: EXAMPLE DATA Context
2011 EXPERIMENT
At the 161st ASA Meeting (2011, Seattle, paper 5pSC6), the authors presented a first report on “Measuring linguistic performance of a hand-held speech-to-speech interpreter”.
Sp!(ASRxMT)!Engl
¿Cuál es el número de mi habitación?
Generic Contexts were constructed from appropriate combinations of these elements: Speaker
Addressee
Location
Customer Traveler Clerk Waiter Guest Officer
Customer Traveler Clerk Waiter Someone Traveler Officer Guest
at the hotel desk. at the travel counter. on a train. in the street. in a restaurant. at Customs. in a shop.
The Amazon Mechanical Turk (MTurk) is a crowd-sourcing market that enables an experimenter to co-ordinate the use of human intelligence to perform tasks that computers are not able to do. An experimenter can post HITs (Human Intelligence Tasks), such as choosing the best among several photographs of a store-front, writing product descriptions, or identifying performers on music CDs. Workers (a.k.a. Turkers) can then select tasks to complete for a payment. [adapted from Wikipedia, the free encyclopedia; Oct. 2012]
Human
Re-Interpretation
Source Translation
Customer speaks to the waiter in a restaurant
I would like that well-done please.
I'd like the steak well-done - please.
Waiter speaks to a customer in a restaurant
Accept this with the complements of the house.
We serve poultry and game
Clerk speaks to a traveler at the travel counter
Is it a bar for young people?
It's a youth hostel.
Customer speaks to a clerk in a shop
I would like 12 of those and 1 of these please.
Give me two of these and one of those, please.
Traveler speaks to a clerk at the travel counter
Can you tell me when I can catch a flight? Can I make a connection to Alicante?
Waiter speaks to a customer in a restaurant
Would you like a table for 5 people?
Do you want a table for five people?
Traveler speaks to a clerk at the travel counter
Can I make a connecting flight to Bali?
Can I make a connection to Alicante?
Guest speaks to a clerk at the hotel desk
Where can I get a bus to the beach?
Where can I get a bus to the beach?
Traveler speaks to someone on a train
Would you tell me when my stop is?
Will you tell me when to get off?
Customer speaks to a clerk in a shop
Can you repair this?
Can you get it repaired?
Customer speaks to the waiter in a restaurant
What would you suggest?
What do you recommend?
Clerk speaks to a traveler at the travel counter
It is an albino youth.
It's a youth hostel.
= =
= = = = =
Final product: MTurk judged that 45% of the human “re-interpretations in context” of the ASRxMT material were functionally equivalent to a human generated “source” translations.
SUMMARY A method was developed for estimating a functionally relevant accuracy rate suitable for use with speech-to-speech interpreter systems. The method was applied to one particular commercial system and results are reported. Using the text output from the Jibbigo system that is normally delivered by a text-to-speech converter, we found that providing skeletal contexts {speaker, addressee, location} with no prior verbal context, one can expect about 45% of the incorrect output can be rendered into text in the target language that is functionally equivalent to the correct translation in a target L2. Thus, for example, an ASR x MT performance of 87% correct may reflect a functional correct rate of 93% for a language pair. NOTES
"Overview | Requester | Amazon Mechanical Turk". Requester.mturk.com. Retrieved 2011-‐11-‐28. Chinese source material was adapted from: hTp://www.'ngclass.net/list-‐5536-‐1.html The authors acknowledge generous help received from Kalen Zhang and David Rubin.