Jared C. Bernstein, LinguisQcs, Stanford University ...

3 downloads 419 Views 752KB Size Report
choosing the best among several photographs of a store-front, writing product descriptions, or ... Jibbigo running on an iPhone 4 for about 5 minutes, or until they.
146th  mee'ng:  Acous'cal  Society  of  America   Kansas  City,  25  October  2012  

4pSC17  

 

146th  mee'ng:  Acous'cal  Society  of  America   Kansas  City,  25  October  2012  

Evalua'ng  an  Automa'c  Speech-­‐to-­‐Speech  Interpreter  in  Context  

4pSC17  

 Jared  C.  Bernstein,  Linguis'cs,  Stanford  University   Elizabeth  P.  Rosenfeld,  Tasso  Partners,  Palo  Alto,  California  

GOAL     Estimate functional accuracy in context for a speech-to-speech interpreter

INTRODUCTION  

  A naïve speech-to-speech interpreter can be implemented as three component processes in series: speech recognition, machine translation, and speech synthesis. However, a fair evaluation of a speech-to-speech interpreting system may be different from just calculating the product of the accuracies of those three component processes. This is because human users are sensitive to the rate at which an interpreter operates and because a system’s communication success rate is properly measured by a listener’s correct understanding of the speaker’s intention in a particular context rather than by the system’s word-for-word accuracy and intelligibility. This poster looks at the effect of context. We are evaluating one currently available system (Jibbigo), operating in one direction for two language pairs: Spanish-toEnglish and Chinese-to-English. For each language pair, L2 output that is incorrect by conventional accuracy measurement is re-evaluated with respect to its potential for communication of speaker intent-in-context. Results for the Spanish and Chinese as L1 (source language) going into English as L2 (target language) suggest that conventional accuracy (in quiet) is surprisingly high at about 70-90% correct, and that a more appropriate functional measure like communication in context nearly halves the error rate.

 

EXAMPLE  DATA     Intended Utterance Recognized String                 NEXT  PROBLEM    

RESULTS Re-Interpreting Task: EXAMPLE DATA (non-match cells in yellow)

Machine Translation

Context

What is the number of my room?

Quisiera alquilar un coche pequeño.

Quisiera alquilar un coche pequeño.

I would like to rent a small car.

A waiter speaks to a customer at a restaurant.

May I take?

May I take your order?

Quiero marcharme cuanto antes.

Quiero marcharme cuanto antes.

I want to leave as soon as possible.

A waiter speaks to a customer at a restaurant.

May I take?

May I take this for you?

Me gustaría una botella de vino tinto.

Me gustaría una botella de vino tinto.

I would like a bottle of red wine.

An officer speaks to a traveler at customs.

The purpose of your visit you.

What is the purpose of your visit?

Quisiera pescado con papas hervidas.

Quisiera pescado con papas hervidas.

I would like fish with boiled potatoes.

An officer speaks to a traveler at customs.

Democratic Party.

How many in your party?

¿Me puede alcanzar una servilleta, por favor?

Me puede alcanzar una servilleta por favor?

Can you get me a napkin please?

Customer speaks to the waiter in a restaurant

I love the shrimp Scampi to the aisle.

I'd like the shrimp scampi on the side.

Un salón de té es a veces, un poco exclusivo.

Un salón de té sabes es un poco exclusivo.

A tearoom know it's a little sole.

A guest speaks to a clerk at the hotel desk.

I live in - please.

Can I check in, please?

Tráigame otra botella de vino blanco, por favor.

Traigame otra botella de vino blanco por favor.

Please bring me another bottle of white wine, please.

A guest speaks to a clerk at the hotel desk.

Could you call me a taxi for one?

Could you call a Taxi for me?

¿Puede usted enseñarme una habitación matrimonial, por favor?

Puede usted enseñarme una habitación matrimonial por favor?

Can you show me one twin room please?

A guest speaks to a clerk at the hotel desk.

Could you call me a taxi for one?

Could you call me a taxi for 1PM?

Quiero marcharme cuanto antes.

Quiero marcharme cuanto antes.

I want to leave as soon as possible.

Customer speaks to a clerk in a shop

Is a portion of this cake takeout please?

Could I have a piece of this cake to go?

Waiter speaks to a customer in a restaurant

Serve birds meat and hunting.

We serve game meat here, such as birds.

Guest speaks to a clerk at the hotel desk

I take bus should I take to the University.

Which bus should I take to get to the University?

A traveler speaks to a clerk at customs.

Tour.

I'm here as a tourist.

Clerk speaks to a traveler at the travel counter It's a it varies to the youth.

     

2012  EXPERIMENT  

Materials 197 Spanish original texts were from a Berlitz phrase book for travelers, along with reference translations in English. 75 Chinese texts came from a Chinese website for travelers. Reference translations are adapted from those on the site. Longer utterances with clause structure were selected, along with some elliptical phrases. Apparatus and Procedure Sessions were conducted in whatever quiet location was convenient for the participants; mostly home or office. L1 speakers practiced with Jibbigo running on an iPhone 4 for about 5 minutes, or until they could produce 5 perfect ASR transcriptions in a row. Participants were instructed to speak so that the device performed at its best. The Spanish speakers were permitted to try a second rendition of a text, as necessary.

There is a different cost for those under 18.

Intermediate product: MTurk workers produced ~750 re-interpretations of 224 incorrect ASRxMT translations from Spanish and Chinese spoken material. These were set up in another MTurk task, shown below, and each was rated by several Turkers.

1.  Conventional accuracy measures underestimate the functional accuracy of noisy communication. 2.  Automatic Voice-in, Voice-out translation is emerging, but its functional accuracy is unknown and unreported. 3.  Functional accuracy is not well understood. 4.  We present a method for estimating functional accuracy, working with actual translations produced by ASR x MT (automatic speech recognition passed to machine translation).

Participants Six Spanish speakers & two Chinese speakers; both Chinese talkers were highly educated; four Spanish talkers were highly educated, and 2 were just literate. 39 Mechanical Turk interpreters; 17 Mechanical Turk judges (in US).

We assumed that all experimental utterances are intelligible and that the synthetic speech is also intelligible. Furthermore, no tests involved repair, accommodation, interviews, long turns, role plays, or ratings of output text in context. The results were quite good.

Re-Interpreted

Cuál es el número de mi habitación?

Equivalence Judgement Task: EXAMPLE DATA Context

2011  EXPERIMENT  

  At the 161st ASA Meeting (2011, Seattle, paper 5pSC6), the authors presented a first report on “Measuring linguistic performance of a hand-held speech-to-speech interpreter”.  

Sp!(ASRxMT)!Engl

¿Cuál es el número de mi habitación?

Generic Contexts were constructed from appropriate combinations of these elements: Speaker

Addressee

Location

Customer Traveler Clerk Waiter Guest Officer

Customer Traveler Clerk Waiter Someone Traveler Officer Guest

at the hotel desk. at the travel counter. on a train. in the street. in a restaurant. at Customs. in a shop.

The Amazon Mechanical Turk (MTurk) is a crowd-sourcing market that enables an experimenter to co-ordinate the use of human intelligence to perform tasks that computers are not able to do. An experimenter can post HITs (Human Intelligence Tasks), such as choosing the best among several photographs of a store-front, writing product descriptions, or identifying performers on music CDs. Workers (a.k.a. Turkers) can then select tasks to complete for a payment. [adapted from Wikipedia, the free encyclopedia; Oct. 2012]

Human
 Re-Interpretation

Source Translation

Customer speaks to the waiter in a restaurant

I would like that well-done please.

I'd like the steak well-done - please.

Waiter speaks to a customer in a restaurant

Accept this with the complements of the house.

We serve poultry and game

Clerk speaks to a traveler at the travel counter

Is it a bar for young people?

It's a youth hostel.

Customer speaks to a clerk in a shop

I would like 12 of those and 1 of these please.

Give me two of these and one of those, please.

Traveler speaks to a clerk at the travel counter

Can you tell me when I can catch a flight? Can I make a connection to Alicante?

Waiter speaks to a customer in a restaurant

Would you like a table for 5 people?

Do you want a table for five people?

Traveler speaks to a clerk at the travel counter

Can I make a connecting flight to Bali?

Can I make a connection to Alicante?

Guest speaks to a clerk at the hotel desk

Where can I get a bus to the beach?

Where can I get a bus to the beach?

Traveler speaks to someone on a train

Would you tell me when my stop is?

Will you tell me when to get off?

Customer speaks to a clerk in a shop

Can you repair this?

Can you get it repaired?

Customer speaks to the waiter in a restaurant

What would you suggest?

What do you recommend?

Clerk speaks to a traveler at the travel counter

It is an albino youth.

It's a youth hostel.

= =

= = = = =

Final product: MTurk judged that 45% of the human “re-interpretations in context” of the ASRxMT material were functionally equivalent to a human generated “source” translations.

SUMMARY A method was developed for estimating a functionally relevant accuracy rate suitable for use with speech-to-speech interpreter systems. The method was applied to one particular commercial system and results are reported. Using the text output from the Jibbigo system that is normally delivered by a text-to-speech converter, we found that providing skeletal contexts {speaker, addressee, location} with no prior verbal context, one can expect about 45% of the incorrect output can be rendered into text in the target language that is functionally equivalent to the correct translation in a target L2. Thus, for example, an ASR x MT performance of 87% correct may reflect a functional correct rate of 93% for a language pair. NOTES  

"Overview  |  Requester  |  Amazon  Mechanical  Turk".  Requester.mturk.com.  Retrieved  2011-­‐11-­‐28.   Chinese  source  material  was  adapted  from:      hTp://www.'ngclass.net/list-­‐5536-­‐1.html   The authors acknowledge generous help received from Kalen Zhang and David Rubin.