Construction and Optimization of a Question and Answer ... - Core

Construction and Optimization of a Question and Answer Database for a Real-environment Speecb-oriented Guidance System Shota TAKEUCHI， Tobias CINCAREK， Hiromichi KAWANAMI， HiroshiSARUWATARI， Kiyohiro SHlKANO GraduateSchool ofInformationScience，NaraInstitute ofScience and Technology 8916・5，Takayama-cho， Ikoma-shi，Nara，Japan， 630・0192 Phone: +81・743・72-5287 Fax: +81・743-72-5289 Ema凶il: {s油hoωtaかa-t，cin即c紅-t， kawanami凶IÍ，s鈎awa瓜ta釘剖r討i，sh悩ikanoωo}@iぬs.nai加s幻乱刈tι吋.j URLじ:h句:泊ρ/!sp凶a叫la油b.nai釘is向t句.JP

Abstract

omitted words using dialog history，釘'e necessary，

system.

dialog

spoken

For

building

techniques 釘e

robust

not

errors.

recognition

speech

against

enough

database (QADB)， which is useful for constructing a

state-of-the-art

but

In this paper， we describe a question and answer

Furthermore， for employing spoken dialog systerns

a

in a real environment， a speaker-independent model

real-environment speech-oriented dialog system，

A

real user由ta釘e required. In order to col1ect these

is used for speech reco伊ition.

data， we have been employing a spoken guidance

speech data is required to achieve high accuracy.

sytem since five years. A database of trainscriptions

Additionally， employing

and appropriate responses，etc. has been constructed

enhance the dialog control technology. However，

large amount of corpus

dialog

a

can

for 2 ye訂s of the collected utterances. Furthermore

there訂'e few corpora of spontaneous utterances

we introduce a QADB optimization method to

collected in a real environment. [1，3，4，9] In order to collect spon句neous user utterances

exclude inappropriate data. Using this method，

for constructing a dialog co中us，we developed two

response accuracy improves by 1 .6% to 75.7% absolute and by 2.9% to 59.8% about adults and

speech -oriented肝idance systerns“Takem訂u-kun"

children，respectively.

[5] and “Kita・chan." They are installed at a public station，

railway

a

and

center

Keywords: Speech DialogSystem， Example-based

community

Response Generation，Optimization，Leave-one-out

respectively. The systerns employ a one question to answer

one

Cross-validation

for

strategy

to

responding

user

utterances. Using an anirnated agent， the purpose of these systems is mainly guidance of the facilities

1 Introduction

A

and to give the user the impression that the system

spoken dialog system is a natural user interface

has a personality.

for human-machine communication using speech. and dialog con仕01 technology to reply to user

labels such as 仕anscriptions and appropriate system responses has been constructed by listening to each utterance. More accurate recognition models can be

studying spoken dialog system. There釘e various spoken

call-routing

system， for

dialog

service

[1]，

wea也er

example，

built by using出s database. Moreover， various

information

methods

伊idance [2] and multi-modal interface [3]. In a real environment， e.g. a public hall， many natural

language

expressions.

for

a

in

recognition

speech

real

environment have been developed， e.g. speech recognition with age group distinction [5]， noise

users will talk to a spoken dialog system using their own

since

operated

November 2002. A two-year database of utterance

utterances. Both technologies can be developing by kinds of

been

has

Takemaru-kun

This kind of system employs speech recognition

rejection using GÞ.伽[6]， unsupervised回ining of

Moreover，

acoustic models [7]， and recognition of preschool

utterances may be affected with background noise

children speech [8].

and reverberation. Therefore techniques to respond

The deployed speech-oriented guidance system

to various u悦rance expressions properly and to

uses example-based response generation in 出e

recognize speech in various situations accurately

dialog

are necessary for a real environment spoken dialog

control

module

to

cope

with

various

utterances and expressi�ns. This method re出eves

system. In this paper， we investigate the former

the response coπespon必ng to the example question

techniques for f1exible dialog processing of various

most similar to the reco伊ition r自由[9]. This

utterance expressions.

method can generate a response easier and faster

For building a reliable speech dialog system，

than

dialog control techniques， for example， semantic

analyzing

the

semantics

of

automatic

recognition results. Furthermore， it is sufficient for

understanding of input sentences and estimation of

the guidance task which requires an immediate

149

円ノ臼 nu nノω

response. Example-based response generation relies on a QADB (question and answer database) to generate appropriate responses. Errors in the QADB may ca田e an improper system response. Their inclusion is hard to suppress perfectly because the utterance database may contain label mistakes of human annotators. Consequently， a method to cons町uct an optimal QADB by excluding improper pairs in order to improve the response performance has been developed [10]. In this paper， we describe and evaluate the QADB optimization method. Fig. 1.

2 Speech-oriented Guidance System

Speech-oriented Guidance System

“Takemaru-kun. "

“Takemaru-kun"

Takemaru・kun is a speech-oriented guidance system which is installed at the public community center. Each response is using text-to・speech， agent animation and either floor maps or web page as visual information. Guidance is conducted by presenting bo白a voice-based and a visual response， and it looks like the agent itself is answering by showing agent animation. The system consists of a rnicrophone， one set of loud speakers and two displays on a desk. The computer hardware for perforrning speech dialog processing is connected to the Intemet through a gateway. Using this configuration， the state of the system can be supervised. All inputs including speech and noise are recorded and saved with a timestamp. Takemaru-kun does not respond to noise thanks to noise rejection [6]. There are 309 distinct responses for Takemaru-kun which c組be classified into eight categoríes: (a) Guidance at the public community center (center伊idance， 64 topics) (b) Guidance for the local釘ea and sightseeing (local guidance， 86 topics) (c) Time， weather， news (news， 8 topics) (d) Access to important web sites (web-site， 21 topics) (e) Profile ofTakemaru-kun (profile， 52 topics) (f) Greeting 企'Om Takemaru-kun (greeting， 7 topics) (g)加swer of utterances except gui伽ce request (chat， 68 topics) response. know" (h) “Sorry， don't 1 (out-of.・domain， 3 topics) Takemaru-kun has been employed since November 2002 to collect spontaneous utter組問S of real users continuously. System operation has never aborted， and the system's contents have been updated several times.

3 Takemaru-kun Speech Database

In the following， the contents of the Takemaru・kun speech database are described. 3.1 Method to Arranging the Database

The recorded speech data were transcribed and labeled by several human annotators. All data from November 2002 to October 2004 ぽe completely labeled and transcribed. Even noìse data are classified as“noise." The Collection of speech has been continued also after October 2004. Unlabeled speech data for additional 3 years is also available.. These speech data have been employed for research on acoustic modeling using unsupervised training， e.g. selective training [7]. Moreover， the system stores the log of speech reco伊ition results since the start of Takemaru-kun. Currently， the log is used for making statistics of the collected speech data. lt is not used to for the database construction， but the log information is useful for research on spoken dialog system. 3.2 Contents of Takemaru-kun Speech Database

All input data are annotated with the following labels: (a) Date: The name of the speech file. Also acts as identifiers of the collected data. (b) Gender: Speakers' gender èlassified by listening. Three classes of female， male， uncertam. (c) Age group: Speakers' age c1assified by listening. Six classes of infant (about 0・5・years-old)， lower school child (about 6・10・years-old)， higher school child (about 11・15・years-old)， adult， elder， uncertain. (d) Validity: Whether a speech is utterance to也e system， is intelligible， and has meaning. Data with background speech or howling are 150

qδ nU つ白

classified as invalid. (e) Transcription (Hiragana-Ka吋i): Transcription in Japanese Hiragana-Ka吋i writing. About noise or unnecess釘y speech， the information is

れり，SE、

noted. Transcription (Katak釘1a): Transcription with Japanese Katakana writing.

(g) Pronunciations (Katakana): Prommciation in Japanese Katakana writing. Differing from katakana Transcription w.r.t. on long vowels. (h) Appropriate

system

response:

Most

appropriate response to user input. Appending the response nurnber ofTakemaru品m system. Represents indirectly the meaning of the user utterance・ system response pair. The annotators have not witnessed the speech collection， so that classiちring the gender or age

elder1y

groups of some speech e.g. preschool children was

・・国国 L 。 3 且値切

not possible. There were many annotators. Each annotators processed distinct portions of the data. The response label is limited to on1y one system response， while some data had more than one

adulta hi，h ch.

low ch.

response label. Special tags are assi伊ed to noise and invalid

presch∞l

speech inputs. Only valid utterances (valid speech) 。也

are labeled with an appropriate response.

Statistics for the Takemaru・kun speech database are

1.

When considering the relative share of age

For

result suggests that a

reason for this dis仕ibution of the public community

however，

with

Considering

only

the

domain

children

of

data，

the

dialog

control，

and

is not required to perform in-depth

is not necessary to deal with the dialog history.Thus

system.

greetings

understanding，

language understanding and dialog con仕01 since it

“Takemaru・kun" than guidance requests， which are be

simple response

In a one-question-to・one-answer dialog system，

eight response categories is shown in Figure 2.

to

and

response generatlon.

The classification of valid speech inputs into白E

assumed

precise

dialog system has modules for speech re∞gnition， language

agent. This results in a m句ority of children inputs.

talk

reason，

of a spoken dialog system. A standard spoken

center is the system interface with an animated

to

this

There are many possible module configurations

children inputs proper1y. It is suggested 白紙出e

utterances

ロ。ut-。←domain

generation is needed for the system.

deployed system needs more efforts to respond to

more

� greeting

which gives appropriate and immediate responses.

groups among the valid speech data， the most are

are

・on-time

t": P向日le

speeches for each age group.

a spoken dialog system is cle釘.

There

・local guidance

・web-site

Fig. 2. Ratios of appropriate response type of valid

From this result， the necessity of noise rejection for

2]. This

100悟

8011

• center guidance

:: chat

One-third of the inputs are noise.

企om children [Table

60'‘

開lativ・.h.....

3.3 Database Statistics shown inTable

40覧

20也

the dialog system described in this paper generates a

to

response

Takemaru・kun 釘e relatively less出an frequently for

using

the

speech

recognition

result

other age groups.

directly.

4 Example-based Response Generation

sophisticated dialog model， the QADB (pairs of

“Ex佃1ple-based response generation" is named

interpreted as simple dialog model.

Although example

after the strategy to select a fixed response for each

this

approach

questions

and

does

not

responses)

use

. c佃

a be

user input using a set of example questions. The

4.2 Principle

advantage is也at semantic analysis is not 前回ssary.

The ex釘nple-based response generation approach can

4. 1 Proposed Approach

cope

with

various

user

expressions

by

providing a large amount of user questions for each

The user expects a田eful spoken dialog system

151

- 204-

system response.

matching words in the input and the example

for an input， it c佃 present the appropriate response

model e.g. TF・IDF. ln preliminary experiments

question. This method does not use vector space

If the system has stored an appropriate response

we

if the same input occurs. This principle is called

could show白紙a vector space model， e.g. based on

When using由is method， many QA pairs ωair

method. For calculating similarity， words which occur in

TF*IDF

example-based response generation.

of example question and proper response) have to

weights，

is

inferior

to

the

proposed

the input and the example question simultaneously

be collected to construct the QADB for system

implementation. On employment， it retrieves the

are counted. The word position is not considered.

most similar to 出e user input. The response

number of matched words by the maximum of

The similarity is calculated by dividing the

example question 企om the database， which is the

words in the input and the example.

coηesponding to the retrieved example is presented as system output. This method

We

performs only

and

retrieves

similarity;

the

therefore

example it

is

with

the

suitable

for

environment spoken dialog system.

words

real

The example-based response generation method

matching of word units.

and

and

E

the

of

words

exampJe

c爪勾，

and

the

denominator

for

|ト.1ト: number of wo地ins問s田e

Perform speech recognition to obtain plain using

1

input

1: set of input words (recognition result)，

text; generate a response using this sentence. similarity

set

E : set of words in the example question，

following characteristics:

the

the

normalization NI爪E).

for the current Takemaru-kun system has the

Calculate

the

to

question， the number of simultaneously occuring

highest a

define

co汀esponding

simil叩ty calèulation of each example for出e input

Thes幻im口mila訂nt守ys(ι7，疋ξ刀)iぬsd白ef白ine児ed as

symmetric

s(I，E)

Nearest neighbour search: Select the QA pair

c(I，E) _ 1 1 n EI =:一一一 =ーム寸丁τLア

of highest similarity.

N(I，E)

Q lI J

max l ， E )

Ex回ctQA pairs企om the utterance database.

ln Japanese， words of a sentence are not

separated

explicitly

by

spaces.

Therefore

The terms c(1，E) and N(I，勾 are symmetric:

a

therefore this similarity measure is s戸nmetric. This

pre-processing with a morphological analyzer [1 1 ]

meas町e takes a high value when there is a large

is carried out before matching the example question

number of matching words and the length of出e

with the recognition result.

example is similar to the input. The measure takes

values in the [0， 1 ] range wh巴n no word occurs

4.3 QADB Construction

twice. The range constrョint is kept by distinguishing

When constructing the QADB 企om ex佃lples of

real user utterances， the

system

can

duplicate words.

properly

4.4.2 Retrieval

re出eve 佃example for an utterance which a system

developer would not have thought of. Moreover by searching not only for exact

The

but also similar

example

with

the

highest

similarity

determined after calculating the similarity of

examples， it can select an example for an input

input

which does not appear in theQADB.

for

each

example

with

the

is

the

previousJy

described method.

Construction of the QADB is straight-forw釘d:

4.5 Response Accuracy

Ex仕action of the utterance transcription and the

appropriate system response for each utterance

For performance evaluation， the response accuracy

excluded.

responses for the test data. Correctness is defined aち follows: a system response of 3n input is correct when the system response is equal to an appropriate

企'om出e Takemaru database. Duplicate pairs are

is used. This measure is the rate of correct system

4.4 Algorithm Details The method for example-based response generation

response of the input. Response accuracy can also be interpreted

in this paper consists of a scoring and a retrieval

step.百le former calculates the simi加均of the

input with each ex創nple. The latter determines the most similar example.

5 QADBOptim包ation

4.4.1 Scoring

5.1 Purpose

ln this section we describe the similarity calculation.

The

similarity

is

the

as

the user satisfaction rate.

normalized

number

A problem specific to spoken dialog system is that

of

response performance may decrease due to speech

152

phd nU つ臼

recognition eηors. This differs from dialog systems which use text input. Moreover， the example-based

response generation method has some problems; inappropriate QA

pairs

or

pairs

with

similar

(Step 1) Cal凶late the degr'個d u5e1es5ne5s 01 a QA pairfor an u杭町田08.

example questions and different responses may

cause an mappropnate system response.

It is difficult to assign appropriate responses for

the human annotators， who are not completely

familiar with the topic domains of system responses.

Us曲目ness count

Therefore a QADB may include inappropriate data

when constructing it仕om an utterance database

(Step 2) Accumulate the U5e培5sne55 count 01 an QA pairs

which has been prepared by different annotators.

As solution to these problems， it is proposed to

exclude unsuitableQA pairs when constructing the

ロニミヰ> 1 内irC is us臨時|

QADB. This method is ca11ed QADB optirnization [10]. The goal ofQADB optimization is to improve

the response accuracy of血e system. Additiona11y，

Fig.3. QADB optimization.

it is expected 白紙 the QADB becomes more

uselessness is accumulated for a11 QA pairs. QA

compact.

pa出with a uselessness counter greater than zero

are removed. This QADB optirnization pro伺ss is

5.2 Principle

iterated several times. QA pむrs removed once釘e

The main purpose of QADB optimization is the

not added again to the QADB later.百le algorithm

detection of uselessQA pairs.

terminates if it is no longer possible to remove QA

If the response accuracy of utterances improves

palrs.

by excluding aQA pair企om aQADB， theQA pair is considered as useless.

5.4 Experimental Results

For QADB optirnization， it is important to

We evaluate the response accuracy irnprovement

distinguish the role of utterances合om that of QA

for the test utterance data and discuss the effect of

pairs， although they are of the same origin. The

excluding

utterance plays the role of an evaluator for the QA palrs.

The optirnization method is based on a so-ca11ed

回目

leave40ne-out

Test and training data are disjoint. Experiments

cross

with adult and child data釘e conducted separately.

validationl in order to use the whole training data

A11 utterances 釘e reco伊包ed by large vocabulary

for evaluation.

continuous speech recognizer using an N-gram

language model and a context-dependent acoustic model [12].百le recognition accuracy is shown in

5.3 Algorithm Uselessness of a QA pair w.r.t.

a validation

Table4.

utter叩ce is determined by using a tempor釘y

5.4.2 Results

QADB constructed企om the training data excluding

the validation utterance. If the number of correct

The experimental results are shown in Figure4. The

responses increases， the currently considered QA pair

is

considered

by

Qnly valid utterances are employed asなaining data.

of excluding more than one pair is not considered. also

responses

The data used in experiments are shown in Table 3.

only oneQA pair is excluded tempora11y. The effect method

system

5.4.1 Conditions

greedy algorithm to reduce the calculation amount;

The

inappropriate

QADB optimization.

as

useless

(increment

solid line shows the response accuracy阻d也e

a

dotted line shows the distinct number ofQA pairs in

uselessness counter for QA pair) otherwise it is

the opt也ùzedQADB.

considered as useful (d田rement counter for QA

The response accuracy in case of speech input

pair).

improves with optimization by 1.6% to 75.7%

After this evaluation for a11 data， the degree of

absolute for ad叫ts and by 2.9% to 59.8% for

children. Additionally， convergence of the QADB

I

size can be observed. The result for children data

A variation of cross-validation. Split training data

shows that the response accuracy improves with

to one sample and the rest and evaluate the model

one and two iterations of the QADB optirnization

with the remaining for the s叩lple. Accumulate the

algorithm although speech recognition is low.

evaluation results for a11 data to determinate the fmal evaluation result.

153

- 206-

Table 3 S

stI

We have employed a speech-oriented guidance system and coUected spontaneous speech data for five years in order to enhance speech recognition and dialog control technology. A speech database is constructed企om these utterances of two years. We described a QADB optimization method using也is database; The method cons釘ucts a QADB by excluding inappropriate data effectively. The response accuracy for speech input improves with the optimization by 1 .6% to 75.7% absolute for adults and by 2.9% to 59.8% for children.

test data

period

7

fd

# of data period

training data

LH t

6 ConcIusions

# of data

Aug.2003 adult 1053 child 6543 ー Nov. 2002 Oct.2004 (excluding test data) adult 19383 child 79346

distinct QA pairs

evaluation

Acknowledgement

adult 6345 30669 child comparing system response with single proper response of test data

This work is partly supported by MEXT leading project“e-society". 8

References

adult child

[1 ) A.L.Gorin， G. Riccardi， J.H.Wright， “How may 1 help you?，" Speech Communication， vo1.23，pp.113・127， 1997. (2) Victor Zue， Stephanie Seneff， James R. Glass， Joseph Poli台。ni， Christine

Pao， Timothy

J.

Hazen

and

Lee

Hetherington， “JUPITER: A Telephone-8ased Co但versational lnterface for Weather lnformation，" ffiEE Transaction on

76.0

[4) Nobuo Kawaguchi，Shi伊ki Matsubara， Yukiko Yamaguchi， Kazuya Taked閣， FUI町mit飽ada ltakura丸， Databaωse、:，"inP桁r。侃cof lCSLP20∞04，FrA2702p，2004 (5) Ryuichi Nisimura， Akinobu Lee， Hiroshi S訓wa削， Kiyohiro Shikano，“Public Speech-Oriented Guidance System with Adult and Child Discrimination Capability.... Proc. of lCASSP2004， vol. l ，pp.433-436， 2004.

72.0

(6) Akinobu Lee， Keisukc Nakamura， Ryuichi Nisimura，

• • • • • • . … 迫一 ::・ ' 一・ 1

“THE AUGUST SPOKEN DIALOG SYSTEM，" Proc. of EUROSPEECH'99， vo1.3， pp.l l 5 1 ・ 1 154， 1999.

ーー+ー-RA% ...・...揮。f

4

o

Kiyohiro Shikano，“Acoustic Mod沼ling for Spoken Dialogue

[8) Tobias Cincarek， lzumi Shindゆ， Tomoki Toda， Hiroshi Saruwatari， Kiyohiro Shikano，“Deve10pment of Preschool Children Subsys阻m for ASR and QA in a Real-Environment

2007，pp. 1469・1472，2007. [9) Hiroya Murao， Nobuo Kawaguchi， Shigeki Matsubara，

55.0

Takeda， Yasuyoshi Inagaki，

“Examp1e-based Spoken Dia10g System with On1ine Examp1e Augmentation，" Proc. of ICSLP2004， Spe併402p-7， pp.3073--3076，2004 [10) Manabu Kida，“Optimal Design of the Qu回tion-Answer Databaseおr a Spee唱h Guidance System，" mぉter thes時， [ 1 lJ!ltto:/Isourccfoll!e.iDfDroicctslchascn-lcl1:acy，f. (12) Akinobu L民Ta凶ya Kawahara， Kiyohiro Shikano， Open Source Real-Time Large Vocabu1ary 泊

Proc.

EUROSPEECH

30 1ー+-RA%

…且…町一一・.

O

...・・・・#

一|

of

aA P8irs

I

i 29

H

28

...__.P.由一ー一一-一一一回ー一一一 . ・ 9・ー・ -・・- _ - ..--- --� -- - -・ i 27 4

6

iter.tion e副nt

10

26

Response accuracy improvement by optímizatíon. Fig. 4.

NAlST，2007. (in Japanese)

Engine，"

31

.� 56.5 a: 56.0 55.5

Speech-oriented Guidance Task"， in Proc. of EUROSPEECH

an

5.7

200 1，

pp.16θ1・1694，200 1 .

154

志穂官aw c o u a EE 】 MA F tD A

60.0 59.5 � 59.0 舌58.5 '" 58.0 o < 57.5 .� 57.0

Systcms 8ased 0岨Unsupervised Utterance-based Selective Training，" lCSLP2006，pp. 1722・1725，2006.

b∞，gnition

10

8

Child. speech

of Unintended lnputs，" Proc. of lCSLP2004， TuA1302p・2， Vol.l，pp. 173ー176，2004. [7) Tobias Cincarek， Tomoki Toda， Hiroshi Saruwatari，

“Ju1ius ---

aA pairs

曲rati開C仙nt

Hiroshi Saruwatari， Kiyohiro Shikano， "Noice Robust Real World Spo占en Dialoguc System using GMM 8ased Rejection

Yukiko Y釘naguchi， Kazuya

{ " o p xv EE a 古C 沼・宅古事 8 3 2 6 6 1 6 D 6 9 5 5

6.4

RJM nun ‘dのuM RdnMの4 aJm za ra a勾 ae 勾a 句、" 、，司，『，『，、，、，、， E ba〈・曽 ω ω mo 、m 安}

Speech and Audiゆ Processing，vo1.8， no.I，pp.85・96，2∞O. [3) Jo依im Gustafson， Nikol吋Lindberg， Magnus Lundeberg，

， n， nU つ白