Construction and Optimization of a Question and Answer Database for a. Real-environment Speecb-oriented Guidance System. Shota TAKEUCHIï¼ ... ä¼idance [2] and multi-modal interface [3]. In a real .... in Japanese Hiragana-Kaåi writing. About noise or ..... World Spoå en Dialoguc System using GMM 8ased Rejection.
Construction and Optimization of a Question and Answer Database for a Real-environment Speecb-oriented Guidance System Shota TAKEUCHI, Tobias CINCAREK, Hiromichi KAWANAMI, HiroshiSARUWATARI, Kiyohiro SHlKANO GraduateSchool ofInformationScience,NaraInstitute ofScience and Technology 8916・5,Takayama-cho, Ikoma-shi,Nara,Japan, 630・0192 Phone: +81・743・72-5287 Fax: +81・743-72-5289 Ema凶il: {s油hoωtaかa-t,cin即c紅-t, kawanami凶IÍ,s鈎awa瓜ta釘剖r討i,sh悩ikanoωo}@iぬs.nai加s幻乱刈tι吋.j URLじ:h句:泊ρ/!sp凶a叫la油b.nai釘is向t句.JP
Abstract
omitted words using dialog history,釘'e necessary,
system.
dialog
spoken
For
building
techniques 釘e
robust
not
errors.
recognition
speech
against
enough
database (QADB), which is useful for constructing a
state-of-the-art
but
In this paper, we describe a question and answer
Furthermore, for employing spoken dialog systerns
a
in a real environment, a speaker-independent model
real-environment speech-oriented dialog system,
A
real user由ta釘e required. In order to col1ect these
is used for speech reco伊ition.
data, we have been employing a spoken guidance
speech data is required to achieve high accuracy.
sytem since five years. A database of trainscriptions
Additionally, employing
and appropriate responses,etc. has been constructed
enhance the dialog control technology. However,
large amount of corpus
dialog
a
can
for 2 ye訂s of the collected utterances. Furthermore
there訂'e few corpora of spontaneous utterances
we introduce a QADB optimization method to
collected in a real environment. [1,3,4,9] In order to collect spon句neous user utterances
exclude inappropriate data. Using this method,
for constructing a dialog co中us,we developed two
response accuracy improves by 1 .6% to 75.7% absolute and by 2.9% to 59.8% about adults and
speech -oriented肝idance systerns“Takem訂u-kun"
children,respectively.
[5] and “Kita・chan." They are installed at a public station,
railway
a
and
center
Keywords: Speech DialogSystem, Example-based
community
Response Generation,Optimization,Leave-one-out
respectively. The systerns employ a one question to answer
one
Cross-validation
for
strategy
to
responding
user
utterances. Using an anirnated agent, the purpose of these systems is mainly guidance of the facilities
1 Introduction
A
and to give the user the impression that the system
spoken dialog system is a natural user interface
has a personality.
for human-machine communication using speech. and dialog con仕01 technology to reply to user
labels such as 仕anscriptions and appropriate system responses has been constructed by listening to each utterance. More accurate recognition models can be
studying spoken dialog system. There釘e various spoken
call-routing
system, for
dialog
service
[1],
wea也er
example,
built by using出s database. Moreover, various
information
methods
伊idance [2] and multi-modal interface [3]. In a real environment, e.g. a public hall, many natural
language
expressions.
for
a
in
recognition
speech
real
environment have been developed, e.g. speech recognition with age group distinction [5], noise
users will talk to a spoken dialog system using their own
since
operated
November 2002. A two-year database of utterance
utterances. Both technologies can be developing by kinds of
been
has
Takemaru-kun
This kind of system employs speech recognition
rejection using GÞ.伽[6], unsupervised回ining of
Moreover,
acoustic models [7], and recognition of preschool
utterances may be affected with background noise
children speech [8].
and reverberation. Therefore techniques to respond
The deployed speech-oriented guidance system
to various u悦rance expressions properly and to
uses example-based response generation in 出e
recognize speech in various situations accurately
dialog
are necessary for a real environment spoken dialog
control
module
to
cope
with
various
utterances and expressi�ns. This method re出eves
system. In this paper, we investigate the former
the response coπespon必ng to the example question
techniques for f1exible dialog processing of various
most similar to the reco伊ition r自由[9]. This
utterance expressions.
method can generate a response easier and faster
For building a reliable speech dialog system,
than
dialog control techniques, for example, semantic
analyzing
the
semantics
of
automatic
recognition results. Furthermore, it is sufficient for
understanding of input sentences and estimation of
the guidance task which requires an immediate
149
円ノ臼 nu nノω
response. Example-based response generation relies on a QADB (question and answer database) to generate appropriate responses. Errors in the QADB may ca田e an improper system response. Their inclusion is hard to suppress perfectly because the utterance database may contain label mistakes of human annotators. Consequently, a method to cons町uct an optimal QADB by excluding improper pairs in order to improve the response performance has been developed [10]. In this paper, we describe and evaluate the QADB optimization method. Fig. 1.
2 Speech-oriented Guidance System
Speech-oriented Guidance System
“Takemaru-kun. "
“Takemaru-kun"
Takemaru・kun is a speech-oriented guidance system which is installed at the public community center. Each response is using text-to・speech, agent animation and either floor maps or web page as visual information. Guidance is conducted by presenting bo白a voice-based and a visual response, and it looks like the agent itself is answering by showing agent animation. The system consists of a rnicrophone, one set of loud speakers and two displays on a desk. The computer hardware for perforrning speech dialog processing is connected to the Intemet through a gateway. Using this configuration, the state of the system can be supervised. All inputs including speech and noise are recorded and saved with a timestamp. Takemaru-kun does not respond to noise thanks to noise rejection [6]. There are 309 distinct responses for Takemaru-kun which c組be classified into eight categoríes: (a) Guidance at the public community center (center伊idance, 64 topics) (b) Guidance for the local釘ea and sightseeing (local guidance, 86 topics) (c) Time, weather, news (news, 8 topics) (d) Access to important web sites (web-site, 21 topics) (e) Profile ofTakemaru-kun (profile, 52 topics) (f) Greeting 企'Om Takemaru-kun (greeting, 7 topics) (g)加swer of utterances except gui伽ce request (chat, 68 topics) response. know" (h) “Sorry, don't 1 (out-of.・domain, 3 topics) Takemaru-kun has been employed since November 2002 to collect spontaneous utter組問S of real users continuously. System operation has never aborted, and the system's contents have been updated several times.
3 Takemaru-kun Speech Database
In the following, the contents of the Takemaru・kun speech database are described. 3.1 Method to Arranging the Database
The recorded speech data were transcribed and labeled by several human annotators. All data from November 2002 to October 2004 ぽe completely labeled and transcribed. Even noìse data are classified as“noise." The Collection of speech has been continued also after October 2004. Unlabeled speech data for additional 3 years is also available.. These speech data have been employed for research on acoustic modeling using unsupervised training, e.g. selective training [7]. Moreover, the system stores the log of speech reco伊ition results since the start of Takemaru-kun. Currently, the log is used for making statistics of the collected speech data. lt is not used to for the database construction, but the log information is useful for research on spoken dialog system. 3.2 Contents of Takemaru-kun Speech Database
All input data are annotated with the following labels: (a) Date: The name of the speech file. Also acts as identifiers of the collected data. (b) Gender: Speakers' gender èlassified by listening. Three classes of female, male, uncertam. (c) Age group: Speakers' age c1assified by listening. Six classes of infant (about 0・5・years-old), lower school child (about 6・10・years-old), higher school child (about 11・15・years-old), adult, elder, uncertain. (d) Validity: Whether a speech is utterance to也e system, is intelligible, and has meaning. Data with background speech or howling are 150
qδ nU つ白
classified as invalid. (e) Transcription (Hiragana-Ka吋i): Transcription in Japanese Hiragana-Ka吋i writing. About noise or unnecess釘y speech, the information is
れり ,SE、
noted. Transcription (Katak釘1a): Transcription with Japanese Katakana writing.
(g) Pronunciations (Katakana): Prommciation in Japanese Katakana writing. Differing from katakana Transcription w.r.t. on long vowels. (h) Appropriate
system
response:
Most
appropriate response to user input. Appending the response nurnber ofTakemaru品m system. Represents indirectly the meaning of the user utterance・ system response pair. The annotators have not witnessed the speech collection, so that classiちring the gender or age
elder1y
groups of some speech e.g. preschool children was
・ ・国 国 L 。 3 且 値切
not possible. There were many annotators. Each annotators processed distinct portions of the data. The response label is limited to on1y one system response, while some data had more than one
adulta hi,h ch.
low ch.
response label. Special tags are assi伊ed to noise and invalid
presch∞l
speech inputs. Only valid utterances (valid speech) 。也
are labeled with an appropriate response.
Statistics for the Takemaru・kun speech database are
1.
When considering the relative share of age
For
result suggests that a
reason for this dis仕ibution of the public community
however,
with
Considering
only
the
domain
children
of
data,
the
dialog
control,
and
is not required to perform in-depth
is not necessary to deal with the dialog history.Thus
system.
greetings
understanding,
language understanding and dialog con仕01 since it
“Takemaru・kun" than guidance requests, which are be
simple response
In a one-question-to・one-answer dialog system,
eight response categories is shown in Figure 2.
to
and
response generatlon.
The classification of valid speech inputs into白E
assumed
precise
dialog system has modules for speech re∞gnition, language
agent. This results in a m句ority of children inputs.
talk
reason,
of a spoken dialog system. A standard spoken
center is the system interface with an animated
to
this
There are many possible module configurations
children inputs proper1y. It is suggested 白紙出e
utterances
ロ。ut-。←domain
generation is needed for the system.
deployed system needs more efforts to respond to
more
� greeting
which gives appropriate and immediate responses.
groups among the valid speech data, the most are
are
・on-time
t": P向日le
speeches for each age group.
a spoken dialog system is cle釘.
There
・local guidance
・web-site
Fig. 2. Ratios of appropriate response type of valid
From this result, the necessity of noise rejection for
2]. This
100悟
8011
• center guidance
:: chat
One-third of the inputs are noise.
企om children [Table
60'‘
開lativ・.h.....
3.3 Database Statistics shown inTable
40覧
20也
the dialog system described in this paper generates a
to
response
Takemaru・kun 釘e relatively less出an frequently for
using
the
speech
recognition
result
other age groups.
directly.
4 Example-based Response Generation
sophisticated dialog model, the QADB (pairs of
“Ex佃1ple-based response generation" is named
interpreted as simple dialog model.
Although example
after the strategy to select a fixed response for each
this
approach
questions
and
does
not
responses)
use
. c佃
a be
user input using a set of example questions. The
4.2 Principle
advantage is也at semantic analysis is not 前回ssary.
The ex釘nple-based response generation approach can
4. 1 Proposed Approach
cope
with
various
user
expressions
by
providing a large amount of user questions for each
The user expects a田eful spoken dialog system
151
- 204-
system response.
matching words in the input and the example
for an input, it c佃 present the appropriate response
model e.g. TF・IDF. ln preliminary experiments
question. This method does not use vector space
If the system has stored an appropriate response
we
if the same input occurs. This principle is called
could show白紙a vector space model, e.g. based on
When using由is method, many QA pairs ωair
method. For calculating similarity, words which occur in
TF*IDF
example-based response generation.
of example question and proper response) have to
weights,
is
inferior
to
the
proposed
the input and the example question simultaneously
be collected to construct the QADB for system
implementation. On employment, it retrieves the
are counted. The word position is not considered.
most similar to 出e user input. The response
number of matched words by the maximum of
The similarity is calculated by dividing the
example question 企om the database, which is the
words in the input and the example.
coηesponding to the retrieved example is presented as system output. This method
We
performs only
and
retrieves
similarity;
the
therefore
example it
is
with
the
suitable
for
environment spoken dialog system.
words
real
The example-based response generation method
matching of word units.
and
and
E
the
of
words
exampJe
c爪勾,
and
the
denominator
for
|ト.1ト: number of wo地ins問s田e
Perform speech recognition to obtain plain using
1
input
1: set of input words (recognition result),
text; generate a response using this sentence. similarity
set
E : set of words in the example question,
following characteristics:
the
the
normalization NI爪E).
for the current Takemaru-kun system has the
Calculate
the
to
question, the number of simultaneously occuring
highest a
define
co汀esponding
simil叩ty calèulation of each example for出e input
Thes幻im口mila訂nt守ys(ι7,疋ξ刀)iぬsd白ef白ine児ed as
symmetric
s(I,E)
Nearest neighbour search: Select the QA pair
c(I,E) _ 1 1 n EI =:一一一 =ーム寸丁τLア
of highest similarity.
N(I,E)
Q lI J
max l , E )
Ex回ctQA pairs企om the utterance database.
ln Japanese, words of a sentence are not
separated
explicitly
by
spaces.
Therefore
The terms c(1,E) and N(I,勾 are symmetric:
a
therefore this similarity measure is s戸nmetric. This
pre-processing with a morphological analyzer [1 1 ]
meas町e takes a high value when there is a large
is carried out before matching the example question
number of matching words and the length of出e
with the recognition result.
example is similar to the input. The measure takes
values in the [0, 1 ] range wh巴n no word occurs
4.3 QADB Construction
twice. The range constrョint is kept by distinguishing
When constructing the QADB 企om ex佃lples of
real user utterances, the
system
can
duplicate words.
properly
4.4.2 Retrieval
re出eve 佃example for an utterance which a system
developer would not have thought of. Moreover by searching not only for exact
The
but also similar
example
with
the
highest
similarity
determined after calculating the similarity of
examples, it can select an example for an input
input
which does not appear in theQADB.
for
each
example
with
the
is
the
previousJy
described method.
Construction of the QADB is straight-forw釘d:
4.5 Response Accuracy
Ex仕action of the utterance transcription and the
appropriate system response for each utterance
For performance evaluation, the response accuracy
excluded.
responses for the test data. Correctness is defined aち follows: a system response of 3n input is correct when the system response is equal to an appropriate
企'om出e Takemaru database. Duplicate pairs are
is used. This measure is the rate of correct system
4.4 Algorithm Details The method for example-based response generation
response of the input. Response accuracy can also be interpreted
in this paper consists of a scoring and a retrieval
step.百le former calculates the simi加均of the
input with each ex創nple. The latter determines the most similar example.
5 QADBOptim包ation
4.4.1 Scoring
5.1 Purpose
ln this section we describe the similarity calculation.
The
similarity
is
the
as
the user satisfaction rate.
normalized
number
A problem specific to spoken dialog system is that
of
response performance may decrease due to speech
152
phd nU つ臼
recognition eηors. This differs from dialog systems which use text input. Moreover, the example-based
response generation method has some problems; inappropriate QA
pairs
or
pairs
with
similar
(Step 1) Cal凶late the degr'個d u5e1es5ne5s 01 a QA pairfor an u杭町田08.
example questions and different responses may
cause an mappropnate system response.
It is difficult to assign appropriate responses for
the human annotators, who are not completely
familiar with the topic domains of system responses.
Us曲目ness count
Therefore a QADB may include inappropriate data
when constructing it仕om an utterance database
(Step 2) Accumulate the U5e培5sne55 count 01 an QA pairs
which has been prepared by different annotators.
As solution to these problems, it is proposed to
exclude unsuitableQA pairs when constructing the
ロニミヰ> 1 内irC is us臨時|
QADB. This method is ca11ed QADB optirnization [10]. The goal ofQADB optimization is to improve
the response accuracy of血e system. Additiona11y,
Fig.3. QADB optimization.
it is expected 白紙 the QADB becomes more
uselessness is accumulated for a11 QA pairs. QA
compact.
pa出with a uselessness counter greater than zero
are removed. This QADB optirnization pro伺ss is
5.2 Principle
iterated several times. QA pむrs removed once釘e
The main purpose of QADB optimization is the
not added again to the QADB later.百le algorithm
detection of uselessQA pairs.
terminates if it is no longer possible to remove QA
If the response accuracy of utterances improves
palrs.
by excluding aQA pair企om aQADB, theQA pair is considered as useless.
5.4 Experimental Results
For QADB optirnization, it is important to
We evaluate the response accuracy irnprovement
distinguish the role of utterances合om that of QA
for the test utterance data and discuss the effect of
pairs, although they are of the same origin. The
excluding
utterance plays the role of an evaluator for the QA palrs.
The optirnization method is based on a so-ca11ed
回目
leave40ne-out
Test and training data are disjoint. Experiments
cross
with adult and child data釘e conducted separately.
validationl in order to use the whole training data
A11 utterances 釘e reco伊包ed by large vocabulary
for evaluation.
continuous speech recognizer using an N-gram
language model and a context-dependent acoustic model [12].百le recognition accuracy is shown in
5.3 Algorithm Uselessness of a QA pair w.r.t.
a validation
Table4.
utter叩ce is determined by using a tempor釘y
5.4.2 Results
QADB constructed企om the training data excluding
the validation utterance. If the number of correct
The experimental results are shown in Figure4. The
responses increases, the currently considered QA pair
is
considered
by
Qnly valid utterances are employed asなaining data.
of excluding more than one pair is not considered. also
responses
The data used in experiments are shown in Table 3.
only oneQA pair is excluded tempora11y. The effect method
system
5.4.1 Conditions
greedy algorithm to reduce the calculation amount;
The
inappropriate
QADB optimization.
as
useless
(increment
solid line shows the response accuracy阻d也e
a
dotted line shows the distinct number ofQA pairs in
uselessness counter for QA pair) otherwise it is
the opt也ùzedQADB.
considered as useful (d田rement counter for QA
The response accuracy in case of speech input
pair).
improves with optimization by 1.6% to 75.7%
After this evaluation for a11 data, the degree of
absolute for ad叫ts and by 2.9% to 59.8% for
children. Additionally, convergence of the QADB
I
size can be observed. The result for children data
A variation of cross-validation. Split training data
shows that the response accuracy improves with
to one sample and the rest and evaluate the model
one and two iterations of the QADB optirnization
with the remaining for the s叩lple. Accumulate the
algorithm although speech recognition is low.
evaluation results for a11 data to determinate the fmal evaluation result.
153
- 206-
Table 3 S
stI
We have employed a speech-oriented guidance system and coUected spontaneous speech data for five years in order to enhance speech recognition and dialog control technology. A speech database is constructed企om these utterances of two years. We described a QADB optimization method using也is database; The method cons釘ucts a QADB by excluding inappropriate data effectively. The response accuracy for speech input improves with the optimization by 1 .6% to 75.7% absolute for adults and by 2.9% to 59.8% for children.
adult 6345 30669 child comparing system response with single proper response of test data
This work is partly supported by MEXT leading project“e-society". 8
References
adult child
[1 ) A.L.Gorin, G. Riccardi, J.H.Wright, “How may 1 help you?," Speech Communication, vo1.23,pp.113・127, 1997. (2) Victor Zue, Stephanie Seneff, James R. Glass, Joseph Poli台。ni, Christine
Pao, Timothy
J.
Hazen
and
Lee
Hetherington, “JUPITER: A Telephone-8ased Co但versational lnterface for Weather lnformation," ffiEE Transaction on
76.0
[4) Nobuo Kawaguchi,Shi伊ki Matsubara, Yukiko Yamaguchi, Kazuya Taked閣, FUI町mit飽ada ltakura丸, Databaωse、:,"inP桁r。侃cof lCSLP20∞04,FrA2702p,2004 (5) Ryuichi Nisimura, Akinobu Lee, Hiroshi S訓wa削, Kiyohiro Shikano,“Public Speech-Oriented Guidance System with Adult and Child Discrimination Capability.... Proc. of lCASSP2004, vol. l ,pp.433-436, 2004.
“THE AUGUST SPOKEN DIALOG SYSTEM," Proc. of EUROSPEECH'99, vo1.3, pp.l l 5 1 ・ 1 154, 1999.
ーー+ー-RA% ...・...揮。f
4
o
Kiyohiro Shikano,“Acoustic Mod沼ling for Spoken Dialogue
[8) Tobias Cincarek, lzumi Shindゆ, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano,“Deve10pment of Preschool Children Subsys阻m for ASR and QA in a Real-Environment
“Examp1e-based Spoken Dia10g System with On1ine Examp1e Augmentation," Proc. of ICSLP2004, Spe併402p-7, pp.3073--3076,2004 [10) Manabu Kida,“Optimal Design of the Qu回tion-Answer Databaseおr a Spee唱h Guidance System," mぉter thes時, [ 1 lJ!ltto:/Isourccfoll!e.iDfDroicctslchascn-lcl1:acy,f. (12) Akinobu L民Ta凶ya Kawahara, Kiyohiro Shikano, Open Source Real-Time Large Vocabu1ary 泊