Using Foot-Syllable Grammars to Customize

0 downloads 0 Views 100KB Size Report
grammars with different sets of grammatical units in four human-machine interaction ... hierarchy of spoken language, which contains syllables as its parts and ...
Using Foot-Syllable Grammars to Customize Speech Recognizers for Dialogue Systems Daniel Couto Vale and Vivien Mast I5-[DiaSpace], SFB/TR8 Spatial Cognition, University of Bremen Cartesium, Enrique-Schmidt-Straße 5, 28359 Bremen, Germany [email protected], [email protected]

Abstract. This paper aims at improving the accuracy of user utterance understanding for an intelligent German-speaking wheelchair. We compare three different corpus-based context-free restriction grammars for its speech recognizer, which were tested for surface recognition and semantic feature extraction on a dedicated corpus of 135 utterances collected in an experiment with 13 participants. We show that grammars based on phonologically motivated units such as the foot and the syllable outperform phrase-structure grammars in complex scenarios where the extraction of a large number of semantic features is necessary. Keywords: speech recognition, dialogue system, natural language.

1 Introduction When designing an intelligent agent for Ambient Assisted Living (AAL) environments, automatic speech recognition is a challenging task. It is not a design option to restrict commands to a small list of utterances with maximally different sound patterns and it is a requirement that agents understand a large repertoire of user utterances, which may oftentimes sound similar though realizing different meanings. On the engineering side, this is a foreseeable catastrophe. Ontology-driven context-free restriction grammars have been shown to give a poor quality of speech recognition for such a selection of utterances while a more reliable corpus-driven (context-free or finite-state) restriction grammar implies the cost of compiling a very large annotated targeted corpus [9]. For this reason, a corpus-based (not corpus-driven) context-free restriction grammar would be an acceptable low-cost solution if it resulted in good speech recognition. This paper aims at improving the accuracy of speech recognition restricted by hand-crafted corpus-based context-free grammars by developing and testing restriction grammars with different sets of grammatical units in four human-machine interaction scenarios. Our quest is to determine the best types of grammatical units for achieving a good speech recognition with a low engineering cost. We conclude that phonologically motivated units such as the syllable and the foot outperform graphologically and ideationally motivated ones such as the word and the phrase, leading to a positive quality-cost trade-off. We argue that the reason for this is that such units map better onto the characteristics of situated spoken text while making a large annotated corpus unnecessary. P. Sojka et al. (Eds.): TSD 2012, LNCS 7499, pp. 591–598, 2012. c Springer-Verlag Berlin Heidelberg 2012 

592

D.C. Vale and V. Mast

2 Interaction in Ambient Assisted Living The Bremen Ambient Assisted Living Lab (BAALL, [5]) is a fully functional 60m 2 , 2 person apartment for testing and evaluating AAL technology [6]. Going way beyond automated light and temperature control, BAALL features sophisticated control of sliding doors, smart furniture, and enables higher-level services through integration [6]. Due to its complexity, intuitive and effective interaction between user and speaking agents is a priority [2]. An important factor for effectiveness of dialogue in AAL environments is that the recognizable utterances should not be restricted to lists of commands that need to be learned and memorized, since learning and memorization can be particularly demanding for elderly people. Being tested in BAALL, the intelligent wheelchair Rolland is a modified version of the commercially available power-wheelchair Xeno [7,8,11]. It has been equipped with sensors and an onboard computer in order to be able to perceive and reason about the environment and to interact with the user in different ways. This includes a navigation assistant that enables the wheelchair to autonomously take the user to a target location [6] with qualitative spatial reasoning and a dialogue system, which is a mixedinitiative semi-autonomous agent using the DAISIE framework (DiaSpace’s Adaptive Information State Interaction Executive) whose architecture was adapted to the specific needs of situated dialogue [12,13,14]. Among its components, it uses Nuance’s VoCon as a speech recognizer [10] and OpenCCG as a parser [3].

3 Methodology The German usage corpus for Rolland was collected during a Wizard-of-Oz experiment that simulates a real-life everyday scenario of wheelchair usage within BAALL [1]. In this experiment, 13 German native speakers of both sexes were told to execute 6 tasks using the intelligent wheelchair, for which they had to drive it to 9 destinations through free spoken commands. All commands were recorded and manually transcribed constituting a corpus of 135 clauses. Finally, these clauses were read out loud with pauses in between by a female German native speaker and recorded again for further processing. Three context-free restriction grammars were manually written for the speech recognizer VoCon to cover 125 of the 135 clauses: namely Phrase-Word (PW) (see Table 1), Foot-Word (FW), and Foot-Syllable (FS) (see Table 2). The phrase-word grammar has traditional grammatical units such as Nouns(N), Prepositions(P), Noun Phrases(NP), Prepositional Phrases(PP) and Verb(V) and allows any syntactically valid structure with these units while ignoring concordance (case, gender, person and number). The foot-word grammar has a three-rank structure starting at the lowest rank with the word (W), an intermediate unit named foot (F), and an uppermost unit named curve (C), which corresponds closely to a clause. A foot is a rhythmic unit in the compositional hierarchy of spoken language, which contains syllables as its parts and which is part of a curve [4]. In German, a foot is composed by one stressed syllable (SS) and its adjacent unstressed ones (US). The boundary of the foot was determined by analyzing frequent

Foot-Syllable Grammars for Dialogue Systems

593

co-occurrences of unstressed syllables around a stressed one in our corpus. Example 1 is divided into feet, whose stressed syllables are [‘kans] in the word “kannst” [‘tIS] in “Tisch” and [‘fa:Rn] in “fahren”: Example 1. kannst du mich / zum Tisch / fahren In this grammar, we restricted word combinations within the foot to those found in our corpus, what eliminated non-occurring preposition-noun combinations such as “ins Sofa” (into the sofa) and verb-pronoun combinations such as “fahre ich” (do I drive?), favoring similar sounding feet such as “ans Sofa” (closer to the sofa) and “fahre mich” (take me) for speech recognition. This elimination relies solely on de facto occurring feet and is written as in the Foot-Word Grammar extract below. Classes of feet were created depending on their possible combinations with others inside of a curve. Foot-Word Grammar : | [...] ; : fahre | fährst du | fahre mich | fährst du mich ; : ans Sofa | an das Sofa | zum Sofa | zu dem Sofa | an die Couch | zur Couch | zu der Couch ; The foot-syllable grammar also has the foot as its intermediate grammatical unit, but feet have syllables as their parts instead of words. By having the syllable as the lowest unit, we could fine tune the speech recognition in much detail (see below). Foot-Syllable Grammar : | ; : ‘kY !pronounce("‘kY") ; : CE !pronounce("CE") | C$ : I !pronounce("I") | $ : In !pronounce("In") | $n : nI !pronounce("nI") | n$ : dI !pronounce("dI") | d$

!pronounce !pronounce !pronounce !pronounce !pronounce

("C$") ("$") ("$n") ("n$") ("d$")

; ; ; ; ;

The audio file with the spoken clauses was then played, the speech was recognized by VoCon with these three grammars, and the highest scoring recognized utterances were saved into three plain text files for further processing. In addition, two human transcriptions were made, one in words and another in feet. Two CCG grammars were created to process VoCon’s output, one for the foot transcriptions and another for the phrase transcriptions, each covering 125 of 135 expected clauses (the latter trying to come up with some feature structure for as many utterances as possible due to the low quality of the surface structure recognition). The texts were parsed with these two grammars and the feature structures were saved to a file. The features were defined in such a way that every utterance could be reverted back to its surface structure including lexical option (“Sofa” and “Couch”, “Rolland” and “Roller” triggering different features), but not including details such as the position of the word “bitte” in the clause.

594

D.C. Vale and V. Mast Table 1. Example transcriptions for the phrase-word grammar

1 2 3 4

Human Trancription komme zum Bett fahre in die Küche komme her komme vor dem Sofa an

Phrase-Word komm her fahre an die Küche — komme vor dem Sofa

Table 2. Example transcriptions for the foot grammars

1 2 3 4

Human Transcription Foot-Word Komme zudemBett Komme zudemBett Fahre indieKüche Fahre indieKüche Komme Her Komm zudemBett Komme vordemSofa An Komme vordasSofa

Foot-Syllable Komm zudemBett Fahre indieKüche Komme Her Komme vordemSofa An

In order to evaluate the performance of the three restriction grammars, we compared the recognition results on two strata, surface structure (VoCon output) and semantic features (CCG output). For the semantic stratum, we considered the requirements of different application scenarios. 3.1 Surface Structure In order to evaluate surface structure recognition, we compared the output of the speech recognizer to human transcriptions. Recognized utterances were only counted as correct if they completely matched their human transcription equivalent. 3.2 Semantic Features For the parsing, we compared the semantic features extracted by OpenCCG for recognized utterance to those for the human transcription for four scenarios. Different sets of features were taken into account for comparison depending on the complexity of the scenario. Feature structure hierarchy was ignored in all scenarios. Translation Scenario. Firstly, we tested for a very simple scenario where it is only possible to direct the wheelchair to certain destinations with limited precision, using low-level translation commands. The only relevant semantic feature is the Relatum, i.e. the place where the wheelchair should go to. Pickup Scenario. If a wheelchair can distinguish whether the user is on board or not and receives a command such as “Bring mich in die Küche” when the user is not on board, it should first pick up the user before bringing him or her to the requested destination, making both the Relata and the Goals (items that should be moved) relevant. Precision Scenario. In this scenario, we assume a wheelchair that can use all ideational features i.e. Process-Type, Relative-Position, Place-Type, etc. Interpersonal features such as Politeness and Mood-Type were not taken into account.

Foot-Syllable Grammars for Dialogue Systems

595

Holistic Scenario. Finally, for a wheelchair that can adjust its behavior and correct its mistakes by noticing diachronic fluctuations in politeness, interpersonal features become also relevant. In this scenario, high-level goals as well as low-level adjustments, different speech functions such as questions, and clarification dialogues can be articulated covering a broad range of tasks.

4 Results With respect to the surface structure, the foot-syllable restriction grammar achieved 67 total matches (49.63%), the foot-word grammar 32 (23.70%), and the phrase-word grammar 16 (11.85%). Table 3 shows the results of the different grammars. Table 3. Correct matches of the three restriction grammars for surface structures

FS FW PW

Correct Wrong 67 68 32 103 16 119

Using Pearson’s Chi-squared test, this difference was shown to be highly significant (χ 2 = 49.57, d f = 2, p < 0.0001). The standardized residuals (Table 4) show that for the Foot-Syllable grammar, the number of correctly recognized utterances was significantly higher than avarage, whereas for the Phrase-Word grammar, it was significantly lower. Table 4. Standardized residuals for the surface structure results Correct Wrong FS 6.70 −6.70 FW −1.48 1.48 PW −5.22 5.22

For the translation and pickup scenarios, Table 5 shows that all grammars reach a fairly high rate of correct feature extraction. The foot-word grammar receives the highest scores, it correctly allows correct extraction of 119 feature structures (88.15%) in the translation scenario and 104 (77.04%) in the pickup, whereas the phrase-word grammar performs worst, merely allowing respectively 83 (61.48%) and 71 (52.60%) correct feature structure extractions. The foot-syllable grammar reaches a value in between these extremes with 108 (80%) and 99 (73.33%) correct recognitions. The difference is highly significant for both scenarios (translation: χ 2 = 28.08, d f = 2, p < 0.0001; pickup: χ 2 = 21.42, d f = 2, p < 0.0001), and the standardized residuals in Table 6 show that the foot-word grammar performs significantly higher than average, while the phrase-word grammar performs significantly lower for both scenarios.

596

D.C. Vale and V. Mast Table 5. Correct Matches Translation Pickup Precision Holistic Correct Wrong Correct Wrong Correct Wrong Correct Wrong FS 108 27 99 36 74 61 71 64 FW 119 16 104 31 36 99 34 101 PW 83 52 71 64 23 112 21 114

Table 6. Standard Residuals Translation Pickup Precision Holistic Correct Wrong Correct Wrong Correct Wrong Correct Wrong FS 1.16 −1.16 1.73 −1.73 6.66 −6.66 6.60 −6.60 FW 3.90 −3.90 2.85 −2.85 −1.87 1.87 −1.82 1.82 PW −5.06 5.06 −4.58 4.58 −4.79 4.79 −4.78 4.78

In the precision and holistic scenarios, the results are slightly different. The footsyllable restriction grammar retains fairly good scores: 74 correct feature structure extractions (54.81%) in the precision scenario and 71 (52.60%) in the holistic. The results of both the foot-word and the phrase-word grammar degrade to a correct feature extraction rate that is unacceptable for practical applications: the foot-word grammar only reaches 36 (26.67%) and 34 (25.19%) correct feature extractions respectively and the phrase-word grammar 23 (17.04%) and 21 (15.56%). The difference is again highly significant in both scenarios (precision: χ 2 = 47.18, d f = 2, p < 0.0001; holistic: χ 2 = 46.52, d f = 2, p < 0.0001), and the standardized residuals in Table 6 show that the foot-syllable grammar performs significantly above average, while the phrase-word grammar performs significantly below.

5 Discussion The results point towards the superiority of the foot-syllable grammar for recognizing speech. This is particularly the case for surface structure recognition and the two most complex scenarios, where all features need to be extracted correctly. For simpler scenarios, on the other hand, the foot-word grammar performed better than the footsyllable grammar, and the disadvantage of the phrase-word grammar was less drastic than in complex scenarios. In addition, the foot-syllable grammar was the one that best discarded the ten clauses which were not covered by the grammar, thus avoiding false recognition. It must be noted that the good results are not only due to paradigmatic constraints on text, but also on the right choice of grammatical unit type. The foot, for instance, cooperated with our results in several ways. Firstly, it allowed creating a flatter structure that keeps information about occurring verb complements such as actors and goals (realized as pronouns), relative positions to the relatum (realized as prepositions), and place type (realized as cases or prepositions). It also allowed keeping ideational features such as five levels of politeness in a foot class, all of this keeping the restriction

Foot-Syllable Grammars for Dialogue Systems

597

grammar context-free and relying neither on typologies of nouns or processes nor on delicate agreement. Finally, it helped making VoCon only output strings that could be successfully analyzed by OpenCCG. The syllable enables us to systematically increase phonological variability for unstressed syllables while enforcing exact matching for stressed ones, which usually carry the most delicate semantic content such as process types and thing types. The syllable rank also helps us allow variation while maintaining the size of the grammar constant, as syllable units can be reused more often than words.

6 Conclusion We have used three alternative BNF grammars to restrict VoCon’s speech recognition, parsed recognized strings with CCG grammars, and evaluated both the recognized surface structure and relevant semantic features for 4 concrete application scenarios. Our results shows that by using corpus data to systematically add phonological variants, and restrict syntactic structures, speech recognition results can be improved significantly. Our results have shown that, for simple scenarios, an approach that limits syntactic variability without systematically controlling the phonological stratum performs best, while for complex scenario with situated spoken interaction, a restriction grammar that additionally uses fine control over phonological variation vastly outperforms all other approaches. We conclude that the syllable and the foot are better suited for restricting speech recognition in complex scenarios than the word and the phrase. Specially for interaction in AAL environments, where language usage is specific and large corpora are not available and very costly, a foot-syllable corpus-based context-free grammar seems to be a real alternative for both cost and quality of user utterance recognition. It remains to be shown whether this corpus-based approach can compete with corpus-driven methods in complex scenarios with large specific corpora. Acknowledgements. We gratefully acknowledge the support of the Deutsche Forschungsgemeinschaft (DFG) through the Collaborative Research Center SFB/ TR8 Spatial Cognition.

References 1. Anastasiou, D.: A Speech and Gesture Spatial Corpus in Assisted Living. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul (2012) 2. Augusto, J.C., McCullagh, P.: Ambient intelligence: Concepts and applications. Computer Science and Information Systems 4, 1–28 (2007) 3. Baldridge, J., Kruijff, G.-J.: Multi-Modal Combinatory Categorial Grammar. In: Proceedings of EACL (2003) 4. Halliday, M.A.K., Matthiessen, C.M.I.M.: An introduction to functional grammar, 3rd edn. Edward Arnold, London (2004) 5. Krieg-Brückner, B., Gersdorf, B., Döhle, M., Schill, K.: Technik für Senioren in spe im Bremen Ambient Assisted Living Lab (Technology for seniors-to-be in the Bremen Ambient Assisted Living Lab). In: Ambient Assisted Living – AAL: 2. Deutscher AAL-Kongress. VDE-Verlag, Berlin (2009)

598

D.C. Vale and V. Mast

6. Krieg-Brückner, B., Röfer, T., Shi, H., Gersdorf, B.: Mobility Assistance in the Bremen Ambient Assisted Living Lab. GeroPsych. 23, 121–130 (2010) 7. Krieg-Brückner, B., Shi, H., Fischer, C., Röfer, T., Cui, J., Schill, K.: Welche Sicherheitsassistenz brauchen Rollstuhlfahrer (What kind of safety assistance do wheelchair users need?). In: Ambient Assisted Living - AAL: 2. Deutscher AAL-Kongress. VDE-Verlag, Berlin (2009) 8. Mandel, C., Huebner, K., Vierhuff, T.: Toward an Autonomous Wheelchair: Cognitive Aspects in Service Robotics. In: Proceedings of Toward Autonomous Robotic Systems (TAROS 2005), pp. 165–172 (2005) 9. Martin, P.: The “Casual Cashmere Diaper Bag”: Constraining Speech Recognition Using Examples. In: Interactive Spoken Dialog Systems: Bringing Speech and NLP Together in Real Applications, pp. 61–65 (1997) 10. Nuance Communications, Inc. Nuance VoCon 3200 Embedded Development System Developer’s Guide (2006) 11. Otto Bock Mobility Solutions, http://www.ottobock.de 12. Ross, R.J., Bateman, J.: Daisie: Information State Dialogues for Situated Systems. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 379–386. Springer, Heidelberg (2009) 13. Shi, H., Jian, C., Rachuy, C.: Evaluation of a Unified Dialogue Model for Human-Computer Interaction. International Journal of Computational Linguistics and Applications 2 (2011) 14. Shi, H., Tenbrink, T.: Telling Rolland where to go: HRI dialogues on route navigation. In: Coventry, K., Tenbrink, T., Bateman, J. (eds.) Spatial Language and Dialogue, pp. 117–216. Oxford University Press, Oxford (2009)

Suggest Documents