RUNA: a multimodal command language for home ... - Springer Link

2 downloads 0 Views 341KB Size Report
are to execute many kinds of command, their users will have to learn long sequences of operations (e.g., selecting menu items). A spoken language interface ...
Artif Life Robotics (2009) 13:455–459 DOI 10.1007/s10015-008-0603-8

© ISAROB 2009

ORIGINAL ARTICLE Tetsushi Oka · Toyokazu Abe · Kaoru Sugita Masao Yokota

RUNA: a multimodal command language for home robot users

Received and accepted: July 9, 2008

Abstract This article describes a multimodal command language for home robot users, and a robot system which interprets users’ messages in the language through microphones, visual and tactile sensors, and control buttons. The command language comprises a set of grammar rules, a lexicon, and nonverbal events detected in hand gestures, readings of tactile sensors attached to the robots, and buttons on the controllers in the users’ hands. Prototype humanoid systems which immediately execute commands in the language are also presented, along with preliminary experiments of faceto-face interactions and teleoperations. Subjects unfamiliar with the language were able to command humanoids and complete their tasks with brief documents at hand, given a short demonstration beforehand. The command understanding system operating on PCs responded to multimodal commands without significant delay. Key words Multimodal · Command language · Home robot · Speech · Nonverbal

1 Introduction Home robots that help ordinary people are expected to understand what their users want them to do as soon as commands are given. For people who need help, even computer graphical user interfaces (GUIs) or the remote controls for TVs are not ideal interfaces. Some may stop using the robot before learning which commands all the buttons, sliders, and levers are linked with. In addition, if the robots

T. Oka (*) · T. Abe · K. Sugita · M. Yokota Fukuoka Institute of Technology, 3-30-1 Wajiro-higashi, Higashi-ku, Fukuoka 811-0295, Japan e-mail: [email protected] This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008

are to execute many kinds of command, their users will have to learn long sequences of operations (e.g., selecting menu items). A spoken language interface can be a good interface as it is not necessary for users to learn a new language. However, difficult problems of natural language understanding, speech processing, etc. must be solved before realizing such robots in practical use.1,2 Utterances in natural languages are ambiguous and must be interpreted differently in different contexts based on reasoning using various knowledge sources. Recently, a number of robot systems implementing automatic speech recognition have been realized.2 To reduce computational cost, many robot systems interpret what the user means by using keyword spotting, i.e., to create semantic representations of user utterances. In these systems, it is not clear to users what kind of an utterance the robot understands since they are not based on a well-defined language. It is also difficult to identify the meaning by this method alone if a variety of commands are given to the robot in different situations. It is believed that nonverbal interfaces have advantages in the communication of quantitative and spatial information, and help a natural interaction between humans and robots. In fact, in human face-to-face communication, nonverbal messages convey different kinds of information. Robots usually have sensors and actuators over their body which can be thought of as devices for nonverbal communication. In recent years, multimodal interfaces for nonexperts integrating speech, gesture, and other modes have been developed for human–robot communication,3 interactive robot programming,4 human–humanoid collaboration,5 and so on. However, multimodal interactions between humans and robots are still limited compared with human– human communication because of different technical issues; we cannot expect robots to respond to us exactly as humans do. For the above reasons, we proposed to design and use a multimodal language to direct home robots in unambiguous commands, integrating a simple but well-defined spoken language, and nonverbal messages.6 Such a language should

456

be easy for humans to learn and should reduce computation for command understanding. This article presents the first version of a multimodal command language, RUNA (robot users’ natural command language), which enables users to command home robots by speaking to them, using gestures, touching them, or pressing control buttons in a natural way. A robot system which can execute commands in the language, and preliminary experiments with small humanoids are also illustrated. Subjects unfamiliar with the language were able to command humanoids when they were given a brief demonstration and a short document.

2 RUNA: the multimodal language The multimodal language RUNA comprises a set of grammar rules and a lexicon for spoken commands, a set of nonverbal events detected using visual and tactile sensors on the robot, and control buttons in the users’ hand. The spoken language itself enables users to command home robots in Japanese utterances, completely specifying an action to be executed. Commands in the spoken language can be modified by nonverbal messages, including gestures and touching.

to command a robot’s actions in a natural way by speech alone, even though there are no recursive rules. In RUNA, a spoken action command is an imperative utterance, including a verb to determine the action type and other words to specify action parameters. For instance, a spoken command, “Yukkuri 2 metoru aruke! (Walk 2 m slowly!),” indicates an action type walk and a distance 2 m (see Fig. 1 for the parse tree). The third rule in Table 3 generates an action command of class 2 (AC2) which has speed and distance (SD) as parameters. There are 166 words, each of which has different pronunciations. These are categorized into 60 groups (or parts of speech) identified by nonterminal symbols (Table 4). Note that users can use different words for the same action type or parameter value. Because the language is simple, well defined, and based on the Japanese language, Japanese speakers would not need long training to learn it.

2.1 Commands and actions In the current version of RUNA, there are two types of command, action commands and repetition commands. An action command consists of an action type such as walk, turn, and move, and action parameters such as speed, direction, and angle. Table 1 shows examples of action types and commands in RUNA. The 23 action types of RUNA are categorized into 11 classes based on the way action parameters are specified in Japanese (Table 2). In other words, actions of different classes are commanded with different modifiers. 2.2 Syntax of spoken commands Table 3 shows some of the 171 generative rules for spoken commands in RUNA. These rules allow Japanese speakers

Fig. 1. Parse tree for a spoken command, “Walk 2 m slowly!”

Table 1. Action representation of RUNA Type

Representation

English utterance

Walk Turn Punch Wave Hug Move Look

Walk_s_3steps Turn_f_l_30deg Punch_s_lh_str Wave_s_lh Hug_s Move_m_b_steps Look_f_l

Walk 3 steps slowly! Turn 30° left quickly! Punch straight with the left hand! Wave your left hand slowly! Give me a hug slowly! Move two steps backward! Look left quickly!

Table 2. Action classes and types in RUNA Class

Type

Parameters

1 2 3 4 5 6 7 8 9 12 11

Stand, lie, squat, crouch, hug Moveforward, movebackward, walk Look, turnto, lookAround Turn Sidestep Move Handshake, highfive Punch Kich Turnbp, wavebp, raisebp, lowerbp Dropbp

Speed Speed, distance Speed, target Speed, directionlrni, angle Speed, directionlrni, distance Speed, directionni, distance Speed, handde Speed, handed, directionni Speed, footde, directionni Speed, bodypartwo, directionni Bodypartwo

457 Table 3. Grammar rules of RUNA

Table 6. Default action parameter values

No

Rule

Description

Type

Speed

Direction

Angle

Distance

Body part

1 2 3 4 5 6 7 8 9 10

S → Action S → Repetition Action → SD AC2 AC2 → AT_WALK SD → SPEED SD → DIST SPEED SD → SPEED DIST SD → DIST DIST → Number Lunit Repetition → REPEAT

Action command Repetition command Class 2 command Action type walk Speed Distance + speed Speed + distance Distance Number + length unit Repeat last action

Walk Look Turnto Turn Wave Kick

Slow Slow Slow Slow Slow Slow

NA Straight Right Right Straight Straight

NA NA 90 30 NA NA

3steps NA NA NA NA NA

NA NA NA NA Right hand Right foot

Table 4. Part of RUNA’s lexicon Nonterminal

Terminal

Pronunciation

AT_WALK

at_walk_aruke at_walk_hokou at_walk_hokoushiro

aruke h o k o: h o k o: sh i r o

REPEAT

md_repeat_moikkai

m o: i q k a i

SPEED

sp_fast_hayaku sp_fast_isoide sp_slowly_yukkuri

hayaku isoide yuqkuri

LUNIT

lu_cm_cm lu_m_m lu_mm_mm

s e N ch i m e: t o r u miri

DIR_LR

dir_left_hidari dir_left_hidarigawa

hidari hidarigawa

NI

joshi_ni_ni

ni

Table 5. Nonverbal events of RUNA Type

Event

Description

Gesture Gesture Gesture Tactile Tactile Button Button

hw_left_slow_long_3cycles hw_right_fast_short_4cycles hm_up_moderate_10cm tc_rightwrist_long tc_leftshoulder_150ms btn_left_1000ms btn_3_563ms

Hand waving Hand waving Hand motion Touching Touching Button press Button press

2.3 Nonverbal events In RUNA, nonverbal events modify the meaning of spoken commands. They convey information about action parameters rather than action types. Table 5 shows some representative nonverbal events in RUNA, which can be distinguished by their own type and parameters. 2.4 Semantics of RUNA In order to execute commands in RUNA, intermediate representations of actions shown in Table 1 are employed. Since the spoken language is simple and syntactically unambiguous, it is straightforward to identify action types and parameters in spoken commands. RUNA allows users to omit action parameters, although the action type cannot be left out when a new action is commanded. If some of the parameters of the specified type are

missing in a spoken command, default values are assigned to those parameters. For example, “Kick!” is interpreted as “Kick slowly straight with your right foot!” in our command language (Table 6). Gesture, tactile, and button events contain parameters, speed, duration, position, etc., which are related to action parameters. For instance, a spoken command “Walk 2 meters!” is interpreted as walk_m_2m in RUNA if no recent nonverbal event has been detected. A waving hand gesture event, e.g., hw_left_fast_40cm_ 3cycles, would create a modified command, walk_f_2m. Thus, natural mappings between parameters of nonverbal events and actions would allow users to specify action parameters in a natural way. 2.5 Pragmatics In RUNA, each action is identified with its type and parameters (see Table 1). In order to execute an action, robots need a motor system which generates a time series of motor commands based on sensory input in real time. In our prototype robot systems, we prepare time series of target joint angles, or motion patterns, to generate robot motions. Each action is converted to the index of a motion pattern by means of a pattern matching method. Each actuator is driven by the current target joint angle, and angular sensor and gyroscope readings.

3 Prototype command interpretation system A prototype command interpretation system based on open agent architecture (OAA, http://www.ai.sri.com/~oaa/)7 has been developed and tested in some preliminary experiments. This runs on a desktop PC with a 2.4 GHz Core2Duo processor and 2 GB memory and other PCs of similar specifications (Fig. 2). It comprises eight agents written in Java and C++. Julian, a grammar-based version of the Julius speech recognition engine8 (http://julius.sourceforge.jp/), is employed for automatic speech recognition (ASR). The event detectors monitor sensor readings, detect nonverbal events, and send them to the interpreter agent. To detect hand gestures, an agent is implemented using Intel’s Open Source Computer Vision Library (http://www.intel. com/), which detects the direction, speed, duration, amplitude, number of strokes, and frequency of hand-waving motions. Tactile and control button events are also detected by other nonverbal event detectors.

458

never operated the robot. Given a short document, they were able to remotely direct the robot to explore in a room, find objects, and correctly answer questions about the room. The other humanoid has eight tactile sensors on its shoulders, arms, and wrists. It can be operated by speech, gestures, and touching. It was also tested with 30 subjects taking account of the results of earlier tests. After watching a brief demonstration movie, they were able to make the robot, punch, kick, walk, turn, and shake hands using multimodal commands and complete the given tasks. In all these experiments, examples of spoken and multimodal commands helped the users to find appropriate and grammatically correct commands in RUNA. They understood that the robots can be commanded with action types and parameters, although they needed some trials to learn to specify action parameters accurately in nonverbal messages. It is noteworthy that the novice subjects rated the humanoids’ operability 5–6 on average on a 7-point scale. The speech recognizer performed better when the grammar was reduced for the specific tasks by removing particular words and rules. Tactile and controller button events helped the system to correctly identify the type and parameters of commanded action because shorter utterances are easier to recognize for the speech recognizer. Fig. 2. OAA-based prototype of the RUNA system

5 Discussion The interpreter agent determines the action type and parameter values of each command based on speech and nonverbal events. It looks for recent nonverbal events in its event memory, and modifies a spoken command whenever it arrives from the ASR agent. An action database is consulted for default parameter values (as seen in Table 6). The motion selector agent looks into a motion database to select robot motion patterns that match action command strings created by the interpreter agent. In the prototype system, 70 motion patterns are registered in the database, each of which can match more than one action string, e.g., both move_slowly_forward_3steps and walk_slowly_3steps match the same motion pattern. Finally, the speech output agent plays a sound file as a response to a command: “I am walking!” (depending on the selected motion)., “OK!”, “I cannot do it!” (if it is impossible to execute the command), or “I don’t understand!”.

4 Applications Two small humanoids (Kyosho Manoi PF-01) were used in some preliminary experiments with subjects unfamiliar with the command language. One of the humanoids has a wireless camera and can be operated by users monitoring what it is seeing on a remote PC with a microphone and a numeric keypad. After an early evaluation phase, it was tested with 27 subjects who had

The results of the preliminary experiments imply that our humanoids can be commanded by novice users without much frustration, and our multimodal language can be a good means to operate home robots, including humanoids, in particular because it is simple and limited. Grammatically speaking, the framework of RUNA does not include anaphoric expressions (including zero pronouns) or compound / complex sentences which are quite common in natural languages, including Japanese. However, the authors presume that default action parameters and pointing gestures can prevent lengthy commands for practical purposes. Further studies are needed to make RUNA more effective and easy to use by extending the spoken language and the nonverbal event set mapped to action parameters. Although human users can adapt to the language, it should be sophisticated in order to make more effective use of gestures, body touching, remote controllers, and other nonverbal modes of communication. The robots’ responses were not delayed thanks to RUNA’s simplicity and well-defined nature. The authors believe that promptness is a key to natural interactions. In order to prove the advantages of multimodal commands, statistical tests on larger samples are necessary. The percentages of successful commands, differences in word error rates in speech recognition, and recognition rates of action types and individual parameters in larger samples are of great interest.

459

6 Conclusion

References

A multimodal language based on a simple spoken language and nonverbal event detection was designed to command home robots. A prototype robot system which interprets and executes multimodal commands in the language was tested, and subjects with no knowledge about the language were able to direct small humanoids. The command language is under development to find a more generic command language for multipurpose home robots which will be given a diversity of tasks including taking care of home devices, internet access, and physical tasks.6 Our future work includes thorough studies on the current system, usability tests on a wider variety of people, and different tasks closely related to practical applications, sophistication of the multimodal language, and the development of a system which has a larger repertoire of actions and speech responses.

1. Bos J, Oka T (2007) A spoken language interface with a mobile robot. J Artif Life Robotics 11:42–47 2. Prasad R, Saruwatari H, Shikano K (2004) Robots that can hear, understand and talk. Adv Robotics 18:533–564 3. Perzanowski D, Schultz AC, Adams W, et al (2001) Building a multimodal human–robot interface. IEEE Intel Syst 16:16–21 4. Iba S, Paredis CJJ, Khosla PK (2004) Interactive multi-modal robot programming. Proceedings of the 9th International Symposium on Experimental Robotics (ISER ’04), Singapore 5. Breazeal C, Brooks A, Gray J, et al (2004) Tutelage and collaboration for humanoid robots. Int J Humanoid Robots: World Sci 1:315–348 6. Oka T, Yokota M (2007) Designing a multi-modal language for directing multipurpose home robots. Proceedings of the 12th International Symposium on Artificial Life and Robotics (AROB ’07), Beppu, Japan 7. Cheyer A, Martin D (2001) The open agent architecture. J Auton Agents Multi-Agent Syst 4:143–148 8. Lee A, Kawahara A, Shikano K (2001) Julius: an open source realtime large vocabulary recognition engine. Proceedings of the 7th European Conference on Speech Communication and Technology, Aalborg, Denmark

Acknowledgment This work was supported by a KAKENHI Grantin-Aid for Scientific Research (C) (19500171).

Suggest Documents