A Map-based System Using Speech and 3D ... - Semantic Scholar

A Map-based System Using Speech and 3D Gestures for Pervasive Computing Andrea Corradini, Richard M. Wesson, Philip R. Cohen Center for Human-Computer Communication Department of Computer Science and Engineering Oregon Health & Science University, Portland, OR, USA {andrea,wesson,pcohen}@cse.ogi.edu Abstract We describe an augmentation of Quickset, a multimodal voice/pen system that allows users to create and control map-based, collaborative, interactive simulations. In this paper, we report on our extension of the graphical pen input mode from stylus/mouse to 3D hand movements. To do this, the map is projected onto a virtual plane in space, specified by the operator before the start of the interactive session. We then use our geometric model to compute the intersection of hand movements with the virtual plane, translating these into map coordinates on the appropriate system. The goal of this research is the creation of a body-centered, multimodal architecture employing both speech and 3D hand gestures, which seamlessly and unobtrusively supports distributed interaction. The augmented system, built on top of an existing architecture, also provides an improved visualization, management and awareness of a shared understanding. Potential applications of this work include tele-medicine, battlefield management and any kind of collaborative decision-making during which users may wish to be mobile.

1. Introduction Over the last decade, computers have moved from being a simple data storage device and calculating machine to becoming an everyday assistant in the lives of many people. Advances in wireless communications coupled with more powerful, portable and cheaper embedded devices have resulted in an increasing demand for human-centered computer architectures that can pervasively sense [22] and recognize human interactions. In such architectures, the interaction with computers will extend beyond the confines of the desktop model. Users will not be required to go to a special place to interact with a computer; rather, interactive computational capabilities will be available everywhere through cameras, microphones and other sensing devices. To be reliable in support of user activities, while remaining persistent and transparent to users, sensing and computing devices will have to be able to work together within a distributed infrastructure. Simultaneously,

computing systems will need to give feedback to users in appropriate, unobtrusive ways—through speakers, headphones, wall screens, and other specialized output devices. In order that future access to computing be natural, the environment itself will need to maintain an awareness and perception of the users with whom it is interacting, and be capable of intelligent interaction in response to both gestural and voice commands. To that end, novel humancomputer interaction (HCI) metaphors need to be investigated. Computers will have to interact and engage in dialogue with the real world surrounding them in ways similar to the way we people do. The interfaces for such computers will not be restricted to just menus or haptic interaction by keyboard and mouse, but include a combination of speech, gestures, context and emotions. Human beings will not have to adapt to technology but vice versa [7].

2. Related Work In recent years, smart environments have emerged as a key target area for pervasive and ubiquitous computing research within an HCI context. Much of this research is based on context-aware rooms where natural forms of interaction allow a group of people to control shared devices or collaborate more effectively. The KidsRoom [5] is a fully automated bedroom, which integrates non-encumbering vision and audio sensing techniques to guide children through a storybook adventure. The interactive play space responds to actions with images, sound, light, music, and video and transforms itself into an imaginative narrative world. During the narration, children naturally interact with virtual creatures projected onto the walls, with objects within the room and with each other. The Easy Living Project [4] is concentrating on building context-aware visual adaptive interfaces, which combine multiple sensor modalities. Currently, the user needs to go through a fingerprint reader to get into the living room. Upon identification, the system loads a set of user preferences that are then used to control various devices. 3-D cameras are deployed for tracking while flat wall screens are used as display devices. In a home environment, music might change and lighting be

automatically adjusted, based on a set of preferences and musical tastes for the current occupant. The user controls the system from a laptop or other mobile device. The Intelligent Room [6] is equipped with an array of computer controlled devices like video cameras, LCD projectors, VCRs, and audio stereo systems that support speech and gesture understanding systems for interaction with its inhabitants. Camera systems track people and identify their gestures and activities, while audio systems decide whether occupants are talking to each other or to the room itself. Together those systems generate a realtime description of the activities in the room that is used to support specific kinds of functionality. Currently, the room is implemented for a command and control situation as well as an interactive space for virtual tours. The Adaptive House [17] automatically controls basic residential comfort systems, such as ventilation and heating, by learning behavioral patterns of its occupants. It is unobtrusive and does not require any special interactions. Occupants operate the Adaptive House just as they do an ordinary house, using the types of switches and thermostats they are accustomed to. The house monitors these adjustments and infers the inhabitants’ desires from their actions and behavior. The Aware Home project [3] aims at the development of a prototype home that will provide assisted living for elderly or sick occupants through a seamless interactive environment, using audio and video sensors to understand voice and gestures. To date, the architecture is under construction with only vision-based sensors to track multiple individuals and an underlying software infrastructure for context-aware applications. In this work, we describe a prototype architecture in furtherance of a pervasive computing environment for collaborative map-based simulations. Within such a context, conventional media such as large paper maps with overlays and boards for pencil annotation are still standard practice. Migration to electronic media [15] has been slow for social, practical and technical reasons. New operative modes need time to become accepted. The use of individual computer screens restricts the representation of the global picture of the environment. Intuitive, robust, unencumbered HCI techniques are still limited. Moreover, in such complex settings, controlling, displaying and manipulating real-time streams of information arriving from a variety of sensors and sources is a critical task. While data fusion and correlation are necessary requirements for creating a global, situationally aware picture of the simulation, collaboration and interaction are also needed to support the participants. This allows users to be mobile while giving input that is independent of the mechanisms attached to a particular display.

3. Setting

3.1. The QuickSet System QuickSet is a collaborative, distributed system that runs on several devices ranging from desktops and wireless hand-held PCs to wall-screen displays and tablets. Users are provided with a shared map of the area or region on which a activity is to take place. By using a pen to draw and simultaneously issuing speech commands, each user can place entities on the map at any desired location and give them various qualities. Digital ink, map-views and tools are shared among the participants in the session, and ink can be combined with spoken input to produce a multimodal constituents. Collaborative capabilities [14] are ensured by a component-based software architecture that is able to accept input for distributed tasks from disparate users. It consists of a collection of agents, which communicate through the Open Agent Architecture [8] and the Adaptive Agent Architecture [13].

3.2. The Idea behind the Design Users should be able to use speech and 3-D gestures to input multimodal commands while being able to move about and concentrate on different aspects or displays of the interface at will, without having to concern themselves with proximity to the host system’s mouse or stylus. For example, in a military command and control environment, operators can use their hand gestures and speech to add multimodal annotations to shared maps describing a battlefield. In the same way, inhabitants of smart rooms could manipulate on/off switches, volume dials or other controls by pointing at them while speaking a command from anywhere within sensor range. To pursue our vision, we have chosen initially to extend the existing pen recognition agent into a human gesture recognition agent that recognizes certain 3D hand movements. With the support of our extended gestural agent, the system can provide an intuitive input interface for speech and human hand gesture, relying on our geometrical model to represent the physical relationships between users and devices.

4. The System There is empirical evidence of both user preference for and task performance advantages of multimodal interfaces compared to single mode interfaces in both general [16] and map-based applications [18]. Speech is the foremost modality, yet gestures often support it in interpersonal communication. Typically when spatial content is involved, information can be conveyed more easily and precisely using gesture rather than speech. Electronic pens convey a wide range of meaningful information with a few

simple strokes. Gestural drawings are particularly important when abbreviations, symbols or nongrammatical patterns are required to supplement the meaning of spoken language input. However, the user must always know to which interface the pen device is connected, and switch pens when changing interfaces. This does not allow for pervasive and transparent interaction as the user gets mobile. We have extended Quickset to accept “ink” input from 3D hand movements independent of any pen device. We make use of a magnetic field tracker called Flock of Birds (FOB) [1] as the input device for our hand drawing system. To give input to the system, the user attaches one FOB sensor on the top of the dominant hand to be tracked. The hand’s 3D position is then given by the sensor’s spatial position, which FOB provides at a frequency of approximately 50Hz. In regard to orientation information, we place the sensor almost at the back of the index finger with its relative x-coordinate axis directed toward the index fingertip. In this way, using the quaternion values reported by the sensor, we can apply mathematical transformations within quaternion algebra to determine the unit vector which unambiguously defines the direction of the sensor and therefore that of the intended pointing gesture. The wired hand tracker (Figure 1) is, of course, inconsistent with our intention of freeing the user from corded input devices. However, the cords we wish to cut are those that tether the user to a particular input device wired to a particular computing device. We have provided the user with just one universal input device, leaving to the system the task of determining with which computing unit the user is interacting. In our system, the universal input device is wired sensors and from that originates this apparent inconsistency. Using other devices, such as wireless acoustic/inertial sensors or cameras would eliminate all wires.

4.1. Geometry Model for Pervasive Computing To facilitate our extended interface we have created a geometrical system to deal with relationships between entities in our setting, namely screen positions (for drawing and visualization) and the perceptually-sensed current hand position and orientation (for recognition of certain meaningful 3D hand movements). This geometrical system is responsible for representing appropriately the information conveyed through hand interaction and automatically selecting the display to which the user’s gesture pertains in the multi-display setting (see Figure 1). The system needs to know to which display regions the users’ 3D gestures can pertain. Therefore, before the system is started for the first time, the system developers have to manually set the regions the users will be able to

paint in. The chosen regions will typically be a wall screen, a tablet or a computer screen on which the shared map or maps have been projected. The developer accomplishes this by pointing at three of the vertices of the chosen rectangle for each painting region. However, since this procedure must be done in the 3D space, the deployer has to gesture at each of the vertices from two different positions. The two different vectors are triangulated to select a point as the vertex. In 3D space, two lines will generally not have an intersection, so we use the point of minimum distance from both lines.

Figure 1: user who is operating on a tablet from a distance of about 1.5m; an array microphone is placed on the top of the screen to capture audio signals while gestures are sensed by a FOB sensor on the back of the PinchGlove.

4.2. The Gesture Recognition Agent Currently, the system supports the recognition of two kinds of gestures: pointing and hand twisting about the index finger. In natural human pointing behavior, the hand and index finger are used to define a line in space, roughly passing through the base and the tip of the index finger. Normally, this line does not lie in any target plane, but it may intersect one at some point. It is this point of intersection that we aim to recover from within the FOB’s transmitter coordinate system. Based on empirical study and analysis of collected gestural data, we characterize a pointing gesture as: 1) a hand movement between two stationary states, 2) whose temporal length is up to 2.4 seconds, and 3) whose dynamic patterns are characterized by a smooth and steady increase/decrease in the spatial component values.

We consider the FOB data stream as static anytime the sensor attached to the user’s hand remains stationary for at least one second (approximately 50 consecutive FOB reports). In this case, we also refer to it as being in a stationary state. We use a simple motion detector over the data stream to determine any stationary state. A motion detector is in charge of reporting to the gesture recognizer anytime a transition from/into a stationary state occurs. If the transition is from static into dynamic, the motion detector forwards the input stream to a finite state machine for the analysis to start. From that point on, if conditions (2) and (3) above also hold until the next transition from dynamic into stationary state is detected, then a pointing gesture is recognized. In this way, the motion detector provides the recognizer with explicit start and end points for classification without the need for any specific userdefined positions. Pointing recognition is based on a finite state machine whose inputs are the signs of the deltas over the spatial components associated with the hand location. Whenever the imaginary line described by the sensor in space intersects the target region (the virtual paper/plane) and simultaneously a pointing gesture is recognized, the system enters an interactive mode with the corresponding map projected onto that target region. Then, using a PinchGlove [2] the user can trigger pen-down (pen-up) events to enter/end free hand gestural input (see Figure 1) to place digital ink at the points of intersection. When a pen-down event is triggered, the drawing is passed on to the ink symbol recognizer agent. Future research is investigating other forms of gesture without instrumental gloves [11].

A hand twisting gesture is used to signal the user’s wish to pan over the map. In order to recognize such a gesture, we analyze the hand rotation information using the quaternion components provided by the sensor. Unlike other common representations for rotation transformations (like direction cosines, XYZ fixed angles and XYZ Euler angles), quaternions allow one to avoid the problem of gimbal-lock while allowing for the implementation of smooth and continuous rotation [21]. They encode in a compact way a unit vector and a scalar value to represent a rotation through an angle defined by the scalar about the axis defined by the unit vector. We characterize a hand twisting as a hand rotation for which: 1) 2) 3)

the unit vector along the pointing direction (see Figure 3) is constant over the movement, the rotation is about that direction for at least 45 degrees, and the rotation takes place between two stationary states.

We determine and examine the scalar and vectorial elements encoded into the quaternions to check the first and the second condition, respectively (see figures 2,3). Once detected, the pointing direction with respect to the center of the virtual paper determines the direction of the panning. An additional twisting gesture causes the system to exit panning mode. Graphical rendering on remote machines is currently implemented with OpenGL, utilizing the Virtual Reality Peripheral Network [20] driver for the FOB.

Figure 2: (on the left) the four quaternion components while performing four consecutive hand twistings over approximately 4.5 seconds; (on the right) the corresponding rotation angles in radians over the same period of time. For a hand-twisting to get recognized, the angular difference at two consecutive rotational direction change points must be at least π/4 radians.

4.3. The Speech Recognition Agent For speech recognition, we use a commercial off-theshelf product known as Dragon 4.0, a Microsoft SAPI 4.0 compliant speech engine. A very important capability of Dragon is that it is speaker independent. Any user can immediately interact via voice without having to train the system for his/her voice. However, the performance of different microphones, as well as noisy and varying environments remain an issue for mobile users. Spoken utterances are sensed by either room-mounted array microphones or (wireless) microphones worn by the user. Sensed utterances are sent to the speech recognition engine, which receives the audio stream and produces an n-best list of textual transcripts, constrained by a grammar supplied to the speech engine upon startup. These are parsed into meaningful constituents by the natural language processing agent, yielding an n-best list of alternative interpretations ranked by their associated speech recognition probabilities. Speech recognition operates in a click-to-speak mode, i.e. the microphone is activated when a pen down event is triggered.

Figure 3: a hand twisting occurs by rotating the hand about the x-axis of the local frame of reference of the sensor.

4.4. The Fusion Agent The original version of QuickSet [9] provides an integrator agent to determine and rank potential (multimodal or unimodal) interpretations. Such an agent labels both speech and gesture as either partial or complete. A modality tagged as partial needs to be integrated with another mode to provide a useful meaning. If tagged as complete, one mode already provides a full specification and therefore needs no further integration. The integration agent also examines the time stamps of the single modes to assess their temporal compatibility. Based on an empirical study [18], temporal integration occurs anytime either speech and gesture have overlapping time intervals or the onset of the speech signal occurs within a time window of up to 4 seconds

following the end of the gesture. For a detailed description, the interested reader should refer to [9,12].

5. Discussion and Future Work The architecture described in this work is flexible, open and easily extensible to provide more specific functionality. It exploits the capabilities of Quickset for recognition and fusion of input modalities while maintaining a history of the collaboration, so that the work can be conveniently set aside and continued later. As more devices are added, the capabilities of the environment will increase automatically and the geometric model will need to be updated to appropriately reflect these changes. While drawing using free hand movements allows for non-proximity and transparency to the interface, creating detailed drawings is not easy because human pointing is not very accurate [10]. Executing detailed drawings by pointing takes training and practice on the user’s part, and relies on a precise calibration of both the tracking device and the geometrical model. The accuracy of drawings decreases with increasing symbol complexity and this might cause poorer recognition. This is analogous to speech dictation and handwriting recognition systems that in general work better in simpler application domains. Our associating an arbitrary movement to a specific meaning could limit the extensibility of the system; for example, one can grab and twist an object in a virtual construction as opposed to the current interpretation of that gesture which indicates a desire to pan over the map. In that regard, we are currently conducting a study to determine and understand natural gestures for virtual manipulation [11]. The FOB is capable of tracking multiple users with little effort; however, a more desirable scenario would be to substitute a wireless tracking system in place of the FOB. Wireless electromagnetic, acoustic, and visionbased tracking units remove the mechanical link, but they still have only a limited operational range. Since individual tracking systems limit the user to a small physical space, one further improvement would be the use of a battery of trackers over larger spaces. It could be argued that the best way to perform 3D gesture recognition is via computer vision techniques, for even if we adopt the use of wireless sensors the user may still be somewhat encumbered by having to wear additional devices. Given the limited accuracy of current computer vision techniques we decided to compromise by adopting a sensor-based solution that least encumbers the user. By logging into QuickSet, a unique speech and gesture ID are assigned to the user. Ideally, for a ubiquitous computing environment, we would like to assign a speech ID via automatic speaker identification. For right now, we use the PinchGlove to signal penup/pen-down. As speech recognition in non click-to-speak

mode in QuickSet becomes more reliable, such penup/pen-down gestures could also be entered by voice. Alternatively, PinchGlove signaling could be replaced by defining a specific hand shape to trigger these events as an extension to our current 3D gesture recognition system. An interesting extension of our system would allow users to exchange, rather than just share, information from one display to another. For instance in [19] users are allowed to drag-and-drop objects across various displays using an electronic pen to record information. In our system, the user would be able to manipulate entities by pointing at them while simultaneously pinching the PinchGlove or issuing a voice command. Then, the user would point at a location on a different display and, by pinching the glove one more time or issuing another voice command, move the object at this new place. Across various displays, hand motions would not require the proximity necessary for a touch-pen. In addition, people naturally move objects by picking them up and then putting them down rather than sliding them across display surfaces as in current computer user interfaces. Only on the same display, a pushing motion may be more natural than a grabbing and dropping motion (e.g., for objects whose shapes lend themselves to rolling). In the context of a painting application, we could have a control palette dedicated-display and one or more drawing surfaces. The user then could select brush, color type and other attributes for the drawings simply by pointing at the corresponding location on the control panel. Our underlying system would be responsible for sharing and exchanging drawings among users in a collaborative, multi-modal, ubiquitous manner.

8. References [1] http://www.ascension-tech.com/ [2] http://www.fakespace.com/ [3] http://www.cc.gatech.edu/fce/house/ [4] http://www.research.microsoft.com/easyliving/ [5] Bobick, A.F., et al., "The KidsRoom: A perceptually-based interactive and immersive story environment", Presence: Teleoperators and Virtual Environments, 8(4):367-391, 1999.

[6] Coen, M.H., "Design Principles for Intelligent Environments", Proc. of the Conf. on Artificial Intell. (AAAI), pp. 547-554, 1998.

[7] Coen, M.H., "The Future of Human-Computer Interaction or How I learned to stop worrying and love My Intelligent Room", IEEE Intelligent Systems, March/April 1999.

[8] Cohen, P.R., et al., "An Open Agent Architecture", Working Notes of the AAAI Spring Symposium on Software Agents, 1994.

[9] Cohen, P.R., et al., QuickSet: Multimodal Interaction for Distributed Applications, Proceedings of the 5th International Multimedia Conference, ACM Press, pp. 31-40, 1997.

[10] Corradini, A., Cohen, P.R., "Multimodal speech-gesture interface for hands-free painting on virtual paper using partial recurrent neural networks for gesture recognition", Proc. of the Int’l Joint Conf. on Neural Networks (IJCNN), pp. 2293-2298, Vol. III, 2002.

[11] Corradini, A., Cohen P.R., “On the Relationships Among Speech, Gestures, and Object Manipulation in Virtual Environments: Initial Evidence”, Proc. of Int’l CLASS Work. on Natural, Intell. and Effective Interaction in Multimodal Dialogue Systems, 2002

[12] Johnston, M., et al., "Unification-based Multimodal Integration", Proc. of 35th Annual Meeting of the Ass. for Comp. Ling., 1997.

[13] Kumar, S., et al., “The Adaptive Agent Architecture: Achieving Fault-Tolerance Using Persistent Broker Teams”, Proc. of 4th Int’l Conference on Multi-Agent Systems (ICMAS), pp. 159-166, 2000.

[14] McGee, D.R., Cohen, P.R, "Exploring Handheld, Agend-based

6. Conclusions This report describes a working implementation of a 3D hand gesture recognizer to extend the existing digitalink input capabilities of a real-time, fully functional multimodal speech/pen architecture. The system presented here is a first step toward an ubiquitous multimodal speech and human gesture infrastructure for cooperative tasks. We have created a geometric model to represent the physical relationships between users and maps within the shared setting. Given our goal of cutting the cord that ties users of a multimodal system to common gestural input devices like a keyboard, pen or mouse attached to the interface device, we believe the system we have presented has taken a significant step towards that goal of pervasive computing.

7. Acknowledgments

Multimodal Collaboration", Proceedings of Workshop Handheld Collaboration at the Conference on CSCW, 1998.

on

[15] McGee, D.R, et al., "A Visual Modality for the Augmentation of Paper", Proc. of Work. on Perceptive User Interfaces (PUI), 2001.

[16] Mellor, B.A., et al., "Evaluating Automatic Speech Recognition as a Component of Multi-Input Human-Computer Interface", Proc. of the Int’l Conf. on Spoken Language Processing (ICSLP), 1996.

[17] Mozer, M, "The Neural Network House: An Environment that Adapts to its Inhabitants", Proc. of the AAAI Spring Symposium on Intelligent Environments, AAAI Press, pp.110-114, 1998.

[18] Oviatt, S.L., "The Multimodal Interfaces for Dynamic Interactive Maps", Proceedings of the Conference on Human-Factors in Computing Systems, pp. 95-102, 1996.

[19] Rekimoto, J., "Pick-and-Drop: A Direct Manipulation Technique for Multiple Computer Environments", Proc. of 10th Symp. on User Interface Software and Technology (UIST), pp. 31-39, 1997.

[20] Taylor R.M., “VRPN: A Device-Independent, NetworkTransparent VR Peripheral System”, Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 2001.

[21] Vince, J., "Virtual Reality Systems", Addison-Wesley, 1995. This research has been supported by the ONR Grants N000149910377, N000149910380 and N000140210038.

[22] Weiser, M., "The Computer of the 21st Century", Scientific American, 3(9):94-104, 1991.

A Map-based System Using Speech and 3D ... - Semantic Scholar

A Map-based System Using Speech and 3D ... - Semantic Scholar

Suggest Documents

Automatic Speech Recognition System Using ... - Semantic Scholar

transonics: a speech to speech system for english ... - Semantic Scholar

Spontaneous Speech Recognition Using a ... - Semantic Scholar

A Speech Enhancement Approach Using ... - Semantic Scholar

A 3D Shape Measurement System - Semantic Scholar

A Methodology and a System for Adaptive Speech ... - Semantic Scholar

punjabi speech synthesis system using htk - Semantic Scholar

hindi speech recognition system using htk - Semantic Scholar

A Face and Speech Biometric Verification System ... - Semantic Scholar

3D Laser Scanning System and 3D Segmentation ... - Semantic Scholar

3D Surveillance System Using Multiple Cameras - Semantic Scholar

Advanced Speech Communication System for ... - Semantic Scholar

an efficient speech recognition system - Semantic Scholar

Subjective Assessment of Speech-System ... - Semantic Scholar

Using Part-of-Speech and Semantic Tagging for ... - Semantic Scholar

Robust Speech Music Discrimination Using ... - Semantic Scholar

Subjective Speech Quality and Speech ... - Semantic Scholar

Natural-Sounding Speech Synthesis Using ... - Semantic Scholar

speech emotion recognition using stationary ... - Semantic Scholar

VISUAL SPEECH RECOGNITION USING ... - Semantic Scholar

Speech-based Emotion Characterization using ... - Semantic Scholar

Emotion Recognition from Speech using ... - Semantic Scholar

Adaptive Speech Enhancement Using Frequency ... - Semantic Scholar

Neural Speech Enhancement Using Dual ... - Semantic Scholar