A Literature Survey on the Design of Speech ... - Semantic Scholar

12 downloads 0 Views 86KB Size Report
Customizable 3D Applications, Frank Althoff, Herbert Stocker, Gregor McGlaun, Manfred K. Lang,. Proceeding of the seventh international conference on 3D ...
A Literature Survey on the Design of Speech Interface to 3D Applications Seonho Kim ([email protected])

1. Introduction More and more virtual reality applications are using speech interfaces because they have obvious advantages over other user interface architectures. First advantage is that the speech interface is the most natural and familiar interface to human. This also means it is easiest user interface to learn and use. Second advantage is that it makes the user’s hand and eye free while the user issue commands. Users can issue commands while he/she traveling, moving or manipulating objects. It is comfort interface as it needs no special devices to handle such as a mouse, wands and data gloves. Therefore, speech interface has high accessibility for diverse users including the elderly and the disabled [13]. Third advantage is that this approach promotes new forms of computing because the speech has more expressive power and efficient than other modalities. However, speech interface have many disadvantages as well. Error rate is significantly high compared to conventional interfaces. The feedbacks are often invisible and very slow. Stability and robustness are also very poor [7]. For these reasons, speech interfaces are not usually used alone. Most 3D virtual reality applications using speech interface are using multi-modal architecture to make it perform efficiently. This literature survey introduces current research trends of speech interfaces for 3D virtual reality applications and the possibilities of speech interface for the future.

2. Speech interface to 3D application According to the result of survey carried out for this report, most 3D VR applications which employing speech interface also using user interface of different type of modalities, for example hand gesture, posture, gaze, wand, and etc. The reasons of selecting multimodal architecture are not only to get benefits from synergy effect by using different multiple modalities but also to reduce ambiguities in recognition process which might be serious when they used alone. There were several preferred combinations of multimodalities for 3D VR applications developed so far. Speech and Wand One type of user interface technique to 3D application using speech is fusing speech and wand or body posture. Ciger [12]’s magic wand system uses a wand besides speech recognition technique. The magic wand has a role in selecting and navigating in the VR environment. The magic wand helps the user to make commands which are difficult to issue with speech interface or conventional input devices such as “fly to there”, “move it to there” and “select this”. This interface implements posture recognition by tracking the wand. Comparing to similar interface like using data gloves [10], it was perceived more intuitive, easier to use and no learning was needed. Most of all, users preferred this interface because it doesn’t restrict user’s motion as HMD does. Speech and Eye gaze One of most promising types of multimodal user interfaces for 3D application is using speech and eye gaze. Many researchers are putting their focus on this interface [1][6][14]. Schapira [14] carried out an interesting experiment to compare the difference of performance between multimodal user interfaces including gaze and couple of other modalities like speech and gesture. He compared “gaze + speech”, “gaze + gesture” and “gaze only” user interfaces. User’s task was selecting an object in VR environment and it was measured for studying how the system recognizes user’s selection accurately and precisely. “gaze only” technique shows worst performance of course. “gaze + gesture” and

“gaze + speech” shows similar performances which are obviously better than “gaze only” technique. This experiment is similar to the Zhang[1]’s which proved the synergy effects by using speech alongside with gaze technique. Zhang showed how the two modalities can help reducing the ambiguities caused by other modalities and formulated the disambiguation processes. Schapira’s experiment, however, didn’t consider the ambiguity problem. Instead, he considered the distance and size of objects to select and the performance of the selection tasks. Tan [6]’s experiment shows that speech modality made worst performance. “Gaze only” and Multimodal shows similar performance but it also revealed that multimodal slightly enhanced the performance of “gaze only”. Unlike the Tan [6]’s experiment is focused on comparing performance of difference combination of multi-modal architecture such as gaze and speech, gaze and spelling, and combination of gaze and spelling and speech, Zhang [1]’s experiment is detailed version of tan’s experiment and focused on comparing performance of detailed speech-gaze architecture. That is, the experiment considers parameters in mixing speech and gaze such as speech details and radius of eye operation regions. The result shows that up to certain degree the larger radius of eye operation regions is better, and the more detailed speech command shows the better performance. Speech and Gesture Like the “speech and eye gaze” user interface, speech and gesture are being selected by many 3D applications for the same reasons. Both speech and gesture are expressive, flexible and usable freely. However, because speech recognition and gesture recognition technologies are still in developing and inevitably produce many ambiguous results besides the recognition delay, they are not applied to real application system. Krum[10]’s system applied “speech and gesture” user interface to navigate Earth 3D VR environment. Even though Krum’s system is simple and uses limited number of speech commands such as “move up”, “move down”, “turn left” and “turn right” and small number of hand gestures such as “palm up”, “palm down”, “move left” and “move up” in the restricted small domain of map navigation, it shows worse performance than normal desktop user interface using keyboard and mouse. The main reason was the errors and delays in voice command recognition which are more affective during detailed movement. Schapira [14]’s experiment shows employing gesture rather than speech performs better even though the difference is minor. This is caused as schapira’s experiment designed to carry out relatively simple and easy tasks such as selecting a specific object among many other similar objects. Speech and Multimodal Many researches [2] [10] [14] tried to apply more than two modalities including speech to their 3D application. Krum[10] and Althoff[2] suggested using both conventional haptic device, like keyboard, mouse and joystick, and higher level input interface like speech and gesture and mixed them to their navigation system. Krum found that conventional devices perform over speech and gesture in fine control or navigation tasks. Althoff also found that accuracy of speech and gesture in simple domain is good but not in detailed movement. Speech only Only few 3D applications are employing “speech only” user interface for experimental purposes. Zhang [1] conducted experiments comparing three types of user interfaces. They were “speech only”, “gaze only” and “gaze and speech.” Igarashi [11] used non-verbal voice for interactive control by using ‘pitch’, ‘volume’, ‘continuous voice’ and ‘tonging’. Unlike most speech interface acts indirectly and non-real time feedback, this interface can manipulate object directly with immediate real time feedback. It was suitable for continuous control input such as moving scrollbars, volume control, rotating object finely, and etc. Also, this interface is language independent and easily can be modified for discreet input. Speech with Agent

New trends of speech interface in VR environment is employing conversation-based embodied agent interface [4][5][8]. Agent interface plays a role in recognizing and responding to verbal and non verbal input, generating verbal and non-verbal output, dealing conversational functions such as turn taking, giving feedback during dialogue, repair conversation. More, the agent controls discourse. “Rea system” of Cassell [4]’s study provide information to users in 3D VR environment in real estate domain. Rea controls the dialogue and shows proper feedback during the conversation. Traum [5]’s system is a military educational agent system in VR environment. Users can be trained while he makes conversation with the agent interface. Traum’s agent system has ability in processing dialog planning, therefore, it lead the conversation with the user into controlled dialog. Other efforts using speech interface Speech interface is a good challenge for the preparation to the invisible computing environment and ubiquitous computing [9]. In these computing environments users will no longer contact with computer directly as we do today and speech will be a good interface. For the collaborative VR environment, speech interface can be a good way of communication between other human in the system. Bowers[3] designed a virtual meeting system that make it possible people to meet from the distance. Main problems for this system were controlling the talking turns because people had difficulties in getting feedbacks from other people. Some research emphasized the possibility of speech interface for elderly and people with disabilities. Ressler[13] improved VRML computing environment by changing it with speech interface.

3. Conclusions Speech interface design has long way to go before it could be applied into 3D applications. Besides the raising the accuracy of speech recognition, we have some problems to solve such as how can we use the speech interface without losing privacy, how to synchronize the speech with other recognition techniques, how to define the relation between the commander and dialog agent in 3D environment. Conversational intelligent agent will be new interface trend for next couple of decades and we need to study more basic issues about conversational mechanism of human.

4. References [1] Resolving Ambiguities of a Gaze and Speech Interface, Qiaohui Zhang, Atsumi Imamiya, Kentaro Go, Xiaoyang Mao, Proceedings of the Eye tracking research & applications sysmposium on Eye tracking research & applications, 2004, pp85-92 [2] A Generic Approach for Interfacing VRML Browsers to Various Input Devices and Creating Customizable 3D Applications, Frank Althoff, Herbert Stocker, Gregor McGlaun, Manfred K. Lang, Proceeding of the seventh international conference on 3D Web technology, 2002, pp67-74 [3] Talk and Embodiment in Collaborative Virtual Environments, John Bowers, et al. Proceedings of the SIGCHI conference on Human factors in computing systems : common ground, 1996, pp58-65 [4] Embodied Conversational Interface Agents, Justine Cassell, Communications of the ACM, 2000, pp70-78, V43, No. 4 [5] Embodied Agents for Multi-party Dislogue in Immersive Virtual Worlds, David Traum, Jeff Rickel, Proceedings of the first international joint conference on Autonomous agents and multiagent systems : part 2, 2002, pp766-773 [6] Error Recovery in a Blended Style Eye Gaze and Speech Interface, Yeow Kee Tan, Nasser Sherkat, Tony Allen, Proceedings of the 5th international conference on Multimodal interfaces, 2003, pp196-202 [7] Multimodal interfaces that process what comes naturally, Sharon Oviatt, Philip Cohen, Communications of the ACM, 2000, V43, No. 3 pp45-48

[8] Toward Adaptive Conversational Interfaces : Modeling Speech Convergence With Animated Personas, Sharon Oviatt, Courtney Darves, Rachel Coulston, ACM Transactions on Computer-Human Interaction, 2004, V11, No.3 pp300-328 [9] Interfacing with the Invisible Computer, Kasim Rehman, Frank Stajano, George Coulouris, Proceedings of the second Nordic conference on Human-Computer interaction, 2002, pp213-216 [10] Speech and Gesture Multimodal Control of a Whole Earth 3D Visualization Environment, David M, Krum, Olugbenga Omoteso, William ribarsky, Thad Starner, Larry F. Hodges, Proceedings of the symposium on Data Visualization 2002, 2002, pp195-200 [11] Voice as Sound : Using Non-verbal Voice Input for Interactive Control, Takeo Igarashi, John F. Hughes, Proceedings of the 14th annual ACM symposium on User Interface software and technology, 2001, pp155-156 [12] The Magic Wand, Jan Ciger, Mario Gutierrez, Frederic Vexo, Daniel Thalmann, Proceedings of the 19th spring conference on Computer graphics, 2003, pp119-124 [13] Making VRML Accessible for People with Disabilities, Sandy Ressler, Qiming Wang, Proceedings of ACM ASSETS 98, 1998 [14] Experimental Evaluation of Vision and Speech based Multimodal Interfaces, Emilio Schapira, Rajeev Sharma, Proceedings of the 2001 workshop on Perceptive user interfaces, 2001, pp1-9 [15] A Multiple-Application Conversational Agent, Steven Ross, Elizabeth Brownholtz, Robert Armes, Proceedings of the 9th international conference on Intelligent user interface, 2004, pp319-321 [16] The Future of Speech and Audio in the Interface, Barry Arons, Elizabeth D. Mynatt, Conference companion on Human factors in computing systems, 1994, pp465