i*Chameleon: A Unified Web Service Framework for ... - IEEE Xplore

0 downloads 0 Views 295KB Size Report
i*Chameleon: A Unified Web Service Framework for. Integrating Multimodal Interaction Devices. Kenneth W.K. Lo, Wai Wa Tang, Hong Va Leong. Department of ...
9th IEEE International Workshop on Managing Ubiquitous Communications and Services 2012, Lugano (19 March 2012)

i*Chameleon: A Unified Web Service Framework for Integrating Multimodal Interaction Devices Kenneth W.K. Lo, Wai Wa Tang, Hong Va Leong

Alvin Chan , Stephen Chan, Grace Ngai

Department of Computing The Hong Kong Polytechnic University Hong Kong

Department of Computing The Hong Kong Polytechnic University Hong Kong

e-mail: {cskenneth,cswwtang, cshleong}@comp.polyu.edu.hk

e-mail: {csschan,cstschan,csgngai}@comp.polyu.edu.hk

developed. OpenInterface is a component-based tool for rapidly deploying multimodal input for different kinds of applications. The framework includes a kernel, a builder and a design tool. After coding their own multimodal application and device drivers, developers will port and deploy them on the platform. However, it does not support plug-and-play and cannot readily support collaborative interaction over the Internet. Promising and intuitive interaction methods with heterogeneous input devices and interaction over the web browser have been developed by W3C. However, there remains much room for the development of multimodal interaction for pervasive systems. A unified multimodal framework is largely needed. This paper presents the i*Chameleon framework, which allows users to customize their interaction, capable of adapting to changes in the environment as well as user requirements according to the features of applications or the needs of different users, like a chameleon. Since this type of dynamic adaptation tremendously increases the complexity in the interaction design, a middleware which can mitigate the tight coupling of applications and specific input devices offers a practical solution. It enables the abstraction of hardware devices and communication technologies, and provides a well-defined interface to interact with other applications or widgets [8]. i*Chameleon focuses on the design considerations of multimodality for pervasive computing. It aims to develop a web-service framework with a separate analytic co-processor for collaborative multimodal interaction, providing a standard and semantic interface that facilitates the integration of new widgets with a wide variety of computer applications. This paper is organized as follows: Section II highlights the desirable system features and requirements. Section III presents the framework workflow and architecture, followed by system implementation and evaluation in Section IV. Brief concluding remarks are given in Section V.

Abstract— Multimodality inputs are becoming increasingly popular in supporting pervasive applications, due to the demand for highly responsive and intuitive human control interfaces beyond the traditional keyboard and mouse. However, the heterogeneous nature of novel multimodal input devices and the tight coupling between input devices and applications complicate their deployment, rendering their dynamic integration to the intended applications rather difficult. i*Chameleon exploits device abstraction in a web services-based framework to alleviate these problems. Developers can dynamically register new devices with the i*Chameleon framework. They can also map specific device inputs to keyboard and mouse events efficiently. A number of input modalities such as tangible devices, speech, and finger gestures have been implemented to validate the feasibility of the i*Chameleon framework in supporting multimodal input for pervasive applications. Keywords - component; multimodal; human computer interaction; middleware; web services;

I.

INTRODUCTION

Mark Weiser coined the term pervasive computing, the third wave in computing describing the creation of environments with communication capability, yet gracefully integrated with human users [1]. Riding on the success of mature distributed systems and well-developed hardware technologies, nowadays users consider usability and functionality to be equally important. The proliferation of smart and human-natural devices, such as motion capture cameras and speech recognition engines, renders the integration of different modalities into one single application a concrete reality. Users prefer a combination of different modalities rather than just a single one when interacting with computer systems. With multimodality, reliability of human computer interaction was found to improve [2]. The term “multimodal” has been used in many contexts across several research areas [3, 4]. The purpose of multimodal human computer interaction is to study how computer technology can be made more user-friendly, through the interplay among three aspects: user, system and interaction [5]. Exploiting a diversity of networking technologies and multimodal interface techniques, cooperative interaction on pervasive computing system has become a new trend for human computer interaction. Many well-known multimodal systems like ICARE [6] and OpenInterface [7] were

978-1-4673-0907-3/12/$31.00 ©2012 IEEE

II.

SYSTEM REQUIREMENTS

The system design for supporting non-traditional interfaces such as a multimodal interaction framework is much more complex than those for traditional interfaces. Design efforts tend to be focused on implementing particular interface devices to fulfill specific requirements. Hence, the product interface is usually not flexible enough to be reused on new applications and new widgets or to

106

deploy the framework to the application server easily. The environment should allow even non-programmers to create widgets and customize the configuration without prior knowledge of the background process, through the use of a drag-and-drop graphical user interface [12].

replace existing widgets, even if the interface modality remains unchanged. When a new interaction is added to an existing application, it is difficult to adopt that interaction without modifying or adding functionality to the controller side [9], thus dictating the recoding of the application when the input hardware is changed. The main cause of such difficulties is the tight coupling between input devices and the underlying applications. The problem of tight coupling can be eased by designing and implementing generic interfaces based on abstraction of devices. In addition, the gaining maturity of web services technology provides a way to dynamically discover and bind devices of different modalities. We highlight the desirable system features and requirements on a web service multimodal framework. Based on these requirements, we propose our solution.

F. Limited Computing Resources Consumption Analyzing input signals for a real-time multimodal application often requires much computational power. In addition, combining input of different modalities may impose a certain degree of hardware requirements. For the user interaction to be feasible and usable, it should not exert too much demand on processor or memory resources. III.

FRAMEWORK WORKFLOW AND ARCHITECTURE

Figure 1 shows the framework architecture. Linked with Cerebellum is Cerebrum, which receives and processes visual and auditory information. We developed a Coprocessor manifested as Cerebrum to offload some processing tasks from the application server Cerebellum. Cerebrum receives raw data from Cerebellum, analyzing and recognizing events and notifying Cerebellum of the intended action, i.e., command to target application. Communication between Cerebellum and Cerebrum is handled by a Nerve System implemented via object serialization, allowing us to create reusable objects and transferring them through standard sockets with common modeling definition. It manages sensory input and application input separately in order to allow each widget to be reused and extended [13]. A modeling and definition language has been developed for i*Chameleon to support the sharing of information among different components.

A. Heterogeneity One key theme in multimodal human computer interaction is the integration of different modalities into a single system [10]. A multimodal framework should provide a standardized protocol to utilize a variety of widgets ranging from traditional keyboard and mouse, game controller, motion capture camera to future devices. B. Pervasion As pervasive computing and ubiquitous devices continue to permeate and are integrated into daily life, the humancomputer interaction is no longer restricted by the wire. Ubiquitous devices could interact with other devices for novel control interfaces, forming a complex interaction network that can be seamlessly integrated and coordinated. Riding on the advancement in network technologies, this dramatically extends the possibilities of collaborative multimodal interactions and pervasive environments at home, at work or on the go. Dynamic adaptability of new widgets, low level of interruption, low system down time, and transfer of signals in standard formats become important. C. Scalability As pervasive computing continues to gain prominence, the variety in types and number of widgets to be supported will continue to grow. The framework should be able to support large number of widgets of the same modality and/or different modalities, to properly support distributed and ubiquitous components and user collaboration. D. Extensibility and Flexibility Regional factors, time constraints, personal preferences, etc., affect the way users interact with applications. The framework should allow developers to utilize the interaction implementation that best fits. It should also be domainindependent, minimizing the changes required to adapt to another domain [11].

Figure 1 i*Chameleon includes two core sub-systems, a web services application server and a co-processor. Using the brain as a design metaphor, the Application Server plays the role of the Cerebellum, which is mainly responsible for motor control. It receives signals from the Sensory Input Widgets, communicates with Cerebrum and coordinates with the Motion Neurons.

A. Framework Workflow In i*Chameleon, sensory input devices are represented by widgets with each widget corresponding to a particular

E. Deployment and Configuration The framework should be based on industry-standard and hardware-independent protocols, so that developers can

107

flexibility and heterogeneity. A web service is a selfcontained, self-describing and modular application that can be published, located, and invoked across the web [14]. Together with contemporary networking infrastructure, web services provide the highest compatibility with different widgets. Once i*Chameleon is deployed, Sensory Input Widgets or Application Input Widgets can discover and invoke the deployed service dynamically. Web services protocols such as UDDI, WSDL and SOAP are platform-independent, conducive to heterogeneity and are supported by all up-to-date IT infrastructures, with libraries for nearly all major programming languages. Developers interact with SOAP, which is a protocol specification for exchanging uniform information by passing XML-encoded data bound to HTTP as the underlying communication protocol from one endpoint to another. It uses XML messaging over plain HTTP, thus avoiding networking issues, such as firewall problems, allowing for remote procedure calls via simple request/reply. These render web services among the best communication protocols between heterogeneous devices and frameworks. Figure 3 shows the detailed view of i*Chameleon Cerebellum, consisting of three modules: Sensory System, Nerve System and Motor System.

input in web service. Figure 2 illustrates how an event is triggered, recognized and an action is executed. An application input is fired by one or more events, while an event is triggered by an input widget. Sensory Input Widget captures raw data and notifies Cerebellum. Data together with the device information will be passed to Sensory System, which translates the raw data into a generalized data object (Nerve Signal Object storing information related to the event) and sends the object to Cerebrum for analysis.

Figure 2 i*Chameleon workflow. Hand gesture processing starts with the camera capturing coordinates. Generalized point object sends the coordinates to Sensory System via web services. Cerebrum translates the raw data into gestures and notifies Motor System to trigger corresponding application.

Figure 3 Two sensory receptors associated with Cerebellum with specific identifier and each receptor consists of different modalities.

Cerebrum is responsible for listening to the Nerve Signal event handler, receiving a signal, and extracting raw data from the signal. The representation of data is devicedependent. The translated object will be passed to a set of data-preprocessing filters, such as noise filter or normalizing filter, to ensure that the data is valid and accurate. The pre-processed data will be processed by the proper algorithm for the modality, e.g., applying a gesture recognition algorithm on a set of points. An analyzer will eventually parse the input data and map it to a command based on a set of grammar rules, paving the way to trigger the corresponding action (command for target application). Cerebrum notifies Cerebellum of the action intended by the user. Motor System in Cerebellum gathers the intended actions and notifies the application input widgets to execute the commands by invoking the corresponding web services.

Sensory System is responsible for handling all input signals from input widgets and communications using SOAP. It provides two important functions. First, it manages all Sensory Input Widgets at the workstation level. Before adding any input widgets, each workstation declares itself as a receptor for the incoming hardware signals. A receptor is a client of the Sensory System. After declaration, Sensory Input Widgets can be added to i*Chameleon, which would return a widget ID to the client. Hence, Cerebellum manages the list of connected receptors which enable i*Chameleon to distinguish the senders of events. Second, all input events from Sensory Receptors are handled by Sensory System. Each Sensory Input Widget can be associated with Sensory System dynamically by invoking web services, and be given an identifier as shown in Figure 3. This method has the advantage of adding new devices without stopping the multimodal application. For example, after a receptor is initiated, a developer can append a new speech recognizer. Thereafter, once incoming verbal input is

B. Cerebellum (Application Server) The i*Chameleon framework is implemented based on web services in order to satisfy the requirements, especially

108

C. Cerebrum (Co-processor) The co-processor performs functions analogous to the cerebrum in the brain, i.e., processing sensory data. It handles all the computation tasks involved in translating data from receptors into meaningful commands, such as mouse moves or key clicks. Cerebrum is organized as three layers: Hardware Abstraction Layer, Modal Layer and Command Layer. Each layer is independent of the other two. Hence changing one layer will not affect the other layers. The Hardware Abstraction Layer is responsible for handling the data communication between Cerebellum and Cerebrum, including object creation and passing to the other layers. Based on the Nerve Signal Object received, this layer creates related devices or notifies corresponding listeners. Apart from this, if the received signal object describes a new arrival event, this layer will notify the suitable device. The device will then create specific data objects and transfer them to the Modal Layer for analysis. Within this layer, data integrity and accuracy can be ensured by passing them through appropriate filters, such as noise filter, point tracking filter or transformation filter. i*Chameleon defines a set of widget classes that provide the structure of the hardware configuration. A widget in i*Chameleon is a hardware abstraction or category. According to the widget’s nature (sensory input), we classify them into four categories as shown in Figure 4, with associated examples depicted in Table 1.

detected and recognized, it can trigger an event to notify Sensory System. One of the biggest problems faced with multimodal interactions is that hardware devices often come with their own language-dependent or platform-dependent libraries. This makes it difficult to integrate multiple modes of interaction into the same application. Using web services can solve the problem of incompatibility between the programming languages required by the various hardware devices. Freed from the programming language constraints, developers only need to call the corresponding web services. A new input widget can be added as follows: Step 1: Declare a workstation as a sensory receptor Step 2: Append new sensory input widget Step 3: Notify sensory system after event arrival

The Nerve System is responsible for transmitting the Nerve Signal Object between Cerebellum and Cerebrum. This is implemented with object serialization, allowing objects to be transmitted over the network. Cerebrum connects by making a connection request to Cerebellum, establishing a communication channel for transferring of Nerve Signals. Motor System is responsible for handling the analyzed events from Cerebrum and forwarding the commands to corresponding Application Input Widgets for execution. The process orchestrations are similar to those with Sensory Input Widgets. Motor System thus acts as the communicator between Cerebellum and command executors (motion neurons). The design is inspired by an observation from biology, where motion neurons directly or indirectly control muscles. Since different muscles function differently, different motion neurons are needed. i*Chameleon is designed in a similar fashion. When Motor System receives the command from Cerebrum, it triggers a particular Application Input Widget on a specific workstation to execute the command. Command executors are platform-dependent, and each workstation owns its executors. When an executor receives a notification from Cerebellum, it triggers the Application Input Widgets to perform the action. Using the web services declared in Motor System, developers can customize or create commands dynamically. A command is responsible for storing the execution conditions, which include events, actions (which are inputs to user programs, e.g., games) and time constraints. An event, an invoker in the command design pattern, is created by the end-user, which describes when the associated action should be triggered. Actions are the instructions that need to be executed when a command is triggered. For example, an action can be “Open File Explorer”, “Make the WiiMote Vibrate” or “Right Click the Mouse”. New Application Input Widgets can be added as follows:

Figure 4 Defined devices under hardware abstraction layer Widget Speech Recognition Multi-Point Key Tangible

Nerve Signal Object Recognized Word A Set of Points Pressed Key Orientation

Modality Voice Hand Gesture Key Tilt

Table 1: Devices and associated modality

The most complex part of Cerebrum dwells in the Modal Layer, which accepts packaged data from Hardware Abstraction Layer analyzes the input and recognizes the events. These events are then passed to the Command Layer, which maps it to a corresponding action. This Modal Layer can be subdivided into two major components. The first component consists of a number of analyzers, which handle data processing and computation in order to translate the raw resources into tokens by pre-defined rules. The second component is a parser that parses the tokens into parses trees of different modalities, such as gesture, voice or others using an interaction grammar and finally maps the parse trees to events. Events are then passed to the Command Layer to be executed. Due to the layered design, developers can attach or detach interaction devices without any modification to the other layers. When data is received from

Step 1: Declare workstation as a motor neuron Step 2: Append new output widgets to motor neurons Step 3: Define command and add to motor neurons

109

the Hardware Abstraction Layer, the relevant modality is triggered. For example, a series of 2D Points triggers the Gesture modality. As we defined four categories of widgets, four corresponding modalities are implemented, as illustrated in Table 1 and Figure 5.

Figure 5 Modal layer

The final layer is the Command Layer. As the name implies, this layer is responsible for determining which command needs to be executed. After the events are recognized by the Modal Layer, these events still appear independent to one another. It is only known that at a certain moment, say, t1, two events, e1 and e2, are triggered, but it is unclear which command needs to be executed. The Command Layer gathers the execution conditions of all application inputs. Once an event is triggered, it helps to consolidate the triggered events with all previously triggered events to determine whether an associated command is to be executed. It then sends back the interpreted signal for the associated command to Motor System in Cerebellum. To illustrate the operation of i*Chameleon, suppose a multi-touch table-top screen is used to display and manipulate a digital mapping application. A user puts four fingers on the table and moves them together, to indicate a zoom-in action. The machine managing the table sends the coordinates of the corresponding points to i*Chameleon via web services. The data are handed over to the Hardware Abstraction Layer for preprocessing and then passed to the Modal Layer for analysis. The Modal Layer interprets the incoming Points as a Gesture, which is passed to the Command Layer, to be mapped to the appropriate command for notification to output widgets through Motor System. IV.

IMPLEMENTATION AND EVALUATION

Powered by the i*Chameleon framework described in previous sections, we evaluate the performance of i*Chameleon and verify its usability. Performance-wise, we first overload i*Chameleon to measure the maximum throughput [15]. Second, we simulate the generic interaction process to measure the response time from sensory input to application input (i.e., command). Third, we measure the system resource utilization of client and co-processor. Usability-wise, we evaluate the system as to the actual implementation efforts on adding a new multi-touch surface, deployed in two different applications, namely, Robot Car and Google Earth as illustrated in Figures 6 and 7. In our evaluation, the web services and co-processor portions of the framework were deployed on a desktop with an Intel i3 3.07GHz processor and 4GB RAM while the

client is running on a laptop with an Intel Core 2 Duo 1.83GHz processor and 3GB RAM. A. Performance Evaluation The measurement of the maximum throughput examines the ability of i*Chameleon to handle significantly heavier workloads than normally required [15]. In this experiment, we simulated one sensory input widget (tangible widget). Since a normal tangible widget can trigger five input signals (up, down, left, right and face up), we randomly generate one of these five input signals to the application server (cerebellum) continuously. As a result, 35218 events are created within 60 seconds and the throughput reaches a maximum of 587 fps. Considering that the normal rate of sensory input widgets is 30 fps, this implies that the application server can support approximately 20 widgets simultaneously. In multimodal interaction, each person typically uses no more than three widgets. Hence, this implies that the framework can support a collaborative working environment accommodating six to seven different users. Second, we simulate the interaction in the same way as the previous experiment, with normal frame rate (30 fps) and measure the response time between the generations of the raw sensory input to receiving the executable application input. Although the response time fluctuates, it stays within 120 ms, which is an acceptable feedback time good enough for real-time interaction. The mean response time is also around an excellent value of 50 ms. Finally, system resource utilization is evaluated. The client simulates the generation of tangible events at a rate of 300 fps, which translates into a frame rate of 30 for 10 different widgets. Our experience shows us that a user usually uses around three modalities, and this frame rate simulates a load generated by three concurrent users. The memory usage of the client machine rises gradually from 5 MB to 7.5 MB. The CPU usage stays steadily between 2 to 8%. Therefore, we can see that the system resources used by i*Chameleon only cause a slight increase in loading to the client system. This is not significant, especially when compared to other commonly-used applications. For the server machine, the memory usage stays constant around 2 to 3 MB, while the CPU usage also stays between 2 to 8%. B. Usability Evaluation As i*Chameleon was designed to facilitate the development and adoption of multimodal interaction, its usability is equally important. We evaluate usability from its effectiveness at integrating various devices with different interaction modalities. The i*Chameleon framework makes the development of the multi-touch surface much easier by standardizing the incoming captured points from a webcam into the same format as those coming from a motion capture camera. This allows many of the semantics already developed for the gesture-recognition system to be reused for the multi-touch

110

surface. The mapping of these semantics to system input events is also simplified, which readily supports the mapping of these events to application inputs. The development of the multi-touch surface interface was essentially reduced to the addition of another input device in the system. This was achieved using 30 extra lines of programming code, a large reduction of the effort that it would have taken otherwise. Without using i*Chameleon, developers are required to write approximately 500 lines of code to achieve this. In addition, if the libraries of hardware devices are in different programming languages, more effort is needed. This scenario also demonstrates the extensibility, flexibility and reusability of our i*Chameleon framework. Multi-Point Devices 1. Optitrack Infra-red Motion Capture Camera for Gesture Control Webcam 2. ITU Gaze Track Image Processing for Retina Tracking 3. Infra-red Webcam and TUIO library for multitouch surface 4. Apple iPhone and iPad multi-touch surface for Gesture Control 5. Nintendo Wii System for Position Tracking

Voice Devices 1. Microsoft SAPI voice recognition

V.

CONCLUSION

We have presented i*Chameleon, a framework designed to allow developers to create and integrate different types of sensory input widgets from distributed systems with little effort to create multimodal interaction for pervasive applications. Preliminary performance evaluation verifies its effectiveness. It could enable remote controlling and collaborative works among a group of people. We anticipate that our work would sparkle further research on multidisciplinary areas such as software engineering, pervasive computing and social computing, with better ethnographic understanding of human computer interaction concepts.

Tangible Devices 1. Nintendo Wii controller for Tilt and Motion tracking 2. Apple iPhone and iPad for Tilt and Motion tracking

VI.

ACKNOWLEDGEMENT

The work reported in this paper was partly supported by the HK Polytechnic University Research Grant G-U757. [1] [2] [3]

[4]

Table 2: Devices currently integrated into i*Chameleon

[5]

[6] [7]

[8] Figure 6 Tangible interaction with i*Chameleon: controlling robots in a competition with Nintendo Wii (left, circled) and iPhone (right, circled). [9]

[10]

[11]

[12]

[13] Figure 7 Controlling map-viewing applications using gestures and Nintendo Wii. i*Chameleon is used to map gestures captured by a motion capture camera, to existing keyboard and mouse commands.

[14]

Table 2 presents a list of devices and interaction modalities that have so far been integrated into i*Chameleon. These deployments include gesture controls (Figure 6), voice controls and tangible controls (Figure 7).

[15]

111

REFERENCES M. Satyanarayanan, “Pervasive computing: vision and challenges”. Personal Communications, IEEE, 8(4):10 -17, 2001. A.G. Hauptmann, “Speech and gestures for graphic image manipulation”. In Proc. ACM CHI’89, pp. 241-245, 1989. N.O. Bernsen, “Foundations of multimodal representations: a taxonomy of representational modalities”. Interacting with Computers, 6(4):347-371, 1994. N.O. Bernsen, “Defining a taxonomy of output modalities from an HCI perspective”. Computer Standards & Interfaces, 18(6-7):537553, 1997. A. Jaimes and N. Sebe, “Multimodal human-computer interaction: a survey”. Computer Vision and Image Understanding, 108(1-2):116134, 2007. S. Oviatt, “Advances in robust multimodal interface design”. IEEE Computer Graphics and Applications, 23(5):62-68, 2003. J.L. Lawson, A.-A. Al-Akkad, J. Vanderdonckt, and B.M. Macq, “An open source workbench for prototyping multimodal interactions based on off-the-shelf heterogeneous components”. In Proc. ACM SIGCHI Symposium on Engineering Interactive Computing System, pp. 245-254, 2009. A. Herzog, D. Jacobi, and A. Buchmann, “A3ME - an agent-based middleware approach for mixed mode environments”. In Proc. Mobile Ubiquitous Computing, Systems, Services and Technologies, pp. 191-196, 2008. N. Kobayashi, E. Tokunaga, H. Kimura, Y. Hirakawa, M. Ayabe, and T. Nakajima, “An input widget framework for multi-modal and multidevice environments”. In Proc. Workshop on Software Technologies for Future Embedded and Ubiquitous Systems, pp. 63-70, 2005. S.L. Oviatt, A. De Angeli, and K. Kuhn, “Integration and synchronization of input modes during multimodal human-computer interaction”. In Proc. ACM CHI ‘97, pp. 415-422, 1997. A. Costa Pereira and F. Hartmann, and K. Kadner, “A distributed staged architecture for multimodal applications”. In Proc. European Conference on Software Architecture, pp. 195-206, 2007. K. Henricksen, J. Indulska, T. McFadden, and S. Balasubramaniam, “Middleware for distributed context-aware systems”. In Proc. of CoopIS, DOA, and ODBASE, LNCS, pp. 846-863, 2005. V. Fernandes, T. Guerreiro, B. Araújo, J.A. Jorge, and J. Pereira, “Extensible middleware framework for multimodal interfaces in distributed environments”. In Proc. International Conference on Multimodal Interfaces, ACM, pp. 216-219, 2007. P. Avgeriou and U. Zdun, “Architectural patterns revisited – a pattern language”. In Proc. European Conference on Pattern Languages of Programs, pp. 431-470, 2005. E.J. Weyuker and F.I. Vokolos, “Experience with performance testing of software systems: issues, an approach, and case study”. IEEE Transactions on Software Engineering, 26(12):1147-1156, 2000