An Evaluation Framework for Assessing and ... - Semantic Scholar

An Evaluation Framework for Assessing and Optimizing Multimodal Fusion Engines Performance Pedro Feiteira LaSIGE, University of Lisbon Campo Grande Edif´ıcio C6 Piso 3 1749-016, Lisboa, Portugal [email protected]

Abstract—The current development of interactive systems is shifting its focus into adding new features and capabilities, encompassing for example, new input devices and ways of interacting. Some applications make use of different modalities for both input and output, which adds great complexity to such systems. Due to a possible high number of input modalities and devices available, the task of combining all the information conveyed by users from these sources, becomes critical and troublesome. This process is commonly referred to, as multimodal fusion, and is performed by fusion engines, components of multimodal systems, that have the purpose of receiving multiple streams of input, combine them, and reaching an interpretation of user intent. Evaluating the efficiency of this process is a task that requires a considerate amount of effort, due to all the variables involved. In this paper we present an evaluation framework and methodology aimed towards the assessment and optimization of fusion engines performance. We begin by correlating our work to project GUIDE1 , explaining how fusion is achieved and how information about user and context are used for adaptation. Subsequent sections discuss the proposed framework, and how it enables developers to assess and improve their components performance. The article concludes by showing some results that demonstrate the benefits of using such an approach and the impact it can have on the development of fusion engines. Keywords-performance evaluation; multimodal fusion; sensor fusion; fusion engine; optimization;

I. I NTRODUCTION Nowadays, multimodal systems are becoming trendier, enabling the use of modalities such as speech, gestures or even recognition of full-body motion. A type of users that can benefit from such a wide number of available choices, are the elderly, whom for the most part, possess some kind of physical or cognitive impairment [1], [2]. By having at their disposal, different means to provide input, elderly users can pick whichever modality is most suitable or comfortable for them, creating an enjoyable interaction experience. Combining input data is a delicate task, not only because of distinct modality recognizers that may be available, but also due to the quantity of information about user and context that can be used to change the system outcomes. On 1 www.guide-project.eu

Carlos Duarte LaSIGE, University of Lisbon Campo Grande Edif´ıcio C6 Piso 3 1749-016, Lisboa, Portugal [email protected]

one hand, users can have specific particularities concerning their limitations and preferences. On the other hand, their surrounding context can also severely impact the results of the interaction, for instance, by compromising the efficiency of a speech recognizer due to loud noises in the background when the user is trying to speak. These types of situation can become critical when dealing with adaptive systems that have to constantly capture and process new events. GUIDE is an Europen project aimed to deliver a comprehensive approach for the development and dissemination of multimodal user interfaces capable of intelligently adapt to the individual needs of users with different kinds of mild impairments. Users will benefit from natural modes of interaction (e.g. speech and pointing), not having to learn how to use a new interaction device, and having presentations adapted to their characteristics and preferences. Developers will benefit from reaching a wider population based, without having to change their development process, and with just a limited overhead. To enable this, GUIDE proposes a run-time software framework, integrating various user-interface technologies, ranging from remote control, free-hand gestures, speech control to avatars and second screen devices. Taking all of this into account, in our approach to design and implement a fusion engine for the GUIDE framework, we strongly considered ways to imprint an adaptive behaviour to this component, in order to cope with variables such as user and context data, aiming to reach the most trustworthy results. Alongside this implementation, our work also focused on creating an evaluation framework, capable of, according to certain metrics, assess the quality of fusion engines and support their continuous optimization. In this paper, after a presentation of related work, ensues a more detailed discussion about how information fusion is currently implemented in GUIDE, and how that process is optimized to deal with constraints related to users, input recognizers performance and environmental factors. We complement this approach with an evaluation framework for fusion engines, capable of testing and validating the aforementioned implementation. We conclude the paper showing some results based on these tests and correspondent validation.

II. R ELATED W ORK A. Fusion of Input Performing multimodal fusion in a multimodal system is a goal that can be achieved through different levels of fusion (e.g. data, feature, decision, opinion) [8] and different algorithms such as frame-based, statistical or hybrid approaches [9]. Decision and opinion-level are high-level types of fusion, oriented for cases where loosely-coupled modalities are to be employed, and therefore were the options we considered to fuse data. Being, usually, one of the first components of a system that deals with information from the “outside”, the fusion engines have a greater responsibility, so that in a given application context, they must correctly interpret the incoming data and pass the most trustworthy interpretation to a human-machine dialogue manager so the interaction between user and machine can proceed to the next step. B. Fusion Adaptation Fusion classifiers can be distinguished not only by the type of fusion or architecture they possess, but also by whether they are adaptive or non-adaptive [10]. The basic concept around adaptive fusion (also called quality fusion) is to assign different weight values associated with a modality. Take for instance, the example of a biometric authentication system that uses facial and voice recognition. Because the light conditions can change and influence the results produced by the system (in this case, the face recognition), this visual modality may get a lower weight value whilst speech input would get a higher value, and thus considered more trustworthy in the recognition process. Sanderson and Paliwal [8] in their work present some simple approaches to combine opinions of n experts and produce a final score, or weight. When developing the GUIDE fusion core, a different approach was thought of and implemented, mainly because most adaptation techniques referred by the literature are focused on either biometric systems or in few variables (e.g making use of modality recognizers output only) [11]. C. Evaluation When users interact with multimodal systems, their actions are captured by modality recognizers that then forward the data to the fusion module. This information can arrive with delays, corrupted, incorrect, or don’t arrive at all. In addition to this, users can also be a source of errors, providing input commands that are not acceptable for a given context. In order to make an efficient and correct evaluation of fusion engines, it is necessary to create a testing environment that is fully controllable and replicable. This task becomes more difficult as the number and complexity of system components grows. To ensure an efficient evaluation of multimodal interfaces, several authors [11], [12] propose a “divide and conquer” method, which consists on adopting a step-by-step process, of evaluating each component on its

own and base later evaluations on the previous ones. The design of advanced algorithms for fusion of multimodal data is a subject which research has remained sparse for quite some time [11] and consequently so has their evaluation, making it an important topic for research [3]. D. Discussion Combining user input is a task of fusion engines, and which can be confronted at different levels and produce disparate results, specially when adaptation needs to be accounted for. Thus, evaluation becomes a critical appendix to the development phase of fusion engines. Details about how fusion and adaptation are performed in the GUIDE fusion core are going to be explained to some extent in the subsequent sections. But again, our main contribution is not how we combine user input and generate interpretations. It is not the way we implement adaptive behaviour into the fusion module in order to cope with user and context-sensitive information. Our goal is to demonstrate the usefulness of our evaluation framework and methodology to fusion engine’s developers through two ends: assessment of performance through a simulation of interaction scenarios; aiding the optimization of adaptation parameters defined by developers. III. F USION IN GUIDE The GUIDE framework puts at user’s disposal different modalities/devices for input, such as speech, gestures or remote control. For the implementation of the fusion module, we chose an opinion-level architecture, first because it is one of two possible choices (other being decision-level) suggested for situations involving loosely-coupled modalities, and second because it embraces the possibility of uncertainty, which is ideal for a system who strives to maximize its efficiency, mitigating the error proneness of components and users. One important constraint to keep in mind about the development of the GUIDE fusion engine, is the platform in which the framework will be running, a TV set-top-box, with limited processing power and memory. This the main reason behind the decision of choosing simple and fast fusion algorithms and adaptation rules. A. A frame-based approach As previously mentioned, there are many options available when designing and implementing a fusion engine. The one picked for the GUIDE fusion engine was a frame-based architecture, inspired by some existing work [13], [14], but augmented with additional capabilities in order to support adaptation. The frame structure consists of two main sets (see Figure ??). The first one is a set of slots, which can either contain triggers or sub-frames. Triggers are associated with one and only one modality (such as speech or pointing) and are basically conditions that are to be met in order for the slot to be activated, while a sub-frame is a regular frame contained

inside another frame, allowing the representation of more complex interaction scenarios. A second set, the results, are actions or commands recognizable by the system that are sent to the dialogue manager once a frame is activated, indicating which action the user performed. Besides these two data structures within the frame, there are also, two attributes that play an important role in the frame activation process. The “mode” attribute defines how the slots are related in time to be activated (i.e the necessary synchronization of input events). Parallel and sequential triggers are distinguished, as well as coupled (and) and exclusive (or) triggers. Based on these properties, there are four possible values for this attribute: •

•

•

•

par-and: used when multiple triggers are to be fused together. It is necessary the activation of all triggers (i.e. receive input events that match the trigger condition) for the overall activation of the frame. The order of the received events does not matter, as long as they all arrive in a defined time window. seq-and: works the same way as par-and does. The major difference is that the events that are supposed to validate the slots must be inside the designated time window and must arrive in a pre-defined order so the frame is validated and the results can be sent. par-or: describes redundant multimodal triggers having similar meanings. Each one is sufficient for the correct meaning to be extracted, but they all can be expressed at the same time by the user, increasing as such the robustness and recognition rate (e.g. a user issuing a “play” vocal command and simultaneously pushing a play button with the remote control). seq-or: to be used when multiple triggers can lead to the same result, but only one of them is to be provided.

The “leadtime” attribute defines the duration of a time window allowed for frame activation. For instance, if the conditions represented in slots, have to occur in parallel, then none of the actions involved in that scenario can be apart, in time, from other action for a duration superior to the established duration. If the actions must happen in in sequence, then this attribute defines the maximum time interval between those actions. At any given point in time during an application lifecycle, the GUIDE fusion engine must be aware of the current UI being displayed to the user, namely interactive elements available (e.g. buttons). This information is extracted from concrete representations of the interface (expressed in UIML[15]) and is the main source of knowledge to create the appropriate frames for a given situation and to expect user input. GUIDE is not only a system that intends to adapt interaction between users and applications, it also has other functionalities (e.g. raising the TV volume, disambiguating commands given to the application that aren’t clear), therefore the fusion engine also has to be ready to receive this

application-independent type of input, and create specific frames for it. When input events generated by the user start to flow in, the engine, according to adaptation rules, starts to gradually activate slots and frames, forwarding interpretations to other system components. B. Adaptation Although the standard frame-based algorithm previously discussed, is enough to represent situations where various modalities have to be employed to produce a certain result, it does not account for the need of adaptation to user or execution context, disregarding precious data that could be used to maximize the overall performance of the fusion, and consequently of the entire framework. The User Model and Context Model, are two components of the GUIDE framework, that are responsible for providing to several other components, respectively, data about user and context. These models are constantly updating this data, because users and context evolve over time. For the fusion, the relevant information to be fetched from this, is how proficient users are with the modalities available to them, and how the current environmental context is affecting the use of those same modalities. In addition to this, modality recognizers that will be working together with the fusion engine to convey user input, will have the capability to report the uncertainty level of their recognitions. This is translated to confidence levels that goes with every input event sent for the fusion. In this way, in addition to using user and context confidence-levels, it also possible to work with the high or low reliability of recognizers. All of these adaptation variables take part in the slot activation process. Whenever an event is received and its semantic content matched with a certain slot, it is assigned a confidence value SC using: SC = aCi + bUi + cEi where Ci , Ui and Ei are the confidence scores assigned to, respectively, input event, user and context, while a, b, c are the weights assigned to each one of those variables. Once a frame becomes candidate for activation, according to its own rules, the confidence assigned to slots come into play and classification criteria are used to set the overall confidence of the frame. If the frame requires all slots to be activated (and mode) a MIN operator is used, otherwise a MAX operator is chosen. In the end, the fusion checks which frame has higher score, and if that value meets an established threshold, the results are sent. These approaches to activate frames and slots are somewhat straightforward and simple, but again, this is something that is decisive due to the limitations imposed by the available hardware. However, despite its apparent simplicity there is an obvious room for improvement. It can be stated for certain, that one of the deciding factors for the final decision made by the fusion engine in this frame activation

process, are the weights assigned to input, user and context (the a, b and c). Defining optimal values for these constants is not a trivial task. And much like this situation, other fusion engines may use adaptation-geared algorithms that are difficult to perfect. All these factors suggest a very strong need for ways to easily evaluate and hone fusion engines. IV. E VALUATION OF THE F USION M ODULE One of the issues regarding the evaluation of multimodal systems, is the vast number of possible case-scenarios that can be created by users, which makes testing, a burdensome and intricate goal. Furthermore, performing tests with real users can become, not only expensive and extensive over time, but also very hard to control and replicate, because of random interaction patterns from users or unexpected behaviour and performance from system components, such as modality recognizers. Despite of all these issues regarding evaluation, for the particular case of the GUIDE fusion engine, it is essential to assess a certain standard of quality and reliance in the delivered results, since all of the interaction adaptation is in favour of the users with impairments, and which in normal circumstances cannot interact in an efficient way with applications. To guarantee the achievement of this goal, alongside the fusion module, arose the development of an evaluation framework, that effectively met our expectations to tackle the short-comings stated, and allowed to adopt an evaluation methodology oriented to enhance the capabilities of the fusion engine, optimize adaptation rules and simulate user and other components behaviour. A. Framework overview/description The key concept behind our evaluation framework are interaction scenarios, which are composed, partially, by a series of events related to different modalities, which have the potential to represent a real-life scenario, as if an actual user was interacting with an application. For instance, a set of these events could represent a user pointing to a certain location on screen and, at the same time, issuing a voice command. These scenarios are stored within the framework and can be easily customized and augmented, in order to establish a testing environment suited for a given application or algorithm. The format used for their representation is EMMA (Extensible Multimodal Annotation markup language) [16], a device-independent, multimodal oriented XML-based markup language, developed by the W3C Multimodal Interaction Working Group, and used for containing and annotating the interpretation of user input by multimodal recognizers or fusion engines. The aim of setting this standardization is to allow any fusion engine to receive and process scenario events that are within the evaluation framework, and send interpretations of the received input in the same format, back to the framework.

The most relevant attributes of a common input event can be controllable by the framework, such as type of modality, semantic content (e.g. what the user is saying, where is he pointing at), beginning and ending instant of the action, confidence score, and its arrival time at the fusion engine. This last one is particularly peculiar and useful, because it allows to simulate processing or network delays that are capable of occurring before the arrival of the event at the fusion module. Another advantage is to simulate the speed at which the actions are performed, which is something important when dealing with users that due to motor or cognitive impairments can’t react as fast as a normal person would. Besides input events, each scenario also include a set of ground-truths. These are the results expected to be produced by the fusion engine. After a scenario simulation, their purpose is to be compared against the actual results obtained from the fusion process and thus measure its performance. Figure 1 illustrates this concept to some extent, and the overall architecture of the framework. To developers of fusion engines, understanding if their component is correctly processing input events, and producing the expected results is definitely beneficial. However, this is only one of two goals on our approach to evaluation. The other goal is to support the continuous optimization of fusion engines. Achieving optimal algorithms and methods for fusion or adaptation can be troublesome for many reasons, specially when heavy processing and many variables are at play. Also, relying on testing with real-users can bring results more reliable, however, these are situations near impossible to replicate, not allowing to establish control scenarios useful for making comparisons between different implementations. So, in a sense, developers can make use of this evaluation framework and its interactive scenarios, not only to validate their fusion engines reliability on delivering trustworthy results, but also as a tool for analysis and refinement of adaptive behaviour or algorithms in general. This last aspect consists on the developers option to establish which variables of their implementations they which to put up to test, and defining a range of values to be tested during the simulation process. This ability is particular useful, because it allows to replicate scenarios with constant variables while changing others, determining how the fusion behaves in those new situations. For instance, altering data related with environmental factors such as surrounding context or user, can show if the parameters that were efficient before are still able to deliver the same performance. In the end, an evaluation of all scenarios is produced, stating all the outcomes of running the interaction scenarios through the fusion engine, for each set of parameters. These are represented based on the following metrics: • Efficiency: percentage of scenarios that produced the expected results (i.e. comparison between the received interpretations and previously established ground-

truths) and an explicit indication of which ones failed. Response time: the duration of time since the first event of a scenario is sent to the fusion engine and the time the correspondent interpretation is received. • Confidence: the confidence score of the interpretation (supported by EMMA annotation) By analysing results, the developers can perceive which set of parameters is more advantageous or disadvantageous for their objectives. And if necessary, the replication feature of the framework can be exploited to re-run the same testing session, in order to experiment new possibilities concerning evaluation (e.g. type of parameters, values) or the implementation of the fusion engine (e.g. new architectural approaches, different algorithms). •

B. Evaluation how-to As suggested before, developers are able to use the evaluation framework at different levels. At a more basic level, the framework can be used to measure a fusion engine performance. This can be done by preparing the engine to receive events (expressed in EMMA standard), starting the simulation process and finally, by analysing the posterior results regarding comparison between results and groundtruths, execution time or confidence scores. If the application context and correspondent testing, require a specific type of semantic content (which can also be embedded inside EMMA), then appropriate scenarios and ground-truths may be manually added to the framework and therefore included in the simulation. The second level of the framework functioning, works as an augmentation of the first level, consisting on including adaptation parameters in the evaluation. To that end, developers should beforehand, define which variables and respective values should be tested, which can be done in two different ways. One of the options available is to manually define values for each parameter, for each run of the evaluation process (in which a “run” is to process all the scenarios for a particular set of parameters). The other possibility, consists on also stating the parameters that should be tested, but instead of writing down all the possible values, a range is defined for each variable (with a minimum, maximum and step value). In this way, all possible combinations will be used once, for all established scenarios. It is important to note that these variables do not have to be directly related to adaptation mechanisms. Any parameter is candidate to be part of the simulations, as long as the fusion engine is prepared to receive that information from the framework, alongside with input events, and apply it in the working of the component. For each possible combination of these parameters, the evaluation framework will run all of the established scenarios and present the correspondent outcomes of each one. The main advantage of this approach is that the framework takes care of all the data and synchronization issues to be handled and fusion

engines simply have to adapt their parameters according to what they receive and process the events as they would in a real life scenario. For instance, if we take the example of the aforementioned adaptation rule used in the GUIDE fusion core, we can use the evaluation framework to run a set of scenarios over and over, changing parameters such as the threshold for sending frame results or the weights assigned to input, user and context, all to understand which choices yield the most trustworthy results. The next section will provide details on these concepts and show some results based on submitting our fusion engine to the evaluation framework. C. Results Being an user centred design project, GUIDE already went through several phases dedicated to testing with real subjects, namely elderly population. The data collected from these tests provided some insightful information [17], not only to validate the need for such a project, but also to educate the component developers in several aspects. One of those was in understanding how the elderly react to multimodal interaction and their actions when confronted with a system that employs several, distinct modalities. These interaction scenarios were the starting point for the definition of our own test cases, that would become the main source of knowledge of the evaluation framework to assess the performance of our own fusion engine. Besides adding authenticity for the evaluation, due to being related to the project context and based on observations of real interactions, these tests can also better prepare the fusion engine of GUIDE for new phases of testing with real users. This shall also help to mitigate the expected issues brought up by the integration of the remainder framework components and modality recognizers when the whole system is to be used in real-life situations. With this in mind, we used the proposed evaluation framework to test the implementation described in section III. Fifteen scenarios were built, comprising different devices/modalities, such as speech, remote control or pointing gestures. Despite all of these choices for interaction, for the application context of the simulation, we considered a simple interface with five buttons representing an EPG (Electronic Program Guide) interface, for channel selection. This type of interface is very similar to what was used in GUIDE user trials. Even for a such visually elementary interface, the number of ways to interact with it are high. For this specific interface the fusion engine allows different ways for selecting buttons which include: saying the button name; pointing at the button and speaking “select”; pointing at a button and speaking the button name; pointing at a button for the duration of three seconds; saying “select” and the button name (in this order). As mentioned before, there are also commands that are independent from applications and must be taken into account at all times. Some of these

Table I A DAPTATION PARAMETERS DEFINED FOR EVALUATION

a b c

Minimum value 0 0 0

Maximum value 1 1 1

Step 0.2 0.2 0.2

Table III E VALUATION RESULTS FOR SECOND PHASE OF TESTING

P+P P+S S+S S+R

Avg. Efficiency 93.75% 59.88% 76.12% 77.31%

Avg. Response Time(s) 7.50 8.0 6.13 2.50

Avg. Confidence 0.80 0.79 0.60 0.76

Table II E VALUATION RESULTS FOR FIRST PHASE OF TESTING

P+P P+S S+S S+R

Avg. Efficiency 97.68% 64.47% 82.97% 83.33%

Avg. Response Time(s) 7.67 8.03 6.13 2.50

Avg. Confidence 0.92 0.92 0.69 0.76

that were included in the tests performed were: turning the TV volume up or down with speech commands; remote control key presses. Since it is expected from a fusion engine to also handle chain of events that are not meant to be fused (but are supposed to produce a response anyway), the scenarios created also represented such situations (e.g. saying ”Weather channel” and pressing the right key on the remote control). In order to summarize the data collected and facilitate analysis, scenarios that involved the same combination of modalities were bundled in the same groups (i.e. pointing + pointing, speech + pointing, speech + speech, speech + remote control) stating for each one, the average values on the obtained efficiency, response time and confidence on results. For all of the simulation involved with the results that are going to be discussed, it was assumed the existence of perfect recognizers (i.e. maximum confidence score assigned to input events reliability). The first phase of testing consisted on using the evaluation framework to assess the performance delivered by the architecture and adaptation rule described in section III. For this particular case, we defined the constants of that rule (a, b, c) as the parameters to be tested (see Table I). For the established ranges and steps, the number of possible combinations between the three variables was 256, resulting in a total of 3240 scenarios simulated. Alongside the assumption of perfect recognitions of input events, for this phase we also established the existence of a perfect user, as well as an unaffected context (i.e. no user impairments or environmental factors affecting the use/performance of any modality/device). Table II shows the results obtained for this first testing phase. As it can be seen, the efficiency of the fusion approach is diminished when different modalities have to employed (e.g. pointing and speech). By analysing the outcome of each scenario, it was possible to confirm which combinations of parameters delivered efficiency, confidence and a response time that were above average, and best suited for our implementation. For the second phase of testing, the same scenarios,

adaptation parameters and values were maintained. However, this time, there were changes to variables related to user ability with the pointing modality and context impact on the speech modality. We set these to lower values, in order to assess the difference in the fusion engine performance and to confirm if such changes could be tracked by the evaluation framework. Results from this phase are depicted in Table III. As expected, introducing changes to the environment, did affect the results produced by the fusion engine. Since all combinations of modalities used in the tests, involved either the use of pointing or speech, we could observe that a considerate decrease of efficiency occurred in all cases. Differences in response time were not so great, mainly because of the simplicity of the algorithm involved. Furthermore, by analysing the reports generated by the framework, we were able to understand which combinations of those parameters could be more beneficial to the fusion engines performance, in certain contexts affected by user disabilities or environment interferences, diminishing the quality of modality recognizers. This is an example of information that could be used to implement an improved dynamic, adaptive behaviour into the fusion engine. These testing sessions, besides helping assessing the performance of our fusion approach, when faced with certain constraints and settings, also validated the purpose and goals of our evaluation framework. In a relative short amount of time, we were able to simulate a total of 6480 scenarios, encompassing different input modalities and portraying reallife situations. In addition, the replication aspect of the framework, allowed to test scenarios over and over, varying several parameters over time, to understand how a fusion engine would react to different contexts and usage of modalities. Also, by making a separation of modalities in the performed tests, the framework allowed to establish a base for refining each specific combination of modalities. Even though, only three adaptation parameters were submitted to the framework, we could see substantial differences in the results when comparing both testing phases. V. C ONCLUSION This paper presented how multimodal fusion is achieved in the framework of project GUIDE, and how its fusion engine uses information about user abilities, context data and modality recognizers output to adapt the interaction between user’s and applications. It was demonstrated the urging necessity for having evaluation solutions regarding

the performance of fusion engines. To avoid the effort needed to make tests involving several users, it was proposed an evaluation framework capable of not only assessing the performance of multimodal fusion in terms of generating correct interpretations of user input, but also how developers can make use of it to optimize their components and algorithms. To this end, we also presented how parameters related to adaptation (specific for each fusion engine) and parameters related to interaction scenarios (e.g. data about user and context) can be used to analyse and optimize the adaptive behaviour of the developers fusion engines. This was demonstrated by results obtained by submitting our own fusion engine to the evaluation framework, and confirming that even for a small set of adaptation parameters and values, we could see a significant impact on the delivered results. Even for a simple adaptation approach such as the one employed by the GUIDE fusion engine, it is possible to execute a large number of distinct simulations. Therefore, we can assume that for a more complex fusion engine or algorithm, the value of using our evaluation framework would be even greater. R EFERENCES [1] A. D. Fisk, W. A. Rogers, N. Charness, S. J. Czaja, and J. Sharit, Designing for Older Adults, 2009, vol. 133, no. 4. [Online]. Available: http://www.jstor.org/stable/1794534?origin=crossref [2] J. L. Fozard and S. Gordon-Salant, Changes in vision and hearing with aging. Academic Press, 2001, pp. 241–266. [3] D. Lalanne, L. Nigay, P. Palanque, P. Robinson, and J. Vanderdonckt, “Fusion engines for multimodal input: a survey,” Interfaces, pp. 153–160, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?id=1647314.1647343 [4] P. Dragicevic and J. D. Fekete, “Input device selection and interaction configuration with icon,” People and Computers, pp. 543–448, 2001. [5] J. Bouchet and L. Nigay, ICARE: a component-based approach for the design and development of multimodal interfaces. ACM Press, 2004, pp. 1325–1328. [Online]. Available: http://portal.acm.org/citation.cfm?id=985921.986055 [6] B. Hartmann, L. Abdulla, M. Mittal, and S. R. Klemmer, “Authoring sensor-based interactions by demonstration with direct manipulation and pattern recognition,” Proceedings of the SIGCHI conference on Human factors in computing systems CHI 07, p. 145, 2007. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1240624.1240646 [7] J.-Y. L. Lawson, A.-A. Al-Akkad, J. Vanderdonckt, and B. Macq, “An open source workbench for prototyping multimodal interactions based on off-the-shelf heterogeneous components,” Proceedings of the 1st ACM SIGCHI symposium on Engineering interactive computing systems EICS 09, p. 245, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1570433.1570480

[8] C. Sanderson and K. K. Paliwal, “Information fusion and person verification using speech and face information,” IDIAP Martigny Research Report, no. 33, pp. 02–33, 2002. [9] B. Dumas, D. Lalanne, and S. Oviatt, “Multimodal interfaces: A survey of principles, models and frameworks,” Human Machine Interaction, vol. 5440, no. 2, pp. 3–26, 2009. [Online]. Available: http://www.springerlink.com/index/65J39M5P56341N49.pdf [10] N. Poh, T. Bourlai, and J. Kittler, Multimodal Information Fusion. Academic Press, 2010, pp. 153–165. [Online]. Available: http://eprints.pascal-network.org/archive/00006857/ [11] B. Dumas, R. Ingold, and D. Lalanne, “Benchmarking fusion engines of multimodal interactive systems,” Proceedings of the 2009 international conference on Multimodal interfaces ICMIMLMI 09, p. 169, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1647314.1647345 [12] A. Paramythis, S. Weibelzahl, and J. Masthoff, “Layered evaluation of interactive adaptive systems: framework and formative methods,” User Modeling and UserAdapted Interaction, vol. 20, no. 5, pp. 383–453, 2010. [Online]. Available: http://www.springerlink.com/index/10.1007/s11257010-9082-4 [13] M. T. Vo and C. Wood, “Building an application framework for speech and pen input integration in multimodal learning interfaces,” 1996 IEEE International Conference on Acoustics Speech and Signal Processing Conference Proceedings, vol. 6, pp. 3545–3548, 1996. [14] B. Dumas, D. Lalanne, and R. Ingold, “Hephaistk: a toolkit for rapid prototyping of multimodal interfaces,” in Proceedings of the 2009 international conference on Multimodal interfaces, ser. ICMI-MLMI ’09. New York, NY, USA: ACM, 2009, pp. 231–232. [Online]. Available: http://doi.acm.org/10.1145/1647314.1647360 [15] M. Abrams, C. Phanouriou, A. L. Batongbacal, S. M. Williams, and J. E. Shuster, “UIML: an appliance-independent XML user interface language,” Computer Networks, vol. 31, no. 1116, pp. 1695–1708, 1999. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S1389128699000444 [16] M. Johnston, “Building multimodal applications with EMMA,” Proceedings of the 2009 international conference on Multimodal interfaces ICMIMLMI 09, pp. 47–54, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1647314.1647325 [17] J. Coelho, C. Duarte, P. Biswas, and P. Langdon, “Developing accessible tv applications,” in The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, ser. ASSETS ’11. New York, NY, USA: ACM, 2011, pp. 131–138. [Online]. Available: http://doi.acm.org/10.1145/2049536.2049561