Interactive Multimodal User Interfaces for Mobile Devices - CiteSeerX

2 downloads 342 Views 93KB Size Report
situation-dependant multimodal interaction with embedded and mobile devices. ... state and context of the application and responds to inputs from component interface objects ..... It is up to the interface designer to model the behavior after the.
Interactive Multimodal User Interfaces for Mobile Devices Wolfgang Mueller, Robbie Schaefer, Steffen Bleul [email protected], [email protected], [email protected] Paderborn University/C-LAB Fuerstenallee 11 Paderborn, Germany

Abstract Portable devices come with different individual limitations in user interaction like limited display size, small keyboard and different sorts of input and output channels. With the advancement of speech recognition and speech synthesis technologies, their optionals or alternatives become feasible and attractive for mobile devices in order to implement real multimodal user interaction. However, current systems and formats do not sufficiently integrate advanced multimodal interaction. We introduce an advanced system for Multimodal Interaction and Rendering (MIRS) dedicated for mobile devices. The MIRS system incorporates efficient processing of XML specification languages for limited, mobile devices. It comes with the XMLbased Dialog and Interface Specification Language (DISL). DISL can be considered as an UIML subset, which is enhanced by the means of state-oriented dialog specification. That dialog specification is based on DSN (Dialog Specification Notation), which was introduced to describe the Control Model of a User Interface as a cross product of several interaction states together with transition rules forming new cross products.

1. Introduction With the wide ability of considerably powerful mobile computing devices, the design of portable interactive User Interfaces (UIs) is posed to new challenges, as each device may have different capabilities and modalities for rendering UIs. The increasing use and growing variety of different mobile devices to access information on the Internet has induced the introduction of special purpose content presentation languages, like WML [17] and CompactHTML [11]. However, their application on limited devices is cumbersome and most often requires advanced skills. We expect that user interaction, in particular on mobile devices, will radically change with the availability of advanced speech recognition and synthesis technologies. However, those technologies will not replace existing interaction rather than they will coexist yielding to user-, hardware-, and situation-dependant multimodal interaction with embedded and mobile devices. First visions are given under the keywords ubiquitous/calm/pervasive computing and Ambient Intelligence (AmI) [18, 1]. Unfortunately, we can only identify little activities addressing thoses issues. In the area of graphical user interface description languages, the User Interface Markup Language (UIML) [4] has been established and is currently available as UIML 3.0. UIML is mainly for the description of static user interfaces (structures) and their properties (styles) also leading to the description of User Interfaces, which may be too closely tied to a targeted platform. The behavioral part of UIML is not well developed and does not give sufficient means to specify real interactive, state-oriented user interfaces. VoiceXML [5] is widely recognized as a standard for the specification of speech based dialogs. In addition to both, InkXML [15] has been defined to support interaction with hand writing interfaces. However, UIML, VoiceXML, and InkXML only cover their individual domains and do not integrate with other modalities. As one of the few multimedia/modal covering XML the W3C has defined the Synchronized Multimedia Integration Language (SMIL) [6], which enables simple authoring of interactive audiovisual presentations. SMIL is typically used for rich media/multimedia presentations, which integrate streaming audio and video with images, text, or any other media type. Considering those languages, only UIML and VoiceXML provides basic means for user interaction. Nevertheless, they are still rather limited for the specification of more complex state–based dialogs as they easily appear in the control of mobile devices and remote control via mobile devices. 1

Though currently no activities are known for an XML-based multimodal dialog and interface specification language, the W3C has established activities and has defined white papers for a multimodal interaction architecture for general multimodal interaction. That Multimodal Interaction (MMI) Framework [9] defines an architecture for speech a set of properties (e.g. presentation parameters or input constraints); a set of methods (e.g. begin playback or recognition); and a set of events raised by the component (e.g. mouse clicks, speech events). The MMI framework covers multiple input modes such as audio, speech, handwriting, and keyboarding; multiple output modes such as speech, text, graphics, audio files, and animation. The MMI framework considers human users interacting with a so-called interaction manager. The human user enters input into the system, observes, and hears information presented by the system. The interaction manager is the logical component that coordinates data and manages execution flow from various input and output modalities by maintaining the interaction state and context of the application and responds to inputs from component interface objects and changes in the system and environment. This paper introduces an instance of such an MMI framework. We present the architecture of our Multimodal Interaction and Rendering System (MIRS). For the specification of advanced multimodal interaction and the corresponding interfaces, we introduce the XML-based Dialog and Interface Specification Language (DISL). DISL is based on an UIML subset, which is extended by rule-based descriptions of state-oriented dialogs, describing the state of the UI as a graph and operations on UI-elements triggering a transition to a new UI-state. The dialog extension is based on DSN (Dialog Specification Notation), which was introduced to describe the Control Model of a User Interface as a cross product of several nodes together with transition rules forming new cross products. Additionally, DISL gives means for a generic description of interactive user dialogs so that each dialog can be easily tailored to individual input/output device properties, e.g., graphical display or voice. In combination with DISL, we additionally introduce S-DISL (Sequential DISL). S-DISL is a sequentialized representation of DISL in order to cope with the currently limited processing capabilities of mobile devices. The remainder of this paper is structured as follows. The next two sections, first introduce approaches and means for dialog and user interface description languages. Thereafter, we present MIRS and DISL in Section 4. Section 5 introduces DISL by the example of an interactive user interface for the remote control of the Winamp MP3 player via the Siemens S55 mobile phone. Finally, the paper closes with a conclusion and outlook.

2. User Dialog Specification There exists several classical approaches which focus on the description of user dialogs interactions for Graphical User Interfaces like Dialogue-Nets [10], Petri-Nets [2], UAN (User Action Notation) [8], and ODSN (Object-oriented Dialog Specification Notation) [14]. They refer all to the same concepts, are based on variants of parallel Finite State Machines, and mainly differ in their description means and hierarchical decomposition into components. They define the user dialog by means of states and state transitions, which are triggered by events from the user interface elements. In our approach, we apply ODSN, which has been developed to model complex state space for advanced human-computer interaction. ODSN models the user interaction as different objects, which communicate by exchanging events. Each object is described by the definition of hierarchical states, user events, and transition rules. Each rule has a condition and body where the condition may range over sets of states and sets of user events. The body is executed when the specified events occur and the object is in the specified state. When the condition becomes true, the execution of the body may change a state. As an example, consider the following rule. USER INPUT EVENTS switches (iAdapt, iReset) SYSTEM STATES brightness (#dark #normal #bright) color (#black #white) RULES #dark #black iAdapt -->

#white

2

It defines two events and two hierarchical states. The defined rule fires when the event iAdapt occurs, brightness equals #dark, and color is #black. After firing, the rule sets color to #white. Though they were originally developed for graphical user interfaces, all of those approaches apply for the specification of voice–based dialogs without further modification. However, current activities in the domain of voice browsers do not focus on the explicit specification of the dialog rather than integrates dialog specification implicitly. The most prominent of all recent approaches is the definition of VoiceXML [5] and the concurrent activities like SALT [7]. VoiceXML is defined by the W3C as an XML-based language. VoiceXML is a language for writing applications for ‘voice browsers’ one interacts with by listening to spoken prompts and jingles, and control by means of spoken input and keypad strokes. VoiceXML defines applications as a set of named dialog states and covers interaction based on spoken prompts (synthetic speech), output of audio files and streams, recognition of spoken words and phrases, recognition of touch tone key presses, recording of spoken input, control of dialog flow, and telephony control (call transfer and hangup). VoiceXML definitions are composed into so-called forms, which represent states of interaction. The goto-tag defines state transitions. The following gives a short example of a form definition. Which do you like better, black or white? I’m sorry, I didn’t understand. ... In the first part, a question is raised. Due to the answer, two different state transitions are performed. In the case of an error, the state is kept and the prompt is repeated.

3. User Interface Description Portable user interface description mostly means to apply HTML and related languages. For advanced user interaction, JavaScript and Java Applets are applied. However, such definitions sometimes lack portability and are not suited for mobile devices. Even complete HTML 4.0 (without JavaScript) cannot be processed on any device. For example, PDAs and mobile phones are limited in their screen resolution while most HTML descriptions are designed for large displays and contain elements like framesets and tables, which require the space of a PC, monitor for display. So, other markup languages were developed to enable the use of Internet services on mobile limited devices: C-HTML (Compact HTML), WML, and MML (Mobile Markup Language). Compact HTML is a subset of HTML 4.0, which is dedicated to the limitations of mobile devices. Elements that do not meet the mobile device requirements like small display, low memory, and slow CPU are excluded. Thus, C-HTML excludes image maps, tables, and framesets. Since Compact HTML is a subset of HTML, each

3

CHTML page can be viewed with an HTML browser, but a C-HTML browser cannot render every HTML element. So the art of transcoding HTML to C-HTML considers not just replacing or deleting unsupported elements but to restructure the document so that no information is getting lost. In addition, WML defined by the WAP Forum (Wireless Application Protocol) for micro browsers on mobile devices. WML is based on a combination of a HTML 4.0 subset and HDML (Handheld Device Markup Language) [12] plus some extensions. In addition to HTML structures, WML is composed of a set of cards, which basically correspond to states in the user interaction dialog. Specific hyperlinks provide means to navigate between cards of one WML specification. Compared with C-HTML, WML has several differences in details like in the support of bold character sets and subheadings (h1, h2, ...). In addition to C-HTML and WML, the Japanese provider J-Phone has defined the MML (Mobile Markup Language) family of languages for their J-Sky service: S-MML (Small MML), M-MML (Medium MML), F-MML (Full MML). S-MML is compliant to a HTML 4.0 subset defined for small displays with 4 lines with 12 characters each. M-MML is defined for mobile phones with display sizes 160x100 pixel. F-MML stands for full compatibility to HTML 4.0. Since HTML and its subsets or derivates provide extremely poor capabilities for the definition of graphical user interfaces, Harmonica has defined UIML (User Interface Markup Language). An UIML description consists of an interface part, describing the structure and behavior of the interface, and a peers section, describing how the used elements are to be translated to target format specific elements. The peers’ section also describes the API of the backend application, so that methods in this API can be invoked from within the code. An interface consists of parts, which can contain subparts. Properties are associated with the parts and parts can have an associated class. Behavior is added by means of rules, which are invoked by events. Actions in a rule include setting property values, throwing events, or invoking (backend) methods. Thus, the interface reflects a simplified model of a general GUI developed with, e.g., Java. An example of the description of an UIML widget is shown below: composer Though rules support the description of basic interaction, advanced action specification and the notion of state-oriented description is missing to specify advanced user dialogs in a comprehensive way. UIML allows switches to select the parts and properties that are desired in a particular rendering. The UIML developers intended this to be used for describing interfaces geared towards different output formats. Thus, they would provide e.g. an HTML section and a VoiceXML section for generating HTML and VoiceXML output. For both sections, they would use a dedicated vocabulary. The peers’ section includes a description of API calls of the backend in the logic section and directions for the conversion of elements to the target format in the peers’ section, of which there may be several. API calls in the logic section contain the method name and the parameters. Optionally also a script may be added, in which case the API call may also be implemented locally. 4

4. Multimodal Dialogs for Mobile Devices In order to implement an XML-based framework for Multimodal Interaction, we have defined a Multimodal Interaction and Rendering System (MIRS) which is based on the specification, exchange, and rendering of user interfaces and dialogs by means of description given in the Dialog and Interface Specification Language (DISL). Before introducing DISL, we first present the MIRS architecture.

4.1. Multimodal Interaction and Rendering System (MIRS) MIRS is based on a client-server architecture as given in Fig. 1. An application server provides a user interface and dialog description by the means of DISL. In order to improve elaboration of the DISL description, it is first transformed to S-DISL, which is a sequential intermediate format of DISL. The transformation is performed by an XSLT translator and given by a XSLT description. After transformation, the S-DISL file is transmitted to the mobile client and interpreted. The interpreter separates the description of the dialog and the abstract widgets, where latter are abstract representations of user interface objects which may be represented as text, graphics, or voice. Due to the individual user and hardware profile, a render is spawned for each abstract widget or set of abstracts widgets. Thus, for a device with poor loudspeaker quality it does not make much sense to provide audio output when a sufficiently large display is available where for visually impaired users graphical output with small characters is deprecated.

021 3#4

 

5.68729 :#;

       

IKJML N.OQPR S,TUVHW X Y

   

=@?BA CED>FHGD#F

!#"$ % &('*),+.- /)

Figure 1. MIRS System Architecture After spawning the renderer, it continuously interacts with the interpreter sending user interface events and receiving user interface descriptions for rendering, i.e., generating user interface information. The communication between both is controlled by the finite state machine, which defines the user dialog as part of the DISL description. So, on the receipt of an event, a state transition is performed, an action is executed, and user interface information is sent to the renderer. User interface information and the specification of the dialog by means of rule-based descriptions of finite state machines is specified by DISL and S-DISL, respectively, which are both introduced hereafter.

4.2. DISL Before discussing the syntax of the Dialog and Interface Specification Language (DISL) in detail, we outline the concepts that have driven the design of this language. With DISL, efforts have been made to provide a language, which is not only able to describe the appearance of a user interface but moreover to describe the interaction with the user. Therefore, it incorporates the previous discussed dialog model and the control model and combines it into one markup language. The description of the dialog model is oriented towards UIML [4]. However, the focus is set on an as generic approach as possible, in order to allow the description of multimodal UIs and easy conversion of UI descriptions to a variety of different target platforms. UIML itself proved to be too tightly connected to the target platform for which an UIML description is written. Therefore, we tried to define the smallest common set of UI elements needed to perform any task. However, it does not imply that it results in limited representation of the UI, as the renderer can obtain additional information of the UI, or construct a more complex UI on basis of inheritence of knowledge on how to perform some specific tasks (see subsection 4.4 for more information). The smallest common set is however needed to ensure that a UI can be rendered on a most limited device or be used with a complete different modality. For example, if a UI description mandates a graphical element like a button, it automatically excludes voice applications. If a generic trigger replaces "button", a voice UI renderer could also

5

interpret that description. Therefore, reducing the set of widgets is not a real constraint but rather enables a broader field of applications. DISL divides the minimal set of UI elements into two groups. The first one only conveys information to the user, whereas the second one supports interaction by the user or by the system. We have identified four informative elements: variable field, which shows the status of a variable text field for showing textual information, note that they can also be spoken generic field, which allows arbitrary extensions widget list, which is used for logical grouping of several widgets All elements can also be set to invisible, when they are only of interest for the processing system. The second group provides widgets for two different kinds of interaction, namely commands issued automatically and confirmations of users, or the possibility to enter values. Those elements are: built-in commands generic commands as an extension mechanism user confirmations variable boxes, to provide values to variables text boxes, to enter textual input generic boxes, used for arbitrary input choice groups, allowing the selection of a set element Even if the afore mentioned parts are oriented towards UIML and the Syntax of the dialog model resembles the UIML Syntax, DISL is a different language and cannot be processed by UIML processors. The collection of widgets and the additional vocabulary could have been modeled in UIML, but when it comes to the definition of a control model, we have no other option than designing a new language. Thus, we will focus on the behavioral part of DISL in more details next. As already mentioned before, the DISL control model incorporates the concepts of the DSN notation [14], which means that it describes a finite state machine, where state transitions are fired by user events or internal events. Using the cross product of states as given in ODSN reduces the amount of transitions to be described. Describing the behavior in DISL requires three steps. 1. First, variables have to be defined to model the state of UI parts. For example, a Boolean variable "power" is used to determine, whether an application is powered on or off. 2. In a second step, rules describe the user dialog. A rule consists of a condition which returns either true or false and an action part. Details will be provided later in this section. 3. Based on the rules, transitions may fire. For that, one or more rules are evaluated, and if all of them are true, the transition can take place. If the transition fires, actions may be performed, which can execute communication with the backend application like providing new values or send commands and define new states by assigning new values to variables. DISL is an XML application, which allows easy transformation with standard tools. A DISL description consists of an optional head element with metadata followed by a set of templates or interfaces. We first focus on the interfaces; they are also applied as reusable templates of interfaces. A set of interfaces applies, when an application switches between interfaces. As an example, consider a file requester, which is used frequently, when a user wants to load or save data. The global structure may look like in following definition:

6

... ...

An interface has to be identified with a name, and a specific state, which is by default set to "other". By means of the "state" attribute, one can define whether the interface is executed at the start of the application, the end, or if it is the default parent of a subinterface. The interface itself is composed of the three parts "structure", "style" and "behavior", where "structure" and "style" define the dialog model, and "behavior" the control model. The "structure" lays out, which widgets are to be used in this interface. Therefore, it only consists of a list of widgets, whereby each widget can contain other widgets and must have an identifier, as well as a widget type, which is one of the following: variablefield, textfield, genericfield, command, confirmation, genericcommand, choicegroup, widgetlist, variablebox, textbox. The attribute "where", can be used to specify the relative position of the widget (before, after, first, last). Per default, widgets are listed one after another. The "style" part helps to specify the widget representation in more detail. A "style" element consists of several parts, each referencing one widget through the required "generic-widget" attribute. A "part" element is describing the properties of a widget, which can be freely set. In order to be upward compatible with UIML, DISL "structure" and "style" elements are subsets of corresponding UIML elements. Also inherited from UIML is the concept of templates to reuse parts of the UI. The DISL "behavior" part required major revisions with respect to UIML and is thus outlined in more details. As UIML has no real control model implemented, communication with the backend application has to be performed with the peers’ section and to be mapped to scripts or other UI languages. Those mappings – though quite straight forward – are rather inflexible and is not very suited for the description and transformation of generic UIs as examined in [13]. Therefore, we introduce an event mechanism, which can trigger the application and change the state of the UI dialog. DISL rules can be connected through Boolean operations, what allows DSN oriented cross product processing, which enables a powerful modeling of user interface behavior. The behavior specification consists of an ordered set of optional variables, rules, transitions, and events. Elements describing variables need to have a unique name and contain parsed character data, which yield the value of the variable. The allowable values depend on the type of the variables, which can be set in the "type" attribute. Currently supported types are integer, string, boolean, and pointers to widgets. Further, attributes can be used to specify, if a variable is used internally, in the case of a constant, which interface it belongs to, and what widget it refers to. Variables are automatically evaluated when referenced by their names. Rules are mainly used to evaluate a condition and to return a Boolean value, however rules can also be used without a condition providing a fixed value. A "set" attribute is used to check, whether a variable has already been initialized. There are four types of conditions. The first one checks for equality of two values. The second compares two values. The "equal" element evaluates to true, if a property, value, or call is executed. Finally, the "op" element supports different n-ary operations combined by Boolean operators. Conditions are referenced in the transitions within "if-true" statements, where all rules - combined by Boolean operators are evaluated. When the "if-true" statement evaluates to "true", the "action" part is executed. An action part may consist of several statements performing actions and changing the state of the UI. The following example illustrates a DISL rule, specifying the volume control of a music device: 128 20 ... yes

7

... yes ...

First, variables for the current volume and a value for increasing the volume are assigned. The rule "IncVolume" implements the condition that evaluates to true, if the widget "IncVolume" is selected. If true, a set of statements is processed in the action part. There, the "incVolumeValue" is added to the previous set volume, and statements update the UI, e.g., setting a "yes" and "cancel" control. DISL event mechanism introduces a new concept, which is derived from the concept of timed transitions in ODSN. Events support advanced reactive UIs on remote clients, since they provide the basis for, e.g., timers. DISL events contain an action part as transitions. However, this action is not triggered by a set of rules evaluating to true rather it depends on a timer, which is set as an attribute. An event may be fired only once after the predefined timer or it may periodically fire. It is also possible to activate or deactivate an event. The following example shows how the event mechanism is used to periodically check, which song is currently playing in a remote music player. Additionally, it outlines how external calls can be applied. getplaypos ...

A call consists of a source. This is typically an http request but other protocols can be supported as well. The call establishes the communication with the real application. The call id is used as a pointer to the return value of the application, which can also be an exception in case of an error. The timeout parameter is used to catch unexpected errors, e.g., when an application is not responding due to a network failure. It is up to the interface designer to model the behavior after the timeout. The timer based event mechanism also allows a client based synchronization with the backend application since internal UI-states can be modified by querying external resources.

4.3. S–DISL Since DISL is designed with mobile and limited devices in mind, we developed a serialized form of DISL that allows faster processing and a smaller memory footprint, namely S-DISL. The purpose behind S-DISL is that an interpreter just has to process a list of elements, which on the one hand saves processing time on the limited device, where the interpreter resides.

8

On the other hand, it allows a smaller interpreter program size, which saves memory better utilized for rendering the UI. To achieve a serialized form, a transformer provides a multi pass XSLT transformation of the DISL file. The first two passes are used to flatten the tree structure. To avoid information loss, new attributes providing links, like "nextoperation", "nextrule" etc. have to be introduced. Through that, the 42 elements of the SDML DTD can be reduced to 10 basic elements. For example, all action elements are reduced to one with a mode attribute defining the type. The next transformation step sorts the ten element types into ten lists. Id’s are replaced by references and empty attributes are deleted in order to get leaner serialized documents. The final output is a stream of not very readable lists of serialized elements. Although the output is longer than the original document, the saved processing time outweights the disadvantage, which results in a slightly longer transmission time. Size however can be additionally reduced by using the binary XML format, introduced in the WAP standard [16].

4.4. Tweaking UIs Generated UIs naturally do not have that appeal as handcrafted UIs. Therefore, we introduce a concept which combines automated UI generation with annotations to improve the look and feel of the UI. This is denoted as "tweaking" and was introduced in [3]. We make use of the fact, that the style of any widget can be extended through a set of newly defined properties, which are added to basic properties. The new properties are used to describe the look and feel of the widget in more detail. With the introduction of widget-classes, we are in a position to provide a tree from a most basic widget up to descendants inheriting all properties of the parent widgets plus introducing new properties. The following figure shows such a tree from a generic widget down to specialized widgets. Generic Widgets

Audio UI

VoiceXML

Graphic UI

Mobile Application

Desktop Application

Java MIDP

Visual C++

Java Swing

Figure 2. Tree structure of multimodal UI Applying the original UIML concepts to DISL, we would have to introduce different vocabularies for different modalities. To overcome this, we introduce different path oriented properties, which reflect object-oriented class relationships, e.g., Graphic-UI.mobile-application.Java-MIDP and Graphic-UI.desktop-application.Java-Swing. Having several vocabularies, which can be easy accessed through path elements, allows straightforward design of multimodal UIs, since it is no problem to switch between different vocabularies within transitions or events. The renderer just has to check from which vocabulary the properties where derived and trigger the right input and output channels. Rendering of a GUI for a mobile device will be outlined in more details by the example in the next section.

5. Example In order to demonstrate the working architecture centered around the DISL language, we provide an example, which already is already completely implemented and in use. The idea is to control home entertainment equipment through mobile devices. More specifically, we control the playback of MP3 files on a PC by a J2ME-MIDP enabled mobile phone1. 1 In order to become attractive, consider cost-free, short-range Bluetooth communication of a mobile phone, so that it can be used as an universal remote control within the home environment. However, the current implementation applies bundled GSM transmission based communication (GPRS) with the

9

On a PC, a user is able to use a full fledged graphical user interface as it comes, e.g., with Winamp (see Fig. 3). However, that UI cannot be rendered on a mobile phone with a tiny display. Therefore, we have applied the aforementioned concepts in developing a generic user interface, which enables control of the MP3 player. This generic UI can be implemented as a service, which can be downloaded and used by the mobile phone.

Figure 3. GUI of Windows based MP3 Player The generic UI - in DISL Notation - mainly describes the control model together with rendering hints. It is transformed in a very memory and space efficient manner to the intermediate S-DISL format through several transformation steps and finally transmitted to the mobile device, which runs the interpreter and renderer given as a Java Midlet. The UI for our music player consists of controls to switch the player on or off, to start playback, to stop playback, to mute or to pause the sound, and to jump to the next or the previous title; volume control is also possible. The collection of these controls is provided as a list of widget elements in the DISL description, which also describes the state transitions as well as their binding to commands of the backend application, i.e., the Winamp player. The following DISL code fragment gives the widget list for volume control:

The structural part of the interface description is followed by a style description for each supported widget. The style elements provide information for the renderer. For example, it defines whether the widget is visible or not. The following code fragment shows the style component for one widget: Increase Volume Increases Volume by 10 Every time this command is activated the volume will be increased by 10% no yes yes server.

10

DISL structure and style specifications are quite similar to UIML. The following behavioral part largely differs from UIML and extends UIML towards state oriented DSN. The specification consists of rules and transitions as introduced in subsection 4.2. We only show one transition illustrating the action of the "increase volume" command. The transition fires, after the "IncVolume" rule becomes true. Then, the value of the variable "IncVolumeValue" is added to the variable "Volume". The following actions then switch the "Apply" and "Cancel" widgets to visible2 . yes yes

Commands to the backend application are provided as http requests, which are handled by the Interaction Manager who is responsible for passing the commands to the application. The UI Interaction Manager can employ the functionality of a webserver, since WAP enabled phones and PDA’s typically support HTTP. Therefore we propose a servlet based approach to enable the communications part of this system. For a first test environment, the player software to be triggered can reside on the same machine as the Webserver, but this can be easily changed to a distributed system, e.g., with the OSGi Framework (http://www.osgi.org/). That would allow controlling applications on multiple target devices, for example, TV, VCR, radio.

6. Conclusion and Future Work This paper introduced a generic MMI framework. The architecture of our Multimodal Interaction and Rendering System (MIRS) was introduced in combination with the XML-based Dialog and Interface Specification Language (DISL). DISL is based on an extended UIML subset. The extension itself is based on DSN (Dialog Specification Notation). A servlet-based implementation of the UI-rendering pipeline is planned for the transformation steps on the server side as well as Java-MIDPbased renderers on the client side, in order to demonstrate the feasability for limited mobile devices. In order to explore multimodality with DISL, we aim also for a voice renderer which will very likely be establihed in a PC-based testbed, considering the low processing power of current PDAs or mobile phones. How the switching between modalities can be done on runtime is still an open problem to be researched.

References [1] E. Aarts. Ambient intelligence in homelab, 2002. Royal Philips Electronics. [2] P. Roche B. d’Ausbourg, G. Durrieu. Deriving a formal model from uil description in order to verify and test its behavior. In Proc. of the 3rd Eurographics Workshop on Design, Specification, and Verification of Interactive Systems, 1999. [3] L.D. Bergman, G. Banavar, D. Soroker, and J. Sussman. Combining handcrafting and automatic generation of user-interfaces for pervasive devices. In CADUI’2002, 4th International Conference on Computer-Aided Design of User Interfaces, Valenciennes, France, 2002. 2 "visible"

is interpreted as "audible" for voice rendering

11

[4] M. Abrams et al. Uiml: an appliance-independent xml user interface language. In Computer Networks 31, Elsevier Science, 1999. [5] Scott McGlashan et al. Voice extensible markup language (voicexml) version 2.0, 2003. [6] Jeff Ayars et al. (eds.). Synchronized multimedia integration language (smil 2.0), 2001. [7] SALT Forum. Speech application language tags (salt) 1.0 specification, 2002. [8] et.al. H.R. Hartson. Uan: A user-oriented representation for direct manipulation interface designs. In ACM Transactions on Information Systems, vol.8, no. 3, July 1990. [9] D. Raggett (eds.) J. A. Larson, T.V. Raman. W3c multimodal interaction framework, May 2003. W3C NOTE 06 May 2003. [10] C. Janssen. Dialogue nets for the description of dialogue flows in graphical interactive systems. In Proceedings of Software-Ergonomie ’93, Teubner, Stuttgart, 1993. [11] Tomihisa Kamada. Compact HTML for Small Information Appliances, W3CNote. World Wide Web Consortium, Februar 1998. URL: http://www.w3.org/TR/1998/NOTE-compactHTML-19980209. [12] Peter King and Tim Hyland. Handheld Device Markup Language Spcification, Mai 1997. http://www.w3.org/TR/NOTE-SubmissionHDML-spec.html. [13] J. Plomp, R. Schaefer, W. Mueller, and H. Yli-Nikkola. Comparing Transcoding Tools for Use with a generic User Interface Format, 2002. Extreme Markup Languages 2002, Montreal, Canada, August 4-9. [14] G. Szwillus. Object oriented dialogue specification with odsn. In Proceedings of Software-Ergonomie ’93, Teubner, Stuttgart, 1997. [15] Z. Trabelsi, S.-H. Cha, D. Desai, and Ch. Tappert. A voice and ink xml multimodal architecture for mobile e-commerce system. In Proceedings of the second international workshop on Mobile commerce, 2002 , Atlanta, Georgia, USA, 2002. [16] WAP Forum. WAP Binary XML Content Format, April 1998. URL: http://www.wapforum.org. [17] WAP Forum. Wireless Markup Language Specification Version 1.1, Juni 1999. URL: http://www.wapforum.org. [18] M. Weiser. The computer for the 21st century, 1991. Scientific American 265(3): 94-104.

12