XML Based Multimodal Interfaces on Mobile Devices in an Ambient Assisted Living Scenario Bálint Tóth Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, Magyar tudósok körútja 2., Budapest, 1117 HUNGARY E-mail:
[email protected]
Géza Németh Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, Magyar tudósok körútja 2., Budapest, 1117 HUNGARY E-mail:
[email protected]
ABSTRACT
portability, well designed user interface, good performance and affordable price. Furthermore 3rd party developers are able to create complex applications with enhanced features, like speech recognition and speech synthesis on these devices.
As information technology is more and more included in everyday life, the design of the user interface becomes a particularly relevant question. New technologies are used not only by computer experts, but by everyday users and by people with special needs as well. Elderly people are a special target group, as their requirements are different, often higher then that of the average users. If the usage of a system is not straightforward enough for them, they can easily loose their motivation in using it. Traditional computer based user interfaces may be too difficult for them. It is reasonable to complement the graphical user interface with other modalities. In the current paper the combination of the graphical and speech modalities is investigated, the authors’ approach of describing multimodal interfaces on mobile devices is given, and a possible ambient assisted living scenario is introduced.
Compared to desktop computers the user interactions are limited in case of mobile devices. To make mobile devices easier and faster to use speech modality can complement the usual graphical user interfaces; speech recognition can be used as user input and speech synthesis as user output. Multimodal interfaces on mobile devices can be very beneficial in ambient intelligence scenarios. The current paper focuses on ambient assisted living of elderly people. They may have many sensors and actuators in their home, which can be controlled by them via a mobile device. In case of this scenario combining the graphical user interface with the speech user interface is very important for several reasons:
Author Keywords
Multimodality, Assisted Living, Ambient Intelligence, Speech User Interface, SUI, Graphical User Interface, GUI, Text-To-Speech (TTS), Automatic Speech Recognition (ASR)
1.
A remarkable percentage of elderly people are not familiar with the traditional computer interfaces (keyboard, screen, pointing devices). For them speech can be a modality, which is more natural than graphics on a display panel.
2.
Elderly people often have impaired vision and/or motor impairments (e.g. shaking hands). Because of these impairments it is even more difficult for them to read display panels and use the input devices (e.g. keyboard, touch-screen). For them it is also favorable to have speech input and output for controlling and supervising the assisted living system.
3.
Another aspect is that when elderly people move they have to carry the control device of the assisted living system around or stand up to use the fixed control device. To overcome this problem speech modality can also be a solution. The moving user can get information with the help of text-to-speech (TTS), and can request information via automatic speech
ACM Classification Keywords
H5.2. User Interfaces: Graphical user interfaces (GUI), Voice I/O INTRODUCTION
The latest mobile devices, including PDAs and smartphones, possess numerous favorable features:
1
recognition (ASR) while the mobile control device is remaining in the same position. Although the speech user interface can substitute other modalities, because of the limitation of current speech technologies (e.g. when the environment is too noisy for speech recognition; the output of speech synthesis is not understandable because of the noise) the regular modality (graphical output, input via keys / touchscreen) is still required. If the graphical modality is realized on a mobile device, then users can always have it in their pocket. Because of the reasons introduced above, mobile devices were chosen as the target platform. Possible solutions of creating multimodal interfaces for the ambient assisted living scenario will be described in the following sections and an example for realizing a task will be given.
If the application is to be implemented on different platforms, then even more questions are raised. Mobile devices may be quite different from each other; consequently the user interface must be designed for the selected devices’ features. Mobile devices of the day have rather small display panels and rather slow input methods compared to desktop computers. Some devices have a touch-screen1, others have QWERTY keyboards2 and/or a numeric keypad. The size of the display panel is also varied together with different resolutions. Therefore the application that can run on many platforms may not be usable on all of them. Consequently first the type of the device must be defined and then the design of the graphical user interface should be performed. Let us have a short look at the currently most commonly used mobile devices: PDAs (Personal Digital Assistants)
PROBLEM STATEMENT
Creating multimodal interfaces on mobile devices for elderly people is a challenging task. In this section the difficulties of creating the graphical and the speech user interface will be analyzed, including problems, which are raised about combining these two modalities. Graphical User Interface and Regular Input Methods
Applications on mobile devices must have an easy and intuitive way of usage to be appealing to the target user group. In the case of the current scenario the demands of the users are much higher then in a traditional case. If the usage of the system is not straightforward enough, then users can easily loose their motivation to use it. To design and realize a Graphical User Interface (GUI) for PDAs or mobile phones is greatly different from the case of desktop computers. Developers must be careful when porting an application from desktop computers to mobile devices, or developing a completely new application for mobile devices. There is no paradigm how to create a GUI for mobiles, but certainly there are guidelines, which can be followed [1]. The following main steps should be considered during the design process: 1.
Port from the desktop user interface or design a new one.
2.
Go over the interface elements, and remove the unnecessary ones.
3.
Go over modal and modeless dialogs, and remove popup dialogs.
4.
Remove the unnecessary menus, submenus and icons on the toolbar to reduce their number to the minimum.
5.
Test the application on as many devices from different vendors as available. There are minor and sometimes major differences between them even if the same operating system is running (e.g. in the structure of the file system, in the registry, in the performance, in the built in speakers and in the display panel).
From the wide variety of mobile devices, designing user interface for PDA is the most similar to designing a UI for desktop computers. The menu and the taskbar are in the lower part of the screen. Almost all UI elements that exist on desktop computers can be used. The behaviour of the touch screen simulates the functionality of the mouse: simple click, double click, tap-n-hold for drag-n-drop and if the push of the screen is executed for a long time, it simulates the left mouse key. Subsequently porting a graphical user interface from desktop computers to a PDA seems to be quite an easy task, but one must keep the steps given above always in his or her mind. Symbian and Microsoft Mobile Phones with Keypad
The penetration of these mobile phones is much greater that of PDAs. They have the traditional 12 telephone keys, the dial and hang up key, some additional buttons. These devices are optimized for one-hand operation, and only a decreased number of user interface elements is available. Designing a graphical user interface of a complex application can be a very hard task on these devices. Applications on mobile phones are often run by unskilled users; accordingly the usage of the software must be easy, intuitive and comfortable. Generally there are two main fields of interface design that require special care in case of smartphones: the control of the application and the user interface elements. •
1 2
Control: most of the smartphones have two function keys at the bottom of the screen. It is recommended to assign a single functionality to the left soft key, and to have a menu with not more
E.g. PDAs, Sony Ericsson P800, Nokia 7700
E.g. Nokia 9200 Series Communicator, Nokia 6810, HTC Kaiser, furthermore Windows Mobile based PDAs have a soft input panel and optional external keyboards as well.
To achieve multilingual support and to make the design of SUIs easier, a description language should be used, instead of hard coding the SUI. There are existing solutions for speech modality descriptions (e.g. VoiceXML [5], SSML [6]), and there are also some solutions for combining GUI (e.g. XIML [7], XUL [8]) and SUI (e.g. XHTML+Voice Profile 1.0 [9], Mobile X+V 1.2. [10]), but these are primarily intended for dialogue systems and websites, not for thick layer mobile applications.
than 9 menu items for the right soft key. In case of binary type questions the left soft key stands for the positive selection, while the right soft key stands for the negative selection. Additional functionality can be added to the numeric keys. •
User interface elements: usually there is a decreased number of UI element types applied in mobile phones, although most of the tasks can be performed with these elements. For example rich edit control is rarely necessary to be used instead of a textbox. The limited text input size should also be kept in mind – usually users do not write texts longer then 1000 characters even if a predictive input method is supplied. Furthermore the label of a UI element should always be put to the top of the element – not to the left or right side, as it is done in case of desktop computers.
A more detailed investigation on creating speech user interfaces on mobile phones can be found in [11]. Multimodal User Interface
The main question about combining the speech and the graphical user interface is, about how much correlation should be between them. Generally there are three possible solutions:
Mobile Phones with a Touchscreen
This type of devices is rather new in the market, and only few of them are available currently.3 It is very similar to the PDA, but with a smaller screen. Everything written above is applicable also on these devices, excluding the soft keys part. Speech User Interface
Designing speech user interfaces for mobile devices is a difficult task. If new tasks can be performed by speech modality or old ones can be solved more comfortably than with the graphical modality then it is reasonable to use SUIs to improve the usability of small devices. Unfortunately SUIs can also decrease the usability if it is more difficult to use the speech modality than the graphical one. In this case speech only frustrates the user, and the user will choose the graphical modality instead.
The SUI and the GUI is completely separated. For example the GUI remains the same and the SUI is a traditional dialogue system.
2.
There is some correlation between the SUI and the GUI. For example the architecture of the GUI and the SUI is similar, but parts of the SUI work like a dialogue system.
3.
The GUI and the SUI are almost the same, but in different modalities. For example when an application works like a screen-reader [12], [13].
In the current scenario we would like to choose the second solution. For more details see the last section. XML BASED MULTIMODAL DESCRIPTION
The authors work on creating a scalable, multimodal XML based description language and the corresponding engine, which creates and supervises the user interfaces on different platforms.
From the technology point of view the lack of a standardized speech interface is a general difficulty. There is no such thing as Speech API (SAPI) [2] on Windows systems or as Java Speech API (JSAPI) [3]; consequently all the speech related engines, including TTS and ASR engines have different programming interfaces. If a 3rd party application developer would like to create a SUI then the developer has to buy a TTS and ASR engine. And in this case multilingual support is still not solved. Most engines support only languages of the most significant markets (i.e. English, German, French, etc). Other, ’smaller’ languages may have such TTS and ASR engines, which cannot be attached to off-the-shelf products, as long as there is no a standardized speech interface.
3rd Party Application
Different Modalities Main Module
GUI
TTS
SUI
XML Interpreter , Processor and Interface (Dinamically Linked Library )
XML User Interface Description File
ASR
Figure 1. The architecture of the XML based multimodal description language.
Although server side speech processing [4] could be a solution for multilingual support, but in this case a data connection is required, which is still expensive, especially abroad. 3
1.
Figure 1. shows the high level architecture of our approach. The GUI and SUI are described in an XML file. The main module is a dynamically linked library, which is called by the 3rd party application. The main module reads the XML user interface description file, interprets it, and according to
E.g. Apple iPhone.
3
it the main module creates the graphical and speech modalities. The TTS and ASR engines are attached to the SUI via a standardized interface; consequently additional engines can be attached to the system. In the following the main features of the multimodal XML description language will be investigated. A more detailed description can be found in [14]. Multimodality
In the next section the main aspects of describing GUIs and SUIs, and their relationship will be described in short. As introduced in the previous chapter, we consider it a basic requirement to have a record for all the user interface elements. Unfortunately different platforms have different user control elements, and even the same type of elements rarely operate the same way. To solve this problem XSLT [15] is applied (see below, subsection ‘Scalability’). The user controls always have a common and a specific part. The common part contains general features, like position, color; the specific part contains unique features only, like the maximum number of charters in a textbox. An example of the GUI description is shown in Figure 2. … common features. … user control specific features. … …
Figure 2. The architecture of the GUI’s XML description. On top level the type of the platform is given; the child elements (‘objects’) are the user controls; user controls have common and specific features.
The main goal of the Speech User Interface may be either to be used alone or to complement the graphical modality. Unfortunately previous solutions - described above - are not suitable, as these solutions are too complex for mobile devices (the resulting XML description files are too large; it is impossible to process them with good response times); they generally realize speech dialogues (not well suitable for thick applications); there is no API for handling them on mobile devices (3rd party interpreters must be used – if there is any).
3 level dialogues (TTS part – ASR part – TTS part); the sequence of the dialogues can be defined. The possible answers are stored in parameters or in a separate file, and conditional (if-else) command execution is also possible. This description was implemented on PDAs in JAVA MIDP [16], and speech I/O was the only modality. Our approach is to create one or more dialogues for each user control. To keep the system downwards compatible, the existing SUI description is kept, which is encapsulated in the ‘object’. Empty user controls are allowed, when there is no graphical representation. This way traditional speech dialogue systems can be realized with the model as a part of the multimodal interface. An example for the SUI description can be seen on Figure 3. … global settings. … local settings. … ASR part. … … …
Figure 3. The architecture of the SUI’s XML description. On top level the type of platform is given; the child elements are interconnected with user controls (‘objects’). The TTS and ASR settings are generally defined in SUI_settings on the top level; they can be modified within the dialogues.
The interconnection of the different modalities is an important question, too. As the description of the user controls is object-based, it is possible to join or separate the GUI and the SUI in different files. It is even possible, to have one file for the GUI, another file for the SUI for each user control. More can be read about the concept in [14]. Scalability
The speech user interface may remain the same on different platforms, but the graphical modality should be modified, if the platforms are not compatible with each other. It is suggested to use XSLT (XML Style Sheet Transformation
In the BelAmi4 framework an XML description language for describing SUIs was defined [14]. Generally it contains
4
BelAmi stands for Bilateral German-Hungarian Research Collaboration on Ambient Intelligence Systems. It is a German-Hungarian cooperation. Fraunhofer IESE, Bay
Zoltán Foundation of Applied Research, TU Kaiserslautern, Budapest University of Technology and Economics, University of Szeged are involved. See http://www.belamiproject.org/ for more information.
the user’s utterance about going to bed. It must be kept in mind, that it can be said in several ways. It is reasonable to set the recognizer to recognize keywords in the continuous utterance. If the utterance contains the keywords (e.g. go to bed / sleep), then the recognition is successful.
[15]) to transform the GUI description from one platform to another. More can be read about the concept in [14]. A POSSIBLE AMBIENT ASSISTED LIVING SCENARIO
We would like to apply the middleware, which was introduced above, in an assisted living scenario for elderly people. As the first step a simple task will be solved with the help of multimodal interface design. If the users (elderly people) find the usage of the interface easy and intuitive, then further tasks can be implemented, based on the XML description language for multimodal interfaces.
During the whole task there must be another button object, with which the user can tell the system that s/he has changed her / his mind, and s/he is not going to sleep. This object also has a graphical and a speech modality and it is active only if the other button object was activated before.
In case of this very special field it is beneficial to choose a well defined, simple, but new task. If the users are familiar with the chosen task (e.g. controlling a video recorder), then they can easily loose their motivation in using the new modalities (in this case speech recognition and synthesis).
•
Label: with the label object it is possible to tell the user that somewhere a door or window is not closed, or a light is not switched off. In case of the GUI the label displays a text with the necessary information. In case of the SUI the TTS engine reads the alert, and at the end of the utterance it tells, that the alert can be repeated by saying some keywords (e.g. please repeat, can you repeat it, etc.). Consequently there is a conditional part in the SUI description; after the TTS reads the alert; if the repetition of the alert is asked (the keywords / sentence is recognized), it sets the next_dialogue parameter to the same dialogue (so reads it again).
•
Image: in order to display the place of the alert on the ground plan, it is favorable to display an image. The image has two layers: the first layer is the ground plan of the flat; the second contains the alert points. Both layers can be realized by image objects. In this case the only modality is the GUI.
To avoid this situation choosing a new kind of service was proposed, that can be beneficial for the elderly people and in which case the task is straightforward. The suggested scenario can be seen on Figure 4.
User goes to bed
Is every door and window closed?
Elderly person
Is every light switched off?
Can the user go to sleep?
CONCLUSION
In the current paper the possibilities and difficulties of creating speech user interfaces on mobile devices were investigated; our approach of an XML based multimodal user interface description language was introduced and a possible solution was described for creating multimodal interfaces in an ambient assisted living scenario for elderly people.
Figure 4. The Use-Case diagram of the proposed scenario.
When the elderly person goes to bed, s/he has to tell the system via voice or via a graphical interface, that s/he is going to sleep. Then the system checks, if all the windows and doors are closed, that all the lights are switched off, etc. If the result is negative, the system alerts via voice and via the graphical user interface also, and tells that for example the user should close the door and window, or switch down the light. If every light is switched off and every door and window is closed, then the graphical display panels goes to standby mode, and a voice message informs the user, that everything is in order. In the first approach the lights, windows and doors should be controlled manually, later automatically.
After realizing the scenario user tests will show, if this approach is suitable for the target user group. If the result is positive, then more scenarios will be implemented. The project is evaluated in the BelAmi framework. A Hungarian student works in Kaiserslautern, and realizes the multimodal interface with the help of the German and Hungarian supervisors as a part of his master thesis. ACKNOWLEDGMENTS
To realize the scenario three user controls with speech user interfaces are required: •
The research presented in the paper was partly supported by the Hungarian National Office for Research and Technology (NAP project no. OMFB-00736/2005 and GVOP project no. 3.1.1-2004–05–0485/3.0)
Button: with the button object the user can tell the system that s/he would like to go to bed. The SUI part of the button is responsible only to recognize 5
REFERENCES
1. C. Bloch, A. Wagner: MIDP 2.0 Style Guide for the Java 2 Platform, Micro Edition, Addison-Wesley, 2003, 262 pp. 2. Parmod, G. MS SAPI 5 Developer's Guide, InSync Software Inc., 2001. 3. Java Speech Api, Available: http://java.sun.com/products/java-media/speech/ 4. G. Németh, G. Kiss, Cs. Zainkó, G. Olaszy, B. Tóth, Speech Generation in Mobile Phones, Editors: D. Gardner-Bonneau, H. Blanchard, Human Factors and Interactive Voice Response Systems, 2nd ed., Springer, forthcoming 5. Voice Extensible Markup Language (VoiceXML) Version 2.1., 2006., Available: http://www.w3.org/TR/voicexml21/ 6. Speech synthesis markup language (SSML) version 1.0., 2004., Available: http://www.w3.org/TR/speech-synthesis/ 7. J. Eisenstein, J. Vanderdonckt, A. Puerta: Applying Model-Based Techniques to the Development of UIs for Mobile Computers, Fifth International Conference on Intelligent User Interfaces, ACM Press, 2001., pp. 69-76 8. XML User Interface Language (XUL) Project, Available: http://www.mozilla.org/projects/xul/
9. XHTML+Voice Profile 1.0., 2001., Available: http://www.w3.org/TR/xhtml+voice/ 10. Mobile X+V 1.2., 2005., Available: http://www.voicexml.com/specs/multimodal/x+v/mobile /12/ 11. B. Tóth, G. Németh: Challanges of Creating Multimodal Interfaces on Mobile Devices, Proc. of 49th International Symposium ELMAR-2007 focused on Mobile Multimedia, 12-14 Sep. 2007, Zadar, Croatia, pp. 171-174 12. Nuance - Nuance Talks, Available: http://www.nuance.com/talks/ 13. JAWS for Windows, Available: http://www.nanopac.com/jaws.htm 14. B. Tóth, G. Németh: Creating XML Based Scalable Multimodal Interfaces for Mobile Devices, 16th IST Mobile and Wireless Communications Summit, Budapest, Hungary, July 2007 15. XSL Transformations (XSLT) Version 1.0., 1999., Available: http://www.w3.org/TR/xslt 16. R. Riggs, A. Taivalsaari, J. V. Peursem, J. Huopaniemi, M. Patel, A. Uotila: Programming Wireless Devices with the Java 2 Platform Micro Edition, Second Edition, Addison-Wesley, 2003, 434 pp.