1
Creating XML Based Scalable Multimodal Interfaces for Mobile Devices B. Tóth, G. Németh
Index Terms—Mobile Device, XML, Speech User Interface (SUI), Graphical User Interface (GUI), Text-To-Speech (TTS), Automatic Speech Recognition (ASR)
mobile market in Europe, and these devices are rarely available in the authors’ region. Unfortunately the two platforms are not compatible, neither on binary, nor on source code level. Many sources e.g. [1] claim JAVA ME is a solution for platform independent mobile development, but the speed of KVM (the mobile version of the Java Virtual Machine, JVM) still cannot achieve that of native code, not even the performance of the built in CLR (Common Language Runtime, CLR) [3] of .NET Compact Framework. Consequently complex calculations, such as speech recognition and synthesis cannot be realized in JAVA in realtime at the time of writing. For this reason we do not deal with JAVA ME technology in this paper. Unfortunately other cross platform solutions (e.g. AppForge’s Crossfire) can neither be applied, as low level functions, especially in Symbian can not be accessed by them.
I. INTRODUCTION
II. PROBLEM STATEMENT
The penetration of smartphones and PDAs (Personal Digital Assistant, PDA) is rapidly increasing. As their price is getting lower, people buy smarter devices instead of low-end ones. Generally the most important difference between smart devices and low-end phones is that on the former, complex 3rd party applications can be installed while on the latter, only JAVA ME [1] based applications with limited features. Different types of ‘smart’ platforms have distinct programming paradigms, so not only the binary, but even the source code itself is not compatible. Recently there are three mainstream classes of high-end mobile devices: the Symbian (primarily smartphones), the Windows Mobile (including PDAs and smartphones) and the Linux based devices. Generally Symbian based mobile devices are more common in Europe. Developers create applications in Symbian C++ (it is similar to C++ with platform specific conventions [2]) for Symbian devices, and in .NET Compact Framework or in embedded Visual C++ (both of them are similar to .NET Framework [3] and Win32 C++ [4], but limited to a subset of functions and APIs) for Windows Mobiles. In this paper the Linux based mobiles are not investigated, as they do not share a remarkable part of the
As a consequence of the facts described above, when an idea is to be implemented on different mobile devices, it must be rewritten for all the platforms. Furthermore there are occasions, when only a small part of the interface should be modified by some other groups than the developers (see section VI. for some examples), but without a user interface description the source code must be modified, and the application must be rebuilt. Furthermore creating Speech User Interfaces (SUIs) on mobile devices is a difficult task. Apart from the design issues (see section IV.), there is no standardized speech I/O (like Speech API on Windows based desktop computers) on mobile devices, consequently the same SUI presumably won’t work with different Text-To-Speech (TTS) and Automatic Speech Recognizer (ASR) systems. With a well designed SUI, including ASR and TTS, a dialog-like system can complete or substitute the GUI. It is favourable in scenarios, where the usage of mobile devices is difficult or impossible (e.g. intra-car environment), or in case of impaired users (e.g. visually, vocally). The goal of the present study is to create a specification for an active, scalable user interface and its interpreter module on different platforms. The user interface should be defined in an XML definition file. In this context active means that the interface reacts to the users’ interactions (e.g. button is pressed, text has changed in a textbox, the ASR recognized a unit, etc.), functions implemented by the 3rd party developers are called (see subsection V./D.), and scalable means that according to the definition file the user interface may have
Abstract—portable smart devices are getting more and more popular. Many people prefer smart solutions over old devices or pen-and-paper. But developing applications for mobile devices is a challenging task. First, it is difficult to create an intuitive, easy to use interface; second as current mobile platforms are not compatible, the application must be rewritten on all of them. Furthermore it is not trivial to design multimodal interfaces (including speech and graphical I/O) which really improve software usability. For these reasons our goal is to create an XML based scalable, cross platform, multimodal user interface description format and the corresponding interpreter software for different platforms. This technology makes development for mobile devices much faster and easier.
B. Tóth, Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, Magyar tudósok körútja 2., Budapest, 1117 HUNGARY (e-mail:
[email protected]) G. Németh, Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, Magyar tudósok körútja 2., Budapest, 1117 HUNGARY (e-mail:
[email protected])
2 different layouts and features in each modality on each platform (e.g. Pocket PC devices, MS Smartphones, Symbian OS based smartphones, etc.). Furthermore we would like to provide a simple, standardized programming interface for TTS and ASR engines, so the applied speech technologies can be easily modified or extended (see Figure 1., the TTS and the ASR should have a common interface with the SUI). III. GRAPHICAL USER INTERFACE ON MOBILE DEVICES Designing Graphical User Interface (GUI) for mobile devices is a challenging task. There are some quite well defined guidelines (e.g. [5]), which describe the major issues that should be kept in mind, but there is no principle that can tell how to create the optimal user interface for a given task. The different layouts of devices also raise many problems in the design period. Obviously scaling down a GUI of a PDA to a smartphone with a small display panel (e.g. Nokia 5500), while keeping usability is not possible. Scaling up a layout of a mobile phone application to larger display panels is solvable, but the advantage of the size won’t be exploited. Therefore it is highly recommended to design different GUIs for different platforms. We must also consider the possible input methods that it is done by buttons, by touch screen, or both. Furthermore different platforms have different types of user controls, and even if the same functionality can be achieved, the way the controls work can still be different. For the reasons described briefly above, it is favorable to use interface description and a corresponding interpreter on mobile devices. IV. SPEECH I/O ON MOBILE DEVICES Designing a Speech User Interface is a challenging task, too. Speech modality combined with intuitive GUI can greatly improve the usability of small devices and allow people with disabilities (e.g. impaired vision) to use technologies (e.g. SMS), which they cannot access without a proper solution. There are applications of speech technology available for mobile phones [6], but still speech I/O is applied in hardly any part of mobile applications. A. Design The design of a SUI is a difficult question on mobile devices. Often it is not easy to decide, how much correlation should be between the GUI (if it exists) and the SUI. In some cases, (a) these should be completely separated (e.g. the GUI remains the same, the SUI is a dialog system), in other cases, (b) there is some correlation (e.g. the architecture of the GUI and SUI are similar, but parts of the SUI work like a dialog system), and (c) the GUI and SUI are almost the same, but in different modalities (e.g. when an application works like a screen-reader [6], [7]). B. Technology From the technology point of view the main reason is the lack of a standardized speech interface. There is no such thing as Speech API (SAPI) [8] on Windows systems; consequently every TTS/ASR engine has a different programming interface.
If a 3rd party application developer would like to realize a SUI in an application, then they must commit themselves to a technology, and after they released the final product, the TTS/ASR engines cannot be changed. (Remark: On desktop computers there is standardized speech I/O, the user can install as many engines as s/he wants, and s/he can select on the operating system level, which speech technology to use, and all the applications, which employ a SUI with the standardized interface will use the selected engines. This is very favorable e.g. when a TTS/ASR engine doesn’t support a given language, therefore another engine is used.) There are existing solutions to solve this problem [9] (the processing is done on a server machine, the audio and the responses are sent/received over any communication channel), but they create data traffic, which is still expensive (especially abroad). There are existing solutions for SUI descriptions (e.g. VoiceXML [10], SSML [11]), and there are also some solutions for combining GUI and SUI (e.g. XHTML+Voice Profile 1.0 [12], Mobile X+V 1.2. [13]), but these are primarily intended for dialog systems and home pages, not for thick layer mobile applications. V. THE AUTHORS’ APPROACH The authors’ goal is to create a standardized XML description for both GUIs and SUIs, and to create libraries on different platforms, which generate and handle the multimodal interface. Figure 1. shows the basic architecture of the system. 3rd party developers’ applications use our module. Our module consists of several DLLs (Dynamically Linked Library), but the main interface, which is used by 3rd party developers, is realized in one DLL (Main Module). The XML interpreter, which is connected to the Main Module: Processor and interface, reads the XML description file(s) and returns the relevant information to the Main Module. The different modalities are also connected to the main module, namely the GUI and the SUI. Currently the size of the XML description is in the 10 Kbytes range, later, if in some application scenarios the size increases, compression may be considered. The rest of this section aims to investigate the main challenges and briefly summarize our solutions for them. Modalities GUI
TTS
Main Module : Processor and interface (Dinamically Linked Library )
XML Interpreter
SUI
ASR
3rd party application
Fig. 1. The architecture of the XML based multimodal interface.
3 A. GUI XML definition The description of the Graphical User Interface should contain everything that can be rendered on the platform. Consequently there might be different user control descriptions e.g. on Symbian platform than on Windows Mobile. Fortunately this kind of incompatibility can be solved (see subsection V./E. for more details). Furthermore the description of user controls must be in a hierarchical structure: all the user controls are on the same level, and the properties of the user controls are child elements. As all user controls have some common features (e.g. size), it is favorable to handle the common and the specific features separately. This is beneficial if the type of a user control is modified (only the specific features must be changed) or if we want to clone the user control. For the reasons given above we define the platform on the top level. The user controls are child elements of this top level. The definition of a user control is realized in an ‘object’ tag, which is uniquely defined by an auto increment identifier. There are at least two child elements of an ‘object’ if we are talking about GUI descriptions (additional children are introduced in subsection V./B.). One of them is related to the common features of a user control (e.g. position, size, color, etc.), the other one refers to the user control’s specific features (e.g. the properties of the text in a textbox). Additionally SUI description can be encapsulated in the ‘object’ field. Figure 2. shows an example of the GUI description. … common features. … ‘UC’ specific features. … …
Fig. 2. The main architecture of the GUI’s XML description. On the top level, the platform is defined, and child elements are the user controls (‘objects’). Every user control has a common and a specific feature description part.
B. SUI XML definition The main goal of the Speech User Interface is to be either used alone or to create another modality besides the GUI. There are existing SUI description languages, as they were mentioned above, but the main problems of these solutions are that (1) they are too complex for mobile devices, (2) they generally realize speech dialogs (not well suitable for thick applications) (3) there is no API for handling them on mobile devices. Last year in the BelAmi1 framework an XML description language for creating SUIs was defined [14]. Briefly, it contains dialogs (the dialog optionally starts with a TTS output, then it may optionally wait for an ASR input, and may have a second TTS output as well, e.g. an acknowledgement), 1 BelAmi stands for Bilateral German-Hungarian Research Collaboration on Ambient Intelligence Systems, it is a German-Hungarian cooperation. Fraunhofer IESE, Bay Zoltán Foundation of Applied Research, University of Kaiserslautern, Budapest University of Technology and Economics, University of Szeged are involved. See http://www.belami-project.org/ for more information.
and the dialogs can refer to each other to define a sequence. The possible answers are stored in parameters, and conditional (if-else) command execution is also possible. This description was implemented on PDAs in JAVA, and speech I/O was the only modality. Our main idea is to have one or more dialogs for each user control. To keep the systems compatible, we kept the existing SUI description, which is encapsulated in the ‘object’ (see Figure 3.). Dummy user controls are allowed, when there is no graphical, but only speech part(s) of an ‘object’. This way traditional speech dialog systems can still be realized with our model, but one can also use the SUI as an extension of the GUI (or vise versa). Implementing the SUI interpreter is a challenging task. Apart from the problems that were introduced in subsection IV./B., the limitations of the mobile devices must be considered. While the performance of concatenative TTS engines [15] is satisfactory, different speakers and languages take big amounts of storage space. The situation is worse in case of the ASR. Generally the problem is not only the storage size, but the performance as well. The Hungarian HMM based, speaker independent recognizer works with acceptable response times2 only if the fixed vocabulary contains no more than 180-200 units. This sounds quite a strict restriction, but taking into account that for each ‘object’ another vocabulary can be given, most tasks can be solved. … global settings. … local settings. … ASR part. … … …
Fig. 3. The main architecture of the SUI’s XML description. On the top level, the platform is defined, and the child elements are interconnected with user controls (‘objects’). The way the TTS and ASR work is defined in the SUI_settings globally and it can be modified locally in the dialogs.
C. Connection of the GUI and the SUI In our approach, as it was introduced above, we allow traditional speech dialogs in order to make the SUI scalable. There are application areas, where only the SUI (e.g. in-car environment) or the GUI is present. Apart from these special cases generally every SUI element(s) (one or more dialogs) belong to GUI user controls. The identifiers of the dialogs allow us (see Figure 3.) to predefine the sequence of the possible ‘audible user controls’ (introductions, questions, answers, acknowledgements), as the positions of GUI elements cannot unambiguously define the order of the SUI dialogs. With this idea all the three variations introduced in subsection IV./A. can be realized, but the structure of our XML description file encourages the developers to follow the IV./A/(b) solution. 2
E.g. on a recent top-category PDA: Qtek 2020i (520 Mhz, 64 MB RAM).
4 Certainly, sometimes there is more or less correlation between the SUI and the GUI. But as a first step it is beneficial to generate the SUI from the GUI description. So, we applied XSL (it is the stylesheet language for XML) transformation (XSLT) [16], which transforms the GUI into a standard SUI (e.g. in case of a combobox the TTS reads the labels, puts them into the fixed vocabulary, than the ASR is started, and when something is recognized a final acknowledgement is asked). Also we must admit that this type of SUI design is similar to using a screen-reader [7] with software that was not designed to be accessible for vision impaired persons. For easy-to-use, intuitive SUI we must carefully design the dialogs. As all the ‘objects’ are uniquely defined, the GUI and the SUI description can be either in one, or in multiple file(s). Putting every definition together is favorable to keep the data in one block, which can be e.g. compressed easily. Separating some parts of the interface makes the interface portable and scalable. The separation can be done in two ways: by content and/or by modality (see Figure 4.). Separation by content means that one or more user controls are separated from the others in a file. Separation by modality means that the GUI and the SUI are stored in different files. Obviously the description of the user interface can be separated both by content and modality. In an extreme case even each user control could be defined in different files, and we can further split up the GUI and SUI description of every ‘object’ into two. User Control 1 SUI
GUI
SUI User Control 1
User Control 2-3 SUI
GUI
User Control 2
User Control 3
GUI User Control 1
User Control 2
User Control 3
Fig. 4. The possible separation of the definition files: by content (left) and by modality (right)
Furthermore the input of the modalities must be connected. For example if an input is given through the SUI it must appear on the GUI. Or the inputs, which are entered through the GUI, may affect the SUI’s parameters as well. For more details see subsection V./D./2. D. Active user interface The interaction between the user, the user interfaces, and between the modalities must be ensured. Let us investigate both cases briefly: 1) Interaction between the user and the UIs: as the user interface is generated from the XML description file, the possible user interactions must be handled by the 3rd party developer (see Figure 1.). In case of the GUI virtual functions are defined that handle the possible events (e.g. in case of Windows Mobile the common events, like OnClick, OnMouseOver, OnTextChanged, etc.). The 3rd party developer can define what should happen, if s/he overrides these virtual functions,
and writes the code which is executed when a user control fires an event. There is a programming difficulty in the solution. As the user controls are rendered in runtime, they are created with the help of one dynamic variable (in contrast with designing a UI in the source code, where usually every user control has a related variable, so we can later refer to them). This is not a problem, when the user controls are generated in runtime, but it is, when one would like to handle the events of a user control, or change their content. To overcome the problem, the unique ID is used to identify, which user control is in use. 2) Interaction between the modalities: it is important to keep the modalities transparent. As it was introduced in subsection V./C., in case of an input in any modality, it should effect the data in all the other modalities. In our case, when two modalities are present (GUI, SUI) there are basically two options: a) The input is done through the Speech User Interface. In this case the recognized unit (which may be acknowledged by the user) should affect the Graphical User Interface. b) The input is done through the Graphical User Interface. Sometimes, the content of the input must be propagated to a parameter of the SUI, which is used later by the TTS or the ASR. Both cases are solved in the 3rd party developer side. This is done so, because it cannot be adequately defined, in which cases which part of the SUI (which dialog(s)) should affect the GUI, and in which cases should the GUI propagate some data to the parameters of the SUI. The authors are working on automating one or both cases. E. Platform independence It is important to achieve platform independence. Generally we focus on three platforms: Windows Mobile (WM) based PDAs, WM based smartphones and Symbian OS based smartphones. Currently the first and the second ones are under implementation, and the third one is to be realized soon. In case of the SUI to achieve platform independence the TTS and the ASR engines must be ported to the actual operating system. This is a difficult task, but mainly from the programming point of view, the actual platform does not influence the design of the Speech User Interface. The problem is more complex in case of the GUI. As it was introduced in section III., different platforms have different user controls. This causes problems even on Windows Mobile platforms: WM based smartphones have fewer user controls as PDAs do. The difference is even bigger in case of WM and Symbian based devices. Therefore in the XML description file, we define the platform on the top level. As in case of the user controls, the UI description of each platform can be stored in one file or separately for the same reasons, as it was discussed in subsection V./C. Certainly the GUI does not need to be designed and built for all the platforms from scratch. XSL transformation can also be applied here, as in subsection V./C. With the help of XSLT we can generate the GUI to a new platform from an existing design. The XSLT makes the transformation of the user controls and removes the user
5 controls, which cannot be implemented on the destination platform. These parts of the GUI must be realized with user controls, which are available on the desired operating system. F. Security In the development stage the authors deal with code security issues only. All the XML description files are verified with XML schemas, and if anomaly is found it is told to the 3rd party application by exceptions. Furthermore we use exceptions to signal any error. The basic exceptions are propagated to the 3rd party application, and we also defined new exceptions to handle additional hazards (e.g. a user control hangs out from the display panel, TTS, ASR issues). On the Symbian platform the exception handling will be solved by the Symbian specific ‘Leave’ and ‘Trap’ operators. VI. POSSIBLE SCENARIOS This section gives some examples, where the use of an XML based multimodal, scalable user interface is beneficial. In augmentative communication aids for speech, hearing and vision impaired people, it can be very useful. E.g. application for speech impaired users give them the possibility to make and receive phone calls [17], or with the help of screen-reader like application [7] visually impaired users can use mobile devices. In this special field, the interface design requires much more attention, as if the user finds it too difficult, s/he can easily loose her/his motivation of using it. Often times the UI must be modified to the impaired users’ needs. It is much more effective, if this modification is done by the impaired user, or by his doctor / caretaker (they are in everyday contact with each other), than by the developers of the application. When the system is ready, the authors wish to demonstrate its effectiveness by implementing SayIt [17] and VoxAid [18] with the XML description method. Another application field can be the intra-car environment. To drive safely it is important to keep the eyes on the road, so a SUI might be a better solution than the GUI. When the vehicle is parking the GUI could extend the features of the SUI. The different modalities can be easily integrated with our approach, and further interface elements (in case of new features, see [19]) can be easily added. VII. FUTURE PLANS The system is under development on Windows Mobile based devices (both Smartphones and PDAs). The GUI and the SUI parts are done, as a next step we would like to implement existing, complex applications, which were developed earlier by our laboratory, to discover bugs and to demonstrate the effectiveness of our approach. After the technology is implemented on Windows Mobiles we will port the idea to Symbian OS based devices. The development for different types of mobile devices will become easier than ever before by this step. The main challenges of porting to Symbian are the assignment of the user controls (see subsection V./E.) of different platforms and the audio interface implementation. Furthermore the authors would like to create a development environment for designing mobile GUIs and SUIs on desktop
computers. With the help of such an application, 3rd party developers would not have to create their UIs in text format, but they could do it in a more effective, intuitive way. VIII. SUMMARY A possible approach of creating scalable, multimodal user interfaces for mobile devices was introduced. We focused on graphical and speech user interfaces, as these are the main modalities of PDAs and smartphones. We investigated the major problems and challenges, and suggested solutions to them, which are realized in our system. Finally we gave two example application scenarios and introduced the future plans. The greatest benefit of the technology is that mobile application development for different platforms becomes easier and faster with the XML based multimodal user interface description. The authors hope that in the future, multimodal mobile interfaces will not only be a research topic but they also appear in off-the-shelf products. ACKNOWLEDGEMENTS The research presented in the paper was partly supported by the Hungarian National Office for Research and Technology (NAP project no. OMFB-00736/2005 and GVOP project no. 3.1.1-2004–05–0485/3.0) REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]
[10] [11] [12] [13] [14] [15] [16] [17] [18]
[19]
R. Riggs, A. Taivalsaari, J. V. Peursem, J. Huopaniemi, M. Patel, A. Uotila: Programming Wireless Devices with the Java 2 Platform Micro Edition, Second Edition, Addison-Wesley, 2003, 434 p. Programming for the Series 60 Platform and Symbian OS, DIGIA Inc. (Editor), 2003, 521 p. J. Prosise: Programming Microsoft .NET, Microsoft Press, 2002., 773 p. Michael J. Young: Mastering Visual C++ 6, Sybex Inc., 1998., 1397 p. C. Bloch, A. Wagner: MIDP 2.0 Style Guide for the Java 2 Platform, Micro Edition, Addison-Wesley, 2003, 262 p. Nuance - Nuance Talks, http://www.nuance.com/talks/ JAWS for Windows, http://www.nanopac.com/jaws.htm Parmod, G. MS SAPI 5 Developer's Guide, InSync Software Inc., 2001. G. Németh, G. Kiss, Cs. Zainkó, G. Olaszy, B. Tóth, Speech Generation in Mobile Phones, Editors: D. Gardner-Bonneau, H. Blanchard, Human Factors and Interactive Voice Response Systems, 2nd ed., Springer, forthcoming Voice Extensible Markup Language (VoiceXML) Version 2.1., 2006., Available: http://www.w3.org/TR/voicexml21/ Speech Synthesis Markup Language (SSML) version 1.0., 2004., Available: http://www.w3.org/TR/speech-synthesis/ XHTML+Voice Profile 1.0., 2001., Available: http://www.w3.org/TR/xhtml+voice/ Mobile X+V 1.2., 2005., Available: http://www.voicexml.com/specs/multimodal/x+v/mobile/12/ C. Hauck: Conception and Integration of Speech Services into an Ambient Intelligence Platform, Project Thesis, 2006., 1st edition., 55 p. A. Syrdal, R. Bennett, S. Greenspan: Applied Speech Technology, CRC Press, 1995., pp. 79-85. XSL Transformations (XSLT) Version 1.0., 1999., Available: http://www.w3.org/TR/xslt B. Tóth, G. Németh, G. Kiss, Mobile Devices Converted into a Speaking Communication Aid, Proc. of Computers Helping People with Special Needs, 9th ICCHP, 2004, Springer, pp. 1016-1023 B. Tóth, G. Németh: “VoxAid 2006: Telephone Communication for Hearing and/or Vocally Impaired People”, Proc. of Computers Helping People with Special Needs, 10th ICCHP 2006, July 2006, Linz, Austria, Springer, ISSN 0302-9743, pp. 651-658 G. Németh, G. Kiss, B. Tóth: “Cross Platform Solution of Communication and Voice / Graphical User Interface for Mobile
6 Devices in Vehicles”, Proc. of Biennial on DSP for in-Vehicle and Mobile Systems, 2-3 Sep. 2005, Sesimbra, Portugal