Development of Speech Input Method for Interactive VoiceWeb Systems Ryuichi Nisimura1, Jumpei Miyake2, Hideki Kawahara1, and Toshio Irino1 1
2
Wakayama University, 930 Sakaedani, Wakayama-shi, Wakayama, Japan Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara, Japan
[email protected]
Abstract. We have developed a speech input method called “w3voice” to build practical and handy voice-enabled Web applications. It is constructed using a simple Java applet and CGI programs comprising free software. In our website (http://w3voice.jp/), we have released automatic speech recognition and spoken dialogue applications that are suitable for practical use. The mechanism of voice-based interaction is developed on the basis of raw audio signal transmissions via the POST method and the redirection response of HTTP. The system also aims at organizing a voice database collected from home and office environments over the Internet. The purpose of the work is to observe actual voice interactions of human-machine and human-human. We have succeeded in acquiring 8,412 inputs (47.9 inputs per day) captured by using normal PCs over a period of seven months. The experiments confirmed the user-friendliness of our system in human-machine dialogues with trial users. Keywords: Voice-enabled Web, Spoken interface, Voice collection.
1 Introduction We have developed a speech input method for interactive voice-enabled Web system. The proposed system is constructed using a simple Java applet and CGI programs. It would serve as a practical speech interface system comprising only free software. The World Wide Web has become an increasingly more popular system for finding information on the Internet. At present, nearly every computer-based communication service is developed on the Web framework. Further, Internet citizens have begun to benefit significantly from transmitting multimedia data over Web 2.0, which refers to second-generation Internet-based Web services such as video upload sites, social networking sites, and Wiki tools. To increase the use of speech interface technologies in everyday life, it is necessary to use a speech input method for Web systems. A large-scale collection of voice database is indispensable for the study of voice interfaces. We have been collecting voices in cases such as simulated scenarios, WOZ (Wizard of Oz) dialogue tests[1], telephone-based system[2,3], and public space tests[4]. However, an evaluation of voices captured by normal PCs is still insufficient to organize a voice database. To develop and utilize the speech interface further, it is important to know the recording conditions prevalent in home and office environments [5]. Further, it is also important to know the configurations of the PCs used for J.A. Jacko (Ed.): Human-Computer Interaction, Part II, HCII 2009, LNCS 5611, pp. 710–719, 2009. © Springer-Verlag Berlin Heidelberg 2009
Development of Speech Input Method for Interactive VoiceWeb Systems
711
recording voices through the use of field tests. The proposed system is designed to collect actual voices in homes and offices through the Internet.
2 Related Works Advanced Web systems that enable voice-based interactions have been proposed, such as the MIT WebGalaxy system[6][7]. Voice Extensible Markup Language (VoiceXML)1, which is the W3C2 standard XML format, can be used to develop a spoken dialogue interface with a voice browser[8]. SALT (Speech Application Language Tags) is an other extension of HTML (Hypertext Markup Language) that adds a speech interface to Web systems[9]. They require special add-on programs or customized Web browsers: therefore, users must preinstall a speech recognition and synthesis engine and prepare nonstandard programs for voice recording. However, complicated setups make it difficult for users to access voice-enabled websites. Adobe Flash, which is a popular animation player for the Web environment, can be used for voice recording using generic Web browsers. However, a rich development tool that is not freely available is required to produce Flash objects containing a voice recording function. Moreover, the system might require a custom-built Web server to serve websites with a voice recording facility. Hence, developments of a voiceenabled Web system by Flash are not necessarily an easy process.
3 Overview Figure 1 shows a screen shot of the proposed Web system. The website features an online fruit shop. In this site, users can use our spoken dialogue interface. The Web browser displays normal HTML documents, Flash movies, and a recording panel. An HTML document that contains text, images, etc., provides instructions on how to use this Web site. One of the HTML documents also presents the results of an
Flash Movie
Recording Panel
Fig. 1. Screen shot of voice-enabled website 1 2
http://www.voicexml.org/ http://www.w3.org/
712
R. Nisimura et al.
order placed by a user. The Flash movie provides an animated agent to realize significant visual interactions between the system and the user. The synthetic voice of the agent produced by dialogue processing would be sounded by embedding a Flash movie. Figure 2 shows a close-up of the re(Windows) cording panel in the Web interface. The (Mac OS X) recording panel is the most important component of our voice-enabled Web (1) On mouse framework. It is a pure Java applet (using Java Sound API of standard Java APIs) (2) Recording that records the voice of a user and trans(3) Data Transmitting, mits it to a Web server. and Processing When the recording panel is activated Fig. 2. Recording panel by pressing the button, the voice of the user is recorded through the microphone of the client PC. After the button is released, the panel begins to upload the recorded signals to the Web server. The recording panel also provides visual feedback to the user, as shown in Figure2. During voice recording, a bar of the panel acts as a power-level meter. Users can monitor their voice signal level by observing the red bar (Figure 2 - 2). The message “under processing now” is displayed on the bar while the recorded signals are transmitted to the Web server and processed in a dialogue sequence (Figure 2 -3). Subsequently, browser shows processing a result after the transmission is completed. We designed the proposed w3voice system to actualize the following concepts: 1. Users can use our voice-enabled web system with easy setup procedures without any add-on program pre-installations on the client PC. 2. Web developers can develop their original voice-enabled web pages by using our free development tool kits with existing technologies (CGI programs and standard Web protocols), and open-source software. We have released the development tool kits in our website, http://w3voice.jp/skeleton/. 3. Our system can run identically on all supported platforms, including all major operating systems such as Windows (XP and Vista), MacOS X, and Linux. Users can use their preferred Web browsers such as Internet Explorer, Mozilla, Firefox, and Safari. 4. Our system can provide practicable and flexible framework that enables a wide variety of speech applications. To realize them, we have adopted server-side architecture in which the major processing sequences (speech recognition, speech synthesis, contents output) are performed on the Web server.
4 Program Modules Figure 3 shows an outline of the architecture of the w3voice system. The system consists of two program modules – a client PC module and a Web server module.
Development of Speech Input Method for Interactive VoiceWeb Systems
713
4.1 Client PC Module As mentioned earlier, the client PC module is a Java applet that is launched automatically by HTML codes as follows. Fig. 3.
The applet runs on the Java VM of the Web browser and it records the user’s voice and transmits the signals to the Web server. The POST method of HTTP is used as the signal transmission protocol. The applet has been implemented such that it functions as an upload routine similar to a Web browser program. Thus, our system can operate on broadband networks developed for Web browsing because the POST method is a standard protocol used to upload images and movie files. In this HTML code, Website developer can specify the sampling rate for recording voice by assigning a value to the parameter “SamplingRate.” The URL address of the site to which the signals must be uploaded must be defined in “UploadURL.” 4.2 Web Server Module shop.cgi We have designed a server-side architecture in upload.cgi (Receiver) (Main) which the processing sequence of the system is Web server performed on the Web server. As shown in Figure 4, the CGI program HTTP Redirect with Data ID forked from a Web server program (httpd) comprises two parts. One part receives the signals from the applet and stores them in a file. The other part consists of the main program that processes the signals, recognizes voices, and produces outputs. Fig. 4. CGI processes The recorded signals transmitted by the applet are captured in “upload.cgi,” which is assigned in UploadURL (Section 4.1). It creates a unique ID number for the stored data. Then, the main program, e.g., “shop.cgi” in the case of the online fruit shop, is called by the Web browser. The address (URL) of the main program is provided to the browser through the HTTP redirect response outputted by upload.cgi as follows. HTTP/1.1 302 Date: Sun, 18 Mar 2007 13:15:42 GMT Server: Apache/1.3.33 (Debian GNU/Linux) Location: http://w3voice.jp/shop/shop.cgi?q=ID Connection: close Content-Type: application/x-perl
714
R. Nisimura et al.
The “Location:” field contains the URL of the main program identifying the stored data by its ID number. Thus, our system does not require any special protocol to realize a practical voiceenabled Web interface. The CGI programs can be written in any programming language such as C/C++, Perl, PHP, Ruby, and Shell script.
5 Applications Original and novel voice-enabled Web applications using the w3voice system have been released. In this section, we would like to make a brief presentation of these applications. They are readily available at the following Web site3:
http://w3voice.jp/ 5.1 Speech Recognition and Dialogue (Figure 5) As shown in Section 3 (online fruit shop), the w3voice system can provide speech recognition and dialogue services. Web Julius is a Web-based front-end interface for Julius4, which is an open source speech recognition engine[10]. This application realizes quick-access interface for a dictation system without requiring the 5. Left: Speech Recognizer Web installation of any speech recognition pro- Fig. Julius. Right: Web Takemaru-kun, Japagrams on the client PC. The HTML text nese spoken dialogue system. shows the recognized results outputted by Julius for an input voice. Users can quickly change the decoding parameters of N-best and a beam search through a Web-based interface. We have prepared many built-in language and acoustic models to adapt to the linguistic and acoustic parameters of the input voice. Web Takemaru-kun system is our Japanese spoken dialogue system, which has been developed on the basis of an existing system[4, 11] with the w3voice system. The original Takemaru-kun system has been permanently installed in the entrance hall of the Ikoma Community Center in order to provide visitors with information on the center and Ikoma City. The authors have been examining the spoken dialogue interface with a friendly animated agent through a long-term field test that commenced in November 2002. Our analysis of user voices recorded during this test shows that many citizens approve of the Takemaru-kun agent. The w3voice system allows users to readily interact with the Takemaru-kun agent through the Internet and also makes the interaction a pleasant experience.
3
It should be noted that the content on our website is currently in Japanese because our speech recognition program can accept inputs only in Japanese. However the proposed framework of w3voice system does not depend on the language of the speaker. 4 http://julius.sourceforge.jp/
Development of Speech Input Method for Interactive VoiceWeb Systems
715
5.2 Speech Analysis and Synthesis Because the w3voice system processes raw audio signals transmitted by the client PC, any post signal processing technique can be adopted. The spectrogram analyzer (Figure 6) is a simple signal analyzer that shows a sound spectrogram produced by MATLAB, which is a numerical computing environment distributed by The MathWorks5. The analyzer is beneficial to students learning about signal processing, although this Fig. 6. Web-base sound content is very simple. w3voice.jp provides a voice-changer application to en- spectrogram analyzer sure user privacy (Figure 7). An uploaded signal is
Fig. 7. Screen shot of the STRAIGHT voice changer. A synthesized voice would be played by clicking a point on the grid. User can control the pitch stretch of changed voices by the xcoordinate of mouse cursor. The spectrum stretch is represented by the y-coordinate.
decomposed into source information and resonator information by STRAIGHT[12]. In addition, the users can also reset the vocal parameters. By applying dynamic Web programming techniques using Ajax (Asynchronous JavaScript and XML), a synthesized voice is reproduced by a single click in real time. The vocal parameters are preserved as HTTP cookies by the Web browser, which are referred by the voice conversion routines of other applications. 5.3 Web Communication Our system can be used to build new Web-based communication tools using which the uploaded voices can be shared by many users. As for an example, we have introduced a speech-oriented interface onto a Web-based online forum system6. Anchor tags linking to transmitted voice files are 5 6
http://www.mathworks.com/ http://w3voice.jp/keijiban/
716
R. Nisimura et al.
provided by the forum system to users who enjoy chatting using natural voices. A conventional online forum system allows users to use text-based information to communicate with each other. The source code of the voice-enabled forum system, which includes only 235 lines of the Perl script, demonstrates the ease with which voice applications based on the w3voice system can be developed. Voice Photo7 can be used to create speaking photo albums on the Web. It is a free service that integrates uploaded voice recordings with JPEG files and generates an object file in Flash. When the user moves the mouse cursor over the Flash file, the embedded voice is played. The file generated by this service can be downloaded as a simple Flash file. Thus, webmasters can easily place them on their blogs or homepages.
6 Experiments 6.1 Data Collection
1200 http://w3voice.jp/
w3voiceIM.js News release
July 9, 2007 1000 News release The operation of “w3voice.jp” April 18, 2007 began on March 9, 2007. Internet tsu 800 p users can browse to our site and in 600 fo try out our voice-enabled Web at # 400 anytime and from anywhere. All voices transmitted from the 200 client PC are stored on our Web 0 server. We are investigating 3/ 1 4/ 1 5/ 1 6/ 1 7/ 1 8/ 1 9/ 1 access logs containing recorded Month user voices and their PC configu- Fig. 8. Number of inputs per day without zero length inputs (March 9, 2007 – September 24) rations.
500 450 400 350
st300 up250 ni fo200 #150 100 50 0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15