We present a voice-based scripting language called ALVIN (Aural Language for
... language to allow a user to dynamically create such navigation strategies for ...
Voice-commanded Scripting Language for Programming Navigation Strategies On-the-fly. Michael Nichols, Gopal Gupta, Qian Wang Department of Computer Science University of Texas at Dallas
Abstract. We present a voice-based scripting language called ALVIN (Aural Language for VoiceXML Interpretation and Navigation) that allows users to define navigation strategies on-the-fly while browsing VoiceXML pages. The language is intended to be completely voice/audio-based, so as to allow it to be used with voice/audio-only communication devices, such as telephones (land-line or wireless). This paper discusses the various challenges that need to be overcome in designing a voice/audio-based language, and describes how these challenges are overcome in designing ALVIN. In addition, this paper discusses a model implementation of ALVIN that leverages the existing capabilities of VoiceXML to provide a readily deployable solution. The language design is, strictly speaking, a special-purpose programming langauge for navigating VoiceXML pages, but the langauge includes features that would be representative of a proposed class of spoken general-purpose programming languages. A prototype implementation of the language is in progress.
1
Introduction
The phenomenal success of the World-Wide Web (WWW) demonstrates, in a profound way, the ability of computer technology to deliver a cornucopia of diverse services and information with unprecedented speed and breadth of content. As users rely more on Web-based services and information sources, the technology is moving in a natural direction toward increased user mobility and ease of access. Traditional visual-interface computing devices place significant constraints on portability and accessibility by requiring the user to possess and see a viewing screen. Voice-based computing has the potential to overcome these constraints by making it unnecessary for a user to view a screen or for the user’s hands to be occupied with a keyboard or stylus. Moreover, a voice-based interface makes it possible for a user to communicate with a computing device via (mobile) telephone, thus obviating the need for the user to actually have the computing device in his or her possession. Recent technology standards, such as VoiceXML, are allowing voice-based applications to be developed on a large scale. VoiceXML is a standard mark-up language created by the VoiceXML forum and the W3C consortium to make Internet content and information accessible via voice and audio. VoiceXML documents can be aurally browsed on a desktop computer with microphone and speakers or over the phone. Just as a web browser renders HTML documents visually, a VoiceXML interpreter (or a voice browser) renders VoiceXML documents aurally. Voice-based computing, however, is not without its limitations. For one thing, speechrecognition technology is imperfect. Rapid advances in speech-recognition technology and signal processing, however, are significantly reducing the limitations of speech-recognition. A greater challenge, however, is the serial nature of spoken communications. While a user is free to visually scan over a web page displayed on a screen, a user must listen to the text of a voice-based page as it is spoken in sequence, which can be very inefficient, especially if the page was originally intended to be read visually.
One way of attacking this problem is to associate a navigation strategy with a particular page so that only certain features of the page are recited to the user and that those features are recited in a particular order selected by the user. We propose the use of a scripting language to allow a user to dynamically create such navigation strategies for VoiceXML pages while aurally browsing them. Such a language should itself be spoken by the user, so as to allow for completely voice-based operation. This paper provides a design for such a language (called ALVIN) and discusses the various challenges overcome by the design. The idea of using a scripting language for dynamically programming navigation strategies rests on the notion of voice anchors [1]. Voice anchors allow listeners to attach speech labels to dialogs of a VoiceXML document. When this speech label is uttered by the user later (during the same browsing session), browsing returns to the dialog which has that speech label attached. Thus, these voice anchors can be used by a user to move around the various parts of the document via simple voice utterances and voice commands. One can be more ambitious and design a navigation language that specifies in advance the order in which various dialogs labeled with voice anchors should be visited and heard. This is precisely the motivation behind the design of ALVIN. This paper also discusses a model implementation of the ALVIN language using voice anchors that leverages the existing capabilities of VoiceXML to provide a readily deployable solution. The language design is, strictly speaking, a special-purpose programming language for navigating VoiceXML pages, but the language includes features that would be representative of a proposed class of spoken general-purpose programming languages. The research described in this paper makes the following contributions: (i) it presents a viable way of navigating complex aural documents in a meaningful and structured way by using spoken scripting languages; (ii) it presents a concrete spoken language, and an outline of its implementation, for aural documents written in VoiceXML. To the best of our knowledge ours is the first effort in this area, since most spoken scripting languages are limited to making selections from a list of menu that is read out at various points in the aural document. In contrast, our spoken scripting language allows complex navigation strategies to be orally programmed.
2
Language Design
Designing a computer language that is intended only to be spoken poses a number of unusual problems that must be addressed in order to arrive at a workable language. Many of the assumptions about the programmer that are implicitly made in the design of written computer languages do not apply. Moreover, many arguably desirable aspects of written computer languages are either less important or actually a hindrance to the design of a spoken computer language. These problems, and our proposed solutions to these problems, as exhibited by our model spoken scripting language, are addressed in the following sections. A partial grammar of a scripting language incorporating our ideas, which we call ALVIN (Aural Language for VoiceXML Interpretation and Navigation), is depicted in Fig. 2.6. Example sessions illustrating the use of the language are provided in Figs. 2.6 and 2.6. 2
2.1
Overall Design Philosophy
Our language design is based on three fundamental principles or goals that we feel can significantly enhance the usability of a spoken scripting language. The first is that the ideal spoken scripting language should be as close to natural language as possible. Since we anticipate that spoken scripting languages will be used by lay end-users as much as by trained programmers, this is an important design goal. The second principle is related to the first, but perhaps more controversial. The second principle is that the ideal spoken scripting language would give its users more ways to write correct code than to generate errors. We expect some variations in users’ code from user to user and from time to time, particularly in the case of spoken code, which is likely to be more “spontaneous” than written code. We think that it is important that the scripting language or its interpreter/compiler is as forgiving as possible. The third principle is related to the second. The third principle is that all reasonable efforts should be made to prevent the user/programmer from having to revisit and/or edit existing code. The rationale behind this is twofold. First, the kind of “full text” editing most programmers are used to can only be partially emulated over a voice interface and not very efficiently, at that. Second, revisiting existing code over audio is undesirable because it can require the programmer to commit more information to his or her memory. 2.2
Punctuation/Delimiters and General Syntax
Written programming languages rely heavily on delimiters to separate statements and blocks of code, such as semicolons (as in C, Pascal, and other similar Algol-family languages), line breaks (as in Fortran), parentheses (as in Lisp), periods and commas (as in Prolog), and whitespace (as in Python). It is very awkward to attempt to reproduce these types of punctuation in speech, however.1 Our language design, therefore, uses a consistent command syntax (or “sentence structure”) to allow individual statements to be distinguished on the basis of their content, rather than on additional punctuation. Since each command in our language is structured in the form of an imperative sentence, each command begins with a verb. As in natural language, each verb may be transitive or intransitive. If a verb is transitive, its direct object (usually either some kind of identifier or a value) immediately follows the verb. Other information needed for performing the command can be supplied by a set of zero or more modifiers, which are generally represented in the form of a prepositional phrase (e.g., “at anchor foo,” “as type number,” etc.). The modifiers in a given command may be supplied in any order. We feel that this grammatical structure is natural to the user and highly expressive, yet regular enough to be readily parsed by a computer. A brief inspection of the grammar in Fig. 2.6 reveals that this grammatical structure is a fair approximation to the way commands are typically spoken in English. Although it was not originally apparent to us at the time, our grammar is actually very structurally similar to that of the Japanese language, where the verb has a fixed position in the sentence and the other sentence elements may be more flexibly arranged, because they are accompanied by “particles,” which denote the roles the various words play in the sentence. The most visible difference between our basic grammatical design and that of the Japanese language is the fact that Japanese prefers postfix expressions over 1
Unless you are Victor Borge [2], that is.
3
our more English-like prefix expressions. The verb in a Japanese sentence goes at the end, and particles follow the words they are associated with.2 The language employs a number of strategies to allow a user some degree of license to use varying forms of the same commands. For example, the commands “skip,” “jump,” and “go” are synonyms and can be used interchangeably to move from location to location within a VoiceXML document. Also, any of a number of “filler words,” including the articles “the,” “a,” and “an,” may be inserted anywhere within a command and will be ignored. This allows a user to speak more naturally to the computer, with no loss of meaning. As shown in Fig. 2.6, the two I/O commands, “ask” and “say” have grammar rules that end in periods (full stops). This denotes that the two commands record audio clips at compile-time for later playback at run-time. The programmer signifies completion of the recording by pressing a key on his/her touch-tone phone or by waiting for a time-out period to expire. 2.3
Tolerated Ambiguity and Interactive Interpretation/Compilation
One of the major differences between natural languages and typical computer languages is that the latter are generally designed to be inherently unambiguous. While inherently unambiguous computer languages are certainly far from being an obsolete concept, it should be recognized that inherently unambiguous languages are a product of the old batch-processing model of computation and are not strictly required for interactive computing. In a batch process, the computer’s instructions must be entirely specified in advance. Thus, no ambiguity in program code can be tolerated, since unpredictable (or, at least, undesirable) results could result. With interactive computing, however, the computer can simply ask the user for additional information needed to solve the problem at hand. Thus, our language model assumes an interactive interpreter or compiler will be used to interpret or compile the language. Hence, if an ambiguity arises (either because of difficulty in speech recognition or because of an ambiguous grammar), the interactive interpreter/compiler will immediately notify the user/programmer and request information from the user to permit the interpreter/compiler to resolve the ambiguity. This approach to interaction, where both sides take turns initiating actions, is knows as “mixed-initiative interaction.” Ambiguity detection may be elegantly implemented in a logic language, such as Prolog [3], which supports non-deterministic parsing through the facility of definite clause grammars (DCGs). If the DCG finds two or more plausible interpretations for a given command, the interpreter/compiler asks the user for some distinguishing piece of information to allow it to resolve the ambiguity. 2.4
Modifiers and Slots
Modifiers are used to assign values to “slots,” which are, essentially, attributes of a given command. Certain recurring types of slots are “locative,” “nominative,” and “type” slots. Locative slots refer to a location, in this case, to a location within the VoiceXML file being 2
For example, in the Japanese sentence, “Watashi-wa Tanaka-san-no bengoshi desu,” the particle “wa” placed after Watashi (I) means that Watashi (I) is the subject of the sentence. The “no” following Tanaka-san is a possessive particle (analogous to an apostrophe-s in English). The final two words, bengoshi (attorney) and desu (to be), are the direct object and verb, respectively. Thus, the complete sentences translates as “I am Mr. Tanaka’s attorney.”
4
browsed (e.g., next to last sentence). Nominative slots refer to a name given to the direct object of the command (e.g., in the command Store 5 as FOO, the nominative slot is filled with the value “FOO,” representing a variable name or label). Some slots are required to be filled in order for the command to make sense, those slots are labeled with an overbar in Fig. 2.6. If the programmer/user completes the command without filling in all of the slots (by stating the appropriate modifiers), the interpreter/compiler interrupts the programmer/user and asks the user a question to resolve the issue. For example, the interpreter or compiler might ask something like this, “What name was 5 supposed to be stored as?” At that time, the programmer/user could simply supply the requested answer, “foo,” then resume programming. 2.5
Serial Programming
Arguably, the primary obstacle faced by the designer of a spoken computer langauge is the strictly serial nature of voice-based I/O. Because only one language element or word may be spoken at any one time, there is no way for a programmer to refer back to earlier parts of the program, even recent code, while programming. The impact this has on the programmer cannot be underestimated. There are many language constructs in written computer languages that implicitly assume that the programmer will have visual access to the source code during the programming process. For example, any form of nested or hierarchical language structure makes this implicit assumption. Parentheses in the LISP programming language are a supreme example of this. Without ready visual confirmation of previously-entered lines of code, it is difficult (and in some case, practically impossible) to determine where one part of a routine ends and another begins or resumes. The problem with the traditional approach taken in written computer languages is that it puts too many demands on the programmer’s memory.[4] Programmers are generally not consciously aware of the degree to which they rely on the information that a screen display provides. Because it provides constant, immediately perceptible feedback about the context in which a given line of code is being written, the screen display frees the programmer from having to keep careful track of what precise point in the program logic the programmer is writing the code. Visual cues, such as brackets, delimiters, and whitespace/indenting, all aid the the programmer in determining this and make it possible for a programmer to interrupt a central sequence of instructions (and train of programmer thought) to address some auxiliary sequence of instructions before returning to the main sequence. For example, the C-like loop “do { instruction 1; . . . instruction n;} while(condition);” involves two such sequences. The primary sequence is that in which the do-while loop instruction resides. The secondary sequence is the loop body itself. Now, to program this loop in a purely serial fashion using C would require that the programmer first begin the do-loop instruction, suspend the programmer’s train of thought about the loop conditions temporarily in order to program the loop body, then return to the loop’s condition for iteration. As a program becomes more complicated, with many such structures presented in a nested form, it is not difficult to see the confusion that can arise when a programmer is confronted with the problem of determining where his or her original train of thought left off. Moreover, the traditionally favored top-down style of design, is difficult to achieve using traditional structured written programming languages, such as C. When employing top-down 5
design techniques, a programmer may write a conditional statement (an “if ”) then leave a placeholder for the code conditionally executed in response to the conditional statement. At a later time, the programmer will come back to the placeholder and replace the placeholder with the conditionally executed code. This scheme, of course, anticipates that the programmer will be able to return to the placeholder in a random-access fashion, delete the now-unneeded placeholder, and insert new code at that location. Thus, traditional top-down design using structured programming languages naturally assumes that the programmer will be able to view and edit his or her code in a more-or-less random-access fashion. In the rigidly sequential world of audio-based programming, however, this kind of random-access programming is not practical. What the above discussion reveals, however, that traditional structured programming languages are not really “top-down” in the sense that the language expresses programming concepts in a top-down fashion. What traditional structured languages express is not “top down” code, but is, rather, a depth-first search of a top-down design. True top-down programming would be more akin to a breadth-first search. In a breadth-first search of a graph, one traverses all of the nodes at a particular level before proceeding to the next level. In a breadth-first search, there is no need to return to a higher level before continuing the search, since all of the higher-level nodes have already been visited. Thus, a programmer writing code in a breadth-first manner need not revisit previous levels, either, and has less contextual information to remember. Our language design allows for breadth-first or top-down writing of code by encouraging the use of forward references. For example, in our spoken language, one might write the previous “do-while loop” using a forward reference in this form: “do FOO while condition where FOO is instruction1 instruction2 . . . end.” This encourages top-down design through the ability to “write” code in top-down order without having to revisit previously defined pieces of code. Because users are allowed to create their own labels, such as “FOO,” for forward references to code blocks, some mechanism for allowing these labels to be defined must be provided for in the language. Where those labels may be defined in our grammar (Fig. 2.6), we have enclosed the labels in boxes. This means that the user may either define the label name at that point or simply recite the label name itself (if the label has already been defined). To define a new label, the user says “new label,” spells the name letter-by-letter (preferably using the International Phonetic Alphabet or IPA), then completes the new label by saying “end label” (alternatively, the user could simply being spelling the label using the IPA (e.g., “foxtrot, oscar, oscar, bravo, alpha, romeo” for “foobar.”). After having spelled the new label, the user should be able to simply recite the word itself, without having to spell the word [1].3
3
The initial spelling of the label is necessary, as standard VoiceXML does not provide a convenient way to recognize words that it does not expect to hear (i.e., words that are not part of the grammar specified by the writer of the VoiceXML page). Once a word is spelled, however, the word may be incorporated into a VoiceXML grammar to allow the word to be recognized by the VoiceXML browser. This is one of the limitations of current state-of-the-art voice recognition software.
6
2.6
Instructions with Compile-time Semantics
Three commands, “list,” “current label,” and “pending labels,” have “compile-time semantics.” Compile-time semantics is a concept borrowed from the Forth language, in which certain commands, when entered in the course of writing compiled code, are not compiled into the program, but execute immediately and return a result. These three instructions are used to perform tasks that aid the programmer and cannot be compiled into a program. The list command is used to have the interpreter/compiler recite the source code to a particular named block of code. The current label command recites the name of block being currently written by the programmer. The pending labels command recites the names of the code blocks that have been used in a forward reference, but which have not yet been defined. This helps facilitate top-down programming, by reminding the programmer of the code blocks that need to be defined. 2.7
Arithmetic Instructions Based on “Current Value”
Arithmetic and memory operations are performed in our language with reference to a “current value,” which might be thought of as a kind of accumulator register, as were common in early computer hardware designs. A “take” instruction is used to load an immediate or stored value as the current value. Arithmetic commands each perform an operation on the current value and some other immediate or stored value and store the result as the new current value. This strategy, although admittedly primitive when applied to written arithmetic expressions, is more appropriate for spoken code, because it eliminates the complications associated with reciting algebraic expressions verbally. The current value is not limited to only numerical values, but may also assume values of other types, such as audio clips. In this way, our current value concept is not unlike the “$ ” variable in Perl, which is used to perform various operations (generally for string pattern matching purpose) without specifying an explicit variable name.
3
Model Implementation
In our model implementation of the ALVIN language, we have chosen to use a Prolog-based CGI (Common Gateway Interface) program to serve VoiceXML pages augmented with additional tags and grammar information to allow a user to speak commands in ALVIN through a VoiceXML browser. The grammar included in the augmented VoiceXML pages is sufficiently detailed to enable the VoiceXML browser to recognize and tokenize the user’s speech. Once a complete ALVIN command is tokenized by the VoiceXML browser, the tokenized command is submitted to the Prolog-based CGI program, which may reside behind a standard HTTP (HyperText Transfer Protocol) server. The CGI interprets the command and returns a result to the VoiceXML browser in the form of an ALVIN-augmented VoiceXML page. As an additional optimization, the ALVIN-augmented VoiceXML page may include JavaScript or other client-side code that may be able to offload some of the work of the CGI (for simple commands) to the VoiceXML browser. The system may be combined with an HTML-toVoiceXML transcoder for sophisticated voice/audio-based web browsing as in [5]. 7
Command → stop Command → continue
/* interrupt/stop reading VoiceXML page */ /* resume reading VoiceXML page */
Command → load Name
/* load a pre-defined set of code blocks */
Command → save Name
/* save the currently defined set of code blocks to disk */
Command Command Command Command
→ → → →
echo Name current block pending textttblocks list Name
/* /* /* /*
play back last command to user */ identify current code block being defined */ identify undefined code blocks */ recite code listing of named code block */
Command → call Name
/* call named code block */
Command → set anchor Name
/* set a voice anchor */
Command → set anchor (here|at Location) (called|named) Name
/* set a voice anchor (alt. version)*/
Command Command Command Command Command Command Command Command Command Command Command Command
/* /* /* /* /* /* /* /* /* /* /* /*
→ → → → → → → → → → → →
set anchor (called|named) Name (here|at Location) (go|skip|jump) to? Location sidetrack to? Location repeat Command Times ask (as|for) Type Question. say Statement. recite Value (take|result is) Value add Value subtract Value multiply (by)? Value divide (by)? Value
set a voice anchor (alt. version)*/ jump to location in VoiceXML document */ jump to location in VoiceXML document with return * loop */ prompt user for input */ say statement */ recite value to user */ get a new “current value” */ add value to “current value” */ subtract value from “current value” */ multiply “current value” by value */ divide “current value” by value */
Command → store (Value|result|it|that)? (as|at|to|in) Name
/* store current value as a variable */
Command → where Name is? (defined (as|by)?)?
/* define new named code block */
Command Command Command Command
/* /* /* /*
→ → → →
define Name as end if Condition then? Command if Condition then? Command
define new named code block */ end compilation of a code block */ execute Command if Condition satisfied */ execute Command unless Condition satisfied */
Location → anchor Name Location → (next to|Ordinal (to|from))? (Ordinal|next|last) LUnit LUnit → (sentence|paragraph|section) Desc? Desc → (after|before) Location Desc → (containing|about) Keyword Type → number|clip|script Condition → C2 (or Condition)? C2 → not? C3 (and C2)? C3 → Value Relation Value Relation → equals|textttis? R2|doesn’t equal| does not equal R2 → not? equal to R2 → (no|not)? (greater|less) than (or equal to)? Event → Condition Event → user (says|utters) Statement Times → Number times Times → indefinitely|forever Times → while Condition Times → until Event Value → Number Value → Name
/* types include numbers, audio clips, and scripts */
8
Fig. 1. Partially-defined grammar containing core rules for the ALVIN voice-based scripting language
COMPUTER (reading VoiceXML page): Consequently, the reason that the . . . USER: Stop COMPUTER: Okay USER: Set an anchor called foxtrot oscar oscar at the next sentence COMPUTER: anchor foo set USER: echo COMPUTER: Set anchor named foo at next sentence
Fig. 2. Example session USER: Define bravo alpha romeo as Ask for a number “Give me a number” (USER pauses) COMPUTER: Go on USER: Store it as x-ray. Store one in foxtrot. Repeat the following until x-ray equals one. Take foxtrot. Multiply by x-ray. Store the result in foxtrot. Take x-ray. Subtract one. Store that in x-ray. End. COMPUTER: Okay USER: Current block? COMPUTER: bar USER: The result is foxtrot. COMPUTER: Okay USER: End. COMPUTER: Bar defined.
Fig. 3. A more elaborate session
4
Related Work
Voice-based programming is a relatively new field, largely due to the fact that voice-recognition technology with sufficient accuracy to permit voice-based programming has only become available in recent years. Arnold et al. of Drexel University and Georgia Tech proposed a system for programming by voice to allow computer users to avoid using a keyboard (in the event of repetitive stress injuries and the like).[6] Unlike our research, which addresses programming in purely audio environments by defining a new spoken computer language, their voice programming system is a syntax-directed editor intended to be used in front of a computer screen for programming in existing written computer languages, such as C and Java. Alvin Surkan of the University of Nebraska-Lincoln has proposed a voice-directed interface in the APL programming language to permit an agent-based program synthesis system to be controlled via voice-based controls.[7] Ramakrishnan et al. of Virginia Tech have published an interesting paper relating mixed-initiative interaction (including voice-based interaction) to mixed computation (i.e., partial evaluation).[8] Finally, note that there are purely speechbased commercial systems available as well (such as IBM’s Via Voice) that takes speech commands. These speech commands are very simple, and certainly these systems do not allow users to program navigation strategies on-the-fly.
5
Conclusion
In this paper we presented a framework for designing a voice-based scripting language that can be used to program navigation strategies on-the-fly for browsing VoiceXML pages. We 9
used this design framework to develop the ALVIN language, which is a representative of a larger potential class or family of voice-based programming languages. We discussed some of the challenges associated with designing a voice-based scripting language and proposed solutions for those challenges which were incorporated into the design of our experimental language, called ALVIN. We also outlined techniques for implementing ALVIN. A prototype implemenation of ALVIN is in progress. We anticipate that, as mobile telecommunications devices and portable computing devices become more prevalent, the need for remote and/or hands-free programmability of computing devices will increase. Spoken programming languages like ALVIN will enable users to “write” scripts to program such devices without requiring a keyboard or display. We plan to build on the basic concepts of ALVIN to develop more general-purpose languages for interactive, voice-based programming to fulfill this need.
References 1. Reddy, H., Annamalai, N., Gupta, G.: Listener-conrolled dynamic navigation of voicexml documents. In: International conference on Computers Helping People (ICCHP), Springer Verlag, Lecture Notes in Computer Science (2004) 337–354 2. Knudsen, W.: Victor borge sound clips. http://www.kor.dk/borge/b-mus-1.htm (1997) 3. Shapiro, E., Sterling, L.: The art of prolog (1994) 4. Fry, C.: Programming on an already full brain. Commun. ACM 40 (1997) 55--64 5. Gupta, G., Raman, S.S., Nichols, M.M., Reddy, H., Annamalai, N.: Dawn: Dynamic aural web navigation. HCI 2005. (This Proceedings) (2005) 6. Arnold, S.C., Mark, L., Goldthwaite, J.: Programming by voice, VocalProgramming. In: ASSETS ’00: Proceedings of the fourth international ACM conference on Assistive technologies, ACM Press (2000) 149--155 7. Surkan, A.J.: Spoken-word direction of computer program synthesis. In: APL ’00: Proceedings of the international conference on APL-Berlin-2000 conference, ACM Press (2000) 219--227 8. Ramakrishnan, N., Capra, R.G., P´ erez-Qui~ nones, M.A.: Mixed-initiative interaction = mixed computation. In: PEPM ’02: Proceedings of the 2002 ACM SIGPLAN workshop on Partial evaluation and semantics-based program manipulation, ACM Press (2002) 119--130
10