A User-Centred Approach to Voice Interface Design: Obtaining Unbiased Command Vocabulary Using Storyboards Robert Book1 and Mikael Goldstein Ericsson Radio Systems AB Applications Research, ERA/T/K Kista SE-164 80 Stockholm, Sweden 1 Now at Logica Svenska AB E-mail:
[email protected],
[email protected] M.G. Phone: +46 8 757 3679, Fax: +46 8 757 3100 R.B. Phone: +46 8 705 77 70, Fax: +46 8 730 3855 Abstract This study focuses on providing designers of speech controlled services with a method, the Storyboard method, for generating an unbiased command vocabulary. A Storyboard is a graphical illustration, that depicts a function or a process without using text information that primes the subject what to say. Subjects interpret the Storyboards and suggest ‘unbiased’ Operator and Object commands. Eight naive subjects were exposed to 17 Storyboards. Each Storyboard described different functions in a prototype of a voice-controlled Personal Assistant service developed at Ericsson Radio Systems in Kista, Sweden. Subjects used, with a few exceptions, an Operator+Object mental model when communicating with the service. Among the different ways of eliciting commands, three categories were identified: For two Storyboards, a Uni-modal command distribution was obtained regarding naming the Operator+Object using only one single command. For thirteen Storyboards, the expected, Positively skewed command distribution was obtained, where subjects give 2-3 different voice commands to denote the Operator and/or Object. In the third category, Priming without Operator, the subjects refrained from using the Operator altogether. Instead, priming occurred, where the end result, the outcome after having changed a state in the system was elicited. Keywords: Usability, design techniques, speech control, command vocabulary, voice control, priming, operator, object Introduction A computer application recognising typed or spoken phrases or commands depends on the user’s ability to learn the vocabulary, which the system accepts as valid input (MacDermid and Goldstein 1996). The larger the command vocabulary, the longer learning time and effort the novice user has to spend before being able to master the interaction adequately. The ability of a system to cope with naïve users is considered to be of prime importance according to ETSI (European Telecommunication Standards Institute). ETSI (1995) suggests that in order for a system to fulfil the criterion of usability, at least 75% of the intended target users should be able to control and master the system (service) successfully at their first attempt. It is therefore of great importance,
1
when designing a voice-controlled application, that the command vocabulary to a satisfactory degree predicts the naïve user’s spontaneous vocabulary for denoting objects and functions. The design of a service/system is usually left entirely to the designer. The designer’s impact is therefore immense. The designer’s impact on the chosen command vocabulary, as well as on the interaction structure is two-fold. Most applications of this type are designed by using a designer-selected vocabulary, known as the "armchair approach” (Furnas et al. 1987), as well as using the designer’s “conceptual model” regarding interaction structure (Norman 1983). The armchair approach refers to the designer’s particular use of technical words that may seem natural to him/her, but which may be obscure or even be meaningless/misleading to the novice target user of the application. Studies conducted with typed input, designed by the armchair approach, showed that only 10-20 per cent of the novice user’s first attempts at using their own command vocabulary matched the designer’s intended command (Furnas et al. 1987). Users tend to use a surprisingly large number of synonymous terms to denote the same object or function. Not one unique term, which is the typical armchair approach. According to Zipf’s law (Zipf 1949), one of the basic principles of language is that, when giving names to Operators/Objects, a few (usually 2-3) synonymous alternatives are used with high frequency, whereas a large number of alternatives obtain low frequencies, thus producing a positively skewed distribution. Furthermore, the probability that the designer picks a low-frequency command is high, given the designers’ everyday use of technical terms that are not spontaneously used by the average user. Thus, with the designer’s armchair approach 8-9 out of every 10 initial user attempts fails to identify the designer’s intended command. This implies that the learning time for a new system/service is prolonged, since the novice user has to learn and use the designer’s way of denoting objects and functions. The alternative is the empirical usercentred design approach, where the potential users of the service are asked to define the command vocabulary themselves. By using the 2-3 most frequently suggested user commands as synonyms in the application’s vocabulary to denote the same process or function, Furnas et al. (1987) and MacDermid and Goldstein (1996) showed that this method was up to five times as successful as the armchair approach. This figure refers to system control using both typed and spoken input (for voice-controlled applications). Unbiased command vocabulary To arrive at an unbiased command vocabulary poses certain problems. In the studies of typed input commands cited in Furnas et al. (1987) this was not taken into account, since they used written words to describe processes and functions and asked the users to suggest synonyms. It is thus possible, that these written starters primed the subjects. When describing complicated processes occurring in an application, it is however difficult to describe the process to the subject without using potential command words denoting the process or function itself. Thus, a certain degree of experimental priming is induced, where the naïve subject’s choice of commands may be coloured by the wording of the experimental stimuli themselves (MacDermid and Goldstein 1996). In order to minimise for linguistic bias from written or verbal descriptors in (voice) command elicitation studies, the Storyboard method was used (MacDermid and Goldstein 1996). The Storyboard method is based on a series of graphical illustrations, each illustrating a process or function in the application, which is to be activated by a voice command or phrase. The Storyboard method does not verbally prime the subjects 2
to the designer’s own preferences. The picture format is used as input, instead of the linguistic modality. Some words may however be included in the storyboards where required, to indicate the context. The subject’s task is then to name the intended Operator/Object (process or function), using a single voice command or a short phrase. On the navigation level, Norman (1983) distinguishes between the designer’s conceptual and the user’s/subject’s mental model of the same system. The conceptual model and the mental model is intended to coincide, which is usually not the case for the novice user. Procedure The target service was a voice-controlled interface for a Personal Assistant prototype service developed at Ericsson Radio Systems, Kista. Seventeen different scenarios, each one consisting of Storyboards (containing 2-6 pictures) were depicted. The Storyboards were created using a drawing program. Each Storyboard illustrated a typical scenario for the service. The service was named ESEK (Electronic SECretary in Swedish) in the Storyboards, and that was also the metaphor that was used during the short introduction part that each subject was confronted with. The general layout of the storyboards was the following: a well dressed man wearing a suit with white shirt and a tie was sitting in front of his desk holding the receiver in one hand. Messages or information emanating from the service was presented as a talk bubble emanating from the receiver, appearing to the right of the person. Intended voice commands (Operators/Objects) to be elicited by the user/subject were presented as an empty pink/shaded voice bubble emanating from the person’s mouth. On practically all storyboards, ESEK was depicted as a PC (Personal Computer). Scenarios The following seventeen text scenarios were depicted as Storyboards (2-6 pictures for each scenario): 1. Check ESEK for any messages 2. Make a reminder after listening to a voice mail from Johan 3. Print a received e-mail message 4. Delete an old e-mail (see Figure 2) 5. Save e-mail address received in a voice mail 6. Check for new e-mail messages 7. Change outgoing message from “Meeting” to “Gone for the day” 8. Change outgoing message from “At lunch” to “Back from meeting at 17.05” (Fig 3) 9. Return call on incoming voice mail 10. Delete one voice mail 11. Reminder based on information from received voice mail 12. Repeat play back of a voice mail 13. Delete a voice mail after listening to it 14. Store phone number in incoming voice mail 15. Check for new voice mail messages 16. Phone Karin Nilsson from Address Book 17. Phone 08-4042441 (see Figure 1) Subjects Eight subjects, three females and five males, with no prior knowledge of voice3
controlled services participated in the study. The presentation order of the seventeen scenarios was reversed for half of the subjects. When looking at each Storyboard, the subject was encouraged to spontaneously speak out (not write) the voice command(s) (Operator(s)/Objects) that first came to his mind. Since the subjects were to express themselves in the foreign language English (they all had Swedish as their native language), we could minimise the text priming effect, by writing the necessary text in the Storyboards in Swedish. Results The result of the analysis of the elicited Operator and Object voice commands, shows that they could be grouped into three different categories; Uni-modal (U), Positively skewed (Ps) and Priming without Operator (PwO), see Table 1. The Uni-modal distribution refers to the fact that only one typical Operator and one typical Object command was elicited. The Positively skewed distribution refers to a distribution of elicited responses according to Zipf (1949) where 2-3 Operators and/or 2-3 Objects were used. The Priming without Operator category refers to elicited responses that constituted a translation (priming) of the end-state of a scenario, written in Swedish, into English, omitting the Operator altogether. In the category Uni-modal (see Figure 1, Storyboard 17) we find two Storyboards (number 16 and 17). Most of the elicited responses included only one Operator and one Object): Call Karin Nilsson (16) and Call zero eight four-oh-four twenty-four fortyone (17). Thirteen Storyboards (number 1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 13, 14 and 15) elicited the expected Positively skewed (Zipf 1949) command distribution (see Figure 2, Storyboard 4), i.e., 2-3 different Operator and/or 2-3 different Object commands: Any (new) messages/notes/reminders/bookings? (1), Remind me/Make a note (2), Print email/message (3), Delete/Erase/Discard message/e-mail (4), Store/Save (Marias) email, Make a note (5), Is there (Any) (new)e-mail/messages? (6), Call message one back/Please give/connect me.., Return call (9), (Please) Delete/Erase/ Listen/ Discard message number one (10), Store/Save/Add/File/ message/Maria (11), Repeat (last) message, Listen to (12), (Play/Read/Listen to last message (13), Save/ Store/Note/Write number/name (14) and Any messages for me?/Have I got.. How many..? (15). Two Storyboards (number 7 and 8) were classified to belong to the category Priming without Operator (see Figure 3 Storyboard 8). The outcome is characterised by the fact that subjects where primed by the end result of the scenario: “Johan är på möte och kommer tillbaks kl. 17.05” (At a meeting, will be back at five-oh-five/Meeting until 17.05 and Gone/go home/Back tomorrow). Whereas the designer’s conceptual model and the user’s mental model (the Uni-modal and the Positively skewed categories) match an Operator+Object structure, the subject’s mental model for scenario 7 and 8 does not. The subjects appear to use the system in a much more direct way, compared to if the technical limitations would have been considered. They did not use an Operator in order to change the state of the machine (see Figure 3, Picture 4) (from Lunch to Möte (Meeting)). They simply elicited the end result of the scenario. Instead of using the Operator Change or Cancel (according to the designer’s conceptual model and previous 4
user models) in order to change the state of the machine from Lunch to Meeting (“Möte”), they simply elicited the end state (At a meeting, will be back five-oh-five, Gone/go home/Back tomorrow), omitting the Operator altogether. Table 1. Elicited typical (I and II) and rare command(s) (Operator+Object) for 17 different Storyboards from 8 subjects. Based on the 1st and 2nd elicited response from each subject. Responses grouped into three different categories: Uni-modal (U), Positively skewed (Ps) and Priming without Operator (PwO). Designer’s scenario Typical Category Typical command (I) command (II) description
Rare command(s)
for
Ps
Any (new) messages?
2. Make a reminder after listening to a voice mail
Ps
Remind me
Make a note/ Call Johan
3. Print a received e-mail message
Ps
(Please) Print e-mail
message
4. Delete an old e-mail
Ps
(Please) Delete message/e-mail
Erase..
Discard
5. Save e-mail address received in a voice mail
Ps
(Please) Store (Marias) e-mail
Save
Write.. Make a note
6. Check for new e-mail messages
Ps
Any (new) messages?
Is there Any…
Have I… Number of..
7. Change outgoing message from “Meeting” to “Gone for the day”
PwO
Gone (go) home
Back tomorrow
Left the office
Please give me/ Connect me
Return call Who called? Discard
1.
Check ESEK incoming voice
e-mail/
Notes/Reminders/ /Bookings?
8. Change outgoing mes- PwO sage from “At Lunch” to “Back from meeting at 17.05”
Meeting until.. (to.., ..will be back..)
9.
Ps
Call message person
10. Delete one voice mail
Ps
(Please) Delete message number one..
Erase..
11. Reminder based on information from received voice mail
Ps
Store/Save message
Add Maria/File/ Make a note
12. Repeat play back of a voice mail
Ps
(Please) Repeat (last) message
13.
voice
Ps
(Play/Read) message
14. Store phone number in incoming voice mail
Ps
Store number/name
15. Check for new voice mail messages
Ps
Any (new) messages for me?/have I got?
Return call on incoming voice mail
Play next message
5
one/
Next
Listen to.. Listen to..
Erase, Next
Save
make a Note How many Messages..?
16. Phone Karin Nilsson from Address Book
U
Call Karin Nilsson
17. Phone 08 4042441
U
Call 08 4042441
1
Connect with ..
me
2
ESEK
Figure 1. Scenario 17: Phone 08-4042441. Operator+Object command: Call 08-4042441.
1
2 Du har 3 nya e-mail. e-mail 1: "Hej Kalle! Välkommen till oss på middag ikväll"
ESEK
3
Uni-modal
distribution.
Det e-mailet är en vecka gammalt, så det behöver jag inte ha kvar.
4 ESEK
One
ESEK
ESEK Du har 2 nya e-mail
Figure 2. Positively skewed Operator+Object command distribution. Scenario 4: Delete an old e-mail. Typical Operator+Object commands: Delete/Erase/Discard e-mail/ message. 6
1
2 Johan
Johan är på lunch, och kommer tillbaka kl. 13.00
3
4
Johan
ESEK
Johan
ESEK
Möte Lunch Semester
Möte Lunch Semester
Gått för dagen
Gått för dagen
5
6 Johan
Johan är på möte, och kommer tillbaka kl. 17.05
Figure 3. Priming without Operator, Scenario 8: Change outgoing message from “At lunch” to “Back from meeting at 17.05”. Subjects are primed by the written Swedish text in Picture 6 “Johan är på möte, och kommer tillbaka kl. 17.05” (At a meeting, will be back at five-oh-five, Meeting until 17.05). No Operator+Object structure is elicited, only the end state of the operation: At a meeting, will be back at five-oh-five/ Meeting until 17.05. The subjects did not try to be personal with the system, despite that the service had a name, ESEK, and that they were using their voice as input medium. Only occasionally, the word Please was used as a sign of personal relationship. One reason for this might be that the service was depicted as a computer in the storyboards. Discussion The absence of the Operator+Object structure for Storyboard 7 and 8 is interesting. Subjects omit the Operator altogether, which is contrary to previous results (MacDermid, Eklund and Goldstein 1997). They simply elicit the end state, and not the transition of the underlying state in the machine. This is a sign of not treating the machine as a machine, but more like a human. Despite this “human treatment” of the machine, and the fact that the service had a name, ESEK, and that the subjects were 7
using their voice as input medium, they did not try to be personal with the system. One reason for not using the name ESEK might have been, that this is not an ordinary name. Only occasionally, they used the word Please, as sign of a personal relationship. Reasons for this might be that the service was depicted as a computer in all the storyboards, and you don’t treat a computer the same way as a human. For simple scenarios, like “Print a message”, the method is adequate in finding the voice command, but for more complex scenarios (like scenario 7 and 8), the method demands a little bit more from the designer of the storyboards. In order to get an Unimodal, or a Positively skewed distribution, the interpretation of the storyboards must be clear. This means that the storyboards might need to go through an additional pre-test before they are actually used in the evaluation proper. Despite the use of the Swedish language in the Storyboards, we suspect that priming does occur. The outcome in this experiment is only valid for non-native English people, i.e. we do not know what commands an English sample would elicit. For Storyboards not generating any Operator+Object command distributions, it is of vital importance that instructions are provided that primes the subject to use the appropriate commands accepted by the application. When the most probable, the Positively skewed command distribution occurs, it is important that 2-3 Operator or Object synonyms are accepted as valid spoken entries by the voice recogniser system to assess a function in order for a novice subject to score a direct hit upon using the system for the first time. References ETSI (1995). Human Factors (HF): Minimum Man-machine Interface (MMI) to public network based supplementary services. Document DE/HF.01017, Version 4.5, 1995, European Telecommunication Standards Institute. Furnas, G.W., Landauer, T.K., Gomez, L.M. and Dumais, S.T. (1987). The vocabulary problem in human-system communication, Communications of the ACM, 30, 11, pp. 964-971. MacDermid, C. and Goldstein, M. (1996). The ‘Storyboard’ Method: Establishing an Unbiased Vocabulary For Keyword and Voice Command Applications. In A. Blandford and H. Thimbleby (eds.), Proceedings of HCI’96 Industry Day & Adjunct Proceedings, British Computer Society Conference on Human Computer Interaction, pp. 104-109. MacDermid, C., Eklund, C. and Goldstein, M. (1997). Conflicts between the conceptual model and the mental model in two voice dialling services. In HCI’97 International Abridged Proceedings, San Francisco, USA, p 93. Norman, D. A. (1983). Some observations on mental models, in D. Gentner and A. Stevens (eds.) Mental Models, Hillsdale, NJ:L, Erlbaum Assoc., pp.7-14. Zipf, G.K. (1949). Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley Press Inc., Cambridge 42, Massachusetts.
8