quill: A Gesture Design Tool for Pen-based User Interfaces - CiteSeerX

9 downloads 1933 Views 388KB Size Report
with desktop applications that are not gesture-aware. Unfortunately, gestures also have drawbacks. Users of cur- rent systems sometimes find that gestures are ...
quill: A Gesture Design Tool for Pen-based User Interfaces A. Chris Long1 HCI Institute Carnegie Mellon University Pittsburgh, PA 15213 +1 412 268 7565 [email protected] ABSTRACT1

Pen-based user interfaces are becoming more prevalent. Gestures are a powerful and convenient way to interact using a pen, but are difficult to design well. This paper describes the first intelligent gesture design tool, quill, which assists designers in creating and improving gestures for pen-based interfaces, both in terms of computer recognizability and likelihood of confusion among gestures by people. quill helps designers improve their gestures by warning them about potential problems with their gestures and offering advice about how to fix them. An evaluation with professional designers showed this type of advice to be promising, but requires care to implement well. Keywords

Pen-based user interface, gestures, design tool INTRODUCTION

Interest in pen-based computers is growing rapidly. In 2000, 3.5 million handheld computers were sold, a substantial increase over the 1999 sales of 1.3 million [10]. Pens are also being used more on the desktop. As pen-based devices proliferate, pen-based user interfaces become increasingly important. Many authors list the benefits of pen-based computer interfaces [1, 2, 3, 11, 12, 21]. In particular, one benefit of pen-based interfaces is the ease of using gestures—that is, marks made to execute commands (see Figure 1). Gestures are useful on displays ranging from small screens, where space is at a premium, to large screens, where controls, such as menus, can be out of reach [14]. In this work we concentrate on gestures in the spirit of copy editing [5, 18] rather than marking menus [20], because we believe that traditional marks are more useful in many circumstances. For example, they can specify operands at the same time as the operation, and they can

delete me Figure 1 Example of a delete gesture. 1. Based on work at U.C. Berkeley

James A. Landay and Lawrence A. Rowe Computer Science Division, EECS Department University of California, Berkeley Berkeley, CA 94720-1776 +1 510 643 3043 {landay,rowe}@cs.berkeley.edu be iconic (i.e., memorable because the shape of the gesture corresponds with its operation). Perhaps because of these advantages, gestures are becoming more prevalent. For example, the Opera web browser supports gestures for common web browsing operations [13]. Sensiva [19] and KGesture [15] allow gestures to be used with desktop applications that are not gesture-aware. Unfortunately, gestures also have drawbacks. Users of current systems sometimes find that gestures are hard to remember and are often misrecognized [7]. Misrecognition errors are undesirable for any recognition system, but for gestures they are especially problematic because gestures perform actions. When a gesture is misrecognized, the user must notice that an error has occurred and take special steps to undo the changes. People may find gestures difficult to learn and remember in part because of their perceptual similarity or dissimilarity to one another. We contend that similar operations with a clear spatial mapping, such as scroll up and scroll down, should be assigned similar gestures. Conversely, gestures for dissimilar abstract operations, such as cut and scroll down, may easily be confused if they are visually similar. Long, et al, created a prototype gesture design tool, gdt, and studied the gesture design process. They found that it is difficult to design gestures that will be recognized well by the computer [8]. Naive gesture designers often created gestures that the computer viewed as similar and thus were difficult for the computer to recognize. To help designers improve their gestures, we have developed an intelligent gesture design tool, called quill. quill continuously analyzes gestures that the designer enters. When quill predicts that two gestures might be confused by people or confused by the computer, it warns the designer and provides advice about how to fix the problem. To detect when people will perceive gestures as similar, quill includes an experimentally-derived model of humanperceived gesture similarity [6,9]. The similarity model can also be used to give advice about how to make gestures more similar, but we believe it is more difficult for designers to make gestures more different than more similar. The remainder of this paper is structured as follows. The first section describes terminology. The next describes how quill is used. The following discusses issues in the design, evolution, and implementation of quill, followed by a user study of quill. The paper concludes with future work and a summary.

TERMINOLOGY

Before discussing details of quill, we must define our terminology. Gestures are the actual objects that quill users want to manipulate, so it is important that the scheme used to organize them is easy for users to understand. We created the following 5-level hierarchy: • Gesture. A single mark or glyph. • Gesture category. A collection of gestures that define a type of gesture. For example, a collection of left-to-right straight lines might define the “scroll right” gesture category. • Gesture group. A collection of gesture categories, typically ones that perform related operations. For example, one might have an “Edit” group that contains the “cut,” “copy,” and “paste” gesture categories. • Gesture set. A collection of gesture categories and gesture groups, typically all the categories and groups for a particular application. • Gesture package. Combines a training set (i.e., a gesture set used to train the recognizer) and zero or more test sets (i.e., gesture sets used to test the recognition of the training set). Gesture groups and gesture packages are not strictly necessary, but experience suggests they are useful for designers of gesture sets with many categories. Gesture packages were introduced to allow tight coupling of test sets with a training set, which was suggested in the gdt evaluation. Collectively, gesture packages, gesture sets, gesture groups, gesture categories, and gestures are called “gesture objects.” Gesture objects that may have children (i.e., all except gestures) are “gesture containers.”

Goodness metrics

Menu bar

Windows

Training/Test selector

Tree view

Drawing area

Log

Figure 2 quill user interface overview. Gestures and the gesture hierarchy are shown in the Tree view and in Windows. Gestures are entered in the Drawing area. The “Training/Test selector” switches between the training gesture set and test gesture sets. Messages and advice are displayed in the Log. Recognized gesture in green

Recognized gesture highlighted

QUILL USAGE

This section describes how quill is used. An overview of the quill interface is given in Figure 2. The first step in using quill is creating some gesture categories, and optionally some gesture groups to organize the categories. Then the designer enters training gestures by drawing them in the “drawing area.” The designer can view the categories and training gestures simultaneously in quill subwindows on the quill desktop. For example, Figure 2 shows the training gestures for the delete category in the lower right window. Once training gestures have been entered, recognition can be tested. When the designer selects the training set and draws a gesture, quill recognizes it and displays the result, as shown in Figure 3. Throughout this process, quill monitors the gestures the designer enters, and analyzes them for problems. If it discovers a problem with human similarity, it is reported immediately. Problems with recognition are not reported until the recognition of gestures is tested, because the relationship of gestures in the recognizer changes every time a new gesture is added. When the designer tests recognition, that probably means the designer is evaluating their gestures, so it is an appropriate time to give recognition-related feedback. In addition to warnings about specific problems, quill also provides the designer with an indication of the quality of the

Recognition result

Gesture that was drawn

Figure 3 quill recognizes a gesture and shows the result several different ways. gesture set in terms of two “goodness metrics.” One metric measures how well the recognizer will be able to disambiguate the gestures, and the other indicates how good the gestures are for humans. Both metrics have a perfect score of 1000. When quill detects a problem, it subtracts a penalty from the appropriate metric with a magnitude based on the severity of the problem. Currently, the human goodness metric is based solely on whether there are gestures that people are likely to perceive as similar.

QUILL DESIGN AND IMPLEMENTATION

The study with gdt found that designers could, by trial and error, improve the recognition of their gestures. However, we believe users of pen-based user interfaces will demand better recognition than most participants in that study were able to achieve2. We also wanted a design tool that could advise designers about human-perceived similarity of gestures in addition to recognition of gestures. quill was prototyped on paper and in PhotoShop and implemented in Java using the JDK 1.3, and the Swing user interface (UI) toolkit. It consists of approximately 26,000 lines of code spread across 93 classes. It uses the gesture recognition algorithm by Rubine [18]. The following two subsections discuss interesting issues in the design of quill: active feedback and test vs. training sets. The next describes how the quill interface evolved. The last discusses implementation issues. Active Feedback

Active feedback is the most important and most novel feature in quill. In the gdt evaluation, many participants had difficulty discovering that recognition problems existed in the gesture set they were designing. gdt included tables based on how the gestures would be recognized by the computer to help designers discover recognition problems. However, many participants did not consult these tables, and the tables were confusing to use. Active feedback helps designers learn that their gestures have problems, both in terms of recognition and human perception. It also addresses the problem designers had of not knowing how to fix recognition problems once they were discovered. When quill provides feedback to the designer about problems, it also gives advice about how the problem can be fixed. quill continuously monitors the gesture set for the following potential problems with gestures: • Recognizer similarity. If two gesture categories are similar to the recognizer, it is more likely that they will be misrecognized. The recognizer does not see gestures the same way people do, so designers may not realize this type of problem exists. • Outlying category. Sometimes two gesture categories may be similar to the recognizer because a third, outlying, gesture category is very different. Only very experienced gesture designers understand this problem, so quill can significantly assist less experienced designers by detecting this problem and offering assistance to fix it. • Misrecognized training example. Normally, all training gestures should be recognized as the category of which they are a member. Most misdrawn gestures are easily caught by this automatic monitoring. • Outlying gesture. Sometimes a gesture is very different from others in its category, yet still correctly recognized. Often, such gestures are not drawn as the designer intended. Misdrawn gestures give the recognizer false 2. Conventional wisdom holds that even a 98% recognition rate is inadequate [1].

Figure 4 quill warns the designer using active feedback. Light bulb icons indicate which gesture objects have notices. Arrows show how the designer has used hyperlinks to obtain additional information. data about what the gesture category is supposed to be like, which may result in recognition problems. • Human similarity. Previous warnings relate to how the recognizer sees the gestures. Another important factor in gesture usage is how easily gestures can be learned and remembered by people. To make their gestures easier to learn and remember, the designer may want gestures that invoke unrelated actions to appear dissimilar. If gesture categories in the different groups are predicted to be similar to people, quill reports this similarity to the user and suggests how to make the gestures less similar. If the gestures are in the same group, there is no warning because it is likely that gestures in the same group perform similar functions, and it may be helpful for gestures with similar functions to be perceptually similar. For example, the operations “scroll up” and “scroll down” should have similar gestures. When quill finds a problem, it warns the designer and provides advice about how to fix it. An example of feedback is shown in Figure 4. Deciding how, where, and when to provide feedback raised difficult interface design issues. Since people in the gdt experiment had difficulty understanding its tables about recognition, we decided the feedback in quill should be easily intelligible, so quill uses plain English where possible. When unfamiliar terms are presented, they are hyperlinked to descriptions in the on-line reference manual, which uses diagrams to explain terms. At any given time the system may have several different feedback items for the designer. For this reason, quill has a log at the bottom of the main window to display the feedback in summary form. We wanted to keep the messages in the log terse so experts would not have to read many words to understand a message, and to conserve screen space.

However, most feedback in quill is not immediately intelligible from the message in the log, so notices3 also include hyperlinks that the user can click to obtain additional information. This additional information is usually shown in new windows in the desktop area, although two types of information, “how to” and “reference,” are shown in dedicated top-level windows (i.e., independent and outside of the quill main window). Active feedback should be noticeable by the designer, yet not obtrusive. The log is insufficient by itself, because it is difficult to see if there are any active notices and, if so, the gesture objects to which they apply. quill makes notices more noticeable by showing warning icons in the tree view and on the desktop next to gesture objects with which it detects problems (see Figure 4). The designer may click on these icons to scroll the log to a relevant notice. Different icons are used to indicate notice severity. There are several options for when to provide feedback to users. One is to give feedback as soon as possible. This approach has the advantage that the user knows what action caused the feedback. However, for feedback related to recognition, this strategy is likely to produce premature and incorrect feedback, because the way the recognizer views the gestures will change over time, as gestures are added to the set. We decided to treat feedback differently depending on whether it was related to recognition or not. Feedback not related to recognition is shown immediately, so that users know what caused it. Feedback related to recognition is delayed until the user draws a gesture to be recognized, because we assume that at that time the user has stopped entering gestures and is starting to test the recognition. Equally important as showing feedback at the right time is removing feedback at the right time. Feedback should be removed when the user tells the program to ignore the problem or when the problem is “fixed.” One problem is determining what “fixed” means. One strategy is when a notice is created, remember the current state of the relevant gesture objects and when the state changes more than some amount, remove the notice. A problem with this strategy is deciding how much change is enough. Another strategy is to evaluate the whole gesture set every time it changes to determine if notices are still valid.

Figure 5 quill low-fi prototype scenario. The designer saw the warning at the bottom and has clicked on the “More info” button. “angle of the bounding box” is a blue hyperlink. In gdt, the recognizability of one gesture set could be tested by using it as a training set and using another gesture set as a test set. However, the system did not tie test sets to a training set. The original paper prototype of quill had this same design. In an evaluation of a paper prototype, we found that users strongly preferred the program to support linking test sets with the training set for which they were made. For this reason, in quill we decided to have every test set bound to a training set inside a gesture package. We considered two possible places for test gestures in the gesture hierarchy: 1) in gesture categories, so that each gesture category would have training gestures and possibly test gestures as well; and 2) in a separate hierarchy parallel to the training set. We discussed how test sets would be used, and thought that designers would at any given time be working primarily on training gestures or on test gestures and would not closely interleave the two. Therefore, we decided to have a separate test set hierarchy, where each test set has its own gesture groups and categories and those categories contain the test gestures. User Interface Evolution

Initially, quill evaluated the entire gesture set any time any part changed, and reported all notices that applied to the new state of the set. We soon realized that this design was not workable, however, because users were deluged with the same notices over and over again and ignored them all.

The first phase of the quill interface was a low-fidelity prototype consisting of a set of sketches. Figure 5 shows a prototyping scenario, in which quill has told the designer that two categories are too similar for the recognizer and the designer has clicked the “More info” button.

We changed quill so that it only displays notices in the log that have not been displayed before. Warning icons in the tree and desktop view are shown for all notices that are applicable, not just new notices.

Next, we performed an informal user study using the paper prototype [17]. The paper prototype used printouts of the online sketches with hand drawings for additional elements (e.g., pull-down menus, dialog boxes, and subwindows).

Test vs. Training Gesture Sets

Four graduate students from our research group participated in the study. During and after the study, participants made a number of suggestions about how to improve the interface on topics such as labeling and visual feedback. They also raised difficult issues, the most significant of which was the question of when warning messages should disappear, which is discussed above.

The gdt experiment showed that explicit support for test gesture sets was important. This section shows why test and training sets should be tightly coupled and where in the gesture hierarchy test gestures should be. 3. We call feedback about a specific problem a notice.

We decided to build the next version of the interface in software because a large component of the interface is the interactions with the tree and desktop views, and these are cumbersome to capture in a paper prototype.

In our study with quill, this approach seemed to work well. Most users discovered that the computer updated its advice if they were idle for a few seconds. A few users did not discover this; possibly the feedback that analysis is occurring should be less subtle, at least for novice users.

Implementation Issues

Advice for hierarchies

While implementing advice in quill, we encountered several issues that may apply to other applications that use advice.

Many interfaces structure complex information with hierarchies, such as the gesture set, group, class, and gesture in quill. In quill, all notices (i.e., pieces of advice) that apply to an object are stored in a list property of the object. Ancestors show a warning icon if any of their descendents have notices, so parent objects in the hierarchy need to know about inherited notices (notices of their descendents)4 as well as intrinsic notices (notices about themselves). This requirement raised the issue of how to store inherited notices, which may apply to other interfaces that display hierarchies.

Java implementation

Background analysis

Long-running operations such as complex analysis, database access, and network communication are becoming more common in end-user applications and these tasks are often run in background threads. For example, the analyses in quill are run in the background to avoid freezing the interface. We evaluated a number of approaches to background analysis to maximize user freedom and minimize user confusion. A summary of the options we considered are: 1. Lock all user actions during advice computation. This option would be fairly easy to implement, and simple for the user to understand, but the delay would annoy the user. 2. Disable all user actions that would change any state during advice computation. The ability to examine the gesture set would be beneficial, but the inability to make changes at some times but not others might confuse users. 3. Disable user actions that would change state in use by current analysis. This option would be efficient in that it allows all non-conflicting operations to occur. However, it would be difficult to implement and would almost certainly confuse users because different actions on different gesture objects would be enabled and disabled as different analyses were performed. 4. Allow any action, but if a change happens that affects analysis, cancel analysis. This option would allow freedom for users, but they may not realize that some actions will cancel analysis. Also, it raises the question of whether canceled analyses should automatically restart. 5. Allow all actions. This option would be easy to implement, and would allow users complete freedom. However, users might modify their gestures during analysis, so the advice may not be consistent with the modified state of the gestures, which would likely confuse users. Analyses in quill may be started automatically by the system or manually by the user. We opted for strategy #2 for user-initiated analyses because we thought disabling some actions in response to a user action would not be confusing. For system-initiated analyses, we decided to use #4, and automatically restart canceled analyses. This option allows users to act freely without having to be concerned about or even know about the background analyses. We also provide feedback to the designer during analysis by putting a question mark after the goodness metrics to indicate that they may not be valid.

The question is what information each parent (i.e., gesture category, group, or set) should store about their inherited notices. Since notices nearly always refer to multiple gesture objects, whereas gesture objects will have multiple notices less often, we decided to keep a list of all descendents that have one or more notices. QUILL EVALUATION

A human factors experiment was performed to evaluate the usability of quill and determine whether quill helps designers produce measurably better gestures. This experiment had two goals: 1) to observe designers using quill and qualitatively assess their usage and experiences and 2) to determine whether gesture sets designed with the aid of the active feedback in quill are measurably better than gesture sets designed without active feedback. We judged gesture set quality in terms of how confusable they would be for the recognizer and how similar people would perceive them to be, based on our human similarity metrics. Participants

We wanted to find only professional designers. We chose to carry out the experiment at the Microsoft Usability Lab because they were able to recruit the participants we wanted. We had 11 participants, 10 of whom were professional user interface designers and one of whom was a professional web designer. Each participants was given a $50 gift certificate5 for their participation. Equipment

Participants used quill primarily with a Wacom PL-300 display tablet, although they also used the keyboard. A scan converter and VCR were used to record the screen, and a Hi-8 camera on a tripod recorded the participants. Procedure

The experimental procedure consisted of four parts. The first part was to read the quill tutorial and perform the tasks 4. Normally in a hierarchy, children inherit from parents. However, parents inherit notices from their children. 5. Participants chose whether they would receive a gift certificate from either Eddie Bauer or Tower Records and Video. This type of compensation is standard at the MS Usability Lab.

listed in it. Participants retained the tutorial for the remainder of the experiment. Next were the two experimental tasks: a long gesture design task and a short gesture design task. For each participant, the active feedback in quill was enabled for one task, and for the other task it was disabled. The order of tasks and the feedback condition were randomized so that participants were evenly divided among the conditions (i.e., Latin Square design). Participants were given 1.5 hours for the long task, and 1 hour for the short task. Although each participant performed both tasks, it was not a true within-subjects design, because the tasks were not of equal difficulty or length, so performance across them could not be compared. After the second experimental task, participants filled out an online questionnaire using a different computer than the one used for the other tasks (to avoid negative effects suggested by [16]). At the end, participants were given a post-experimental handout that gave more details about quill and the experiment. Also, the experimenter answered any remaining questions they had. Quantitative Results and Analysis

The quantitative data from one participant, #4, had to be removed from consideration, because he did not follow instructions. He discovered a flaw in the human goodness metric that caused its value to be inflated when a very small number of training examples are entered (3–4 per gesture category). The experimenter instructed him to enter the same number of examples as the other participants for consistency, but he did not. Three dependent variables were measured for each task for each participant: 1. The human goodness of the gesture set 2. The recognizer goodness of the set 3. The time to finish the task. A multivariate ANOVA was performed to find significant differences due to a) feedback, b) order, and/or c) task (long vs. short). Few statistically significant effects were found. The long task took participants significantly longer to perform than the short task (51.1 vs. 22.7 minutes, p < 0.0092), and had significantly lower human goodness (690 vs. 940, p < 0.011), with or without active feedback. Also, when long and short tasks where considered together, there was a strong interaction between task order and whether it had feedback or not. For tasks performed first, the human goodness was higher if the task had feedback than if it did not. However, when the second task had feedback, it had slightly worse human goodness than when it did not. The effect of feedback alone was not statistically significant, and was positive in some cases and negative in others. In the long task, it had a positive effect on human goodness overall, but not for those who had feedback on their second task. In the short task it had a positive effect on human goodness for the first task, and no effect for the second (because it was at the maximum regardless of feedback). Recognizer goodness was slightly worse with feedback in

the long task, but slightly better with feedback in the short task. Some answers from the post-experimental questionnaire were correlated with performance (i.e., human goodness, recognizer goodness, and time). On a scale of “terrible” (1) to “wonderful” (9), participants rated quill 5.9, which correlates with the human goodness of produced gesture sets 0.77, (p < 0.010). For “inadequate power” (1) vs. “adequate power” (9), participants rated quill 6.4, which correlates with human goodness 0.902 (p < 0.00036). Recognizer goodness correlates significantly with several questions about the long task (0.76 or better, p < 0.011 or better), but with only one about the short task (0.75, p < 0.033). This disparity is probably due to the nearly uniform high scores on the short task. Interestingly, younger people created gestures with significantly higher human goodness than older people (correlation –0.83, p < 0.003). Human goodness is based on similarity metrics whose data was mostly acquired from college students. The negative correlation of age with human goodness in the current experiment may indicate an age effect in the similarity metric. Qualitative Results

This section describes common themes derived from comments participants made during the experiment and on the post-experiment questionnaire. Participants had many comments about the suggestions quill gave about how to improve their gestures. Some participants read, understood, and used the feedback productively. However, some participants disagreed with some of the feedback, especially for gestures that were letters or contained letters, probably because the human similarity metric is derived from non-letter data. Other participants did not understand what quill was trying to tell them. In describing what was wrong with the gestures, quill used some mathematical terms, which prompted comments from some participants that the feedback was “too technical” and “That's not too good for humans; [it] might be good for mathematicians.” It used diagrams to explain what the technical terms meant, but a few participants said that it would have been more useful if it had been more visual. One participant suggested explaining the geometric features (e.g., size of bounding box and sine of initial angle) using the actual gestures the user was working on, rather than stock images, “The error/suggestion text could potentially show my problem gesture with the error highlighted or overlaid.” The naming of objects in quill (i.e., gesture set, gesture group, gesture category, gesture) confused a few participants. One said, “Having ‘gesture’ in all the names makes the names difficult to distinguish.” Some participants seemed unclear on the difference between training sets and test sets. The tutorial discusses test sets and gives an example of their use, but this explanation was clearly inadequate for some participants. One participant commented, “Test Set and Training Set are too similar—probably because there are always tests in training courses and tests are a part of training.” Participants had several suggestions about how to improve the quill interface.

• When something is selected in the tree, display it on the desktop automatically. Currently, the user creates new views of objects explicitly with the menu item “View/ New View”. We discussed this feature during the design of quill, but decided that explicitly creating views was better, which may not have been the right decision in light of the experimental results. • Add right-button context menus to operate on objects. • Support undo/redo. • Synchronize selection between the tree and the desktop. (Currently, there is an active selection in the tree or the desktop, but not both.) • Add a time stamp to the suggestions, so that it is easier to tell what is current. A few participants wanted more information on or support for a larger context. For example, one participant wanted more information about the application in which the gestures would be used. Another wanted to be able to run the application for which the gestures were being designed and try out the gestures in context. She believed it would help her discover whether the gestures would be memorable and whether they made sense. Discussion

gdt provided a small amount of help to designers in figuring out how to make their gestures easier for the computer to recognize, but its tables of numbers were incomprehensible to most users. The suggestions in quill were designed to be more accessible by putting the information in English and supplementing it with pictures. In spite of this effort, many participants did not understand the suggestions, and the suggestions did not consistently help participants make better gesture sets. We believe the suggestions can be made more accessible by changing the language to be less technical, by using more diagrams, and by using dynamically generated diagrams based on the actual gestures entered by the designer. We also believe that more training would be helpful in getting the most value from quill. The tutorial used for the experiment did not cover all the features in quill, and those that it did include were not covered in depth. We decided to keep the tutorial short to keep the length of the experiment as short as possible. (The long task took some participants nearly ninety minutes. A few participants took over 2.5 hours for the whole experiment.) In retrospect, participants may have been able to use quill more effectively if they had gone through longer training. In particular, some participants clearly could have benefited from more time spent on testing vs. training sets and on a step-by-step example of how to use quill’s suggestions. A professional technical writer helped edit an early version of the tutorial. Nevertheless, most participants did not read it carefully nor absorb it well, so it is unclear how much a longer tutorial would have helped. Overall, the effect of feedback was mixed and in no case statistically significant. For the first task, feedback had a positive effect, but when feedback was provided in the second task, it had no effect or a negative effect. There was great variance among participants in their understanding of quill’s feedback. Although it seemed more approachable

than the tables and graph in gdt, it was still too technical for many participants. Several subjective judgements about quill that were collected in the post-experiment questionnaire correlated with performance measures. For the long task, higher recognition goodness correlated with thinking the task was clear, entering new gesture categories was easy, and testing recognizability of gestures was easy. Also, people with a higher overall reaction to quill and people who thought quill had adequate power created gestures with higher human goodness. These correlations are not surprising, since we would expect that people who do better on the task would think it is easier. What is surprising is that people who took longer to perform the long task also tended to think that finding recognition problems was easy. We might expect that people who easily find recognition problems finish the task more quickly. A problem with the human similarity suggestions is that they were not always right. The models used in quill to predict human similarity are not perfect, and participants rightly disagreed with it at times. quill seemed especially prone to overestimate similarity when a gesture was or contained a letter. This flaw is probably due to the similarity models being based entirely on non-letter gestures. It seems likely that people would perceive letter and non-letter shapes differently. A small difference between two non-letter shapes might be perceived by people as a large difference between two letters. Ideally, quill would have separate similarity models for letter and non-letter gestures. Future experiments to measure similarity of letters are required to provide models for letter similarity. Naming has been an issue with quill since the first version of the tutorial. There was much discussion with a professional technical writer who helped edit the tutorial about how to name the different gesture containers. The present system was the best that we could create, but it was confusing to some participants. A diagram of the relationship between the different containers might be helpful, although it would lengthen the tutorial. There is a long list of features that quill was designed to have, and many of them have not yet been implemented. Most of the suggestions for improvement that participants made were already on the list of desired features, but there were a few new ones, such as timestamping notices. Also, some issues were illuminated by participants’ comments. For example, we were unsure how best to handle selection across the tree and desktop, but based on this evaluation it seems clear that they should be synchronized together. It is difficult to determine the ideal length for an evaluation of quill. On one hand, quill is not intended to be walk-upand-use, but rather to be a tool like Photoshop, which designers use extensively and with which they become familiar. Therefore, an experiment such as the one we performed in which participants spend approximately twenty minutes learning the tool is unrealistic. On the other hand, it is difficult to perform an experiment with expert users of a research system. Ideally, the experiment would include a longer training session so that participants could learn the tool better. Unfortunately, the longer the experiment is, the

more difficult it is to recruit participants, especially when the best participants are highly trained professionals. Experiment length and the difficulty of recruiting participants was the primary reason we chose to have tasks of differing length. We chose to make one task shorter than the other because we did not want to increase the length of the experiment. We did not make both tasks short because we were concerned that the short task was not long enough. Participants scored extremely high on the short task, so we believe it was too easy. In contrast, the scores in the long task did vary, so it is probably an appropriate length. Unfortunately, by making the tasks different lengths, and therefore different difficulties, we precluded within-subjects analyses. Overall, active feedback was helpful to some of the participants, but not to others. The informal feedback we received from participants suggests that this approach is valuable, but is hard to implement. FUTURE WORK

We have many ideas for ways to extend quill. A few of them are: • Support a database of known gestures (e.g., copyedit marks) against which quill can compare new gestures. • Collect more data about human-perceived similarity to improve the similarity models, especially for letters. • Automate (partially) the repair of human similarity and recognition problems by morphing gestures. • Add features to the recognizer that were discovered to be important for human-perceived gesture similarity. These features could make the recognizer behave more intuitively. • Use a different recognition technology with quill, for example to support multi-stroke gestures or to improve recognition accuracy. • Show designers advice with drawings that include their own gestures, rather than static ones. For more discussion of future directions, see [6]. CONCLUSIONS

We built the first intelligent tool for gesture design, quill, which helps designers improve the computer recognition of their gestures. Our evaluation of quill showed that advice can be helpful to designers in improving their gestures. This work enables designers to create better gestures for pen-based user interfaces. It also allows a wider group of designers to create good gestures for pen-based UIs. Improving gestures and making gesture design more widely accessible is important because gestures are a powerful interaction technique, especially for pen-based user interfaces. ACKNOWLEDGEMENTS

The authors thank the other members of the Group for User Interface Research for their feedback on the design of quill and the user study, the Microsoft Usability Lab for their help running the study, and the study participants. REFERENCES

[1] Briggs, R., Dennis, A., Beck, B, and Nunamaker, J. Whither the pen-based interface? Journal of Management Information Systems, 9(3):71–90, 1992-1993.

[2] Frankish, C., Hull, R., and Morgan, P. Recognition accuracy and user acceptance of pen interfaces. In Human Factors in Computing Systems (SIGCHI Proceedings), pages 503–510. ACM, Addison-Wesley, April 1995. [3] Hanne, K. and Bullinger, H. Multimedia Interface Design, chapter 8, pages 127–138. ACM Press, 1992. [4] Kruskal, J. B. and Wish, M. Multidimensional Scaling. Number 07-011 in Quantitative applications in the social sciences. Sage Publications, Beverly Hills, CA, 1978. [5] Lipscomb, J. A trainable gesture recognizer. Pattern Recognition, 24(9):895–907, September 1991. [6] Long, Jr., A. C. Quill: a Gesture Design Tool for Penbased User Interfaces. PhD dissertation, University of California at Berkeley, Berkeley, CA, Dec. 2001. Available at http://guir.berkeley.edu/pubs/quill/dissertation.pdf. [7] Long, Jr., A. C., Landay, J. A., and Rowe, L. A. PDA and gesture use in practice: Insights for designers of penbased user interfaces. Technical Report UCB//CSD-97976, U.C. Berkeley, 1997. Available at http://bmrc.berkeley.edu/papers/1997/142/142.html. [8] Long, Jr., A. C., Landay, J. A., and Rowe, L. A. Implications for a gesture design tool. CHI 1999, ACM Conference on Human Factors in Computing Systems, CHI Letters, 1(1), 40–47. [9] Long, Jr., A. C., Landay, J. A., Rowe, L. A., and Michiels, J. Visual similarity of pen gestures. CHI 2000, ACM Conference on Human Factors in Computing Systems, CHI Letters, 2(1), 360–367. [10] Luening, E. Study charts sharp rise in handheld sales. CNET WWW site, January 2001. http://news.cnet.com/ news/0-1006-200-4601431.html. [11] Meyer, A. Pen computing. SIGCHI Bulletin, 27(3):46–90, July 1995. [12] Morrel-Samuels, P. Clarifying the distinction between lexical and gestural commands. International Journal of Man-Machine Studies, 32:581–590, 1990. [13] Opera Software. Mouse gestures in Opera, 2001. Available at http://www.opera.com/windows/mouse.html. [14] Pier, K. and Landay, J. A. Issues for location-independent interfaces. Technical Report ISTL92-4, Xerox Palo Alto Research Center, December 1992. [15] Pilone, M. KGesture, 2001. Available at http:// www.slac.com/%7Empilone/projects/. [16] Reeves, B. and Nass, C. The media equation: how people treat computers, television, and new media like real people and places. Center for the Study of Language and Information; Cambridge University Press, Stanford, Calif.: Cambridge [England]; New York, 1996. [17] Rettig, M. Prototyping for tiny fingers. Communications of the ACM, 37(4):21–27, April 1994. [18] Rubine, D. Specifying gestures by example. In Computer Graphics (SIGGRAPH), pages 329–337. ACM SIGGRAPH, Addison Wesley, July 1991. [19] Sensiva, Inc. Sensiva product brochure, 2001. Available at site http://www.sensiva.com/. [20] Mark Tapia and Gordon Kurtenbach. Some design refinements and principles on the appearance and behavior of marking menus. In Proceedings of the ACM Symposium on User Interface and Software Technology (UIST), pages 189–195. ACM, November 1995. [21] Wolf, C., Rhyne, J., and Ellozy, H. The paper-like interface. In Gavriel Salvendy and Michael Smith, editors, Designing and Using Human-Computer Interfaces and Knowledge Based Systems, volume 12B of Advances in Human Factors/Ergonomics, pages 494–501. Elsevier, September 1989.

Suggest Documents