Oct 16, 2004 - {gulliver,tochigi,tomohiro,msuzuki,hagita}@atr.jp. ABSTRACT ... interact with digital multimedia elements by using their bare hands. By sharing a ... multiple users to share experiences by discussing common interest information .... When sharing our own experiences with others we usually make use of ...
SenseWeb : Collaborative Image Classification in a Multi-User Interaction Environment Roberto Lopez-Gulliver Hiroko Tochigi Tomohiro Sato Masami Suzuki Norihiro Hagita ATR Media Information Science Research Labs 2-2-2 Hikari-dai Seika-cho Soraku-gun Kyoto, 619-0288, JAPAN +81-774-95-1401 {gulliver,tochigi,tomohiro,msuzuki,hagita}@atr.jp
ABSTRACT The SenseWeb system is a multi-user interactive information environment. Aimed at supporting the sharing of experiences and collaboration among multiple users allowing them to simultaneously interact with digital multimedia elements by using their bare hands. By sharing a common information space on a large screen, users can cooperatively search, filter, classify and interact with multimedia data in a natural and intuitive way. This paper introduces the system as well as preliminary experiment results to assess its effectiveness as a simultaneous multi-user interaction environment in collaborative image classification tasks. Results confirm our expectations of improvements in task completion time, ease of use and user satisfaction over multi-user but one-at-the-time interaction scenarios.
Figure 1: SenseWeb: A hands-free multi-user interaction environment
Categories and Subject Descriptors H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems—artificial, augmented, and virtual realities; H.5.2 [Information Interfaces and Presentation]: User Interfaces— Interaction styles; I.3.6 [Computer Graphics]: Methodology and Techniques—Interaction techniques
General Terms Human Factors, Algorithms, Theory
Keywords multi-user, collaborative interaction, touch-based interface, large screen display
1.
INTRODUCTION
In recent years, our lives have drastically changed with the amount of data we need to deal with. We now face the problem of classifying, filtering, visualizing and making sense out of all this data in
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’04, October 10-16, 2004, New York, New York, USA. Copyright 2004 ACM 1-58113-893-8/04/0010 ...$5.00.
a sensible manner. Most of this data is organized in databases for easy query and retrieval allowing us to browse and interact with its contents[1, 2]. However, discussion, brainstorming or collaboration situations require multiple users to query, retrieve, classify and filter data from large databases. In most of current scenarios, only one person at the time is in charge of directly interacting with the data presented. Taking turns or allowing only one person to interact with the data often constrains the generation of ideas thus disturbing the discussion at hand[3]. This paper introduces a large hands-free multi-user interaction environment, named SenseWeb, aimed at supporting discussions and sharing of experiences among users in a collaborative way. It enable users to simultaneously interact with digital data for browsing, discussing, filtering and classifying information. These features provide intuitive and ease-to-use collaboration capabilities that could not be achieved with the traditional single-user, as in mouse-keyboard based, interactions. The following sections describe the proposed multi-user interaction environment as well as preliminary results on its effectiveness regarding its multi-user functionality.
2. SENSEWEB The SenseWeb system[4] was conceived and designed to support multiple users to share experiences by discussing common interest information, while allowing each of them to interact with multimedia data at any time regardless of others’ users interactions. Sharing
Figure 3: Screen shot of the image classification application used
Figure 2: Browsing images from the Internet: Multiple users can zoom, drag, bookmark, and “AND” images searches by touching them with their hands.
the same data space, ease of use and intuitiveness were also essential part of the design. To this end, a hands-free large multi-user interactive environment was prototyped, as shown in Fig. 1. Users can use their bare hands to touch and interact with digital images and video. It consists of a large rear-projection screen made interactive by tracking the users’ hands, analyzing their infrared “shadow” video image while touching the screen. A black and white camera, with an infrared filter, behind the screen, together with an image processing algorithm, serves this purpose.
2.1 Application and main features The very first application on the SenseWeb system was to allow multiple users to discuss common topics of interest by browsing and interacting with images and sounds found in the Internet. Mouse and keyboard where replaced with hands and voice. A typical interaction scenario would be: Any of the users says a keyword regarding the topic of discussion, the user utterance is recognized by the system via a voice recognition module. These keyword is then used to query existing image and sound search engines in the Internet for data related to it. The keyword then appears as an interactive keyword icon floating upwards in the screen. Users can touch these keyword icons triggering the download and display of more images related to them, as well as sounds. These related images are displayed in a fireworks fashion, flying outwards, with its center being the keyword icon. Users can then interact with these images in different ways. For example, by touching the icons with two hands users can zoom and drag them to any other part of the screen to be shown to other users. Users can also discard non interesting images by “sweeping” away them with one only hand. Users can select and bookmark images of common interest by dragging them to the bookmark column on the right of the screen. Even more, they can narrow the image search by means of a logical “AND” operation, by simply bringing together two key-
word icons as shown in the bottom of Fig. 2. In this way, users can support their topic of discussion by simultaneously bringing and interacting with data found in the Internet. It is worth noting, that the system can make use of any specific database of images, sound, videos, etc, instead of the Internet, thus customizing the information presented.
2.2 Related work Concerning the multi-user capabilities of the SenseWeb some definitions are necessary. In this paper, with “multi-user” we mean any system where multiple users can take part in the interaction, passively or actively. With “single-point” we mean a multi-user system where only one user at the time can actively interact with the data, possibly taking turns. On the other hand, with “multipoint” we mean a multi-user system where multiple users can actively interact with the data simultaneously. It is in this regard, “multi-point”, that we evaluated the system. Similar multi-point systems include: SmartSkin[5] is an interactive table capable of tracking the users’ hands and fingers by using capacity sensors embedded in the table. It uses overhead projection on the table for display. Even though it allows for a wide range of new interaction techniques, it doesn’t scale well in size and easy of deployment to other environments. With similar advantages and disadvantages, there is another table-based multi-point system, called DiamondTouch[6]. This one adds the capabilities of identifying the user, and use this information to possibly present different views of the data according to the users now touching it. A wallbased system, described in [7], deployed in a wall of a busy street acts as a luminous advertisement and reacts to people approaching it, changing the information displayed in a rather artistic way. Although interesting, this system is not aimed at any serious applications that support discussions or collaborations. The main difference with our approach and the systems described above is how well it scales in size and its easy of deployment. All you need is an extra camera, common halogen lamps as infrared light source, and a piece of software to turn any rear projection screen interactive and multi-user capable.
3. MULTI-POINT EVALUATION 3.1 Experiment setup 3.1.1 Goal, subjects and application The goal of the experiment is to assess the effectiveness of the
multi-point capabilities of the SenseWeb system. To this end, a collaborative image classification task was designed in two variants, both using the SenseWeb system. The only difference in these variants is that one was single-point while the other was multi-point capable. We purposely modified the original SenseWeb system to also constraint it into a single-point interaction. Experiment subjects were 20, male and/or female, university students, grouped in pairs of 2. None of the pairs knows their party in advance. An image classification application was set, were users can use their hands to classify images, displayed on the SenseWeb screen, from a previously assigned image databases with 100 images each. In order to force users touch the screen as much as possible, apart from the original color image, we prepared their corresponding monochrome ones. Image icons would endlessly come down from the top of the screen, as users touch them they are turned into color as well as zoomed while being touched, see Fig. 3. Users can drag or throw the images icons for discussion or put them in the “selection row” at the bottom of the screen. Icons in the “selection row” can be dragged out and interchanged with others if the need arises.
3.1.2 Method and tasks Each pair of subjects is presented with the application described above to collaborately classify the images. Concretely, for each task assigned, after commonly agreeing upon, the two users are to select 8 images from the total 100 of that task database. There are 3 tasks with two variants each, single-point and multi-point, for a total of 6. After the users jointly agreeing upon the total 8 images so selected the task is considered finished. Tasks taking longer than 5 minutes were stopped even without completion. All the tasks were video recorded, as well as each task completion time and the complete hands interaction is logged into a file for later processing. As soon as the users are done with all 3 tasks they are asked to answer a 1-5 rank questionnaire comparing the modes, singlepoint and multi-point, in terms of usability and user satisfaction. In the multi-point mode, both users can simultaneously manipulate the images or even use both hands if needed. In the single-point mode only one user can manipulate the images at the time, they may take turns if needed. The one who first touch the screen takes precedence over any other touches. All the three tasks required the users to select 8 images from different databases and put them in the selection row at the bottom. Task and their respective databases, in parenthesis, are as follows : 1) Select 2 images for each season expressing “The Change of the Seasons”(Seasons), 2) Select 4 pairs of frogs with different colors each (Frogs and Lizards), 3) Select two groups of 4 herbivorous and 4 carnivorous animals to make a poster expressing “The most representative animals in our Zoo” (Animals).
3.2 Experiment results and discussion The graph at the top in Fig. 4 shows the subjects answers to the question “Which system you feel easier to use” within range from 1 (single-point) to 5 (multi-point). 19 out of 20 users regarded the multi-point system easier to use. In free comments on why their decision on these questions, most of the user highly appreciated the dual capabilities of the multi-point mode of being able to work in parallel or in a synchronous way. Interesting are the comments of the only user who preferred the single-point mode. For example : “Being both of us able to select at the same time created more confusion than helping selecting more images in less time. In singlepoint mode we could calmly discuss upon which image to select and only then proceed to the selection”. However, the other parter of this user, preferred the multi-point mode with similar comments as most of the users in the experiment. This user comments al-
Figure 4: Comparing single and multi-point results: Ease of use(top) and average task completion times(bottom)
though gave us some indication that for some people multi-point mode could be more confusing than a help. The graph at the bottom in Fig. 4 shows the average completion times over all the users in the two modes. In this respect multi-point mode completion time were slightly shorter that in single-point mode. The standard deviation from the averages, black vertical bars in the graph, can be considered due to each pair of users’ different levels of manipulation skills as well as how comfortable each user feel about collaborating with somebody he/she mets for the first time. In order to evaluate user satisfaction, users were asked to answer 8 questions comparing the single-point and multi-point modes, within a “YES”(5 points) and “NO”(1 point). Average results and t-student probabilities comparing the two means are shown in Fig. 5. That multi-point is more effective can be drawn from Q1, Q2 and Q8 (p