2.2 Window-shopping. One of the domains of greatest interest to us has been Web shopping. .... The URL target for the image set was the home page of that site.
headerIn forProceedings SPIE use
of Electronic Imaging ‘2000: Internet Imaging, San Jose, CA, USA, January 26-28, 2000, pp. 48-56.
Browsing Through Rapid-Fire Imaging: Requirements and Industry Initiatives Kent Wittenburg, Carlos Chiyoda, Michael Heinrichs, Tom Lanning GTE Laboratories, Waltham, MA 02451 {kentw, cchiyoda, mheinrichs, tlanning}@gte.com ABSTRACT It is well established that humans possess cognitive abilities to process images extremely rapidly. At GTE Laboratories we have been experimenting with Web-based browsing interfaces that take advantage of this human facility. We have prototyped a number of browsing applications in different domains that offer the advantages of high interactivity and visual engagement. Our hypothesis, confirmed by user evaluations and a pilot experiment, is that many users will be drawn to interfaces that provide rapid presentation of images for browsing tasks in many contexts, among them online shopping, multimedia title selection, and people directories. In this paper we present our application prototypes using a system called PolyNav™ and discuss the imaging requirements for applications like these. We also raise the suggestion that if the Web industry at large standardized on an XML format for meta-content that included images, then the possibility exists that rapid-fire image browsing could become a standard part of the Web experience for content selection in a variety of domains. Keywords: visualization, navigation, cognition, browsing, images, XML
1. INTRODUCTION It is well established that humans possess cognitive abilities to process images extremely rapidly. Prior experiments have shown that certain tasks can be accomplished when images are flashed to users for only 200 milliseconds or less. Healey et al. summarize prior work in cognitive science in the area of pre-attentive processing.1 The human visual field is organized on the basis of cognitive operations that are rapid, automatic, and parallel. Certain information can be extracted with a single glimpse—200 milliseconds or less—that does not require focusing. Features that can be processed preattentively include line orientation, 3D depth cues, lighting direction, and others. In the area of video browsing, Marchionini and colleagues have conducted a variety of experiments that confirm that subjects can extract gist information from key video frames being flashed to them in a slide-show format at the rate of up to 8 frames a second.2 Object identification is also possible. At GTE Laboratories we have been experimenting with Web-based browsing interfaces that take advantage of this human facility. We have prototyped a number of browsing applications that offer the advantages of high interactivity and visual engagement. These prototypes follow earlier work that examined image-based browsing in the Web context more generally.3,4 We have also done some evaluation of our prototypes, which we overview here. Our position is that rapid-fire image-based browsing is in effect a new medium for content publishing. Existing media players do not satisfy its requirements. Nor do conventional HTML designs. However, there are several implementation platforms on which rapid-fire image browsers may be implemented such as Java, VRML, and Flash. We present here a common set of requirements for image-based browsing after overviewing prototypes in three content areas: Web-based shopping, video selection, and people directories.
2. POLYNAV™ PROTOTYPES PolyNav™ is our name for a set of tools developed at GTE Laboratories that provide a platform for building rapid-fire image-based browsing applications. PolyNav™ has been used to build browsing prototypes in a number of domains in which images play a role in revealing meta-information about content. The applications have ranged from those in which the content itself is images (such as family picture albums and other image collections) to those in which images may be one of a number of surrogates for the visual content (video) to those in which images may be a less direct but still useful adjunct to content (products in catalogs, people directories). We were actually inspired originally by the interesting hypothesis put forth by J. Helfman that image-based browsing might apply to Web pages in general.5 In what follows we will overview prototypes from three more restricted domains: video title selection, Web-based “window-shopping”, and people directories.
2.1 Video title selection One domain in which we have prototyped rapid-fire image browsing is entertainment video. As is commonly recognized in the video browsing research community, key frames extracted from video content provide one important source for supporting video browsing and title selection. In an IP video entertainment prototype (SeeHere) that includes selection, delivery, and billing, we have incorporated a technique called dynamic key-frame collages for browsing candidate video titles.6 Prior work had uncovered some user problems with fixed-speed slide-shows as well as with storyboards. Dynamic key-frame collages built with PolyNav™ were designed to address these problems. The technique combines temporal and spatial resources by flashing a sequence of extracted key frames in a circular, partially overlapping layout. The images belonging to a particular title remain visible until either they are occluded by others from the same title or another title is introduced in the sequence. We hypothesize that using spatial and temporal resources in this way can help contribute to visual gist comprehension through grouping a set of related images. When compared to slideshow presentations, each image is visible longer, and a set of images visible at the same time can reinforce a particular cinematographic style. Users can focus on one general area of the screen and do not have to spend time and cognitive energy scrolling.
Figure 1: Title-skimming in a video entertainment prototype. Our prototype also addresses the issue of when to use such rapid-fire image browsing techniques in the context of a broader multimedia information-seeking dialog. Our design rationale suggests that dynamic key frame collages are most appropriate when the size of a set of candidate videos is more or less 15 titles. Thus it makes sense to use as a way of browsing result sets that may be constrained by earlier query or category selection actions on the part of the user. SeeHere thus includes three “modes”: query/search, title skimming, and detail. Figure 1 shows SeeHere in title skimming mode, in which the user has narrowed down a working set of video titles for further consideration. The list of working titles is visible on the left. On the right is a Java PolyNav™ client that allows users quickly to skim the title list through a dynamic presentation of key frames. When a user places the mouse pointer above one of the control buttons at the bottom of the image display area, a presentation begins in which a title graphic for each movie is shown in the center. Key frames from that title then appear around the title in rapid succession. Speed is increased by moving the mouse cursor away from the horizontal center. Buttons closest to the center direct the player to the slowest speed; those furthest away to the fastest. Awareness of the current title being shown is reinforced by highlighting the name of the corresponding title in the list on the left. Users may randomly access different titles in the display sequence by clicking on the text or one of the key frames. That puts the user into detail mode, in which further text is available on that particular title as well as the option to play a trailer.
2.2 Window-shopping One of the domains of greatest interest to us has been Web shopping. Indications are that this industry is skyrocketing, and yet our view is that shopping as it is known on the Web today provides little of the pleasurable experiences associated with browsing through brick-and-mortar stores or their environs. An element that is singularly lacking is the sensory stimulation that shoppers experience when, say, window-shopping. (Other elements are of course missing as well, such as social interactions.) A strong motivation of our efforts to prototype new shopping applications is to introduce some elements of visual sensory engagement and enjoyment into the experience. Our hypothesis is that one way the visual senses can be engaged in a Web shopping experience is to let users closely control the pacing and direction of a series of images that are presented to them. The particular sequence would be selected through other means such as search or category selection in the application at large. Of course the quality of the images themselves and their design will determine largely whether users find the experience engaging and interesting. But there is also something inherently intriguing, in our view, of watching a rapid series of images being flashed to you whose pacing is under your control. You can see elements of such designs in much of the MTV-generation video content. Figure 2 shows a screen shot of one of our implementations of Web-based window-shopping. The user has selected a product category to peruse, in this case books. Images from a set of products that are showcased for this category are then presented under the users’ control. The products within the category are grouped by Web merchant. The particular image shown in Figure 2 represents a Web store whose featured products would follow on successive screens. The speed and direction are controlled by mouse rollover on the controls at the bottom-right area. Here there are just two speeds in each direction. As the presentation proceeds, text at the right side of the display presents some context information for the products. This implementation was done using Macromedia Flash.
Figure 2: Flash implementation of window-shopping. Figure 3 shows another variant of Web window-shopping implemented in Java. The layout option chosen here makes use of the same dynamic collage technique discussed above in the context of video browsing. There may be some overlap and occlusion in this spatial-temporal layout. Such a design variant may provide extra visual interest and engagement, but we have also found that it may risk visual clutter. Our client may optionally show just one image at a time slide-show style by changing an applet parameter. In this example the user has selected a holiday occasion as a category and is perusing featured specials across a set of stores that have been targeted by merchants for that occasion. The speed controls at the bottom of the display are again controlled by mouse rollover. In this design variant, the speed controls feature a continuous range between some minimum and maximum that is also settable with an applet parameter.
Figure 3: Java applet implementation of window-shopping. Another PolyNav™ client was implemented in VRML (see Figure 4). As with the other PolyNav™ clients, rolling the mouse cursor over or off the control arrows causes the presentation of images to start or stop. Placement of the mouse cursor towards the tips of the arrows increases the speed; placement towards the tails of the arrows decreases speed. However, the VRML model of traversing a set of images is significantly different from the other examples we have discussed here. Instead of the images being painted on a two-dimensional surface, and possibly overwritten with succeeding images, the VRML presentation model incorporates movement of the images. The controls cause the images to move back or forward in a threedimensional space as if the images were flowing toward or away from the viewer. Images always remain perpendicular to the viewing plane. Grouping of related images is done by placing them in the same vertical plane. These controls are highly constrained compared to the usual VRML designs that allow the user to “fly through” a three-dimensional space with as many as eight degrees of freedom. Conceptually, we relate our approach to 3D information navigation to an aspect of realworld navigation practiced in Micronesia and Polynesia.3,4 Classic Micronesian and Polynesian sea-going navigation incorporates a mental model of traversal in which the traveler remains stationary and the “world” moves past. We have sought to implement a design in this abstract information space in which the user also feels that they are stationary and that they control the movement of the surrounding environment towards or away from them.
Figure 4: VRML implementation of window-shopping.
2.3 People finding and organizational browsing Recently we have deployed a pilot on the GTE Intranet that utilizes rapid-fire image-based browsing in the context of an organizational directory. Images of people in the GTE organization that are based in Waltham, Massachusetts, serve as the content, along with their place in the organizational chart. Again, the implementation shown here is our Java applet. We call the application PeopleNav. It exists as one tool in a larger context of people-finding lookup services, in which individuals can be searched through a variety of fields including name, location, organization, etc. But what if users only know a person’s face? Conventional directory lookup offers no hope of finding a person based on facial recognition alone. With PeopleNav, users can rapidly browse a set of people by organization and use their extremely rapid facial recognition cognitive abilities to pick out a face they have seen before. PeopleNav offers the speed to scan through a large number of images rapidly. The application is designed to support a perusal task as well. If, as we expect, many users find such visual browsing a pleasurable experience, then they will be motivated to peruse the faces of people in their own or related organizations. Being able to attach faces to names would be a positive step in improving the basic social interactions necessary for developing effective business organizations. A screen shot of PeopleNav is shown in Figure 5 (with fictional names other than the authors’).
Figure 5: An application of PolyNav™ for organizational people browsing. It is important to note that the data format for the PolyNav™ clients shown in Figures 1-5 is independent from any of these clients. Any of the clients could be used with the same data file. The data format consists of a simple hierarchical organization of elements, each of which is associated with an image, a text label, and an HTTP link. We have implemented a servlet-based architecture that can process the identical data file and produce the client applications shown (with the exception of the Flash implementation, which as the most recent PolyNav™ client, has not yet been integrated into the generalized architecture).
3. USER EVALUATIONS AND EXPERIMENTS Over the course of this work we have conducted systematic user evaluations on three occasions. They have all investigated the use of PolyNav™ in a Web shopping context. The first two were relatively informal sessions where users were instructed d of images culled from shopping sites. The goal was to present meta-data from each site as a collection of images. The URL target for the image set was the home page of that site. User feedback was solicited and their sessions were videotaped. They were asked whether they liked the application; whether they had any suggestions for improving the interface; and whether they would use an application like this if they were shopping on the Internet. The first set of user evaluation sessions (approximately 12 subjects that were non-GTE employees) led to a revision of the design of the controls. Initially we used conventional buttons for the controls as are typically found on media players (stop, pause, forward, fast-forward, reverse, fast-reverse). A significant problem was that users could not stop the sequence quickly enough to allow them to click through on images that caught their attention. The typical scenario was that they would recognize an image that they wanted to click on and then begin hunting for the stop button with their mouse cursor. By the time they manipulated their cursor over the button and pressed, the image they were interested in had disappeared or been overwritten on the screen. A design revision led us to use rollover instead of click as an action to control the buttons. This design allowed users to stop and start instantaneously as well as change speed or direction. They could continue to attend to the image presentation area and still successfully utilize the controls. The graphic design of the control buttons was also revised. In a second round of videotaped user evaluations, it was confirmed that these revisions had corrected the earlier problems. Our assessment of user
comments was that the majority of users encountering rapid-file image browsing were receptive. Many were enthusiastic. A few were not interested; they preferred “hard” information, and had no use for images at all in their Web browsing. More recently we have undertaken the start of a more systematic experiment to try and determine whether users prefer PolyNav™ style browsing over conventional HTML alternatives. An initial pilot was run with 18 subjects in summer 1999. Three conditions were prepared, all with content that consisted of images of particular products in a variety of categories ranging from home furnishings to electronics. The first condition was PolyNav™ using the dynamic collage layout and controls evident in Figure 3. The second was a “slide-show” layout consisting of a temporal display of just one image at a time but with these same controls. The third was a set of HTML pages intended to represent the conventional design for browsing products on the Web today. It consisted of a set of linked pages with up to five products per page. “Back” and were on each page, as well as “Home” links, which would return to the root page for the category in question. After some discussion, we decided not to include any outline navigational help in the PolyNav™ or the HTML conditions, since we were trying to focus on image-based browsing and the possible presentations and controls for navigating through a set of images. Users were asked to do two types of tasks. The first was to spend 80 seconds browsing over a set of content. This task type was intended to represent browsing in the sense that the word is used in real-world shopping contexts. That is, users were asked to get an overview of the content. They did not have a specific target item they were seeking. Rather, they were seeking to get a sense of what selections might be available, much as dining guests peruse a menu from start to finish to get an overview before studying a subset of individual items in more detail.7 The second type of task was scanning. They were asked to carry out such tasks as the following: 1. 2.
Please search the browser and tell me when you find the Uno cards. Please search the browser and tell me when you find the basketball.
The particular items that subjects were asked to look for may or may not have been present in the samples. This task type was intended to represent one class of activity that users undertake while shopping, namely, determining whether a particular item or class of items they are seeking is or is NOT present in the collection. We also believe that users often find themselves undertaking a similar subtask when they need to go back to an item that they may have seen already in a perusal mode. At the conclusion, users were asked to rank the three browsing conditions in order of preference. A score of 1 represents a most favored ranking; 3 a least favored. Statistical analysis revealed a significant difference in ranking between all three browsers. The dynamic-collage condition had an average rank of 2.58; the slide-show condition, 1.37; and the HTML condition, 2.05. The slide-show condition was ranked significantly higher over the dynamic collage, t(36) = 6.20, p = 0.001, as well as over the HTML, t(36) = 3.04, p = 0.004. The HTML condition was ranked significantly higher over the dynamic collage, t(36) = 2.32, p = 0.03. We caution that these results are preliminary. We found many factors needing refinement in this experimental pilot. In particular, the poor showing of the dynamic-collage condition is perhaps due to the fact that we included text labels under each image that resulted in poor readability during the presentation. Nonetheless, we treat this as a preliminary evidence that users in general prefer certain types of rapid-fire image browsing over conventional HTML-based page flipping for some combination of perusal and scanning tasks. While observing the browsing behavior of the participants, the experimenter observed the users’ behavior regarding their use of speed controls. The controls for the experiment had 20 speed increments available in each direction (see Figure 3). That is, there were 20 arrows that moved the images backward, and another 20 that advanced the images. Speed of presentation ranged from approximately .75 images per second at the low end to 4 images per second on the high end. Each increment was distributed evenly between these two extremes. There seemed to be three discrete tasks that correlated with particular speed settings. During the 80 seconds allotted to participants to peruse the content of the site, they highlighted an average of 4.89 arrows, a speed setting we estimate to be 1 image per second. During the scanning tasks, they averaged 7.24 arrows, which is in the mid range, between 1 and 2 images per second. Once the participants had familiarized themselves with the browser, i.e. could recall the general order of the images, they began performing what we might call a fast go-to behavior; if the image they were seeking was located towards the end of the content, participants learned to roll-over the entire length of the controls, 20 units, to “fast-forward” to the image they were seeking at roughly 4 images per second. These observations suggest a basis for future experiments to confirm user behavior and preference with respect to speed and direction controls. However, we do take it as significant that at least three distinct speed settings emerged as a result of users’
behavior when they were given a fairly wide range of speeds to choose from. This suggests to us that commercial refinements of our prototypes of rapid-fire image browsing should include as least three settings for speed in each direction and/or offer finer-grained incremental speed selection across the range we have been discussing.
4. REQUIREMENTS SUMMARY Based on our preliminary experiment along with many months of design refinements and informal observations, we suggest that rapid-fire image-based browsing needs to support the tasks of perusing, scanning, and fast go-to. Therefore, the controls must support multiple speeds and changes of directions with instant response times. In general, this requires a new type of player. The commercial media players currently in wide circulation typically (1) require an initial buffering time delay when the playing point of the sequence is altered; (2) fail to offer the necessary speed adjustments; (3) fail to offer symmetric bidirectional controls; and/or (4) fail to support instantaneous stop and start. We have implemented rapid-fire image-based browsers in Java, Javascript, VRML, and now in Macromedia Flash. Each of them has trade-offs, of course. We do not seek to impose yet another plug-in on the Internet browsing world, although that too could be another vehicle for getting rapid-fire image-based browsing to the marketplace. The imaging requirements are as follows. Our maximum target presentation rate is four images per second. Our minimum is roughly one image per second, though that depends on the domain; images that contain text generally will require longer to absorb. The main factors that affect an application’s ability to achieve the higher presentation rate include bandwidth, image size, image compression, CPU speed, and initial buffering time. Any new compression techniques that apply to standalone images would have a positive impact in lowering the bandwidth required for the content in question. However, many of the compression techniques in use for streaming media do not apply because they assume a pre-specified sequence and a known rate of presentation. High bandwidth and throughput does not require any special image compression to meet the requirements. With incremental loading built into the client player, we have been able to achieve our target rates on the GTE Intranet, for example, without any special effort. Content providers planning for publishing on the consumer Internet, however, in which 28.8 bit connection speeds are still prevalent, must be prepared to reduce the quality and/or size of their images and use the best compression techniques available. Commercial avenues such as Macromedia’s Flash, which allow incremental loading along with some image compression may be feasible if they can be easily generalized to dynamic content lists. We also hypothesize that rapid-fire image-based browsing is improved when there is a text-based index that is synchronized with the image presentation. Although users cannot be expected to attend simultaneously to the elements of a text outline when they are focusing on a rapidly changing image space, it is nevertheless important to offer some orientation to the user as to where they stand in the sequence. We believe it is also appropriate to allow alternative means of navigating to an image previously shown (either by “playback” or random access through outline selection). We do not yet have experimental evidence that support these hypotheses, but would encourage further experiments that might do so. Our recommendation at this point is to include text labels for each element in any standardized data format.
5. INDUSTRY INITIATIVES What are the steps that might be taken in the industry to facilitate the use of image content to feed broad-based information browsing services analogous to text-based search services we find on the Web today? An XML document format could potentially let providers publish their meta-content with images when appropriate such that it could be harvested by Web portals and used with rapid-fire image browsers in conjunction with text-based search and other broad-based indexing techniques. Our own efforts have included designing an XML format that includes a hierarchical structure with text, images, and links that is of general utility. While many challenges remain to achieve a standard format on an industry-wide basis, we share the view that adding imagery to the standard Web experience of users who are seeking information would be a major step forward. Such a standardization effort would be well-worth undertaking, particularly if it began with content that is timesensitive and of obvious need of wide dissemination. The W3C ICE (Information Content and Exchange Protocol) initiative is an example vehicle for the necessary standardization effort.8 A starting point is to look for industry niches where images are a well-established form of content. Two examples from the shopping arena are advertising and consumer-based catalogs. From the entertainment world there is video content, and in the electronic libraries world as well. From the directories world there are people images. There are also consumer-based applications such as family digital photo collections and video collections. We encourage industry-level discussions of meta-content standards that would include image specification as an enabler of rapid-fire image-based browsing in all these domains. Probably a necessary precursor, however, is to get
compelling examples of rapid-fire image-based browsing into the Internet consumer marketplace such that it could galvanize the effort to publish more widely to this new medium.
ACKNOWLEDGMENTS We thank Christina Fyock and Glenn Li for their contributions to recent PolyNav™ applications; Demetrios Karis, John Huitema, and Regina Sutton for the informal user evaluation sessions; Joel Angiolillo for help in designing and supervising the pilot experiment; and Emily Bonham for carrying out the pilot experiment as well as conducting the analysis.9
REFERENCES 1. 2.
3.
4. 5. 6.
7. 8. 9.
C. G. Healey, K. S. Booth, and J. T. Enns. “High-speed visual estimation using pre-attentive processing,” ACM Transactions on Computer-Human Interaction, 3, pp. 107-135, 1996. T. Tse, G. Marchionini, W. Ding, L. Slaughter, and A. Komlodi, “Dynamic Key Frame Presentation Techniques for Augmenting Video Browsing,” in Proceedings of the Working Conference on Advanced Visual Interfaces (AVI ’98), L'Aquila, ITALY, May 25-27, 1998, pp. 185-194, 1998. K. Wittenburg, W. Ali-Ahmad, D. LaLiberte, and T. Lanning, “Rapid-Fire Image Previews for Information Navigation, ” in Proceedings of the Working Conference on Advanced Visual Interfaces (AVI ’98), L'Aquila, ITALY, May 25-27, 1998, pp. 76-82, 1998. K. Wittenburg, W. Ali-Ahmad, D. LaLiberte, and T. Lanning, “Polynesian Navigation in Information Spaces,” in Proceedings of CHI'98, April 1998, Extended Abstracts, pp. 317-318, 1998. J. D. Hollan, B. B. Bederson, and J. I. Helfman, “Information Visualization,” in M. G. Helander, T. K. Landauer, and P. V. Prabhu (eds.), Handbook of Human-Computer Interaction, North Holland, 33-48, 1997. K. Wittenburg, J. Nicol, J. Paschetto, and C. Martin, “Browsing with Dynamic Key Frame Collages in Web-Based Entertainment Video Services,” in Proceedings of IEEE International Conference on Multimedia Computing and Systems, June 7-11, 1999, Florence, Italy, 1999. R. Spence, “Image Browsing,” video presented at Interact ’97, Imperial College of London, 1996. The Information and Content Exchange (ICE) Protocol, W3C Note, 26 October 1998, http://www.w3.org/TR/NOTEice. GTE and the GTE logo are trademarks of GTE. PolyNav™ is a trademark of GTE Laboratories, Incorporated. All other marks are the property of their respective owners.