Human-Centered Interaction with Documents - CiteSeerX

Human-Centered Interaction with Documents Andreas Dengel, Stefan Agne, Bertin Klein

Achim Ebert, Matthias Deller

Knowledge Management Lab DFKI GmbH Kaiserslautern, Germany +49 631 205 3216

Intelligent Visualization Lab DFKI GmbH Kaiserslautern, Germany +49 631 205 3424

{dengel,agne,klein}@dfki.de

{ebert,deller}@dfki.de

ABSTRACT In this paper, we discuss a new user interface, a complementary environment for the work with personal document archives, i.e. for document filing and retrieval. We introduce our implementation of a spatial medium for document interaction, explorative search and active navigation, which exploits and further stimulates the human strengths of visual information processing. Our system achieves a high degree of immersion of the user, so that he/she forgets the artificiality of his/her environment. This is done by means of a tripartite ensemble of allowing users to interact naturally with gestures and postures (as an option gestures and postures can be individually taught to the system by users), exploiting 3D technology, and supporting the user to maintain structures he/she discovers, as well as provide computer calculated semantic structures. Our ongoing evaluation shows that even non-expert users can efficiently work with the information in a document collection, and have fun.

Categories and Subject Descriptors H.1.2 [User/Machine Systems]: Human factors, Human information processing; H.5.2 [User Interfaces]: Graphical user interfaces (GUI), Haptic I/O, Input devices and strategies (e.g., mouse, touchscreen), Interaction styles (e.g., commands, menus, forms, direct manipulation), User-centered design; I.3.6 [Methodology and Techniques]: Interaction techniques

1. INTRODUCTION Visual processing and association is an important capacity in human communication and intellectual behavior. Visual information addresses patterns of understanding as well as spatial assemblies. This also holds for office environments where specialists are seeking for best possible information assistance for improved processes and decision making. However, in the last decades the paradigm of document management and storage has radically changed. In daily working this has led to electronic, nontangible document processing and virtual instead of physical storage. As a result, the spatial clues of document filing and storage are lost. Furthermore, documents are not only a means for information storage but are an instrument of communication which has been adapted for human perception over the centuries. Reading order, logical objects and presentation are combined in order to express the intentions of a document’s author. Different combinations lead to individual document classes, such as business letters, newspapers or scientific papers. Thus, it is not only the text which captures the message of a document but also the inherent meaning of the layout and the logical structure.

General Terms Multimodal interaction, Interactive search, Human-Centered Design

Keywords Immersion, 3D user interface, 3D displays, data glove, gesture recognition

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HCM'06, October 27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-500-2/06/0010...$5.00.

Figure 1. Our demonstration setup: 2D display, stereoscopic display and data glove When documents are stored in a computer they become invisible for human beings. The only way to retrieve them is to use search engines allowing the user to get a keyhole perspective to the contents where all the inherent strengths of document structure for reading and understanding are disregarded. Towards this end, we need better working environments which while working with

documents consider and stimulate our strengths in visual information processing. One of the main advantages of such a virtual environment is described with the term “immersion”, standing for the lowering of barriers between human and computer. The user gets the impression of being part of the virtual scene and can ideally manipulate it as he would do in his real surroundings, without devoting conscious attention to use an interface. One reason why virtual environments are not yet as common as one should think based on the advantages they present might be the lack of adequate hardware and interfaces to interact with immersive environments, as well as methods and paradigms to intuitively interact in three-dimensional settings. 3D applications can be controlled by a mapping to a combination of mouse and keyboard, but the task of selecting and placing objects in a threedimensional space with 2D interaction devices is cumbersome, requires effort, and the conscious attention of the user. More complex tasks, e.g., opening a document, require even more abstract mappings that have to be memorized by the user. In the following, we like to discuss a new perceptual user interface as a complementary working environment for document filing and retrieval. Using 3D display technology our long term goal is to provide a spatial media for document interaction as well as to consider and further stimulate the strengths of visual document processing at the same time. The approach supports explorative search and active navigation in document collections.

2. STATE OF THE ART Many researchers already have addressed this problem from different perspectives. In [1], Welch et al. proposed new desktops where people can “spread” papers out in order to look at them spatially. High-resolution projected imagery should be used as ubiquitous aids to display documents not only on your desk but at walls or even on the floor. People at distant place would be able to collaborate on 3D displayed objects in which graphics and text can then be projected. Krohn [2] developed a method to structure and visualize large information collections allowing the user to quickly recognize whether the found information meets her or his expectations or not. Furthermore the user can give feedback through graphical modification of the query. Shaw et al [3] describe an immersive 3D volumetric information visualization system for the management and analysis of document corpora. Based on glyph-based volume rendering, the system enables the 3D visualization of information attributes and complex relationships. The combined two-handed interaction by threespace magnetic trackers and stereoscopic viewing provide for enhancing the user's 3D perception of the information space. The following two sections treat in more detail the two core aspects of visualization and interaction.

2.1 Visualization The information cube introduced by Rekimoto et al. [4] can be used to visualize a file system hierarchy. The nestedbox metaphor is a natural way of representing containment. One problem with this approach is the difficulty in gaining a global overview of the structure, since boxes contained in more than three parent boxes or placed behind boxes of the same tree level are hard to observe. Card et al. [5] present a hierarchical workspace called Web Forager to organize documents with different degrees of interest at different distances to the user. One drawback of this system,

however, is that the user gets (apart from the possibility of one search query) no computer assistance in organizing the documents in space as well as in mental categories. 3D NIRVE, the 3D information visualization presented by Sebrechts et al. [6] organizes documents that result from a previous search query depending on the categories they belong to. Nevertheless, placing the documents around a 3D sphere proved to be less intuitive than simple text output. Robertson et al. [7] developed Task Gallery, a 3D window manager that can be regarded as a simple conversion of the conventional 2D metaphors to 3D. The 3D space is used to attach tasks to the walls and to switch between different tasks by moving them on a platform. The user's movements and the only advantage of this approach over the 2D windows metaphor is possibility to quickly relocate tasks using spatial memory. The Tactile 3D [8] system, a commercial 3D user interface for the exploration and organization of documents; it is still in development. The file system tree structure is visualized in 3D space using semitransparent spheres that represent folders and that contain documents and other folders. The attributes of documents are at least partly visualized by different shapes and textures. Objects within a container can be placed in a sorting box and thus be sorted by various sorting keys in the conventional way, forming 3D configurations like a double helix, pyramid, or cylinder. The objects that are not in the sorting box can be organized by the user. The (still ongoing) discussion on the usefulness of 3D visualization in Information Visualization is very controversial (e.g., [6], [9], [10], [11]). Nevertheless, Ware's results [9] show that building a mental model of general graph structures is improved by the factor 3 compared to 2D visualization. The studies also show that 3D visualization supports the use of the spatial memory and that user enjoyment is generally better with 3D than with 2D visualizations which has been recently rediscovered as a decisive factor for efficient working.

2.2 Interaction At the moment, research on interaction is focused mainly on visual capturing and interpretation of gestures. Either the user or his hands are captured by cameras so their position or the posture of the hands can be determined with appropriate methods. To achieve this goal, there are several different strategies. For the user, the most natural and most comfortable way to interact is the use of non-invasive techniques. Here, the user is not required to wear any special equipment or clothing. However, the application has to solve the problem of interpreting the cameras live video streams in order to identify the user’s hands and the gestures made. Some approaches aim to solve this segmentation problem by assuring a special uniform background to distinguish the user's hand from it [12,13], others don't need a specially prepared, but static background [14], still others try to determine the hands position and posture by feature recognition methods [15]. Newer approaches use a combination of these methods to enhance the segmentation process and find the user's fingers in front of varying backgrounds [17]. Other authors simplify the segmentation process by introducing restrictions, often by requiring the user to wear marked gloves [18,19], using specialized camera hardware [16] or by restricting the capturing process to a single, accordingly prepared setting [20].

Although promising, all of these approaches have the common drawback that they pose special needs to the surrounding in which they are used. They require uniform, steady lighting conditions, high contrast in the captured pictures and have difficulties when the user's motions are so fast that his hands are blurred on the captures. Apart from that, these procedures demand a lot of computing power as well as special and often costly hardware. In addition, the cameras for capturing the user have to be firmly installed and adjusted, so these devices are bound to one place and the user has to stay within a predefined area to allow gesture recognition. Often, a separate room has to be used to enable the recognition of the user’s gestures. Another possibility to capture gestures is by the use of special interface devices, e.g. data gloves [21,22]. The handicap of professional data gloves, however, is the fact that they are not per se equipped with positioning sensors. This limits the range of detectable gestures to static postures, unless further hardware is applied. The user has to wear additional gear to enable determination of position and orientation of his hand, often with electromagnetic tracking devices like the Ascension Flock of Birds [23]. These devices allow a relatively exact determination of the hands position. The problem with using electromagnetic tracking, however, is the circumstance that they require the user to wear at least one extra sensor attached to the system by cable, which makes this equipment uncomfortable to wear and restricts its use to the adjacency of the transmitter unit. Additionally, electromagnetic tracking devices have to be firmly installed and calibrated, and they are very prone to errors if there are metallic objects in the vicinity of the tracking system.

3. VISUALIZATION OF DOCUMENT SPACES A collection of documents can be regarded an information space. The documents as well as the relations between them carry information. Information is an asset that improves and grows when used. Information visualization techniques, together with the ability to generate realistic real-time, interactive applications can be leveraged in order to create a new generation of document explorers. An application environment that resembles real environments can be used more intuitively by persons without any prior computer knowledge. The users perceive more information about the documents, such as their size, location, and relation to other documents, visually using their natural capabilities to remember spatial layout and to navigate in 3D environments, which moreover frees cognitive capacities by shifting part of the information-finding load to the visual system. This leads to a more efficient combination of human and computer capabilities: Computers have the ability to quickly search through documents, compute similarities, calculate and render document layouts, and provide other tools not available in paper document archives. On the other hand, humans can visually perceive irregularities and intuitively interact with 3D environments. Last but not least, a well designed virtual-reality-like graphical document explorer is more fun to the average user than a conventional one and thus more motivating.

user can pre-select documents of this collection by typing a search query in the gray search query panel. The search panel is invoked by a gesture, which moves it to the bottom of the screen with the gray search field on the left side. The pre-selected documents can be thought of as belonging to a higher semantic zoom level than those in the bookcase and are therefore displayed with more detail. These documents are moved out of the bookcase (slowin/slow-out animation), are rotated so that their textured front page faces the user, and are moved to their place in the start configuration. While the pre-selection search query has to be typed in the left search query panel, additional search queries typed in the color-coded right search query panels can be used for more detailed structuring requests.

Figure 2. PlaneMode Figure 2 shows documents in the so-called PlaneMode. The results of the color-coded search queries are visualized by small score bars in front of the documents and, moreover, documents which match more search queries are moved closer to the user than others, bringing the most relevant documents to the front and thus to the focus of the user. Important documents show an animated pulsing behavior and thus instantly catch the user's eye. The color of the document representation is used to represent the document category and the frontpage is a thumbnail of the first page. The thickness of the document indicates how many pages it contains. The yellowness of the documents encodes a scalar value like date (yellowing pages). When the mouse pointer is moved over a document, a label with the file name of the document is displayed directly in front of the document and an enlarged preview texture of the first

3.1 Visualization of Documents The opening screen of our implemented prototype is a visualization of all documents in the state of being stored in the form of books standing at a bookcase in the back of the room. The

Figure 3. ClusterMode

page is displayed in the upper right corner of the screen. There is also the possibility to mark documents with red crosses in order to define user- or task-specific interests and not to lose sight of them when they move to different places in different modes. To get a closer view of single documents, they can be moved to the front (the reading position). Our prototype allows the user to save interesting document configurations and restore them later. ClusterMode, a variation of PlaneMode, is shown in figure 3. Here, the most relevant documents are not only moved to the front but also to the center of the plane, which results in a pyramid-shape. Depending on which of the four color-coded search-queries match a document, the document is moved to the cluster with the colors of these search-queries. E.g., documents which match the blue and white search-query, but not the black and brown search query are in the cluster with the blue-whitelabel. Figure 5. Visualization of relations system, also allows to support users with a semantic engine, which calculates similarity relations between documents.

Figure 4. ClusterMode (variation)

Relations between documents can intuitively be represented by connecting the document under the mouse pointer by curves to the documents to which it is related (see figure 5). These curves create a mental model like that of thought flashes moving the user's attention from the current document to related documents. This mental model is supported not only by the way the curves are rendered with shining particles, but also by the fact that the curves are animated, starting at the document under the mouse pointer and propagating to the related documents. The advantage of this way of representing relations is that documents are visually strongly connected by curves which moreover create an impressive 3D structure. The disadvantage of literally connecting documents is that the thought flashes occlude the documents behind them, which makes it sometimes hard to understand which documents the curves are pointing to.

A variation of ClusterMode can be seen in figure 4: The clusters are semitransparent rings in which the documents rotate. (In this figure the wall structure is being removed, the document space and thus also the clusters are rotated, and a different color-pattern has been used.) There are two different semantics of connecting the clusters to each other: The first is to connect the clusters with only one color with all clusters that also contain that color by line segments in the same color. The second possibility is to connect clusters that have the same color codes except one additional color by a strut in this dividing color. Another feature of ClusterMode is that the user can change between static mode (the position of the clusters is optimized in order to avoid occlusion) and dynamic mode (The position of the clusters reorganizes dynamically. When the user clicks on a cluster this cluster is moved to the focus position.)

3.2 Visualizations of Relations Users enter into the information of a document space and work with it, by discovering and maintaining relations between documents. E.g. a new scientific article might be discovered relevant to a project and interesting for a co-worker. Thus, users need to maintain relations that they personally discover. Beyond these relations, state-of-the-art technology, implemented in our

Figure 6. Relations (variation) To overcome this disadvantage, documents which are related to the document under the mouse pointer can simply be visualized with semitransparent green boxes around them (see figure 6). Thus, they can be perceived preattentively and do not occlude anything as the boxes are not much larger than the documents

themselves. Also, the box representation can be used without complications in PlaneMode and in the bookcase. The connection of the document under the mouse pointer and the ones in green boxes are, however, not as intuitively clear as in the case of connecting thought flashes. One possible solution to this dilemma might be to combine the visualization of both relation visualization types and their advantages in the visualization of a single relation type – i.e., to use thought flashes that end at documents in green boxes.

with the application in a natural way by just utilizing his hands in ways he is already used to. Consequential, there is a need for a gesture recognition that is both flexible to be adapted to various conditions like alternating users or different hardware, possibly even transportable devices, yet fast and powerful enough to enable a reliable recognition of a variety of gestures without hampering the performance of the actual application. Similar to the introduction of the mouse as an adequate interaction device for graphical user interfaces, gesture recognition interfaces should be easily defined and integrated either for interaction in three-dimensional settings or as a means to interact with the computer without having to use an abstract interface. This might sound easy to achieve, but it required us some effort.

4.1 Applied Hardware

Figure 7. Relations (variation) The visualization of relations as shown in figure 7, is different from the previous ones. When enabled, the user can trigger the display of the relation by clicking on a document. This will display the documents the selected document is related to in the form of ghost documents hovering in front of the main document space. The user now has the possibility to select one of these ghosts by clicking on it, which will trigger an animation that seems to collapse all of the ghost documents to the real document the selected ghost stands for, and to further visually emphasize this document by displaying a semitransparent red box around it in the moment of collapse. With this visualization technique for relations the user's attention is automatically moved from the selected document to all related documents and finally to the related document that seems to be most useful to him or her. From this document he or she can start the process again to move further to more related documents.

4. INTERACTING WITH DOCUMENTS The most natural way for humans to manipulate their surroundings, including the documents e.g. on their desktop, is of course by using their hands. Hands are used to grab and move objects or manipulate them in other ways. They are used to point at, indicate or mark objects of interest. Finally, hands can be used to communicate with others and state intentions by making postures or gestures. In most cases, this is done without having to think about it, and so without interrupting other tasks the person may be involved with at the same time. Therefore, the most promising approach to minimize the cognitive load required for learning and using a user interface in a virtual environment is to employ a gesture recognition engine that lets the user interact

The glove hardware we used to realize our gesture recognition engine was a P5 Glove from Essential reality [24], shown in Figure 2. The P5 is a consumer data glove originally designed as a game controller. It features five bend sensors to track the position of the wearer's fingers as well as an infrared-based optical tracking system, allowing computation of the glove's position and orientation without the need for additional hardware. The P5 consists of a stationary base station housing the infrared receptors enabling the spatial tracking. The attainment of position and orientation data is achieved with the help of reflectors mounted on prominent positions on the glove housing. Dependent on how many of these reflectors are visible for the base station and on which positions the visible reflectors are registered, the glove's driver is able to calculate the orientation and position of the glove.

Figure 8. Essential Reality P5 Data Glove During our work with the P5, we learned that the calculated values for the flexion of the fingers were quite accurate, while the spatial tracking data was, as expected, much less reliable. The estimated position information was fairly dependable, whereas the values for yaw, pitch and roll of the glove were, dependent on lighting conditions, very unstable, with sudden jumps in the calculated data. Because of this, additional adequate filtering mechanisms had to be applied to ascertain sufficiently reliable values. Of special attention is the very low price of the P5. It costs about 50 €, by comparison to about 4000 € for a professional data glove, which of course provides much more accurate data but on the other side doesn't come with integrated and transportable position tracking. Indeed, the low price was one reason we chose the P5 for our gesture recognition, because it shows that serviceable interaction hardware for virtual environments can be realized at a cost that makes it an option for the normal consumer

market. The other reason for our choice was to show that our recognition engine is powerful and flexible enough to enable reliable gesture recognition even when used with inexpensive gamer hardware.

4.2 Posture and Gesture Recognition and Learning A major problem for the recognition of gestures, especially when using visual tracking, is the high amount of computational power required to determine the most likely gesture carried out by the user. This makes it very difficult to accomplish a reliable recognition in real-time. Especially when gesture recognition is to be integrated in running applications that at the same time have to render a virtual environment and manipulate this environment according to the recognized gestures, this is a task that cannot be accomplished on a single average consumer PC. We aim to achieve a reliable real-time recognition that is capable of running on any fairly up-to-date workplace PC and can easily be integrated in normal applications without using too much processing power of the system. Like Bimbers 'fuzzy logic approach' [25], we use a set of gestures that have been learned by performing the gesture to determine the most likely match. However, for our system we do not define gestures as motion over a certain period of time, but as a sequence of postures made at specific positions with specific orientations of the user's hand.

application to have different posture definitions for different users, allowing an on-line change of the user context.

4.3 Recognition Process Our recognition engine consists of two components: the data acquisition and the gesture manager. The data acquisition runs as a separate thread and is constantly checking the received data from the glove for possible matches from the gesture manager. As mentioned before, position and especially orientation data received from the P5 can be very noisy, so they have to be appropriately filtered and smoothed out to enable a sufficiently reliable matching to the known postures. First, the tracking data is piped through a deadband filter to reduce the chance of jumping error values in the tracked data. Alterations in the position or orientation data that exceed a given deadband limit are discarded as improbable and replaced with their previous values to eliminate changes in position and orientation that can only be considered as erroneous calculation of the glove's position. The resulting data is then straightened out by a dynamically adjusting average filter. Depending on the variations of the acquired data, the size of the averaging values is altered within a defined range. If the data is fluctuating in a small region, the size of the filter is increased to compensate jittering data. If the values show larger changes, the filter size is reduced to reduce latency in the consequential position and orientation.

Thus, the relevant data for each posture is mainly given by the flexions of the individual fingers. However, for some postures the orientation of the hand may be more or less significant. For example, for a pointing gesture with stretched pointing finger, the orientation and position of the hand may be required to determine what the user is pointing at, but the gesture itself is the same, whether he is pointing at something to his near left or his far right. On the other hand, for some gestures the orientation data is much more relevant, for example the meaning of a fist with outstretched thumb can differ significantly whether the thumb points upward or downward. In other cases, the importance of orientation data can vary, for instance a gesture for dropping an object may require the user to open his hand with the palm pointing downwards, but it is not necessary to hold his hand completely plain. Due to this fact, the postures for our recognition engine are composed of the flexion values of the fingers, the orientation data of the hand and an additional value indicating the relevance of the orientation for the posture. As mentioned before, the required postures are learned by the system simply by performing them. This approach makes it extremely easy to teach the system new postures that may be required for specific applications. The user performs the posture, captures the posture data by hitting a key, names the posture and sets the orientation quota for the posture. Of course, the posture name can also be given by the application, enabling the user to define individual gestures to invoke specific functionality. Alternately, existing postures can be adapted for specific users. To do so, the posture in question is selected and performed several times by the user. The system captures the different variations of the posture and determines the resulting averaged posture definition. In this manner, it is possible to create a flexible collection of different postures, termed a posture library, with little expedience of time. This library can be saved and loaded in form of a gesture definition file, making it possible for the same

Figure 9. Our tool for training new postures The resulting data is reasonably correct enough to provide a good basis for the matching process of the gesture manager. Should the gesture manager find out that the provided data matches a known posture, this posture is marked as a candidate. To lower the possibility of misrecognition, a posture is only accredited as recognized when held for an adjustable minimum time span. During our tests it showed that values between 300 and 800 milliseconds are suitable to allow a reliable recognition without forcing the user to hold the posture for too long. Once a posture is recognized, a PostureChanged-event is sent to the application that started the acquisition thread. To enable the application to use the recognized posture for further processing, additional data is sent with the event. Apart from the timestamp, the string identifier of

the recognized posture as well as the identifier of the previous posture is provided to facilitate the sequencing of postures to a more complex gesture. Furthermore, the position and orientation of the glove at the moment the posture was performed is provided. In addition to the polling of recognized postures from the gesture manager, the acquisition thread keeps track of the glove's movement. If the changes in the position or orientation data of the glove exceed an adjustable threshold, a GloveMove-event is fired. This event is similar to common MouseMove-events, providing both the start and end values of the position and orientation data of the movement. Finally, to take into account hardware that possesses additional buttons like the P5 has, the data acquisition thread also monitors the state of these buttons and generates corresponding ButtonPressed- and ButtonReleased-events, providing the designated number of the button.

gestures. This is done by tracking the sequence of performed postures as a Finite State Machine. For example, let's consider the detection of a "click" on an object in a virtual environment. Tests with different users showed that an intuitive gesture for this task is pointing at the object and then “tapping” at it with the index finger. To accomplish the detection of this gesture, one defines a pointing posture with outstretched index finger and thumb and the other fingers flexed, then a tapping posture with half-bent index finger. All there remains to do in the application is to check for a PostureChanged-event indicating a change from the pointing to the tapping posture. If, in a certain amount of time, the state of the recognized posture is reversed from tapping to pointing, a clicking gesture is registered at the position provided by the PostureChanged-event. In this manner, almost any desired gesture can quickly be implemented and recognized.

It is important to note that although the data acquisition we implemented was fitted to the Essential Reality P5, it can easily be adapted to be suitable for any other data glove, either for mere posture recognition or in combination with any additional 6 Degrees Of Freedom tracking device like the Ascension Flock of Birds [23] to achieve full gestural interaction.

4.4 The Gesture Manager The gesture manager is the principal part of the recognition engine, maintaining the list of known postures and providing multiple functions to manage the posture library. As soon as the first posture is added to the library or an existing library is loaded, the gesture manager begins matching the data received from the data acquisition thread to the stored datasets. This is done by first looking for the best matching finger constellation. In this first step, the bend values of the fingers are interpreted as fivedimensional vectors and for each posture definition the distance to the current data is calculated. If this distance fails to be within an adjustable minimum recognition distance, the posture is discarded as a likely candidate. If a posture matches the data to a relevant degree, the orientation data is compared in a likewise manner to the actual values. Depending whether this distance exceeds another adjustable limit, the likelihood of a match is lowered or raised according to the orientation quota associated with the corresponding posture dataset. This procedure has proved to be very reliable concerning both a very fast matching of postures and a very consistent recognition of the performed posture. Apart from determining the most probable posture, the gesture manager provides several means to modulate parameters on run time. New postures can be added, existing postures adapted, or new posture libraries can be loaded. In addition, the recognition boundaries can be adjusted on the fly, so it is possible to start with a wide recognition range to enable correct recognition of the user's postures without the posture definitions adapted to this specific person, and can then be narrowed down as the postures are customized to the user.

4.5 Recognition of Gestures As mentioned before, we see actual gestures as a sequence of successive postures. With the help of the PostureChanged-events, our recognition engine provides an extremely flexible way to track gestures performed by the user. The recognition of single postures like letters of the American Sign Language ASL is as easily possible as the recognition of more complex, dynamic

Figure 10. Immersive drag’n’drop operation: picking up, moving and placing documents / document stacks

4.6 Implementation and Results We have evaluated our gesture recognition engine in several demo applications representing a virtual document space. Due to the thread-based architecture of our engine, it was easily integrated by adding the recognition thread and reacting to the received events, making it comparably straightforward as adding mouse functionality. In the implemented virtual environments, the user can manipulate various objects representing documents and trigger specific actions by performing a corresponding intuitive gesture. The next implementation step will be the integration of the developed functionalities into a single prototype. In order to enhance the degree of immersion for the user, we used a particular demonstration setup as shown in Figure 1. To allow the user a stereoscopic view of the scene, we used a specialized 3D display device, the SeeReal C-I [27]. This monitor creates a real three-dimensional impression of the scene by showing one perspective view for each eye and separating them through a prism layer. To compensate for the resulting loss in resolution especially while displaying texts  we used an additional TFT display to also show a high resolution view of the scene. A testimony for the speed of our recognition engine is the fact that we were able to realize the application logic including the rendering of three different perspectives (one for each eye, another one for the non-stereoscopic display), and the tracking and recognition of gestures on a normal consumer grade computer in real-time. In our environment we used two different kinds of gestures [26], namely semiotic and ergotic gestures. Semiotic gestures are used to communicate information (in this case pointing out objects to mark them for manipulation), while ergotic gestures are used to manipulate a person’s surroundings. Our demo scenario shown in Figure 10 consists of a virtual desk, on which different documents

are arranged randomly. In the background of the scene, a wall containing a pin board and a calendar can be seen. Additionally, the user’s hand is represented by a hand avatar, showing its location in the scene as well as the hand’s orientation and the flexion of the fingers, so that the user can get a better impression if the glove captures the data correctly. The user was given multiple means to interact with this environment. First, he could rearrange the documents on the table. This was done by simply moving his hand avatar over a chosen document, then grabbing it by making a fist. He could then move the selected document around and drop it in the desired location by opening his fist, releasing his grip on the document. Another interaction possibility was to have a closer look at either the calendar or the pin board. To do this, the user had to move his hand in front of the object and point at it. In the implementation, this was realized by simply checking if the hand was in a bounding box in front of the object when a PostureChanged-event indicated a pointing posture. Once this happened, the calendar respectively the pin board was brought to the front of the scene and remained there. The user could reverse them to their original location by making a dropping gesture, performed by making a fist, then spreading his fingers with his palm pointing downward. Additionally, there were several possibilities to interact with specific documents. For this, one of the documents had first to be selected. To select a document, the user had to move his hand over it, and then tap on it in the way described earlier in this paper. Originally, the gesture for selecting a document had been defined as pointing at it with thumb spread, then tapping the thumb to the side of the middle finger, but users felt that tapping on the document was the more intuitive way to do this. As a measure for the flexibility of our gesture recognition interface, the replacing of one gesture for the other took less than five minutes, because all we had to do was replacing the thumb-at-middlefinger-posture with the index-finger-half-flexed-posture. Once a document was selected, it moved to the front of the scene, allowing a closer look at the cover page. The user then had the choice between putting the document back in its location, performing the same dropping gesture used for returning the calendar and pin board, or he could open the document. To open it, he had to “grab” it in the same way (by making a fist), then turn his hand around and open it, spreading his fingers with his palm facing upward. This replaced the desktop scene with a view of the document itself, represented as a cube with the sides of the cube displaying the pages of the document. Turning the cube to the left or right enabled the user to browse through the document. This could be done in two ways. The user could either turn single pages by moving his hand to the left or right side of the cube and tapping on it, in the same manner as described before. In addition, it was possible for him to browse rapidly through the contents of the document. To do this, the user had to make a “thumbs up”gesture with his thumb pointing straight up. He then had the possibility to indicate the desired browsing direction by tilting his hand to the left or right, triggering an automatic, fast turning of the cube. To stop the browsing, he just had to either tilt his thumb back up to a vertical position or completely cancel the spreadthumb posture. We had several users (e.g., knowledge workers, secretaries, students, and pupils) test our demonstrational environment, moving documents and browsing through them. Apart from initial

difficulties, especially due to the unfamiliarity with the glove hardware, after a short while most users were able to use the different gestures in a natural way, with only few adaptations of the posture definitions to the individual users. However, during the evaluation the browsing gesture turned out to be too cumbersome for leafing through larger documents. Therefore we are currently extending our interaction concept by additional interaction possibilities. For example, a force feedback joystick seems to be a well fitting additional device by supporting a continuous browsing speed selection and haptic feedback.

5. CONCLUSION Human thinking and knowledge work is heavily dependent on sensing the outside world. One important part of this perceptionoriented sensing is the human visual system. It is well-known that our visual knowledge disclosure - that is, our ability to think, abstract, remember, and understand visually - and our skills to visually organize are extremely powerful. Our overall vision is to realize an individually customizable virtual world which inspires the user’s thinking, enables the economical usage of his perceptual power, and adheres to a multiplicity of personal details with respect to his thought process and knowledge work. We have made major steps towards this vision, created the necessary framework, created a couple of modules, and continue to get good feedback. The logical conclusion is that by creating a framework that emphasizes the strengths of both humans and machines in an immersive virtual environment, we can achieve great improvements in the effectiveness of knowledge workers and analysts. We strive to complete our vision by further extending our methods to present and visualize data in a way that integrates the user into his artificial surroundings seamlessly and gives him/her the opportunity to interact with it in a natural way. In this connection, a holistic context and content-sensitive approach for information retrieval, visualization, and navigation in manipulative virtual environments was introduced. We address this promising and comprehensive vision of efficient manmachine interaction in manipulative virtual environments by the term “immersion”: a frictionless sequence of operations and a smooth operational flow, integrated with multi-sensory interaction possibilities, which allows an integral interaction of human work activities and machine support. When implemented to perfection,

Figure 11. Virtual desktop demo

this approach enables a powerful immersion experience: the user has the illusion that he is actually situated in the artificial surroundings, the barrier between human activities and their technical reflection vanishes, and the communication with the artificial environment is seamless and homogeneous. As a result, not only are visually driven thinking, understanding, and organizing promoted, but the identification and recognition of new relations and knowledge is facilitated. We have dedicated our studies on virtual environments to personal (virtual) information spaces which are, to a high degree, based on documents - i.e., the personal document-based information spaces. Our continuing usability tests have delivered promising results for the developed visualization and interaction metaphors. Thus, the first claims implied by our theory have been proven true: it is natural and easy to handle ones documents, and to get new ideas from computer calculated clues (like similarity relations and clusters) about ones documents.

6. ACKNOWLEDGMENTS This research is supported by the German Federal Ministry of Education and Research (BMBF) and is part of the project @VISOR.

7. REFERENCES [1] Welch, G., Fuchs, H., Raskar, R., Towles, H. and Brown, M.S. Projected Imagery in Your "Office of the Future”. IEEE Computer Graphics and Applications, vol. 20, no. 4, Jul/Aug, 2000, 62-67. [2] Krohn, U. Visualization for Retrieval of Scientific and Technical Information. Dissertation. Techn. Univ. Clausthal, ISBN 3-931986-29-2, 1996. [3] Shaw, C.D., Kukla, J.M., Soboro, I., Ebert, D.S., Nicholas, C.K., Zwa, A., Miller, E.L., and Roberts, D.A. Interactive volumetric information visualization for document corpus management. International Journal on Digital Libraries, 1999, 144-156. [4] Rekimoto, J. and Green, M. The information cube: Using transparency in 3d information visualization. Workshop on Information Technologies & Systems, 1993, 125-132.. [5] Card, S. and Robertson. G. The webbook and the webforager: An information workspace for the world wide web. Proceedings of the Conference on Human Factors in Computing Systems CHI'96, 1996. [6] Sebrechts, M. and Cugini, J. Visualization of search results: A comparative evaluation of text, 2d, and 3d interfaces. In Research and Development in Information Retrieval, 1999), 3-10. [7] Robertson, G. and van Dantzich, M. The task gallery: A 3d window manager. CHI Conference, 2000, 494-501. [8] TACTILE: Tactile 3d user interface, 2005. [9] Ware, C. and Franck, G. Evaluating stereo and motion cues for visualizing information nets in three dimensions. ACM Transactions on Graphics 15, 2, 1996, 121-140. [10] Van Ham, F. and van Wijk, J. Beamtrees: Compact visualization of large hierarchies. 2002.

[11] Wiss, U. and Carr, D. An empirical study of task support in 3d information visualizations. IV Conference Proceedings, 1999, 392-399. [12] Quek, F., Mysliwiec, T. and Zhao, M. Finger mouse: A freehand pointing interface. International Workshop on Automatic Face- and Gesture-Recognition, Zürich, 1995. [13] Lien, C. and Huang, C. Model-Based Articulated Hand Motion Tracking For Gesture Recognition. Image and Vision Computing, vol. 16, February 1998. [14] Appenzeller, G., Lee, J. and Hashimoto, H. Building topological maps by looking at people: An example of cooperation between intelligent spaces and robots. Proceedings of the IEEE-RSJ International Conference on Intelligent Robots and Systems. [15] Rehg, J. and Knade, T. Visual Tracking of High DOF Articulated Structures: an Application to Human Hand Tracking. Proc. ECCV, Vol.2, 1994. [16] Rehg, J. and Kanade, T. Digiteyes. Vision-based human hand tracking. Technical Report CMU-CS-93-220, School of Computer Science, Carnegie Mellon University, 1993. [17] Von Hardenberg, C. and Bérard, F: Bare-Hand HumanComputer Interaction, Proceedings of the ACM Workshop on Perceptive User Interfaces, Orlando, Florida, 2001. [18] Starner, T., Weaver, J. and Pentland, A. A wearable computer based American sign language recognizer. Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998. [19] Hienz, H., Groebel, K. and Offner, G. Real-time hand-arm motion analysis using a single video camera. Proceedings of the International Conference on Automatic Face and Gesture Recognition, 1996. [20] Crowley, J., Bérard, F., and Coutaz, J. Finger tracking as an input device for augmented reality. Automatic Face and Gesture Recognition, Zürich, 1995. [21] Takahashi, T. and Kishino, F.: Hand gesture coding based on experiments using a hand gesture interface device. SIGCHI Bulletin, 1991. [22] Huang, T.S. and Pavlovic, V.I. Hand Gesture Modeling, Analysis, and Systhesis. Proc. of International Workshop on Automatic Face- and Gesture-Recognition (IWAFGR), Zurich, Switzerland, 1995. [23] Ascension Products - Flock of Birds. URL: www.ascension-tech.com/products/flockofbirds.php [24] The P5 Glove. URL: www.essentialreality.com/VGA/video_game/P5.php [25] Bimber, O.: Continuous 6DOF Gesture Recognition: A Fuzzy-Logic Approach. Proceedings of 7-th International Conference in Central Europe on Computer Graphics, Visualization and Interactive Digital Media (WSCG'99), 1999. [26] Sandberg, A. Gesture Recognition using Neural Networks. Master thesis, 1997. [27] SeeReal Technologies. URL: http://www.seereal.de

Human-Centered Interaction with Documents - CiteSeerX

Human-Centered Interaction with Documents - CiteSeerX

Suggest Documents

Audio XmL: Aural Interaction with XML Documents - CiteSeerX

FACT: Fine-grained Cross-media Interaction with Documents via - fxpal

Documents - CiteSeerX

RFID: enhancing paper documents with electronic ... - CiteSeerX

Contextualizing a Data Warehouse with Documents - CiteSeerX

Caching Documents with Active Properties - CiteSeerX

Reference interaction site model and molecular ... - Documents

Documents de travail - CiteSeerX

Cohesin Interaction with Centromeric Minichromosomes ... - CiteSeerX

Interaction Profile of Diphenyl Diselenide with ... - CiteSeerX

HOTPAPER: Multimedia Interaction with Paper using ... - CiteSeerX

Modeling with COMSOL the Interaction between ... - CiteSeerX

Auctions with Downstream Interaction among Buyers - CiteSeerX

Two-Handed Direct Interaction with ARToolKit - CiteSeerX

Interaction of hHR23 with S5a - CiteSeerX

Full-Body Gesture Interaction with Improvisational ... - CiteSeerX

3D Interaction with the Desktop Bat - CiteSeerX

Catalyzing Social Interaction with Ubiquitous Computing - CiteSeerX

exploring sociotechnical interaction with rob kling - CiteSeerX

Electromagnetic Field Interaction with Vehicle Cable ... - CiteSeerX

Investigating Human-System Interaction With an ... - CiteSeerX

interaction of chlordimeform with clay minerals - CiteSeerX

modeling interaction of turbulence with premixed ... - CiteSeerX

Interaction of Mouse Peritoneal Macrophages with ... - CiteSeerX