Pers Ubiquit Comput DOI 10.1007/s00779-008-0209-0
ORIGINAL ARTICLE
Embodied interaction with a 3D versus 2D mobile map Antti Oulasvirta Æ Sara Estlander Æ Antti Nurminen
Received: 17 September 2007 / Accepted: 29 February 2008 Ó Springer-Verlag London Limited 2008
Abstract In comparison to 2D maps, 3D mobile maps involve volumetric instead of flat representation of space, realistic instead of symbolic representation of objects, more variable views that are directional and bound to a first-person perspective, more degrees of freedom in movement, and dynamically changing object details. We conducted a field experiment to understand the influence of these qualities on a mobile spatial task where buildings shown on the map were to be localized in the real world. The representational differences were reflected in how often users interact with the physical environment and in when they are more likely to physically turn and move the device, instead of using virtual commands. 2D maps direct users into using reliable and ubiquitous environmental cues like street names and crossings, and 2D better affords the use of pre-knowledge and bodily action to reduce cognitive workload. Both acclaimed virtues of 3D mobile maps—rapid identification of objects and ego-centric alignment—worked poorly due reasons we discuss. However, with practice, some 3D users learned to
Electronic supplementary material The online version of this article (doi:10.1007/s00779-008-0209-0) contains supplementary material, which is available to authorized users. A. Oulasvirta (&) S. Estlander A. Nurminen Helsinki Institute for Information Technology HIIT, Helsinki University of Technology and University of Helsinki, Helsinki, Finland e-mail:
[email protected] S. Estlander e-mail:
[email protected] A. Nurminen e-mail:
[email protected]
shift to 2D-like strategies and could thereby improve performance. We conclude with a discussion of how representational differences in mobile maps affect strategies of embodied interaction. Keywords 3D graphics Field experiment Mobile map Pointing task Spatial cognition Virtual environment
1 Introduction Interactive maps have become popular as interfaces to georeferenced data in mobile applications like navigators, route planners, guides, and location-based media. A recent market forecast [5] predicted that 42 million users in Europe and North America will use mobiles for navigation in 2012. Due to their popularity, mobile maps are an important topic for the theme of this special issue: mobile spatial interaction [12]. The fundamental and unique question that mobile maps pose for researchers working in this area is how different interface solutions support the user in achieving actionable understanding of the referential relationship between the virtual and the real world. For example, given a point-of-interest (POI) found on a tourist map, how does a tourist find the corresponding object in the surrounding real world? This question is particularly pronounced in mobile maps, because current mobile map solutions do not leverage augmentation or mixed reality technologies like the magic lens [33]. We claim that the ability to understand referential relationships underlies two important spatial tasks carried out with mobile maps: (1) identifying locations of objects in relation to one’s own physical location, and (2) navigation to given targets. Even navigation involves splitting the route to landmarks that are
123
Pers Ubiquit Comput
navigated to, a task which necessarily involves solution of referential relationships. Figure 1 presents an abstraction of this problem. The present paper examines how representational differences between 3D and 2D maps are reflected in users’ strategies in accomplishing these tasks. Particularly, the results will show differences in the use of body and gaze for the benefit of cognition and task performance. The design space of mobile maps is roughly divided into two solutions, which are so radically different that it would be a surprise if they were not related to different user performance and strategies. First, two-dimensional (2D) representations, the present-day standard, exploit users familiarity with the cartographic conventions of paper maps. Second, technological progress has made room for speculations about 3D maps becoming more popular and replacing 2D maps [5]. Indeed, 3D maps have many qualities that appear promising. They support (1) representation of volumetric concepts, an important dimension of any cityscape; (2) multiple views to spatial data, all views nevertheless being directional and bound to the first-person perspective; (3) rich and realistic visual details that can support direct recognition of objects based on matching of perceptual features; and (4) degrees of freedom in movement not possible with 2D maps. In principle, these qualities should support ego-centric alignment: matching what is seen in the physical environment (PE) to what is seen in the viewport of the virtual environment (VE). Furthermore, there is evidence that route-finding performance on the basis of information from a 3D desktop VE can be equally efficient as on the basis of information from PE [34]. One can ask if this could also be achieved with a mobile 3D map. Moreover, the small display size of mobile devices does not have to be a problem; some findings hint that presenting a desktop VE sequentially in fragments, rather than all at once, does not degrade performance in tasks of spatial cognition [43] (cf. [1]).
Therefore, one can surmise that if a 3D mobile map was to follow the known principles of good VE design, it could be at least as good as a 2D map, or even better. However, the technology is so new that empirical studies that would provide a serious test for these claims are scarce. Kray et al. [18] found that both orientation and walking to target were slower with a 3D map than with a traditional paper map and that absence of GPS data caused navigational problems in 3D. Burigat and Chittaro [8] noted that low resolution when the viewpoint is near a building hampers visual recognition. However, they reported that users think that 3D maps improve recognition of buildings. Other studies have also yielded positive results for 3D. Rakkolainen and Vainio’s [32] user interface featured a 2D map and a 3D model view simultaneously shown on the screen. The results of their pilot study indicated that users preferred to use a combined view rather than either 2D or 3D alone, corroborating findings from studies with a desktop VE system [33]. Rakkolainen and Vainio speculated that 3D allowed users to better recognize their own position and landmarks than the 2D map did. Subsequently, Vainio and Kotala [38] improved their system by adding a symbol that showed the user’s location and viewing direction. On the basis of a small study, they concluded that the benefit of this augmented 3D model is that it illustrates motion more clearly than their 2D map alone. However, these previous studies have been mainly couched in terms of evaluation of design solutions, and their results may thus be of limited value for designers in need of more principled understanding of how users interact with 3D mobile maps in the real world. Due to excessive focus on outcome variables and not the process of interaction itself, these studies are not indicative of how the different qualities of a 3D map affect interaction with the map. The goal of the present paper is to provide a careful and rigorous experimental study of the matter. 1.1 Representational differences between 2D and 3D maps
Fig. 1 The mapping problem: when using a mobile map, the user must understand the referential relationship between two points in two spaces
123
Maps in 3D have much to offer beyond their 2D counterparts. They differ both in medium—the way of presenting information—and in informational content. Among the former differences is the basic one of representing the 3D real world in those same three dimensions, instead of flattening one dimension and thus creating an impoverished representation of space. This third dimension allows them to provide variable viewpoints, so that a scene may be seen in first-person perspective, while 2D maps generally provide a fixed top–down view. Variable perspectives, in turn, entail special principles of navigation in the space [11].
Pers Ubiquit Comput
In terms of informational content, whereas 2D maps are typically simplified to contain only a specific set of a specific type of cue, 3D supports the representation of realworld objects in a more comprehensive manner. This often entails photographic detail, supporting direct visual recognition at different levels of abstraction rather than forcing the user to decode the static symbols and shapes of structures found in traditional maps. According to Crampton [10], the reader of a map must mentally visualize the information in the map as a physical environment. In 2D maps, this may at times require creating a 3D visualization out of 2D information. However, in a 3D model with realistic textures, that visualization is rendered unnecessary—the map user need only visually recognize that the image in VE corresponds to the view in the physical environment. 1.2 Understanding referential spatial relationships Lobben [22] identifies several cognitive processes involved in map-reading. In order to understand a map, the user must first understand the basic referential relationship between the map and the world it represents. This is a prerequisite for all decoding of information from a map. Levine [19] has characterized more closely how the referential understanding forms the basis for navigation. First, by the process of structure matching, specific cues in the map must be related to the corresponding specific cues in the world. However, one such match is not enough; a correspondence between one pair of points (A and A0 ) in two congruent spaces (e.g., a physical area, X, and a map, X0 , representing that area) will not determine a correspondence between the entire spaces, as the rotation between them is unknown. According to the two-item theorem, another point of correspondence (B and B0 ) needs to be established in order to make the two spaces match completely. It is also possible to match one pair of points and one pair of directions, such as a landmark and the cardinal direction north. This results in what Bluestein and Acredolo [7] call a projection or superimposition of one space on the other. Based on the result from their study on children’s ability to superimpose two spaces, it is a separate process from understanding the referential relationship between map and world, as it develops later. The on-line nature of navigational map-reading involves much more than the abilities studied in the laboratory, however. Liben and Downs [21] note that in addition to understanding the space–map relation described above, self-location requires two other relationships to be understood: the person–space relation (i.e., the person’s position in space) and the person–map relation (i.e., the person’s position in the map). They also note that self-location
differs from location of other points of interest in two significant ways. First, the current position on the map changes as one moves around in the environment. Second, human beings are so constituted that they face in one direction, and thus self-location necessarily involves orientation as well. Thus, in addition to locating the spot where one is, one needs to know the direction on the map in which one is facing in the physical environment. Much research has focused on this orientation of maps—the degree of rotation, both horizontal and vertical, between a map and the area it represents. Levine [19] put forth the idea that you-are-here maps are misleading if not aligned with the terrain, that is, if ‘‘up’’ on a vertical map does not correspond to ‘‘forward’’ in the terrain (the forward–up principle). This idea was confirmed by Levine, Marchon, and Hanley [20], who found that misaligned maps led to striking errors. In a notable study, Aretz and Wickens [1] established that misalignment between map and world correlates with response time, and suggested that not only must a horizontally misaligned map be mentally rotated into forward–up alignment, the 2D ‘‘up’’ of a vertical map must also be rotated into the 3D ‘‘forward’’ of the horizontal physical world. They also noted that when a stimulus is more difficult to rotate, such as when it is complex or not presented simultaneously with the other space, other procedures, like rotating the viewpoint instead of the object or analytically reasoning about the problem, take over. In line with this, Liben and Downs [21] suggest that physical rotation of the map is preferred to mental rotation. These results are particularly interesting for the present study, because 3D mobile maps are directional, while 2D maps are not (for 2D mobile maps, see also [2]). Self-location is most crucial when one does not know the map well enough to immediately recognize the current location, or when one does not know the environment well enough to understand where one is. The acquisition and use of such knowledge in the form of spatial representations is termed cognitive mapping. Learning about an area by physically being in it is referred to as environmental mapping, whereas learning that is based on other representations of the area, such as maps or aerial photographs, is termed survey mapping [22]. Visual detail and directionality of the 3D mobile map may help or prevent the use of such mental representations of space. 1.3 Research questions Our main question is: Question 1 (Q1). What is the role of the third dimension in helping users to localize objects in the real world? Thus, instead of asking whether users recognize 3D objects or remember spatial relations better after 3D
123
Pers Ubiquit Comput
interaction, we ask how action on these maps relates to the type of representation. In the present study, three types of tasks will be used: pointing at a proximate object, pointing in the direction of a remote object, and navigating to a remote object. At the heart of the problem lies the fact that 3D and 2D cartographic representations are, inevitably, selective and incomplete. Paper maps use symbolic conventions that tend to call attention to street names, landmarks, and crossings. However, there are no conventions for 3D maps. For our study, we have adopted an approach of realism; we provide a model that is as accurate as possible and let the user select which way to utilize it. We have taken pains to build a reasonably accurate, semi-photorealistic 3D VE that includes facades, building geometry, street topography and names, statues, landmarks, and logos. Our system mLOMA is one of the first 3D systems working on a mobile phone [26]. Given this approach, it remains an empirical question how users attend to such models in their attempts to find matching features between the virtual and the physical environments. There is a multiplicity of ways to perceive such VE objects. We call any cluster of visual features that a user perceives in VE or PE a cue, and we ask: Q2. Which cues do users attend to when using 3D mobile maps, and how do they find the targets? Both map types also transform the viewport differentially as a response to the user’s commands. Prototypical 2D mobile maps, and the one we use in the trial, allow zooming and scrolling. Generally, a transformation may or may not be effective; the newly acquired view does not necessarily provide cues matching those visible in the PE. Interestingly, the nature of these transformations may be linked to the user’s bodily behavior. A field evaluation of rotation techniques revealed that users of 2D mobile maps prefer physical rotation of device, in hands, over automatic compass-based rotation. Rotation was claimed to be necessary for understanding the alignment of VE and PE, and to decrease cognitive load [35]. We therefore ask: Q3. Are 3D and 2D mobile maps associated with different patterns of bodily conduct or deployment of gaze? Bessa et al. [6] studied, in a laboratory setting, what features are utilized in orientation with a 3D map. The users were first shown photographs of a certain place and asked to identify the locations and verbalize the key features they used for identification. After this, they were brought to the same physical location (from a different direction than the photo) and asked to identify the same spot and specify the features that they used for identification. The key features
123
were divided into urban furniture (lamps, seats, etc.), buildings (including doors, windows), publicity (advertisements) and ‘‘other’’ (cars, trees, temporary elements). They concluded that only the overall geometry and a few key features were required for recognition and, further, that the selected features were not necessarily those most salient to the human visual system. Nothegger et al. [25] evaluated the feature salience of landmarks (selected automatically by their algorithm). The task was to look at panoramic images of intersections and name the most prominent facade. They found that subjects criteria for selection varied notably (facade, size, shape, shape deviation, color, visibility, cultural importance, and certain explicit marks). It seems that many features are attended to, but no systematic studies exist to see how selection of a feature is linked to interactive or bodily behavior. As there are no prior studies with real 3D mobile maps on this matter, we decided to explore the question in an open-ended manner. Since 3D and 2D mobile maps differ not only in terms of representation, but also in terms of movement schemes, we had to take a stance toward what movement aids 3D maps should to entail. Since, in principle, no one movement scheme can be expected to be optimal in all situations [41] and for all users [39], we decided to enable a larger array of methods of moving in the 3D VE. We instantiated many principles of support for orientation and navigation in desktop VEs [11, 15, 17, 27, 37, 41], such as direction indicators, landmarks, restricted movement, assisted camera schemes, position projections, street names, and animated viewpoint elevation. The idea is to let the user choose how to move in VE and observe how this affects bodily conduct in PE. Nevertheless, as we argue in length elsewhere, burdening users with all six degrees of freedom is not wise [27]; therefore, some restrictions have been made. The benefit of this approach is that we can learn how a given map feature is associated with unique search patterns. The drawback is that because of the multiplicity of movement schemes, it makes it more difficult to generalize from the results to 3D. To address this problem, we use a range of first person (verbal reports, workload ratings) and third person (video tapes, interaction logs) measures. A threat to validity of comparison is posed by the fact that people have much more experience with 2D maps than 3D maps. To ensure that our results do not trivially reflect the learning of 3D map functionalities, which would happen if the experimental subjects saw the 3D map for the first time when entering the trial, we recruited users experienced in 3D games (and familiar with paper maps), and trained them to use the UI on their desktop PCs, following ideas presented in [34]. We also trained them with the mobile version just before starting the trial. Ideally, we should have encouraged the users to use the 3D map also in
Pers Ubiquit Comput
real localization and navigation tasks well before the actual trial, in order to even out the benefit 2D maps get from years of real world use experience users have with them. However, at the present time, this was impossible due to practical constraints. A potential limitation to the generalizability of our results is posed by the fact that we recruited only males for the study. This scoping was due to our inability to find a matching number (8) of female subjects who would be experienced with 3D games, particularly with first person shooters, and would be otherwise (demographically) comparable to the male subjects. Previous literature has showed notable individual differences in spatial behavior attributable to gender [39], and we assessed that it is better to recruit only males than go forward with an unbalanced experimental design that would not afford a valid comparison between the genders. This scoping must be taken into account in the interpretation of the results. In this trial, we opted not to use GPS for real-time positioning of the user. This is due to two reasons. First, most present-day phones do not carry GPS and it is unlikely that the majority will have GPS in the next couple of years. Second, previous studies and our own experimentations have revealed that GPS error (so called urban canyons) in built environment introduce problems that are unacceptable and undermine the achieved benefits [9]. This is particularly true in pedestrian applications where the user may stand still in a GPS shadow for longer periods. Moreover, due to the interactive and very rapid nature of mobile map interaction, we deemed it impossible to use a Wizard-of-Oz approach where a moderator would have plotted the user to the map in real time. However, in the Sect. ‘‘Discussion’’, we estimate the effect of the lack of positioning technology. To sum up, the present paper should not be interpreted as an evaluation of a 3D mobile map, but as an exploratory empirical study of how its properties and affordances feature in mobile spatial interaction. We will explore users strategies through the use of video recording, performance measures, verbal protocols, and high-fidelity logging of interactions. A backdrop for the present study is a pilot study (n = 8) conducted with a previous version of m-LOMA that we published as a technical report [29]. In that paper, we also compared 2D to 3D, but at that point we did not have in our disposal a real street map or navigation assistance for 3D. The 2D view was basically a top–down 3D view that did not have street names, and the 3D was a street-level view without navigation assistance like tracks. Unlike in the present experiment where the tasks start in front of the to-bepointed-at building, all pointing tasks started from the same map location as where the user is standing, providing an advantage for the 3D. Given that 3D was compared to an
impoverished 2D without street names, it is understandable that 3D was found to be better in performance. In the present experiment, we try to construct conditions that create a more realistic comparison between the two representations. Moreover, in 2004, we did not have the video recording system we use in the present experiment, and we were therefore not able to carry our analysis of gaze or bodily conduct, which are the main topics of the present paper.
2 Mobile map design m-LOMA, a mobile map prototype, was custom-tailored for a N93—one of the first Nokia phones with 3D accelerated hardware. The N93 features a 240 9 320 pixel display in a clamshell design (see Fig. 2f). Details on m-LOMA’s implementation are given elsewhere [26, 27]; the following focuses solely on features used in the present study. There a few characteristics that distinguish m-LOMA and are worth keeping in mind when reading the paper: 1.
2.
3.
4.
5.
The core of m-LOMA is a semi-photorealistic model of the Helsinki city center based on photography of building facades and simple geometry. The 3D interface provides two perspectives to the model (street-level and bird’s eye), basic maneuvering, and several optional navigation aids. The 2D map is an official (raster) map of the city, intended for navigation, with emphasis on street names. To ensure rapid use, all controls were mapped to the keys of N93. Invoking commands from a menu was not needed. A satisfactory frame rate for 3D rendering was achieved, which is necessary to ensure that results do not reflect solely the effects of lag.
2.1 3D model and graphics m-LOMA models an area of approximately 2 9 2 km in the center of Helsinki. In practice, an area of 1 km2 is fully modeled and textured. This area’s eastern part includes the Senate Square, a church, and a district with 19th century two-storey houses and newer government buildings. The western part covers a shopping street, the northern part of a campus, and the southern part of a park. The virtual model of the cityscape includes facades, statues, parks, streets, and rooftop logos. The 300 modeled buildings are lowpolygon models, generally simple geometrical boxes with walls and a realistically shaped roof. Prominent buildings are more accurately modeled. Statues are implemented as textured billboards. While the model is lightweight in geometry, all buildings are individually textured based on
123
Pers Ubiquit Comput
Fig. 2 The user interface of m-LOMA
digital photographs of the facades (Fig. 2c). One pixel in a texture corresponds to about 10–20 cm in a real object. Several steps have been taken to achieve an acceptable frame rate (see [26]). Frame rates during real use with the N93 are high (30–60 fps) and even faster at street level (up to 200 fps). 2.2 3D interaction 2.2.1 Basic maneuvering and views
Scripted actions cycle the viewpoint, in an animated fashion, between two pre-selected views [17]. In the streetlevel view, the view is tilted slightly upwards to reveal the details of the buildings. In the rooftop view, it is tilted downward, but keeping the horizon in the view (Fig. 2c, d). 2.2.2 Assisting functionalities Five assisting functionalities were provided for 3D: 1.
The simple, unrestricted basic maneuvering mode, the default in both two views, follows the flying metaphor. The viewpoint can be moved backward/forward and left/right, and rotated vertically/laterally. Angular velocity is always constant, but the motion is faster at a higher elevation [37]. Ascend/descend raises/lowers the viewpoint rapidly (please note that the use of a deprecated Symbian S60v2 clock function caused a slight instability in basic maneuvering, but this did not markedly affect task completion times).
123
With the Tracks function, movement is constrained to pre-defined paths [15]. When Tracks mode is initiated, the viewpoint animates to the nearest street and the view direction is set along it. An arrow points at the direction of movement. Following the idea of maximizing orientation value, rotation is allowed, but it automatically increases the distance to the opposing fac¸ade to provide a better view (Fig. 2g, h). Street names become visible (Fig. 2e). The name of the current street is placed in the upper part of the display
Pers Ubiquit Comput
2.
3.
4.
5.
and the next choice points (i.e., crossings) in the lower part. The Landmark view function turns the viewpoint towards the closest visible landmark, marks both that and the initial start point with Marker Arrows, and allows the user to move the viewpoint around freely. Further presses advance the view to the next landmark, and a separate button returns to normal view. Orbiting is a mode of cylindrical movement either towards/away from the target, or around it, but holding the view always towards the target (see [37]). We implemented a Fly-to-target functionality as a scripted action. It flies the viewpoint to a target, which in the experiment was always the to-be-located POI. An orange Marker Arrow displays direction and distance to the task POI. It appears solid when the target is in sight, and outlined when the target is occluded (See Fig. 5).
2.3 2D graphics and interaction The 2D map is an official raster-format street map from the Helsinki City Survey Department. Streets, public buildings, building blocks, and parks are distinctively colored. The street network is exaggerated to increase readability. Street numbers are provided, as are labels of well-known areas and landmarks. The map is scrollable and zoomable, but the quality suffers at extremes (Fig. 2b). 2.4 Controls for 2D and 3D All controls were bound to the keys of the N93. In addition to the normal phone buttons, the N93 contains a joypad with a center button, which were used for movement. We allowed simultaneous use of more than one button to support two-handed use. A special button was assigned the end-of-test functionality. Small white stickers on the keys explained functions, while unused keys were masked with black stickers (Fig. 2f). In the experiment, it was not possible to shift from 3D view to 2D map view, or vice versa.
3 Method In general, the method follows principles of quasi-experimentation in HCI as presented in [30]. The pointing paradigm was modified for the purposes of the present paper, based on experiences from the pilot study [29]. This paradigm has been used for studying the orientation of spatial knowledge to the physical environment. There, indicating a target on a map, a subject is asked to point out the corresponding location in the physical
environment (e.g., [36]). There is a converse version, indicating the target in the physical environment and asking subjects to locate it on a map (e.g., [28]), but this was not used in the present experiment. Both proximate (visible in the current standing position) and remote buildings were utilized as targets. Moreover, in navigation studies, it is common to ask subjects to move to a target. In this experiment, navigation was the second task type we utilized. Other characteristics of the method include the following: 1.
2. 3. 4. 5.
The participants were young males experienced in 3D PC games. They had familiarized themselves with a desktop version of m-LOMA before starting the trial. The experimental design featured carefully chosen measures to counter all known nuisance variables. Spatial skills were measured with two tests. Multiple methods of data collection were employed. Video data, verbal protocols, and interaction logs were analyzed in parallel. Several indices of bodily behavior and gaze deployment were manually coded from the video data with an accuracy of 1 s.
3.1 Participants Nineteen subjects were recruited, all of whom played 3D PC games and used mobile phones regularly (three had to be excluded from analysis due to technical problems; see Sect. ‘‘Recording apparatus’’). None of the subjects knew the site of the study well. Their mean age was 23.3 years (range 18–35 years). For abovementioned reasons, all subjects were males. They were rewarded three cinema tickets for their efforts. 3.2 Tasks In all tasks, the to-be-localized POI was marked in the VE: in 2D with a filled circle, and in 3D with a Marker Arrow on the building facade, viewing it on street-level view from a distance of about 15 m. The instruction was to indicate the corresponding building in the physical world to the experimenter as quickly and accurately as possible. There were three subcategories of tasks (Task): (P) Proximate pointing: The target, in view from the current position in PE, was to be pointed at with one hand. (R) Remote pointing: As in P, but the target is not in view from the current position. The instructions for R and P tasks were the same, so that the user did not know whether the target was proximate or remote. (N) Navigation: As in R, the target is not in view. The task is not to point at but to walk to the site of the target,
123
Pers Ubiquit Comput
stopping on the pavement on the target’s side of the street. 3.3 Route, sites and targets There were four objectives in creating a route: to spend no more than 2 h in the field, to minimize the possibility of easily guessing the answer, to minimize the effect of learning from one visited site to another, and to eliminate learning of the area used in one map condition (2D vs. 3D) to another. To satisfy these objectives, we divided the route, 2.46 km in total, into four loops—A, B, C and D. Each site of a task was removed at least 50 m from the previous along the same loop. Each loop contained two tasks of each type—P, R, and N—yielding altogether 24 tasks per subject. Since four loops do not allow for completely balancing the task sequences, two loops (A and C) featured the task sequence P–N–R–P–N–R, while two (B and D) featured the reverse R–N–P–R–N–P (Fig. 3). All targets were buildings or statues randomly selected from a list of possible targets. All POIs were presented from a viewing distance of about 15 m in the VE. R targets were all in the same map half (northerly-westerly or southerly-easterly) as the site, but not visible from any point along any of the loops. N targets were all around at least one corner along the loop, so that they would not be visible from the starting point. N targets also had to be at least 50 m from the starting point of both the current and the next task, again in order to minimize learning effects. No major landmarks were picked as targets.
3.4 Procedure Before the trial, participants were asked to use a version of m-LOMA on their PCs. The package contained both a 3D model and a 2D map of a fictitious town, ‘‘Fakeham City’’. They were asked to familiarize themselves with all 2D and 3D features of m-LOMA, which were thoroughly explained in a manual accompanying the package. The participant was brought to a room outside the route to complete a Corsi test for assessing visuo-spatial working memory (WM) span [3] and an adaptation of the Manikin task for assessing spatial rotation skills [4]. Then, 1. 2. 3. 4.
The think-aloud procedure The pointing task The NASA TLX scale (translated into Finnish [29]), and The use of the N93 was practiced.
Fifteen practice tasks with the Fakeham map were completed on the N93 while thinking aloud. The tasks covered all map functionalities. Lastly, the procedure was described and the camera equipment set up. On a site, the participant was told where to stand, in what direction to face and which number to choose in a task menu. The initial facing direction was randomized according to a pre-made table. If the participant was silent for more than 10 s during the trial, he was reminded to think aloud. When the participant had found the target, he pointed at it with his hand, pressed the assigned task-end key, and started post-trial debriefing, recounting what he remembered of the task. He then stated the certainty of his answer (0–10) and rated the workload verbally. At no point was he told whether he was correct or not. While moving to the next site, the experimenter engaged the participant in light conversation, in order to distract him from memorizing his surroundings. During the trial, the experimenter answered only questions concerning task instructions or the N93’s keys. After the last task, a background information form was completed. 3.5 Experimental design The experiment was a 2 9 3 within-subject design, the factors being: 1. 2.
Fig. 3 The route, the four loops, task sites (filled circles), inter-site transitions (regular arrows), and navigation paths (arrows marked with the letter N)
123
Map type: 2D/3D. Task type: P/R/N.
With 16 participants and 4 loops with 2 of each task type per loop, a full counterbalancing could not be achieved. Furthermore, the loops could not be reversed, since navigation from point X to Y is not the opposite equivalent of navigation from point Y to X. Consequently, the tasks were counterbalanced using the following manipulations:
Pers Ubiquit Comput
1.
2. 3.
Half of the participants performed 2D tasks in loops A and B and 3D in loops C and D, the other half vice versa. Half of the participants performed 2D tasks first, half 3D. Half of the participants walked loop A before B, half B before A, and within these two groups half walked C before D and half D before C.
Other manipulations, like randomization of initial facing direction, are explained in Sect. ‘‘Tasks’’. 3.6 Recording apparatus Two methods were used to record interaction. First, mLOMA created full logs of interaction with the mobile map. Second, video data was recorded from one moderatorcontrolled camera and three user-worn cameras: one showing the display of the phone, one pointing at the face of the user, and one hanging around the user’s neck. A small microphone was contained within the necklace camera. All A/V output was combined on the fly by a video hub to a single output. The wireless receiver, the video hub, recorder, and necessary batteries were placed in a backpack worn by the subject, except for one subject, who wore them on a belt (Figs. 4, 5).
effects reported below hold with and without the problematic data included. In addition, altogether three tasks from one trial were excluded due to faulty task instructions. Finally, since the four-cam recording included the sound from the microphone around the user’s neck, verbal protocols are missing from 31 tasks. 3.7 Video analysis We analyzed a total of 127 P tasks, 127 R tasks and 127 N tasks, altogether 12.2 h of video material of task performance. For the analysis, we integrated all video data with full reconstructions of interaction in the VE (see our video figure, part of Electronic Supplementary Materials). The resulting videos were ca 2,000 kbps MPEG-1 streams, with a resolution of 520 9 320, a frame rate of 25 fps and a MPEG-1 Layer 3 64 kbps mono sound track. 3.7.1 Coding of events and behavior The following taxonomy was the basis for coding of the data: 1.
2. 3.6.1 Technical problems 3. Due to critical technical problems, three trials had to be rerun with new subjects. There were other non-critical problems stemming from only handycam data being available (31 tasks), image quality problems (10), sunglare problems (4), and the experimenter not being able to shadow the user quickly enough when moving (22). Nevertheless, because of redundant use of cameras, all codings could be completed with these data, although the certainty of some categorizations, like Facing, suffered. To examine this problem, we ran all statistical tests twice; the
4.
5.
Facing. The user looks: at the device, forward, left, right, straight up, up and to the left, or up and to the right. Body turning. The user turns his body using his feet, left or right. Device turning. The user rotates the device in his hands, clockwise or counterclockwise. Head tilting. The user brings one ear toward the shoulder: to the left, right, upright (back to normal position). Walking. The user moves his feet more than 50 cm.
The time code of each event referred as closely as possible to the start of that event. A degree of precision of 1 s was deemed to be the maximum, yet enough for the purposes of the study. InqScribe version 2.0.2 was used. In addition to moment-per-moment action, the accuracy of answers (directional pointing errors) were coded as: (1) correct with a 10°or (2) with a 22.5° marginal, (3) incorrect, target directly to the left or right of the subject with 45° marginal, (4) incorrect, target behind. 3.7.2 Coding procedure and reliability
Fig. 4 The recording equipment. The user carries one camera on his chest (1), and two attached to the mobile device (2 and 3). The moderator follows one step behind, shooting with a wide angle lens (4). All A/V data is integrated on the fly by a recorder the user carries in a backpack (5; weight 1.5 kg)
The material was evenly divided between two coders. They coded the same selected tasks of a subject, discussed discrepancies and elaborated the coding scheme. They then moved on to another subject data. When one coder encountered difficult sections in the video material, the two coded that section together. This was continued until consensus on all definitions was settled. Eventually, all
123
Pers Ubiquit Comput
Fig. 5 A frame from an integrated video used for analysis
data was coded/re-coded with the final, agreed-on coding scheme. The main problems arose from the trials in which technical problems with the cameras occurred (reported above), but which were despite this coded in their entirety. Finally, to assess reliability, the two coders coded, independently of each other, a subset of data. For this sample (238 events), j (Cohen’s kappa) was 0.75 for the overall coding (all variables included). This j can be interpreted to indicate a good level of agreement. 3.8 Qualitative analysis Word-for-word transcriptions of concurrent and post-trial verbal protocols were examined together with the integrated videos. The analysis proceeded each task type and map type at a time, focusing on two issues: (1) enumerating cues mentioned in a random sample of 75 tasks and (2) on the user’s logic of how these cues help to find the target. 4 Results Before continuing, we encourage the reader to view the video figure (part of Electronic Supplementary Material) which shows several extracts from the data. In this section, we recount the main findings first in qualitative and then in quantitative data.
started the task with a scan around the target, at times progressively expanding to include more buildings and a larger area. The faster way of locating targets, possible only with a 3D map, was by means of direct matching: if the to-besought-for POI has salient visual features, one can horizontally scan the PE in the hope of finding matching features. This tactic failed often, partly because the participants did not know beforehand whether the target was visible in PE, partly because the targets included nonsalient buildings. Predominantly, both 2D and 3D tasks were solved indirectly, by finding a point that maps a location in the VE to one in the PE (hereafter: a reference point). One can infer the target’s direction by estimating the angle difference D between the target and the reference point, and the target can then be pointed at D° to left or right from the reference point, assuming that one knows one’s own position in relation to the reference point. One can achieve this also without locating oneself on the map, by ‘‘sandwiching’’ the target with two external reference points. We observed three methods of searching for reference points: 1.
2. 4.1 Qualitative analysis: general form of search Since a 3D task started with the POI’s facade in view, users often wanted to get more details of the target, and thus
123
3.
Cue scan: actively looking for a salient cue in one environment and then searching for the corresponding cue in the other. Primed search: using pre-knowledge of the VE or PE to elect a candidate reference point, and then searching the corresponding site. Ego-centric alignment: aligning the position and view in the VE with the position and view in the PE.
Pers Ubiquit Comput
Over all of these, the sets of cues to which users orient are non-overlapping for 3D and 2D (see Table 1). In 3D use, cue scan proceeded mostly by scanning buildings that surround the target in the VE, for example to spot a yellow building in the midst of gray ones. While street-level scan was typically inefficient, many participants learned to ascend to the rooftop view and rotated there, in search of statues, parks, recognizable buildings and rooftop logos of companies. If this did not work, they ‘‘flew’’ around in the area above the target. Many times a user would turn on the Tracks feature to see street names and use them as cues. Cue scan does not have to happen from VE to PE, but cues seen in the PE can also be primary. Our 3D users rarely took the effort to walk around in the area to find better cues, although there were cases of walking to a crossing in the PE in the belief it is easy to spot in the VE. In contrast to 3D use, where we saw plenty of strategy changes, cue scan in 2D use was remarkably more straightforward. Users typically read the name of the street by the target. They also used some cue types, like street numbers, that were not available in 3D (Table 1). The use of street numbers turned out to be an inferior strategy, however, because the corresponding information in the PE was difficult to obtain and because not all street numbers are marked on the map. Primed search was based on users pre-knowledge about the direction of a street, landmark, or area. To ‘‘hunt’’ for known landmarks, 3D users often rotated or flew around in the rooftop view, or used the Landmark View function repeatedly to view nearby landmarks. In 2D use, the hunt for a specific street often took place simply by scrolling the 2D map. In comparison to 3D where searching for a known street often involved movement along Tracks, street search in 2D seemed to be less random, as if guided by some intuition as to the area where the street is. Table 1 A catalogue of cues used and their frequencies. Based on analysis of think-aloud data of 75 tasks Cue type
3D
2D
Known landmarks
Very often
Often
Building shapes
Often
–
Facade details
Often
–
Facades (whole)
Often
–
Relative directions
Often
Sometimes
Street names
Often
Very often
Street crossings
Sometimes
Often
Blocks, or part of blocks
Rarely
Sometimes
Parallelism of streets
Rarely
Rarely
Cardinal directions
Very rarely
Sometimes
Store/office names Street number
– –
Rarely Rarely
The advantage of ego-centric alignment is that it enables pointing towards the target directly without first mentally inferring its direction. In 2D use, this was typically done by turning the device to make the streets in VE and PE parallel. To this end, some users walked a few steps from the initial standing position in PE to read a street sign. In 3D use, ego-centric alignment was performed by placing the camera at the spot in the VE corresponding to the current position in the PE. Although preferred, this strategy was not as easy in 3D use, because it demands localizing one’s own position and the viewing direction, which in turn requires triangulation with buildings surrounding one’s current location. Interestingly, 2D users at times based their solution on memory for the route traveled in the experiment. They ‘‘traced’’ their movements from there to the current position and could then use the end result for egocentric alignment. This tactic worked quite well, as if 2D users were able to do spatial updating while walking from one site to another. In navigation (N) tasks, we saw that both 2D and 3D use involved attempts to plan the whole route in advance. Zoom out was used in 2D, and perspective change in 3D. If this did not work, users sometimes used ‘‘divide and conquer’’ both in 2D and 3D; that is, determined only one leg of the route, stopping on the way to plan the next leg. In 3D this was achieved by using Tracks mode to benefit from street name and crossing data. In an ego-centric alignment strategy operating in N tasks, some users, uncertain of their current position, first walked towards a known landmark, this serving as a way of ‘‘putting oneself on the map’’. 3D users also used the camera as a ‘‘scout’’ by moving it further ahead along the route to preview the area. Many times the user did not stop immediately when the target direction was inferred, but gathered confirmatory evidence for his conclusion, typically by trying to establish yet another reference point such as a nearby street corner. In many tasks, this also served the micro-level inference on which side of the street a (non-visible) target is; e.g., ‘‘on the Cathedral’s side’’, or ‘‘on the north-east corner’’. The above description portrays interaction as straightforward action, as if a course of action was planned and, step by step, its execution led to the solution. In fact, 2D use appears to fit this characterization well, whereas our data on 3D use is swarming with opportunistic behaviors. 3D users frequently changed their strategies, trivially in the face of a dead end, but more often due to new opportunities perceived as a result of movement or perspective change. 4.1.1 The importance of self-location Self-location has been claimed to be a central element of map tasks [21]. The present results support that claim, as
123
Pers Ubiquit Comput
Map (2D/3D) and Task (N/P/R) as two within-subject independent variables (IVs). The problem with running several statistical tests entailing multiple families of interrelated dependent variables (e.g., ‘‘task performance’’ is a DV family of speed and accuracy) is that the probability of experiment-wise Type I error is elevated. We hence chose to employ a conservative a level of 0.01.
the participants did explicitly look for the ‘‘I am here’’ (IAH) point in about of the studied cases and in both of the map types. The pervasiveness with which self-location was performed suggests that participants not only tried to use it in solving tasks, but that they relied on it as an effective step towards completing the task. Although it was not always performed right at the start of a task, it was usually performed also after the participant had located some known cue near the target on the map, that is, after he had established the general area of the target. This can be taken to imply that knowledge of the IAH point is important for precise and confident performance even when there is already some information on the whereabouts of the target. Thus, the data suggests that self-location is a crucial procedure in solving real-world map tasks. When the participants did not locate their own position on the map, it was because they did not feel that they needed to know it exactly. The reason for this was either that they realized that the target was very close or that knowledge of PE guided them in the right general direction. It should be noted, however, that at least two erroneous performances can be attributed to neglecting to locate the IAH point. Hence, although it is possible to bypass self-location on the basis of knowledge of PE, doing so may cause decreases in efficiency and accuracy in completing the task. When self-location has been completed, it is used in combination with alignment to infer the direction to a target. While walking to a navigational target, the map user keeps track of the current location by generating expectations from the information in the map and checking these expectations against the environment, or conversely by choosing cues in the environment and checking those against the map, especially at street corners. These results add further support to the claim that locating the current position on the map provides important information for solving tasks. Furthermore, they show that in navigation, it is equally important to continually update that position as one’s location changes in the physical environment.
The first task in analyzing data from an experiment utilizing the pointing paradigm is to see if task conditions are associated with differential pointing performance. If this is the case, speed–accuracy trade-offs must be taken into account in subsequent analyses, which complicates matters. Luckily, directional pointing error was at the same level in the two Map conditions, v2(3) = 6.2, p = 0.12. There was a borderline-significant effect of Task, v2(6) = 14.6, p = 0.02, reflecting a somewhat lower accuracy in R tasks in comparison to P and N tasks. This was expected, because the latter task types both end with the target in sight in PE, and are thus easier to confirm than R tasks. Other accuracy measures echo this pattern. There was no effect of Map on reported certainty of the answer, F1,15 = 0.47. The absence of effects of Map on error and certainty is a positive finding, because it implies that the effects we report in the following do not trivially reflect user errors. Task completion times were significantly better for 2D than for 3D (M 86.7 vs. 142.7 s), F1,15 = 51.0. As expected, there was a significant effect of Task, F2,30 = 51.3, as well as a significant interaction effect of Map 9 Task, F2,30 = 9.1. Figure 6 portrays the situation. It shows that the handicap of 3D is not brought about by a single task type but appears to be proportionate to task completion time. This makes it likely that the handicap of 3D was not due to a single time-consuming operation of approximately constant length, such as finding one’s own position could be.
4.2 Quantitative data: performance, action, experience
4.2.2 Cognitive load
We now turn to quantitative measures, focusing mostly on two dimensions: differences between the representations and differences among the three task types. The results not show only dramatic differences between the map types in favor of 2D, but also reveal explanations that are indicative of how strategies differ between the two representational formats. In the following, unless otherwise mentioned, all F statistics are from General Linear Models (GLMs) with
Measures of felt workload echo the findings on task completion times. To capture workload, we summed three NASA TLX ratings: mental load, physical load, and temporal load. There were significant effects of Map, F1,15 = 59.6, and Task, F2,30 = 22.8, but their interaction was non-significant, F2,30 = 2.8, p = 0.07. The pattern is similar to that of task completion times: 2D was better as it was associated with significantly lower workload (M 4.9) than 3D (M 7.0). Not surprisingly, N and R tasks were
123
4.2.1 Task performance
Pers Ubiquit Comput
Fig. 6 Task completion times. Vertical bars denote 99% confidence intervals (CIs)
Fig. 7 Distance traveled in the mobile map. Vertical bars denote 99% CIs
more demanding than P tasks (Ms 7.4, 6.0, and 4.6, respectively).
4.2.4 Deployment of gaze
4.2.3 Interaction with the mobile map Measures of interaction with the mobile map reveal that whereas 2D and 3D users traveled about the same distance when accomplishing a task, 3D users had to spend much more time traveling the same distance because movement is slow particularly in the street-level view and because movement in the bird’s eye view is directional and requires continuous rotation and use of navigation aids. Map did not have a significant effect on total distance traveled in the VE, F1,15 = 3.5, p = 0.08. As expected, Task did have an effect, F2,30 = 22.5, as did the Map 9 Task interaction, F2,30 = 13.3. Figure 7 conveys that there were virtually no differences between 3D and 2D in other task types than N. In N tasks, 3D was poorer than 2D. The 3D functionalities were used as follows (unit: times on average per task): 1. 2. 3. 4. 5.
The analysis of deployment of gaze is the other of key analyses in this paper. The statistical testing was based on comparisons of various measures of user ‘‘time-sharing’’; that is, measures of sharing of attention between the PE and the VE [23]. Map had a significant effect on the average length of a gaze at the PE, F1,15 = 22.7, as did Task F2,30 = 19.9, and their interaction, F2,30 = 9.0. Map also had a significant effect on proportion of time per task spent looking at the PE, F1,15 = 79, as did Task, F2,30 = 111.3, and their interaction, F2,30 = 19.9. Figure 8 depicts this pattern. Users looked more at the PE when using a 2D map, and this difference is emphasized in N tasks. When analyzed more closely, it turns out that these differences do not stem from looking up or to the sides: for side-looking F1,15 = 2.3, p = 0.14, for up-looking
Switch perspective 1.8. Tracks 1.3. Show landmark 0.7. Orbit 0.3. Fly-to-target 0.2. For the 2D map,
1. 2. 3.
move north/south was used on average 2.2 move east/west 3.0, zoom in 2.0 zoom out 5.8
Later on in analysis of practice effects, we will report that, with practice, 3D users learned to use the switch perspective functionality and used that to shift to a strategy that resembles the use of a 2D map.
Fig. 8 Proportion of attention-sharing to PE. Vertical bars denote 99% CIs
123
Pers Ubiquit Comput
F1,15 = 4.7, p = 0.046. 3D users did not look at the device more often than 2D users, as measured by the frequency of looking at the device, F1,15 = 0.27. There was an expected effect of Task, F2,30 = 20.8, but no interaction, F2,30 = 0.49. As a possible explanation, a stronger difference between the 3D and 2D conditions was found for the forward-looking measure, on which Map had a significant effect, F1,15 = 14.7, as did Task, F2,30 = 61.6, whereas their interaction was not significant, F2,30 = 0.23. There was proportionately more forward-looking in 2D than 3D. For the Task IV, N [ P [ R. Taken together, 3D maps are associated with lengthier gazes at the PE, but the frequency of switching is at the same level with 2D maps. 2D users must be able to extract more information in less time. They seem to do so particularly by raising their gaze from the device to look forward. As will be noted below, this is associated with tactics to use the body to move the field of vision. 4.2.5 Bodily action Several measures of bodily action were analyzed. The representational differences are reflected in how often users interact with the physical environment, and when they are more likely to physically turn and move the device, instead of using virtual commands. There were significant effects of Map and Task on the amount of walking, and a significant interaction effect thereof; F1,15 = 7.1, F2,30 = 211.9, F2,30 = 6.0, respectively. In absolute numbers, 3D users spent more time walking, this difference stemming from N tasks where the difference to 2D was 14.8 s. Proportionately, however, users actually did not walk as long in 3D as in 2D tasks: F1,15 = 45.7. Given that 2D users were faster in their tasks, this means that they walked more efficiently, which may explain also why 2D generally outperformed 3D in measures of task performance. Moreover, there was an interaction effect Map 9 Task, F2,30 = 21.5. To compare 2D and 3D for Task, we ran a Scheffe’s test, which yielded p \ 0.001 only for N tasks. Finally, the rest of the indicators strongly indicate that 2D users utilized as tactics more (1) tilting of the head, (2) turning of the device, and (3) turning of the body. Actually, turning of device and tilting of head took place almost entirely in 2D tasks. Moreover, overall, there was more body turning in 2D tasks. These three effect of Map were all significant, all F2,30 [ 8.7 (Fig. 9). 4.2.6 Practice effects Although the users were trained in the desktop version of m-LOMA, they were novices in using the mobile version for real-world tasks. We therefore needed to analyze
123
Fig. 9 Tilting of device, body turns, head tilts. Vertical bars denote 99% CIs
practice effects. If practice effects would be found, it would be interesting to know if performance approached the level of 2D performance and if they were associated with qualitative shifts if search strategies. We studied practice effects by dividing each subject’s tasks, 2D and 3D separately, into four ‘‘Phases’’. We then ran GLMs with Map and Phase as within-subject IVs on several DVs of interest. For task completion time, there was a clear effect of Phase, F3,45 = 8.5, but no interaction with Map, F3,45 = 2.4, p = 0.08. Cognitive load also significantly decreased with practice, F3,45 = 10.3. Despite the absence of a reliable interaction effect, F3,45 = 2.1, p = 0.11., the trend suggests that cognitive load decreased somewhat faster in 3D use, from a difference of 36% in Phase 1 to a difference of 22% in the last Phase, when compared to 2D. A practice effect was also visible in the distance traveled in the VE: F3,45 = 11.0, but again the interaction effect was not significant, F3,45 = 0.47. Taken together, practice effects appear to be universal and not specific to either of the two map types. It is of further interest to know whether the users changed their use of map features with practice or if the practice effects reflect learning of map content. If we adopt an (admittedly arbitrary) criterion of 5% change in the frequency of use, and compare Phase 1 to Phases 3–4—the two periods between which the largest improvements were observed—the following pattern emerges: generally for most functionalities, frequency of use does not change from Phase 1 to the last three. The only functionalities for which frequency of use did change were: 1. 2. 3.
Orbiting, which fell by 19.8% Show landmark, which fell by 14.9% Fly-to-target, which fell by 8.1%.
Pers Ubiquit Comput
In other words, users learned to avoid these sophisticated assistive functionalities. By contrast, the use of Tracks increased by 7.4%. When it comes to change in bodily behavior, 3D users learned to look up less in the VE (6.0%), but more in the PE (6.7%). This echoes findings from studies of desktop VE use; with practice, 3D VE users converge onto a narrower set of strategies [31]. Moreover, these results add to this pattern that bodily interaction becomes preferred over virtual interaction. By comparison, 2D shows no changes in how its features are used. Only the proportion of time spent looking up in the PE changed, as there was a 6.8% increase from Phase 1 to the last three. Nevertheless, in statistical testing, although the trend is visible, none of the abovementioned differences satisfied the strict criterion of alpha 0.01. The effect on amount of Orbiting was closest, F3,45 = 3.2, p = 0.03. 4.2.7 Influence of spatial abilities: a post hoc analysis We finally examined whether spatial ability measures were associated with task performance or strategies. Since spatial abilities were not originally part of the experimental design, we are forced to rely on a post hoc analysis. The results tentatively suggest that 3D performance is more dependent on spatial skills than 2D, most likely because of the narrower viewport which places more demands on memorizing spatial features. A GLM with Map as the within-subject IV was run, first with the score in the Corsi task and then with the score in the Manikin task, as a continuous predictor for task completion time. For the former, the test score came quite close to significance, F1,15 = 6.2, p = 0.03. For the latter, the effect is not significant, F1,15 = 0.23. For other DVs, we found no effects. We therefore investigated the first result by dividing subjects into two groups, according to a median split for their Corsi score. The results indicate that spatial visual working memory span may be associated with high performance in 3D but not in 2D. The high span group was 16% quicker in 3D than the low span group, and they used Switch perspective 26% more. They gazed at the environment 11.8% less per task, although the two groups frequency of gazing at the environment was at the same level.
5 Discussion The results show quite forcefully that a professional street map can outperform a 3D mobile map, even though the users are 3D gamers and the design instantiates known
good principles of VE design. This overall result is hardly surprising; 2D city maps are products of decades of work. More interestingly, the representational differences are reflected as differences in users’ strategies of embodied interaction: e.g., how often users interact with the physical environment, what cues they attend to, and when they are more likely to physically turn and move the device in their hands instead of using virtual commands. We found that 2D maps direct users into using reliable and ubiquitous cues like street names and street topology, and they better afford the use of pre-knowledge and bodily action to reduce cognitive workload. The findings point out two prominent problems of 3D in this task: the uninformativeness of the street-level perspective and the ambiguity of photorealistic cues on a small mobile display. Directionality of the viewport complicates ego-centric alignment in the 3D view, too. However, with practice, some 3D users learned to prefer 2D-like strategies and could thereby significantly improve performance. To conclude the paper, we will first revisit the three questions set out in the Sect. ‘‘Introduction’’. Through these answers, we summarize and discuss the observations as well as provide tentative explanations to them. In drawing conclusions from the results, one must keep in mind that we deliberately selected young male 3D gamers as subjects, which may undermine the generalizability of our results. Previous research has shown individual differences in spatial behavior pertaining to age, experience, and gender [39]. However, the lack of results regarding the effects of gender particularly on bodily spatial action prevents us from putting forward hypotheses on this matter. Q1. What is the role of the third dimension in helping users to localize objects in the real world? We learned that, task-analytically speaking, performance is analogous between 3D and 2D mobile maps. To simplify, information about the POI is gathered first, a reference point is searched for, and the direction to the target inferred. The most time-consuming subtask for both map types is finding a reference point. Strategies with both map types rely heavily on self-location. Therefore, to understand what distinguishes 3D, we must examine closer how these subtasks were actually carried out, as this is where the differences between the two become visible. Q2. Which cues do users attend to when using 3D mobile maps, and how do they find them? The two representations, not unexpectedly, drew attention to different visual features. 3D and 2D performance exhibited partially non-overlapping sets of cue types. 3D users relied on known landmarks more than 2D users did, and in contrast to 2D users they used building shapes, facade details, whole facades and relative directions. 2D
123
Pers Ubiquit Comput
users relied on street names most often, and street crossings more often than 3D users. Search for such cues in the 3D model involved horizontal movement. However, often such cues were chosen that proved to be undiagnostic, or for which it was difficult to find matching cues. Visual features of the built environment are not designed to be salient or diagnostic. Rather, there are many similarities among buildings in almost any scene, and it requires some visual skill to be able to pick up the most informative features. One may, however, speculate that with higher granularity—one pixel in m-LOMA’s texture corresponded to about 15–20 cm in the real building—users may have been more successful in finding diagnostic features. Unhappy with the street-level view, and wishing to get access to more informative cue types, 3D users frequently used Perspective change and Tracks functionalities. One can then ask why users using the bird’s eye view in 3D, which provides many similar cues as the 2D city map view, were not actually equally fast? One explanation may be that, when it comes to the bird’s eye view in 3D, the small scale of transformations [14] may lower the informativeness and recognizability of objects. Only recognizable objects can be utilized as landmarks that help the tasks of mapping and orientation. In this respect, 2D differs from 3D in its use of symbols that are not so much affected by transformations of viewport (zooming particularly). By contrast, in 3D, perspective changes from street level to top–down view bring about radically different views and require changes in the use of cues as well, for example from fac¸ades to street names or topology. As a result of moving back and forth between views, 3D users spent proportionately more time focusing on (gazing at) the device than 2D users. While we thought that 2D use would mainly rely on street topology and street names as cues, 2D users actually exploited a much richer repertoire of cues. Moreover, they focused more on extracting cues from the PE, instead of from the VE, and were able to do so faster than with the 3D map. Generally speaking, the results of the experiment unanimously speak in favor of preferring the use of information from the PE to that from the VE. The conclusion is that street maps direct users to operate on cues that are pervasive and reliable. Users are less familiar with using visual features of the cityscape like facades to find reference points. 2D users also moved in the VE more efficiently. A telling difference is that the total distance traveled in the VE was at the same level for both map types; yet, in 2D, traveling the same distance requires significantly less time and fewer commands. Interestingly, it appears that pre-knowledge helped 2D users more than 3D users in finding cues. Because the use of mobile maps takes place when the user is moving, the
123
map’s support for spatial updating—the mechanisms involved in locating positions in space relative to oneself after a given spatial transformation—is emphasized [42]. Wang and Brockmole [40] have examined the influence of the environment being divided into nested structures (e.g., city consisting of districts consisting of blocks) and noted that spatial updating is not carried out in all structures simultaneously nor with same accuracy. When switching to a new environment, for example, walking around the corner or to a new block, one often loses track of one’s position relative to old environments. From this perspective, the 2D map could provide more stable spatial representation, and one that in comparison to street level 3D allows users to do updating at more levels of topographical span. Indeed, we saw that 2D users benefited from quickly being able to operate on distant landmarks that they knew of. By contrast, 3D users had to invoke the Landmark view command repeatedly, or navigate and rotate camera view to be able to achieve the same. Verbal protocols also provide evidence for the idea that 2D users may update spatial representations during walking and use this knowledge to infer their current location by tracing their movements from the last known site in PE. However, even the 2D mobile map is actually quite narrow in comparison to paper maps and does not necessarily encompass more than a few blocks at the most. The reason for the finding that pre-knowledge based tactics are rarer in the use of 3D maps is partly to be found in our method. In our pilot study [29], where pointing performance was better in 3D use than here, the starting point in VE was always the current position; only viewing directions were misaligned. Thus, ego-centric alignment was partially done at the outset of the task, and the user only had to scan around horizontally to align viewing directions. In the present study, the task started at the POI, so as to mimic the situation of the user having found a POI and now wanting to locate it in PE. In real world use, the user would not have ‘‘teleported’’ to the POI, but wandered in the area, which would have reduced disorientation by supporting spatial representations (pre-knowledge) of the area. A comparison of the present results to our pilot study [29] suggests tentatively that the advantage of starting the pointing task from the current location entails about onefourth faster task completion time, but not necessarily more than that. Directional support with an electronic compass may improve performance further. We believe that 3D maps rely more than 2D maps on automatic orientation support that can be achieved with GPS and electronic compass. Despite these problems with the 3D map, performance actually improved with practice. We saw simultaneous decreases in the use of Orbiting and Fly-to-target and increases in the use of Tracks. Thus, the data tentatively
Pers Ubiquit Comput
suggests that those 3D users who improved performance started to use more 2D-like search strategies. The benefit was about 16% in task completion times. Similar analysis of 2D map use indicates that practice with 2D improved learning of the map contents, but did not change the interaction tactics for that map type. Q3. Are 3D and 2D mobile maps associated with different patterns of bodily conduct or deployment of gaze? We are surprised by the general tone of the results: 2D rather than 3D tasks involved more efficient use of body and gaze. When using 2D, users turned their upper bodies more in search of cues like street names and crossings. They deployed gaze significantly more effectively to find cues like parallel streets. They tilted their heads and rotated their devices in their hands [35] more than 3D users. Their walking was significantly more efficient. 2D users performed ego-centric alignment more, even when the target was remote and not visible in PE, by relying on the zoom out function to see the current position and the target POI’s position in the same view. The crux of these strategies may be that they allow 2D users to avoid mental manipulation and rely more on perception in solving the task. By moving their bodies, they change their position in the environment, and can thereby eliminate costly mental operations and change them to perceptual ones. We see a parallel to experienced Tetris players tactic of rotating a piece in order to see, rather than mentally simulate, its fit to the landscape [16]. By tilting the device in hand, the mobile map user can enforce alignment between the two viewports and make reference point candidates appear parallel to each other, thus obviating the need to mentally rotate representations to be able to match reference point candidates. Device turning and head tilting were practically absent in 3D use, and 3D users turned their upper bodies significantly less. Their significantly longer lengths of gaze at PE indicate that 3D users were searching in PE, but were often unable to extract or match information. Problems in utilizing the strategy of ego-centric alignment constitute one reason for the lack of bodily strategies in 3D use. The alignment of the representation and the represented space is important in general, because some central types of human spatial knowledge are known to be viewpoint-dependent, even representations of dynamic scenes [13]. Hence, when objects in a map do not correspond to stored representations, the user has to transform or rotate the representation, which implies mental or physical effort [19]. Mou and McNamara [24] elaborate on this view by arguing that spatial memories are defined with respect to intrinsic frames of reference, which are selected on the basis of egocentric experience and environmental cues. In
the present experiment, ego-centric alignment was difficult due to problems in first finding one’s current location and then keeping both one’s own location and the POI in view in VE. The difference between 3D and 2D may arise from the fact that the 3D viewport is directional and needs to be manually rotated (with virtual commands), whereas 2D is non-directional and can be rotated in the user’s hands. Perhaps reflecting this, our results suggest that 3D performance, but not 2D performance, may be dependent on visuo-spatial working memory span. Due to more fragmented viewing of VE, and less help from bodily strategies, 3D users have to keep more locations of interest in memory in order to solve the problem. Acknowledgments This work utilizes the results from the EU InterregIIIA project m-LOMA. The study was funded by the EU FP6 ICT projects ROBOSWARM (045255) for mobile graphical user interface development, IPCity (FP-2004-IST-4-27571), and Academy of Finland project ContextCues. We thank Vikram Batra, Tuomo Nyysso¨nen, Giulio Jacucci, Katherine Wallis, Esko Lehtonen, Peter Fro¨hlich, Nikolaj Tatti, Petteri Torvinen, Antti Salonen, Heikki Vuolteenaho, Sauli Tiitta, and Tommi Ilmonen for comments and help.
References 1. Aretz AJ, Wickens CD (1992) The mental rotation of map displays. Hum Perf 5(4):303–328 2. Aslan I, Schwalm M, Baus J, Kru¨ger A, Schwartz T (2006) Acquisition of spatial knowledge in location aware mobile pedestrian navigation systems. In: Proc. MobileHCI 2006, ACM Press, New York, pp. 105–108 3. Berch DB, Krikorian R, Huha EM (1998) The Corsi block-tapping task: methodological and theoretical considerations. Brain Cogn 38(3):317–338 4. Benson AJ, Gedye JL (1963) Logical process in the resolution of orientation conflict. RAF Institute of Aviation Medicine Report 259, Ministry of Defence 5. Berg Insight AB (2007) Summary of a market report ‘‘Mobile maps and navigation’’, released in September 2007 6. Bessa M, Coelho A, Chalmers A (2005) Alternate feature location for rapid navigation using a 3D map on a mobile device. In: Proc. MUM 2005, ACM Press, New York, pp 5–9 7. Bluestein N, Acredolo L (1979) Developmental changes in mapreading skills. Child Dev 50(3):691–697 8. Burigat S, Chittaro, L (2004) Location-aware visualization of a 3D world to select tourist information on a mobile device. In: Proc. MobileHCI 2004, ACM Press, New York 9. Chittaro L, Burigat S (2004) Location-aware visualization of a 3D-world to select tourist information on a mobile device. In: Proc. MobileHCI 2004, ACM Press, New York 10. Crampton J (1992) A cognitive analysis of wayfinding expertise. Cartographica 29(3):46–65 11. Darken RP (1996) Wayfinding in large-scale virtual worlds. Dissertation, The George Washington University 12. Fro¨hlich P, Simon R, Baillie L, Roberts J, Murray-Smith R (2007) Mobile spatial interaction. In: Ext. Abst. CHI 2007, ACM Press, New York 13. Garsoffky B, Schwan S, Hesse FW (2002) Viewpoint dependency in the recognition of dynamic scenes. J Exp Psychol Learn Mem Cogn 28(6):1035–1050
123
Pers Ubiquit Comput 14. Golledge RG (1999) Human wayfinding and cognitive maps. In: Golledge RG (ed) Wayfinding behavior: cognitive mapping and other spatial processes. Johns Hopkins University Press, Baltimore, pp 5–45 15. Hanson A, Wernert E (1997) Constrained 3D navigation with 2D controllers. IEEE Visualization: 175–182 16. Kirsh D, Maglio P (1994) On distinguishing epistemic from pragmatic action. Cogn Sci 18:513–549 17. Kiss S, Nijholt A (2003) Viewpoint adaptation during navigation based on stimuli from the virtual environment. In: Proc. Web3D 2003, ACM Press, New York, pp 19–26 18. Kray C, Elting C, Laakso K, Coors V (2003) Presenting route instructions on mobile devices. In: Proc. IUI 2003, ACM Press, New York, pp 117–124 19. Levine M (1982) You-are-here maps: psychological considerations. Environ Behav 14(2):221–237 20. Levine M, Marchon I, Hanley GL (1984) The placement and misplacement of you-are-here maps. Environ Behav 16(2):139– 157 21. Liben LS, Downs RM (1993) Understanding person-space-map relations: cartographic and developmental perspectives. Dev Psychol 29(4):739–752 22. Lobben AK (2004) Tasks strategies and cognitive processes associated with navigational map reading: a review perspective. Prof Geogr 56(2):270–281 23. Miettinen M, Oulasvirta A (2007) Predicting time-sharing in mobile interaction. User Model User-Adapt Interact 17(5):475– 510 24. Mou W, McNamara TP (2002) Intrinsic frames of reference in spatial memory. J Exp Psychol Learn Mem Cogn 28:162–170 25. Nothegger C, Winter S, Raubal M (2004) Selection of salient features for route directions. Spat Cogn Comput 4(2):113–136 26. Nurminen A (2006) m-LOMA: a mobile 3D city map. In: Proc. WEB3D 2006, ACM Press, New York, pp 7–18 27. Nurminen A, Oulasvirta A (2008) Designing interactions for navigation in 3D mobile maps. To appear in L Meng, A Zipf, S Winter (eds.) Map-based mobile services: interactivity and usability, Springer 28. Ottosson T (1987) Map-reading and wayfinding. Acta Universitatis Gothoburgensis, Go¨teborg 29. Oulasvirta A, Nurminen A, Nivala AM (2007) Interacting with 3D and 2D mobile maps: an exploratory study. HIIT Technical Report 2007–1
123
30. Oulasvirta A (2008) Field experiments in HCI: promises and challenges. To appear in P Saariluoma, C Roast, HK Punama¨ki (eds.) Future Interaction Design: Part 2, Springer 31. Parush A, Berman D (2004) Navigation and orientation in 3D user interfaces: the impact of navigation aids and landmarks. Int J Hum Comput Stud 61:375–395 32. Rakkolainen I, Vainio T (2001) A 3D city info for mobile users. Comput Graph 25(4):619–625 33. Rohs M, Oulasvirta A (2008) Target acquisition with camera phones when used as magic lenses. To appear in Proc. CHI 2008, ACM Press, New York 34. Ruddle RA, Payne SJ, Jones DM (1997) Navigating buildings in ‘‘desk-top’’ virtual environments: experimental investigations using extended navigational experience. J Exp Psychol Appl 3(2):143–159 35. Seager W, Fraser DS (2007) Comparing physical automatic and manual map rotation for pedestrian navigation. In: Proc. CHI 2007, ACM Press, New York, pp 767–775 36. Sholl MJ (1987) Cognitive maps as orienting schemata. J Exp Psychol Learn Mem Cogn 13(4):615–628 37. Tan D, Robertson G, Czerwinski M (2001) Exploring 3D navigation: combining speed-coupled flying with orbiting. In: Proc. CHI 2001, ACM Press, New York, pp 418–425 38. Vainio T, Kotala O (2002) Developing 3D information systems for mobile users: some usability issues. In: Proc. NordiCHI 2002, ACM Press, New York 39. Waller D (2000) Individual differences in spatial learning from computer-simulated environments. J Exp Psychol Appl 6(4):307– 321 40. Wang RF, Brockmole JR (2003) Human navigation in nested environments. J Exp Psychol Learn Mem Cogn 29(3):398–404 41. Ware C, Osborne S (1990) Exploration and virtual camera control in virtual three dimensional environments. In: Proc. 1990 symposium on interactive 3D graphics, pp 175–193 42. Wraga M (2003) Thinking outside the body: an advantage for spatial updating during imagine versus physical self-rotation. J Exp Psychol Learn Mem Cogn 29(5):993–1005 43. Zimmer H (2004) The construction of mental maps based on a fragmentary view of physical maps. J Educ Psychol 96(3):603– 610