CLUSTERING LANDMARK IMAGE ANNOTATIONS BASED ON TAG LOCATION AND CONTENT Phil Bartie (University of Stirling), William Mackaness (University of Edinburgh), Philipp Petrenz (University of Edinburgh), Anna Dickinson (University of Edinburgh) Contact:
[email protected]
Abstract When exploring a new urban region you build a mental model of the space, firstly recognising landmarks, then joining these together into sequences to form routes, and finally building survey knowledge where you can begin to link up places and the spaces between (Hirtle & Jonides, 1985). Landmarks form a significant building block in this process, used to recognise places (Hirtle & Heidorn, 1993; Tversky, 1993) and also in forming wayfinding instructions to describe routes (Caduff & Timpf, 2008; Duckham, Winter, Robinson, 2010; Lovelace, Hegarty, Montello, 1999). Landmarks are defined as identifiable features in an environment, whose saliency may be calculated by comparing scores for particular attributes (e.g. their size) and identifying those which deviate from the mean (Elias, 2003; Raubal & Winter, 2002). These are the objects unlikely to be confused with others, as they appear different to their surroundings (e.g. churches, statues) or are well known international brands (e.g. Starbucks, McDonalds) easily identifiable by their shop front design, logo, and shop name label. To gain a better understanding of landmark types and the terminology used to identify those objects, a web based experiment was conducted in which human subjects were asked to identify features in a number of urban scenes which could be used in forming navigational instructions. The aim was to then automatically group the tags at an object level (e.g. tags relating to a particular café) so that the identified features in each scene could be ranked, and this information could be used in comparing and refining the weights of various inputs into an algorithm for landmark saliency modelling used in a virtual city guide location based service (Bartie, Clementini, Reitsma, 2013; Janarthanam et al., 2012; Mackaness et al., 2014). The experiment was publicised through social media, attracting 185 participants. Users were assigned images randomly from a set of 37, and able to leave the experiment at any time but encouraged to complete as many images as possible by giving them an additional entry into a prize draw for each completed set. For each task the participant saw an image of part of Edinburgh city (Scotland), and was asked to identify what they considered to be features of interest by tagging the item on the image, and supplying a text annotation. This resulted in a set of tag locations and annotations for each scene. To refine this and provide sets for each landmark required a new grouping method able to link together related tags from all participants. Text based clustering alone was not suitable as a single scene may have many occurrences of a similar object type (e.g. a small church, and a large church with a spire), and it would not be possible to identify which tags were related to which instance, as shown in the scene depicted in Figure 1.
Figure 1: Tag locations and Corresponding Word Cloud Based on Text Frequency
Spatial clustering, based on tag location only, was also unsuitable as landmark boundaries in each image were undefined, and clustering techniques based on proximity may group tags from unrelated landmarks. This is shown in Figure 2, where the single landmark (A) is contains multiple tag clusters, while (B) appears as a single landmark but as shown on the more detailed view below features (i) and (ii) are separated by distance. This is not known from the tag locations and therefore the spatial clustering fails to differentiate these groups. Similar occurrences occur in (C) where two features are incorrectly combined, and (D) where a single feature has two clusters.
(A)
(C)
(D)
(B)
(ii) (iii) (i)
Figure 2: Spatial Clustering of Tag Locations Using Kernel Density Estimation
Therefore a method which combined the location and the annotation components of the tags was developed. This found similar phrases within a specified distance to build a network of related words. Where similar words were found nearby they were considered to have a stronger relationship. From this a set of word lists were generated which were used in a second pass of the data to expand the query flexibility, enabling tag lists to merge, as shown in Figure 3. For example a collection of ‘church’ and ‘church tower’ tags may merge with ‘clock’ and ‘clock tower’ tags through a common relationship of the term ‘tower’ .
First Pass
Second Pass
clock
clock clock
clock clock tower
clock tower
no entry sign
big church
church
no entry sign
big church
church
sign
sign no entry
church tower
clock tower
no entry
church tower
clock tower
Figure 3: Query Expansion Using Tag Location and Content to Expand Word Association Lists
The process resulted in a set of landmark definitions both as spatial networks on the image, and as a set of word collections relating to each landmark feature. The results exhibit a partonomy through the spatial location and connectedness of the tags in each landmark group. The output for a number of examples is shown in Figure 4, with the landmark popularity displayed at feature level automatically extracted from tag content and location. As seen in these examples the issues which occurred in Figure 2 for clusters (A,B,C,D) no longer exist, as the algorithm has correctly recognised related tags at the feature level. The relative importance of the landmark in each scene is displayed by the size of the weighted centroid.
Figure 4: Examples of Spatial and Content Based Clustering Networks and Weighted Centroids at Feature Level
The popularity and range of terminology output from this approach can be useful in determining the suitable landmark candidates in a speech enabled location based system, and for improving our understanding of the relative importance of a variety of metrics into landmark based saliency modelling.
References: Bartie, P., Clementini, E.,Reitsma, F. (2013). A qualitative model for describing the arrangement of visible cityscape objects from an egocentric viewpoint. Computers, Environment and Urban Systems, 38(C), 21-34. Caduff, D.,Timpf, S. (2008). On the assessment of landmark salience for human navigation. Cognitive Processing, 9(4), 249-267. Duckham, M., Winter, S.,Robinson, M. (2010). Including landmarks in routing instructions. Journal of Location-Based Services, 4 (1), 28-52. Elias, B. (2003). Determination of landmarks and reliability criteria for landmarks. Paper presented at the 5th Workshop on Progress in Automated Map Generalization, Paris, France. Hirtle, S. C.,Heidorn, P. B. (1993). The structure of cognitive maps: Representations and processes. Behavior and Environment: Psychological and Geographical Approaches, 170-192. Hirtle, S. C.,Jonides, J. (1985). Evidence of hierarchies in cognitive maps. Memory and Cognition, 13(3), 208-217. Janarthanam, S., Lemon, O., Liu, X., Bartie, P., Mackaness, W., Dalmas, T., et al. (2012, 5-6 Jul 2012). Integrating location, visibility, and Question-Answering in a spoken dialogue system for Pedestrian City Exploration. Paper presented at the SIGDIAL, South Korea. Lovelace, K. L., Hegarty, M.,Montello, D. R. (1999). Elements of good route directions in familiar and unfamiliar environments. In C. Freksa ,D. Mark (Eds.), Spatial Information Theory: Cognitive and Computational Foundations of Geographic Information Science (Vol. 1661, pp. 751): Springer Berlin / Heidelberg. Mackaness, W. A., Bartie, P., Dalmas, T., Janarthanam, S., Lemon, O., Liu, X., et al. (2014, 7-13 April). Talk the Walk and Walk the talk: Design, Implementation and Evaluation of a Spoken Dialogue System for Route Following and City Learning. Paper presented at the Annual Conference of the Association of American Geographers, Tampa Florida. Raubal, M.,Winter, S. (2002). Enriching wayfinding instructions with local landmarks In M. J. Egenhofer ,D. M. Mark (Eds.), Second International Conference GIScience (Vol. 2478, pp. 243-259). Boulder, USA: Springer. Tversky, B. (1993). Cognitive maps, cognitive collages, and spatial mental models. Paper presented at the Spatial Information Theory: A Theoretical Basis for GIS, Italy.