SPATIAL AND TEMPORAL ATTRACTIVENESS ... - CiteSeerX

2 downloads 0 Views 284KB Size Report
GEO-REFERENCED PHOTO ALIGNMENT. Paul Chippendale, Michele ... to estimate vanishing points and the presence of occluding objects. Unfortunately ...
SPATIAL AND TEMPORAL ATTRACTIVENESS ANALYSIS THROUGH GEO-REFERENCED PHOTO ALIGNMENT Paul Chippendale, Michele Zanin, Claudio Andreatta Fondazione Bruno Kessler ABSTRACT This paper presents a system to create a spatiotemporal attractiveness GIS layer for mountainous areas brought about by the implementation of novel image processing and pattern matching algorithms. We utilize the freely available Digital Terrain Model of the planet provided by NASA [1] to generate a three-dimensional synthetic model around a viewer‟s location. Using an array of image processing algorithms we then align photographs to this model. We will demonstrate the accuracy of the resulting system through the overlaying of geo-referenced content, such as mountain names and then we will suggest ways in which visitors/photographers can exploit the results of this research, such as suggesting temporally appropriate photohotspots close to their current location. Index Terms - geo-referenced, semantic labelling, augmented reality, image processing 1. INTRODUCTION In recent years the advancement of portable consumer technology has increased at a phenomenal rate, furnishing everyone with Megapixel digital cameras and powerful mobile computing devices with Gbytes of storage capacity; GPS is now common place. In fact, technology convergence is beginning to encompass all of these elements into single affordable devices, such as the Nokia N96 and the Apple iPhone2. The near ubiquitous ownership of digital cameras and the speed at which an image can be taken and shared with the World is promising a new paradigm in environmental observation. Every day millions of georeferenced, high oblique images are captured and shared via Internet websites such as Flickr [2] and Panoramio [3], often within minutes of capture. For example, the Flickr API tells us that over half a million geo-referenced photos were taken in the Alps in 2007, of which a high proportion contain mountains viewed from many locations and aspects. There is a clear trend: geo-tagging photos for the sole purpose of placing one‟s photos „on the map‟, either automatically (via GPS) or manually (via GUIs), is becoming increasingly popular. It is this ever-expanding source of freely available ground-level, up-to-date, high resolution imagery and

associated metadata that the research within this paper proposes to exploit. From a database of geo-referenced images it is straightforward to create a spatiotemporal GIS layer, like the ones seen on Flickr. However, once a dataset becomes too large, a summary framework is required to effectively abridge the information [4]. Although at present only around 5% of images on Flickr are geo-tagged and are made publicly accessible, observations over time can reveal clues about where photographs are more often taken. Furthermore, if the same photographer takes several photos in the same day, their path can be extrapolated by joining together these snapshots in time in an intelligent manner to understand common visitor routes. From the location field within the metadata packaged along with each photo, we can only know (at best to a few meters) where a photographer was standing. Ideally, we would also like to know where he was pointing the camera too. We can take an educated guess by comparing a photo‟s tags with those posted by others in the vicinity. Research conducted in the Zonetag project [5] nicely demonstrates how a new photo can have its tags auto-generated through proximity to other photos alone, based on social annotation. In Microsoft‟s Photosynth [6], highly recognizable manmade features are detected, such as the architecture of Notre Dame de Paris, to align and assemble huge collections of photos. Using this tool, the exact orientation of a photo can be calculated due to the unique and unchanging nature of the calibration object. However, both of these approaches do not scale well to the natural environment where the visual appearance of the landscape changes over time and it is less structured; furthermore, the objects which constitute the scene may be far in the distance. In Fotowoosh [7] the authors combined a large number of visual cues to make an estimate of depth perception in 2D photos. Their approach provided very impressive results when manmade objects were visible, offering valuable cues to estimate vanishing points and the presence of occluding objects. Unfortunately straight lines rarely exist in the natural environment, so we have to rely on other visual cues to estimate depth. Behringer [8] uses an approach based on an edge detector to align the horizon silhouette extracted from Digital Terrain Model (DTM) data with video images to

estimate the orientation of a mobile Augmented Reality (AR) system. They demonstrated that a well-structured terrain could provide information to help other sensors to accurately estimate registration, but their solution had problems with lighting and visibility conditions. Our approach however incorporates an enriched set of feature points and a more accurately rendered terrain model enhanced by additional digital content such as: lake contours, road or footpath profiles, etc. Extracting feature points from outdoor environments is however challenging, as disturbances such as clouds on mountain tops, foreground objects, large variations in lighting and generally bad viewing conditions, such as haze, all inhibit accurate recognition. As a result, great care needs to be taken to overcome such inherent limitations and we achieve this by combining different yet complementary methods. 2. GEO-REFERENCED PHOTO ALIGNMENT

of machine-vision algorithms: region segmentation, edge extraction, region classification (sky, rock, vegetation, water), and profile extraction, thus profiles are generated at region borders (Figure 3 and Figure 4).

Figure 2: Salient features extracted from synthetic image

We then analyse both profiles with a multi-scale saliency algorithm that aims to extract important points at different scales. These are shown as horizontal bands in Figure 2 and 4; white vertical lines represent maximum points, blacks are minimums, and their length is proportional to importance.

We register photos by correlating their content against a rendered 3D spherical panorama generated about the photo‟s „geo-location‟. To do this we systematically load, scale and re-project DTM data onto the inside of a sphere using raytracing techniques. In preparation for photo alignment the sphere is „unwrapped‟ into a 360° by 180° rectangular window (see Figure 1); each synthetic pixel has its own latitude, longitude, altitude and depth.

Figure 3: Image segmented into regions

Figure 1: Synthetic panorama generated about photo‟s location

Placing a photograph into this unwrapped space requires a deformation so that photo pixels are correctly mapped onto synthetic ones depending upon estimated camera parameters, such as pan, tilt, lens distortion, etc. Scaling is estimated from the focal length parameters contained within the EXIF metadata of a JPEG photo. The problem of accurate photo alignment is the main crux of our research, and is an ongoing challenge. The general approach is structured into four phases: 1) extract salient points from synthetic rendering, 2) extract salient points from the photo, 3) search for correspondences aiming to select significant synthetic/photo point pairs, 4) apply an optimization technique to derive the alignment parameters that minimize reprojection error. In our current implementation, we extract mountain profiles in both the synthetic panorama and in the photo. While in the first case it is just a matter of detecting distance discontinuities (see Figure 2), in photos we apply a sequence

Figure 4: Sky-land profile with saliency map above

At each scale, a compact representation (relative positions and type of sequence) of salient photo points is derived and then systematically compared to the salient synthetic points until one or more approximate matches are found. Each hypothesis is then considered independently, using the synthetic profiles to guide a custom edge extraction algorithm through the original image. If there is sufficient correlation between the photo profile and the

model, then the hypothesis is accepted and a list of corresponding points is generated. A set of calibration parameters is next extracted by minimizing the angular distances between synthetic points and their corresponding photo points (after reprojection into the spherical space). Parameter tuning is currently performed thanks to the Downhill Simplex method [9]. Once a best match has been found, we project each nonsky pixel as a four sided plane onto the synthetic landscape thus enabling us to create a 3D view-shed for each autoaligned photo. After this, we then cross-reference the „3D‟ photo with a geo-referenced names database, such as Geonames [10]. In this way, not only can we make a better guess at what features can be seen inside a photo, but we can also annotate them precisely within the images themselves. Such semantic labelling can include features like mountain peaks, lakes, huts, etc (see Figure 5 for an example output). In addition to estimating photo content; we explore the question „Could a similar, more beautiful photo have been taken from somewhere else nearby?‟ To answer this, we generate a 3D „attractiveness‟ layer, where each surface voxel of a region can be assigned an updateable value according to socially derived aesthetic metrics. Over time, the accumulation of many aligned photos taken at other times and from other perspectives creates an attractiveness heat-map. There are several ways to assess the attractiveness or popularity of a photo. Within Flickr, an „interestingness‟ metric is created for each photo shared, using factors such as the numbers of comments made, how many times it is viewed, marked as favourite by others, number of tags, etc. Similarly, in Panoramio the „popularity‟ of a photo can be assessed from the number of views and comments. Due to restrictions imposed by their API‟s, non-owners are not permitted to access all of a photo‟s statistics. However, sufficient information can be gleaned from the Html, Xml and EXIF metadata to generate a normalised attractiveness metric, A, which we calculated using the following formula: A=5fα+2cβ+vγ. Where f is the number of times a photo has been made a favourite; α is a normalizing factor based upon the average number of favourites taken from a random selection of 1000 images from the same website; c is the number of comments made about a photo, likewise β is a normalizing factor for comments, v is the number of views an image has had and γ is a normalizing figure for views.

Figure 5: Aligned photo with mountain peaks labelled

Figure 6: Depth map of photo in Figure 5

Using the algorithms from this research, a heat-map can be generated for any area. Figure 7 shows a map that was generated by our system and overlaid onto Google Earth. An area around The Matterhorn was selected and one hundred photos were downloaded (with their associated metadata) from Flickr and then aligned. The map clearly shows that the appeal of the peak is far greater than its surroundings for photographers. This outcome is of course expected with such an iconic mountain. A similar analysis of less well known areas may help emerging tourist authorities maximise on their assets, perhaps helping site picnic areas, panoramic information points, etc.

3. RESULTS Figure 5 demonstrates how our system can identify and label mountains visible inside a photo. In addition to the names of peaks, we can add their heights above sea level and their distances from the camera. We use the Geonames online database of names. Figure 6 illustrates the depth information contained within aligned photos. The depth map clearly shows that Mont Blanc lies considerable further away than The Matterhorn.

Figure 7: Photo attractiveness heat-map for The Matterhorn

4. CONCLUSIONS In this paper we have presented a system to create a spatiotemporal attractiveness GIS layer for mountainous areas brought about by the implementation of novel image processing and pattern matching algorithms. We have shown how in-image labelling can be achieved in a fully automated manner and suggested ways in which visitors/photographers can exploit the results of this research, giving them a wealth of interesting information and also suggesting temporally appropriate photo-hotspots close to their current location. A quantitative evaluation of algorithm performance is underway. Qualitatively, the method is successful whenever the landscape exhibits sufficient geographical features (i.e. not all flat) and there are only a few occlusions created by foreground objects, like trees or buildings. To increase robustness we are developing a close object detector. The sky detection component is quite robust in the presence of clouds, however, in extremely misty or dense cloud conditions, when a large proportion of the visible space is obscured, profile extraction sometimes provide incorrect results and alignment often fails. Currently, the system is sensitive to GPS positional errors, more so if geographical features are very close. A misplacement of 200m is not a problem if the mountains we are looking at are 50Km away. However, such an error could have a strong impact if mountains are closer than a Km away. To address this, we are developing an improved version of our algorithm, which can also take into consideration positional errors and correct them. Other ideas we are addressing attempt to cross-reference the tags from a current photo with tags from previously computed photos whose depth maps partially overlap. We are also starting to investigate the system‟s uses in environmental monitoring. As we know „when‟ digital photos are taken, a spatiotemporal appearance layer for draping onto a DTM can easily be created by synthesising the multitude of geo-referenced photos taken daily. Images taken on the ground provide us with a unique perspective for monitoring erosion on near vertical rock faces or terrain under partial tree cover, for example. Such a layer would automatically evolve helping us to understand snow and glacier coverage or the onset of spring in various geographical locations.

For more examples of our research please refer to http://tev.fbk.eu/marmota/. 11. REFERENCES

[1]

The Shuttle Radar Topography Mission (SRTM), http://www2.jpl.nasa.gov/srtm/ [2] Flickr, http://www.flickr.com/ [3] Panoramio, http://www.panoramio.com/ [4] A. Jaffe, M. Naaman, T. Tassa, M. Davis, “Generating Summaries and Visualization for Large Collections of GeoReferenced Photographs”, In proceedings The 8th ACM SIGMM Int. Workshop on Multimedia information retrieval, Santa Barbara, USA [5] S. Ahern, M. Davis, D. Eckles, et al, "ZoneTag: Designing Context-Aware Mobile Media Capture to Increase Participation." In: Proceedings of the Pervasive Image Capture and Sharing, 8th Int. Conf. on Ubiquitous Computing, California, 2006 [6] N. Snavely, S. M. Seitz, R. Szeliski, "Photo tourism: Exploring photo collections in 3D," ACM Transactions on Graphics (SIGGRAPH Proceedings), 25(3), 2006, 835-846 [7] D. Hoiem, “Seeing the World Behind the Image: Spatial Layout for 3D Scene Understanding”, Doctoral dissertation, Tech. report CMU-RI-TR-07-28, Robotics Institute, Carnegie Mellon University, August, 2007 [8] R. Behringer, “Registration for outdoor augmented reality applications using computer vision techniques and hybrid sensors”, In Proc. IEEE VR'99, Houston, Texas, USA, March 13-17 1999 [9] H. Press et al. 'Numerical Recipes in C++', 2002, page. 413417 [10] Geonames, http://www.geonames.org/