AUGMENTING REALITY WITH COMMUNITY DRIVEN WEBSITES ...

WIKIREALITY: AUGMENTING REALITY WITH COMMUNITY DRIVEN WEBSITES Douglas Gray †

Igor Kozintsev, Yi Wu, Horst Haussecker

University of California, Santa Cruz 1156 High St. Santa Cruz, CA 95060, USA [email protected]

Intel Corporation 2200 Mission College Blvd Santa Clara, CA 95052, USA {Igor.V.Kozintsev, Yi.Y.Wu, Horst.Haussecker} @intel.com

ABSTRACT We present a system for making community driven websites easily accessible from the latest mobile devices. Many of these new devices contain an ensemble of sensors such as cameras, GPS and inertial sensors. We demonstrate how these new sensors can be used to bring the information contained in sites like Wikipedia to users in a much more immersive manner than text or maps. We have collected a large database of images and articles from Wikipedia and show how a user can query this database by simply snapping a photo. Our system uses the location sensors to assist with image matching and the inertial sensors to provide a unique and intuitive user interface for browsing results. Index Terms— Mobile Augmented Reality, Image Search, Image Indexing, Multimedia Databases, Multimedia HCI

1. INTRODUCTION Mobile devices such as cell phones and Mobile Internet Devices (MID) are becoming ubiquitous in the modern world, and as such there is growing demand for software and tools to allow users to create and access multimedia and other information on these devices. However, the small form factor presents application designers with many challenges. The lack of a keyboard and mouse limits user input, while a small screen size makes browsing even small quantities of multimedia challenging. Future devices will have to compensate for these inadequacies by adding new forms of input such as cameras, accelerometers, and location sensors such as a global positioning system (GPS). In this paper we present a system for making the popular community driven website Wikipedia easily accessible on the mobile devices of the future using these new sensors. † This work was performed while the author was at Intel Corporation

Fig. 1. Our system used GPS data and live camera input to correctly recognize a landmark and augment live video with a Wikipage icon pinned to the query object using orientation sensor data. The corresponding Wikipedia page is automatically opened in the background. Here, the top 5 images are displayed to allow the user to select an alternate.

2. SYSTEM OVERVIEW Our proposed system uses a query by example paradigm, enhanced with location based filtering to improve matching results when possible. Size and battery life on mobile devices are a major concern, and as such all major processing tasks in our proposed system are offloaded to a server through a wireless network connection. This server periodically collects new information from Wikipedia and indexes it in a database, making it easily available to mobile users. Our system is certainly not the first attempt to present internet content to users through mobile devices. In 2001, Pradhan et. al. described a system called Websigns for presenting location aware content to users on PDAs using various nonvisual sensors [1]. In 2004, Lim et. al presented SnapToTell [2], an early system for giving directory service to tourists in Singapore based on camera phone photos. Around the same

time, Yeh et. al. presented a method of identifying camera phone photos using a hybrid image/text search of the internet. Later Zhou et. al. built a system for identifying books and buildings using a camera phone which sent pictures to a server for matching [3]. More recently Takacs et. al. presented a system for performing location based image matching using local features which are sent to a smart phone in clusters called loxels [4] for rapid matching. Perhaps the closest work to this paper is by Quack et. al., which seeks to build a world-scale dataset of geotagged and clustered photos and associate each one with a Wikipedia entry [5]. Our work is unique in several ways. Our system uses a query by example and location paradigm to switch between two state of the art matching techniques depending on the location. Our server periodically queries Wikipedia to obtain additional articles and photos to match. Our client software uses a novel user interface that allows the user to interact with queries and results as if they were object in real space. In summary, we have built a complete system which makes information in these large scale information repositories easily available on mobile devices. 3. SERVER DESIGN 3.1. Gathering information Our server periodically crawls Wikipedia to find additional photos and metadata to index. However images gathered from Wikipedia present a problem because spatial coordinates are typically available for articles, but not individual images. Furthermore, an article may contain many images and the same image may be present in multiple articles. For this reason we collect references in both directions for use in a TF-IDF weighting scheme [6] during the matching process discussed later. At the time of publication we have collected 70k+ images from Wikipedia which were found on 50k+ geotagged articles. 3.2. Indexing and Matching Our system performs image matching using a hybrid approach that considers both image similarity and location. The primary mode of operation assumes that the location is known and that there are nearby images in the server database. The system first uses the location information to reduce the number of possible matches. If the number of nearby candidates is reasonable (∼ 5-1000) we perform local feature descriptor matching with a best bin first (BBF) algorithm similar to that proposed by Lowe in [7]. One key difference is that we use extended SURF (128-dimensional) [8] features instead of SIFT because they are faster to compute. Furthermore, our system uses the approximate nearest neighbor (ANN) software library released by Arya & Mount [9] for quickly finding similar keypoints.

Fig. 2. Some examples images from our Flickr/Wikipedia benchmark. These tourist images taken from Flickr are matched against the images found in Wikipedia to identify relevant pages to display to a user. GPS is used to filter the results. If no location information is available (eg. if the user is indoors) or the density of nearby images is outside the range discussed above, we perform a less accurate but more scalable search method proposed by Nister & Stewenius [10]. This method requires more offline computation time to build a scalable vocabulary tree and add every feature to a special data structure. However we have found that it scales much better than the ANN data structure for image collections with more than a few thousand images and maintains comparable accuracy and runtime. 4. CLIENT DESIGN Our client is currently an Intel Atom powered MID, but our client software will run on any comparable client with a camera and a network connection. When the program is started it displays live video from the camera and waits for the user to press the shutter button. As soon as a picture is taken, sensor data from a GPS unit, inertial sensor and compass are written to the images EXIF fields and uploaded to the server. Typically the top five matching images are returned to the client along with a list of pages which include those images and some frequency information that is used to determine page relevance. Page relevance is determined with a TF-IDF scoring mechanism determined as follows:

Relevance = # Matches ×

1 # Images in Page

(1)

Where # matches is the number of references to that page that are found in the top n results, where n is usually 5. This scoring mechanism generally favors more specific pages. For example, a query image of the Brandenburg Gate in Berlin might produce one match for the page Berlin out of 39 images

(score ≈ 0.026), where the Brandenburg Gate page might match six of fifteen (score = 0.4). 4.1. User Interface User interfaces on mobile devices are usually scaled down versions of their desktop counterparts: text, lists, maps, etc. We present a novel alternative which uses the inertial sensors and is in essence a more convenient version of the early augmented reality systems [11]. When a query is made, the position is recorded using the orientation sensors on the device. As the user moves the device around, an icon representing the query will be pinned to the position in which it was made. This way multiple queries can be interacted with separately by simply pointing the camera at different locations. For example, in figure 1, a Wikipedia icon appears to be pinned to the Golden Gate Bridge. This pinned icon will appear to hover in the same location while the user moves the camera around. This gives the user the impression that their query exists at that location like a bookmark which they can come back to later. Clicking the icon opens the page, so a user could then edit the page to update the content or upload a new photo and then return to the window to open another nearby page. In addition to placing icons to represent image queries and results, nearby information may be proactively placed for easy access. 5. BENCHMARK RESULTS We have two benchmarks for testing our system, the Zurich buildings database (ZuBuD) [12], which is a standard test which contains 1005 gallery images and 115 query images, and a smaller qualitative benchmark of images of iconic landmarks that we have collected from Flickr and matched against our Wikipedia dataset. The later test consists of only 100 geotagged images from 10 major landmarks. Several example images can be found in figure 2. We do not yet have a quantitative measure of performance for the later benchmark because our Wikipedia dataset may contain many relevant pages for these image, some of which we may not have anticipated, but we do have extensive experimental qualitative results which demonstrate that the system is indeed capable of performing the desired task. For the ZuBuD database we present performance using average precision. We use the standard definition of precision for n images returned: Precisionn = P (n) =

|{matched images}| |{relevant images}|

6. CONCLUSION

PN Average Precision =

Where r is the rank and relevant(r) is a binary function which is 1 if the image at rank r is relevant. We have used the ZuBuD database to experiment with parameter settings to explore the various performance trade offs. Figure 3 shows a scatter plot of various parameter settings and how they effect query time and average precision. We can see that the size of the vocabulary in the method proposed by Nister & Stewenius [10] is the key parameter, which confirms what was reported in their paper. On this benchmark, the ANN approach gives more accurate results but is much slower than the other configurations. We also show a brute force pairwise matching of every image. This provides the most accurate results, but the runtime is two orders of magnitude greater. For our Flickr dataset, we present five example queries and their top five matching results in figure 4. In each of these tests, we have positioned a small camera in front of a printed copy of the images and faked the GPS coordinates to correspond to the actual location. This GPS filtering greatly improves results and leads to some interesting matches such as the sphinx statues in the last example. These matches are possible because they appear on the same page as an image of the Great Sphinx of Giza. Average query time for these examples was approximately 200ms.

(2)

The average precision is then given by: r=1 P (r) × relevant(r) |{relevant images}|

Fig. 3. Total Query Time Vs. Average Precision on the ZuBuD benchmark. Brute force is a canonical best bin first keypoint matching, ANN is a fast approximate version. The remaining results use the TF-IDF scoring method discussed in [10] with scalable vocabulary trees of the given size (branch factor x tree depth). The threshold parameter of the keypoint detector is varied to obtain each curve.

(3)

We have presented a system for making community driven websites such as Wikipedia and Flickr easily accessible with a novel user interface from a new class of mobile devices.

Fig. 4. Some example query results for images of major landmarks. From left to right: Golden Gate Bridge, Brandenburg Gate, Empire State Building, Taj Mahal, and Great Sphinx of Giza. Our approach leverages the new sensors that are available on these new devices, but can still function without them. There are many other similar efforts being made in this direction [5, 4, 3, 13]. We have built a complete system with a dataset that is large enough to be useful in many tourism scenarios and a user interface which presents results in a unique and intuitive manner. Furthermore, the system of indexing Wikipedia is completely automated and as such, the utility of our system will grow with the number of geotagged articles in Wikipedia. In the future we will be expanding this method to include other sites and usage models. A quantitative benchmark will be needed in order to fully optimize the performance of such a system and is the subject of our future work. 7. REFERENCES [1] Salil Pradhan, Cyril Brignone, Jun-Hong Cui, Alan McReynolds, and Mark T. Smith, “Websigns: Hyperlinking physical locations to the web,” Computer, vol. 34, no. 8, pp. 42–48, 2001. [2] J. Lim, J. Chevallet, and S. N. Merah, “SnapToTell: Ubiquitous Information Access from Cameras,” in Mobile & Ubiquitous Information Access (MUIA04) Workshop, 2004. [3] Y. Zhou, X. Fan, X. Xie, Y. Gong, and W.Y. Ma, “Inquiring of the Sights from the Web via Camera Mobiles,” in Multimedia and Expo, 2006 IEEE International Conference on, 2006, pp. 661–664. [4] G. Takacs, V. Chandrasekhar, N. Gelfand, Y. Xiong, W.C. Chen, T. Bismpigiannis, R. Grzeszczuk, K. Pulli, and B. Girod, “Outdoors augmented reality on mobile phone using loxel-based visual feature organization,” ACM International Conference on Multimedia Information Retrieval (MIR), 2008. [5] T. Quack, B. Leibe, and L. Van Gool, “World-scale mining of objects and events from community photo collec-

tions,” in Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM New York, NY, USA, 2008, pp. 47–56. [6] K.S. Jones et al., “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, vol. 60, pp. 493–502, 2004. [7] D.G. Lowe, “Distinctive Image Features from ScaleInvariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [8] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded Up Robust Features,” Lecture Notes in Computer Science, vol. 3951, pp. 404, 2006. [9] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An optimal algorithm for approximate nearest neighbor searching fixed dimensions,” Journal of the ACM (JACM), vol. 45, no. 6, pp. 891–923, 1998. [10] D. Nister and H. Stewenius, “Scalable Recognition with a Vocabulary Tree,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006. [11] S. Feiner, B. MacIntyre, T. H¨ollerer, and A. Webster, “A touring machine: Prototyping 3D mobile augmented reality systems for exploring the urban environment,” Personal and Ubiquitous Computing, vol. 1, no. 4, pp. 208– 217, 1997. [12] H. Shao, T. Svoboda, and L. Van Gool, “ZuBuD: Zurich Buildings Database for Image Based Recognition,” Technique report No. 260, Swiss Federal Institute of Technology, 2003. [13] T. Yeh, K. Tollmar, and T. Darrell, “Searching the Web with Mobile Images for Location Recognition,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004.