Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J., Geographical partition for distributed web crawling, Proceedings of the 2005 workshop on Geographic ...
BEIRA: A geo-semantic clustering method for area summary Osamu Masutani, Hirotoshi Iwasaki Research and Development Group, Denso IT Laboratory,Inc. 3-12-22 Shibuya Shibuya-ku, Tokyo, 150-0002, Japan {omasutani, hiwasaki}@d-itlab.co.jp
Abstract. This paper introduces a new map browser of location based contents (LBC) that summarizes area characteristics. Recently various web map services have been widely used to search web contents. As LBC increase, browsing a number of LBC which are viewed as POI (point of interest) on a geographical map becomes inefficient. We tackle this issue by using AOI (area of interest) instead of POI. With the AOI a user can instantly find area characteristics without viewing each content of POI. We assume that semantically homogeneous and geographically distinguishable areas are suitable for the AOI. The AOI is formed by geo-semantic clustering which is a co-clustering that takes into account both geographical and semantic aspects of POI information. By the experiment using real LBC on the web, we confirmed our method has potential to extract good AOI. Key words. Map interface, web mining, co-clustering
Introduction Various web map services (WMSs) come into wide use recently. WMS is more suitable than an ordinary search engine to find location based contents (LBCs) which are contents associated with some location. For example, to find a favorite restaurant in specific area by WMS is much easier than by search engines. The WMS has function to display the distribution of search results on a geographic map and specify geographic area to exclude results on irrelevant areas. WMSs have started to be used not only with on desktop PCs, but also on mobile devices. On a driving situation, which is our main target, some considerations are needed to simplify the user interface of map viewing. For example, the result of restaurant search in Tokyo includes thousands of candidates to choose. Only within 500m from Shibuya station, there exist about 1000 restaurants (Fig. 1). It may be feasible to choose among the candidates on a PC, however it should be unfeasible on a mobile device because of its limited user interface. Moreover, on a mobile device, users often have less time to find destination than on a PC. Of course, filtering by a genre of restaurants can reduce the number of candidates, but on a driving situation filtering cause some problems. For example, a user chooses Italian restaurant which is favorite for him, but there might be no Italian restaurant
2
Osamu Masutani, Hirotoshi Iwasaki
near current location. Even if there are some Italian restaurants, the user might miss to choose other genre of restaurant which is popular and only find in the area. Such situations often occur in an unfamiliar area. These issues about limited user interface and improper information retrieval are caused by lack of the overlooking view of entire contents. The clustering and summarization techniques are often used to overlook data. Then we also use clustering and summarization of POI (point of interest) data. Some studies propose methods to browse and retrieve POIs, by linking neighbor contents with geographic similarity [1], by showing textual area summary for each predefined area (ex. city) [2]. Using predefined area is not best solution for area summary because users would like to know which area exactly corresponds to the summary. In [3], the area is automatically extracted by geographic association of POIs. However, they don’t take semantic (textual) aspect into account. We assume that partitioning area should be refined by taking semantic aspect into account. We introduce AOI (area of interest) instead of POI to represents such area summary (Fig. 1). An AOI consists of an area surrounded by irregular shaped boundary and some summarization labels. We use clustering of POI contents to define the area boundary. The clustering is performed by “geo-semantic” co-clustering which take both geographic and semantic aspects into account. To summarize an AOI, we propose “location aware” summarizing that emphasizes local terms. We think the AOI view efficiently reduces search processes, as long as users prefer representative and unique POI in a certain area. We developed a map based contents browser BEIRA that extracts and displays AOI. We evaluate our geo-semantic clustering and location aware summarization by using real restaurant contents on web.
POI
AOI
Fig. 1. The map user interfaces with POI (left) and with AOI (right)
Related Works Major web search engines recently begin to introduce geographical or local search service, for example Google Maps 1 . Only the POI address in the contents is used to bridge contents with a map. They use geo-coding that covert address to geographical coordinates (latitude and longitude). However other features of contents are used as only accompanying information for POI. 1
http://maps.google.com/
BEIRA: A geo-semantic clustering method for area summary
3
Dealing more contents information associated with geographic features is also studied in web mining or geographic information system/science (GIS) field. At first, there have been a number of geographical web search studies. Some of them are based on geo-coding of a general location name with information extraction technique [4][5]. Studies such as [2] propose a comprehensive map-based digital library for location based contents using general name geo-coding. Some researches address the geographical query categorization or result ranking [6,7,8,9] that is geographically aware search engine. Their purpose is eliminating unrelated query results according to geographical scope of a user request. In another type of studies, they partition web crawling targets according to geographical location of web pages or web sites [10]. Their focus is to enhance the web search performance. There also exist studies using GIS to deal location based contents [1,3,11]. They use various geographic manipulations over POI datasets which is extracted from a web. In [1], GIS is used to introduce “generic link” which links geographically close contents. Using the generic link users can find geographically similar contents without actual web link. In [11] construct versatile contents management system GeoWorlds on GIS. It handles not only ordinary geographic data (such as remote sensing data) but also location based contents. The user can analyze textual data in a certain area with some NLP techniques. However textual data is dealt almost separately with geographic data. Geographic features of contents are used only to filter texts within specified area. In [3] they address extracting spatial knowledge from location based contents. The system finds optimal region by spatial distributions of POI and then extract textual summary within the region. In [12] they use regional cooccurrence summation to rank place names to highlight place name on the map. They employ a major POI for area summary in each area. These studies still use geographic data and semantic data in separate processes. Our main contribution is to extract area summary by combining them seamlessly. Combining semantic data and other type of data is also studied. One of major field of such study is multimedia information retrieval such as image retrieval. The image retrieval is to search images from a textual query from database. Most of the commercial services such as Google Image 2 use only textual information surrounding the image on the homepage. Some advanced systems use both textual and image features to characterize each image on web. Vector space model of text and image features are combined by linear combination [13] or joint distribution model [14]. Furthermore there exist some web mining approaches which use semantic features with other type features such as user characteristics or query log. Various types of coclustering which are able to handle heterogeneous data are summarized in [15]. One of approach is tensor-based representation of heterogeneous data [16]. Each feature of data composes a product space of some heterogeneous dimensions. The other approach is based on unified relationship matrix (URM) [17,18]. The approach integrates pair wise relationship matrices into a unified matrix. We use both of these approaches to combine geographic and semantic features.
2
http://images.google.co.jp/
4
Osamu Masutani, Hirotoshi Iwasaki
BEIRA – Bird’s Eye Information Retrieval Application Our application BEIRA (Bird’s Eye Information Retrieval Application) is a prototype of a WMS. BEIRA has two pane of POI for information browsing (Fig.2). The left pane is a geographic map. A user can use same functions of typical WMSs, such as searching POI by name or keyword, filtering candidates by some attributes such as genre, or focus candidates according to their locations. Furthermore a user can see AOIs on the pane. An AOI is drawn as a colored contour with some label. By using AOI a user can understand area characteristics at a glance, and AOI helps to narrow candidates of POIs. In driving situation, AOI reduce a load factor of operation of car navigation when the user must specify a destination from a number of candidates (AOI also can be used as a destination). The right pane is a semantic map which shows a distribution of POIs on semantic space. The semantic space is a multidimensional space of aspects described in latter part of this paper. They help comparing candidate POIs on (selected) 2 dimensional aspect space. Because the location on a semantic map indicates respective characteristics with other POIs, a user is able to specify a favorite POI without reading POI contents. The semantic map also has (purely semantic) AOIs.
Fig. 2. BEIRA user interface
The location based contents of POIs are extracted from some web sites. The contents are processed through geographic and semantic preprocessing (Fig.3). Both geographic and semantic representations of POI are used to make clusters by geosemantic co-clustering. Then the clusters or AOI on geographic and semantic map are drawn by contour graphics. The label of AOI is summarization of each cluster.
Fig. 3. BEIRA data process flow
BEIRA: A geo-semantic clustering method for area summary
5
Preprocessing We use vector space model (VSM) to represent semantic space. The terms used for our VSM are adjective, noun and verb in location based text. We employ tf-idf weighting. Then we perform dimension reduction by latent semantic indexing (LSI). We call the LSI space as the semantic (aspect) space. The semantic map browser on right side of BEIRA simply represents a selected two dimensional view of the space. The AOI on semantic map view is provided by k-means clustering on the two dimensional aspect space. The concept and methodology of semantic map is almost same with the ones in [19]. A geographic space is simply a space of geographic coordinates. The addresses in contents are geo-coded into the geographic coordinates. Labeling and drawing of AOI The label that summarizes AOI is top ranked list of terms in each POI text in the AOI cluster. The term ranking is according to tf-idf weighting or its refined version as described in following part of this paper. We employed convex hull to draw AOI. Each POI member of AOI is surrounded by convex hull. Additionally the result of convex hull polygon is smoothed by some smoothing technique of GIS. Geo-semantic co-clustering In this section we explain necessity of co-clustering on the geo-semantic space and also explain some co-clustering methods applied in BEIRA.
Fig. 4. Explanation of geo-semantic clustering
Fig.4 shows how the geo-semantic blending works to extract better cluster than either geographic clustering or semantic clustering. The geo-semantic space is represented as 2 dimensional space here. Each POI is mapped onto the space and 2
6
Osamu Masutani, Hirotoshi Iwasaki
semantic characteristics are represented as black and white for example. Both two mapping to semantic and geographic dimension are shown as 1 dimensional distributions of POI. Three types of clustering are demonstrated on the POI distribution. Each cluster is labeled according to the density of type of POIs. By semantic clustering (left side), 2 clusters are purely homogeneous in semantic aspect, but the distribution on geographic dimension of two clusters is heavily overlapped. The purpose of clustering is extracting visually distinguishable area, therefore this result isn’t appropriate. By geographic clustering (down side), 2 clusters make very distinguishable areas. However both 2 clusters isn’t homogeneous. Furthermore the black is hidden behind white characteristics. By geo-semantic clustering (circles), moderately homogeneous and geographically distinguishable cluster can be extracted. The result seems to be compromised in both aspect, however it meets user requirement in an aspect in finding “rough” summary of each area. Some variations of co-clustering of two heterogeneous data are proposed [15]. Simple one is tensor based co-clustering which combines two heterogeneous data spaces into one product space of each dimension. The concept of the tensor based geo-semantic co-clustering is almost same with Fig. 4. The combined data is shown as a matrix Ltensor :
Ltensor= [λGDG λS DS ]
(1)
where DG is a geographic data matrix and DS is a semantic data matrix. We use simple k-means clustering on the geo-semantic tensor space Ltensor . Another type of co-clustering is URM (Unified Relationship Matrix) based coclustering which combines multiple relational data into one matrix. For URM based clustering we employ M-LSA (Multi-type Latent Semantic Analysis) which constructs multiple inter-relational data [18]. M-LSA uses eigenvalue decomposition on URM to cluster each objects. URM LURM is defined as :
LURM
⎡ λ11M 1 ⎢λ M = ⎢ 21 21 ⎢ M ⎢ ⎣λ N 1 M N 1
λ12 M 12 λ22 M 2 M
λN 2 M N 2
L λ1N M 1N ⎤ L λ2 N M 2 N ⎥⎥ ⎥ O M ⎥ L λNN M N ⎦
(2)
where M i is intra-type adjacency matrix, M ij is intra-type adjacency matrix. We use 2 types of data geographic and semantic data. Geographic data defines intra-type relationship of POI by POI similarity matrix MG according to geographic closeness between two POIs. The other relationship matrix M GS is semantic relationship between POI and semantic concepts which is the result of LSI. The geo-semantic URM LGS is shown as follows:
BEIRA: A geo-semantic clustering method for area summary
λGS M GS ⎤ ⎡λ M LGS = ⎢ GG G 0 ⎥⎦ ⎣λSG M SG
7
(3)
A bias parameter R = λGG / λGS is left to choose. This is a geo-semantic ratio to balance geographic and semantic aspects of clustering. Tensor based clustering has also similar definition of geo-semantic ratio R = λG / λS .
Location aware weighting of terms We use term weighting to construct VSM and labeling of AOI. Normally TF/IDF is used to represent term significance. However we would like to define term significance as not only term rarity but also locality of term. The geographical distribution of term is defined by term occurrence in POI documents. Fig. 5 shows distributions of some terms. A general term “onion” distributes widely and uniformly. On the other hand, the location name “Dogenzaka” has heavily clustered distribution. The other term having similar clustered distribution is “wedding”. Wedding party tends to be held in a calm sophisticated area which is often separated with congested area of a city. Therefore the term “wedding” is more clustered than general terms. The term that has high locality (the value of K here) is regarded as suitable label for AOI than low locality term, because users would like to know local and unique characteristics.
general term (“onion”) IDF = 3.08 K=4.41
location name (“Dogenzaka”) IDF=3.51 K=54.0
biased name (“wedding”) IDF= 3.04 K=9.93
Fig. 5. Some type of noun distributions
We employ location aware TF/IDF (L-TF/IDF) to calculate term importance. LTF/IDF of term t on document d is d t defined as follows:
dt = L(t)TF(d,t)IDF(t)
(4)
where L(t) is a value of Repley’s K-function which is one of point distribution analysis index. K-function means expectation of the density of data points in the circle around the randomly selected points. Its estimate is as follows:
8
Osamu Masutani, Hirotoshi Iwasaki
L(t ) = K (t , dist) =
A ∑ I ( xi − x j < dist) N 2 i≠ j
(5)
where dist is distance parameter, A is a area of target region, N is the number of points in the area and I() is a delta function. In Fig. 5, we also described the value IDF and K(1km) value of each term. We chose three terms having similar IDF value. K value of “wedding” is higher than “onion”, therefore “wedding” is regarded as more important than “onion”. Such group of terms that has highly condensed distribution are, for example, ”brand” (K=26.1), “department store” (K=26.1), “foreigner”(K=30.0), “underground”(K=28.6). System Architecture BEIRA is a .NET application with GIS component SIS 3 (See Fig.6). Drawing AOI is performed by SIS’s convex hull drawing function with smoothing. Most of backend processes are written in Java with some commercial or open source libraries. We use NQL 4 to extract web pages on POI information sites. We use GATE 5 and Weka 6 , Sen 7 as basic Japanese language analysis. Note that most of our technique are language-independent though we evaluated only by data in Japanese. We employed TCT 8 as a high performance Java based co-clustering tool with enhancement to enable higher-order co-clustering. TCT use ARPACK for sparse matrix computation which is suitable for natural language processing. We also use CUDA 9 to accelarate matrix computation for rapid response of user interface.
Fig. 6. System architecture of BEIRA 3
SIS / Informatix Inc. / Cadcorp Ltd. (http://www.cadcorp.com/) NQL / NQL tech (http://www.nqltech.com/) 5 GATE(A General Architecture for Text Engineering) / Sheffild University (http://gate.ac.uk/) 6 Weka / The University of Waikato (http://www.cs.waikato.ac.nz/ml/weka/) 7 Sen (Japanese morphological analyzer) (http://ultimania.org/sen/) 8 TCT (Text Clustering Toolkit) / University College Dublin (http://mlg.ucd.ie/) 9 CUDA (Compute Unified Device Architectur) / NVidia (http://developer.nvidia.com/) 4
BEIRA: A geo-semantic clustering method for area summary
9
Evaluation To evaluate our application we use real world data. One of most popular POI is restaurant. There are massive numbers of restaurants in Tokyo and finding the best favorite restaurant is very time consuming. We use reputation text on the restaurant reputation sites asku 10 which mainly focus on reputation. Each article in the web site has address, so geo-coding is not a problem here. Over 30,000 restaurants in Tokyo area are registered and commented on the site. We choose 289 cafes in Shibuya-ward, because café seems to reflect area characteristics rather than all types of restaurant, and Shibuya-ward has various characteristics of sub-areas. We prepare correct data in the dataset. We carefully choose cluster member by reading each reputation of cafés. Only the semantically and geographically condensed cluster is chosen. The dimension of result semantic space of LSI is 200. LSI is performed on a VSM of about 30,000 restaurant texts. Geographical coordinate is converted into rectangular coordinates system and its unit is kilo meter. K-function for L-TF/IDF is calculated at distance 1km. Geo-semantic ratio We confirmed clustering result using a ratio R between 1.0E-04 (semantic) and 1.0E-06. Fig.7 shows how AOI is drawn on each R. By semantic clustering (R=1.0E04) the drawing contour is failed because members of a cluster is spread over whole area. By geographic clustering (R-1.0E-06) the AOI becomes circular area around POI cluster. By geo-semantic clustering extracts some non-circular areas whose members are semantically homogeneous (ex. the contour in the center area).
Semantic R=1.0E-04
Geo-semantic R=1.0E-02
Geographic R=1.0E-06
Fig. 7 AOI by each geo-semantic ratio R
We also performed sensitivity analysis over R (Fig.8). The evaluation index is Fmeasure if precision and recall on the correct data. We confirmed there was an optimal R between geographic and semantic extremes. The optimal R was around 0.01 for both clustering. It tells us geo-semantic clustering is better than purely geographic or semantic clustering though the optimal ratio here might not be general
10
http://www.asku.com
10
Osamu Masutani, Hirotoshi Iwasaki
value. However no significant differences are found between MLSA and Tensorbased clustering in this examination. 0.6 0.5 0.4 MLSA Tensor-Kmeans
0.3 0.2 0.1
1.0E-04
1.0E-02
0 1.0E+00
1.0E+02
1.0E+04
1.0E+06
Fig. 8. F-value for each R for M-LSA and tensor-based co-clustering
Furthermore the result confirms geo-semantic ratio R has potential to be calibrated by some method. This will be our next target to reveal. Location-aware summarizing Next we evaluated our location-aware summarizing which is calculated by location-aware weighting of terms. We evaluate term weighting method by performance to rank location name higher rank. Location name is regarded as good estimator of locally important term. Of course part of speech or term category information is not used to extract them. We evaluate performance of term weighting methods by density of location names in top 1,000 ranked term list by each weighting method. 30
density of location name[%]
25
20 IDF K IDF*K
15
10
5
0 1
100
200
300
400 500 rank
600
700
800
900
Fig. 9. Density of location name in top 1,000 rank of each term weighting method
BEIRA: A geo-semantic clustering method for area summary
11
This result on Fig.9 shows L-IDF (K-function value by IDF) is best method to weight term importance. High occurrence terms tend to have high K-function value so only by K-function location names aren’t extracted efficiently. The list of the “local terms” which are extracted by this method, might also be a good estimator of dialect.
Conclusion In this paper we proposed a new map based browser of web contents. The core concept is the AOI which has semantically homogeneous and geographically distinguishable cluster of POIs. A geo-semantic clustering is assumed to be able to extract AOI by POI data. The geo-semantic ratio R is only free parameter and we evaluate it on real restaurant reputation web pages. The result tells geo-semantic clustering extracts more suitable border of AOI than geographic clustering or semantic clustering. Additionally we proposed a location aware summarizing method which is take term locality into account. We confirmed our location aware term weighting is good index for location aware summarizing. For future work, we will evaluate our method with other types of location based contents such as restaurants, shops, and sightseeing spots. And we will also attempt usability evaluation to confirm what the best AOI for users is. An optimal ratio R for AOI on every type of region or genre should be estimated.
References 1. 2.
3.
4.
5.
6.
7. 8.
Hiramatsu, K., Ishida, T., An Augmented Web Space for Digital Cities, Proceedings of the International Symposium on Applications and the Internet SAINT’01 (2001) 105 Lim, E., Goh, D, Wee-Keong N., Khoo, C., Higgins, S., G-Portal: a map-based digital library for distributed geospatial and georeferenced resources, Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries (2002) 351-358 Morimoto, Y., Aono, M., Houle, M., McCurley, K., Extracting Spatial Knowledge from the Web, Proceedings of the International Symposium on Applications and the Internet SAINT’03 (2003) 326-333 Kanada, Y., A Method of Geographical Name Extraction from Japanese Text for Thematic Geographical Search, 18th International Conference on Information and Knowledge Management CIKM’99 (1999) 46-54 Amitay, E., Har'El,, N., Sivan, R., Soffer, A., Web-a-where: geotagging web content, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (2004) 273 – 280 Buyukokkten, O., Cho, J., Garcia-Molina, H., Gravano, L., Shivakumar, N., Exploiting Geographical Location Information of Web Pages, Proceedings of Workshop on Web Databases WebDB’99 (1999) Ding, J., Gravano, L., Shivakumar, N., Computing Geographical Scopes of Web Resources, 26th International Conference on Very Large Databases (2000) Silva, M. J., Martins, B., Chaves, M., Cardoso, N., Afonso, A., Adding geographic scopes to web resources, ACM SIGIR 2004 Workshop on Geographic Information Retrieval (2004)
12
Osamu Masutani, Hirotoshi Iwasaki
9.
Asadi, S., Xu, J., Shi, Y., Diederich, J., Zhou, X., Calculation of Target Locations for Web Resources Proceedings of the 7th International Conference on Web Information Systems Engineering WISE 2006, LNCS 4255 (2006) 277-288 10. Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J., Geographical partition for distributed web crawling, Proceedings of the 2005 workshop on Geographic information retrieval (2005) 55-60 11. Neches, R., Yao, K., Ko, I., Bugacov, A., Kumar, V., Eleish, R., GeoWorlds: Integrating GIS and Digital Libraries for Situation Understanding and Management, The New Review of Hypermedia and Multimedia NRHM, Volume 7 (2001) 127-152 12. Tezuka, T., Kurashima, T., Tanaka, K., Toward tighter integration of web search with a geographic information system. In Proceedings of the 15th International Conference on World Wide Web WWW’06 ACM Press (2006) 277-286
13. Sclaroff, S., Cascia, M., Sethi, S., Unifying textual and visual cues for content-based image retrieval on the World Wide Web. Computer Vision and Image Understanding Volume 75, Issue 1-2, (1999) 86-98 14. Barnard, K., Forsyth, D., Learning the Semantics of Words and Pictures. International Conference on Computer Vision, vol 2, (2001) 408-415 15. Liu, T. Y., High-order Heterogeneous Object Co-clustering, The 4th Chinese workshop on Machine Learning and Application, (2006) 16. Sun, J.-T. Zeng, H.-J. Liu, H., Lu, Y., Chen, Z., CubeSVD: A Novel Approach to Personalized Web Search. In Proceedings of the 15th International Conference on World Wide Web WWW’05 (2005) 382-390 17. Xi, W., Zhang, B., Chen, Z., Lu, Y., Yan, S., Ma, W., Fox, EA., Link fusion: a unified link analysis framework for multi-type interrelated data objects. Proceedings of the 13th international conference on World Wide Web WWW '04 ACM Press (2004) 319-327 18. Wang, X., Sun, J., Chen, Z., Zhai, C., Latent Semantic Analysis for Multiple-Type Interrelated Data Objects. SIGIR’06 (2006) 236-243 19. Yoshida, N., Zushi, T., Kiyoki, Y., Kitagawa, T., Context Dependent Dynamic Clustering and Semantic Data Mining Method for Document Data, IPSJ, Vol41. No SIG1 (2000) 127-139