Web Information Retrieval Based on the Localness Degree

4 downloads 86810 Views 85KB Size Report
chiyako@db.cs.scitec.kobe-u.ac.jp http://www.db.cs.scitec.kobe-u.ac.jp/. 2. Department of Social ... degree is a new notion for estimating the local dependence and ubiquitous nature of Web pages. ... In the year 2000, about 2.1 billion pages ...
Web Information Retrieval Based on the Localness Degree Chiyako Matsumoto1 , Qiang Ma2 , and Katsumi Tanaka2 1

Department of Computer and Systems Engineering, Graduate School of Science and Technology, Kobe University. Rokkodai, Nada, Kobe, 657-8501, Japan [email protected] http://www.db.cs.scitec.kobe-u.ac.jp/ 2 Department of Social Informatics, Graduate School of Informatics, Kyoto University. Yoshida Honmachi, Sakyo, Kyoto, 606-8501, Japan {qiang,tanaka}@dl.kuis.kyoto-u.ac.jp http://www.dl.kuis.kyoto-u.ac.jp/

Abstract. A vast amount of information is available on the WWW. There are a lot of Web pages whose content is ’local’ and interesting for people in a very narrow regional area. Usually, users search for information with search engines, even though finding or excluding local information may still be difficult. In this paper, we propose a new information retrieval method that is based on the localness degree for discovering or excluding local information from the WWW. Localness degree is a new notion for estimating the local dependence and ubiquitous nature of Web pages. The localness degree is computed by 1) a content analysis of the Web page itself, to determine the frequency of occurrence of geographical words, and the geographical area (i.e., latitude and longitude) covered by the location information given on the page, and 2) a comparison of the Web page with other pages with respect to daily (ubiquitous) content. We also show some results of our preliminary experiments of the retrieval method based on the localness degree.

1

Introduction

The World Wide Web (WWW) is rapidly progressing and spreading. In the year 2000, about 2.1 billion pages were available on the Web, and about 7 million new pages appeared every day [1]. Everyone-both novice and expert- can access the vast amount of information available on the WWW and find what they are looking for. Conventionally, users input keywords or keyword-based user profiles to search for the information of interest. Unfortunately, it’s not always easy to specify the keywords of user interest information, especially, for recently posted information, such as new pages and news articles. Ma [2,3] and Miyazaki [4] focused on the time-series feature of information and proposed some meaningful measures, called freshness and popularity, to fetch new valuable information. R. Cicchetti et al. (Eds.): DEXA 2002, LNCS 2453, pp. 172–181, 2002. c Springer-Verlag Berlin Heidelberg 2002 

Web Information Retrieval Based on the Localness Degree

173

However, while there are many occurrences of local adhesion information on the Web, it is not always easy to find and eliminate local information via conventional methods. Some portal Web sites [5,6] provide the regional information, which is clustered by keywords. In such portal Web sites, users are required to clearly specify their interest by keywords. Moreover, since the regional information is entered by users, the search information may be limited, and some valuable local information may be missed. In this paper, we propose a non-traditional information retrieval method that is based on the localness degree. Localness degree is a new notion for discovering or excluding local information from the WWW. In this paper, we assume that local information includes: – information that is only interested by users in some special region or organization (we called regional interest information), and – information that describe something (event, etc. ) about some special region or organization (we called regional content). From these aspects, regional interest and regional content, we propose two types of localness degree of a Web page: 1) localness of content, and 2) localness of community. Localness of Content. Localness of Content means that how much the content is concerned with specifical region, according to the aspect of regional content. We analyze the content of Web page to estimate its region dependence. If a page has high region dependence, it should be regional content. Therefore, its probability of being local information may be high. We call this type of localness as localness of content. First, we compute the frequency and the level of detail of geographical words appearing on a Web page. In this paper, we assume that geographical words are words of region name, oraganization name, and so on. If geographical words appear frequently and in detail, the localness degree is high. Therefore, it could be regional content and its localness degree of content may be high. We also compute the coverage of these geographical words with respect to location information (that is, latitude and longitude). If the geographical coverage is small and there are many detailed geographical words (e.g., regional names), the localness degree of the Web page is considered high. In other words, we compute the proportion of the density of geographical words to the geographical coverage to estimate the localness degree. If the density is high, localness degree of content of that page could be high. Localness of Community. Localness of Community1 means the localness of the interests, that is, to what extent the interests are bound to a special region, where, the interests is a group of people interested in a special topic (content). Some events occur anywhere and some events occur only in some special region. Information about the latter is a scoop and may be interested by all users (of all regions). In other words, it may be global interest information more than regional interest information. Therefore, if a page describes such scoop event (the latter), it may not be local information. On the other hand, if a page describes ubiquitous event (the former type. 1

In this paper, the term ’localness’ means physical localness, not logical. For this, we do not consider the logical (virtual) community in this paper.

174

C. Matsumoto, Q. Ma, and K. Tanaka

High similarity between these events excluding the location or time), it may be interested only by users of some special region. For instance, a summer festival is held everywhere in Japan. These events (summer festivals) are similar although the location and time may be different. People may be only interested in the summer festival of his/her town. In other words, a summer festival may be interested by its residents only. We call this kind of information local information of community. We compare the estimating page with other Web pages, to determine if it represents a ubiquitous topic or event. If a page represents a ubiquitous topic or event, its localness degree of community may be high. The first step is to exclude geographical words and proper nouns from the Web page. Then, if this page contains many words that often appear in other pages, then it is regarded as ubiquitous information. Ubiquitous information means information that can be found anywhere and anytime. This information could be an event, or information given daily. The localness degree of such a page is considered to be high. We also note that there may be many pages covering a same event. Such pages should not be local but global or popular [2,3,4]. To distinguish from this case, we just need to check the location information of the events, which are reported by the estimating page and the comparison pages. If the locations are different, so we can say that these pages represent different daily events. The localness degree of these pages may be high. The remainder of this paper is organized as follows: in chapter 2, we give an overview of related work. In chapter 3, we present a mathematical definition of localness degree of Web pages. We show some preliminary evaluation results in chapter 4. Finally, we conclude this paper with some discussion and a summary in chapter 5.

2

Related Work

Mobile Info Search (MIS) [7] is a project that proposes a mobile-computing methodology and services for utilizing local information from the Internet. The goal of MIS is to collect, structure, and integrate distributed and diverse local information from the Internet into a practical form, and make it available through a simple interface to mobile users in various situations or contexts. The researchers in this project do this in a location-oriented way, and have produced a MIS prototype system [8]. The prototype system exchanges information bi-directionally between the virtual Web world and the real world. Our research differs in that we define a new concept, the localness degree of a Web page, for discovering ’local’ information from the Web. Buyukkokten et al. [9] discussed how to map a Web site to a geographical location, and studied the use of several geographical keys for the purpose of assigning site-level geographical context. By analyzing “whois” records, they built a database that correlates IP addresses and hostnames to approximate physical locations. By combining this information with the hyperlink structure of the Web, they were able to make inferences about the geography of the Web at the granularity of a site. They also presented a tool to visualize geographical Web data. Digital City [10] propose an augmented Web space and its query language to support geographical querying and sequential plan creation utilizing a digital city that is a city-based information space on the Internet. The augmented Web

Web Information Retrieval Based on the Localness Degree

175

space consists of home pages, hyperlinks, and generic links that represent geographical relations between home pages. They have applied the proposed augmented Web space to Digital City Kyoto, a city information service system that is accessed through a 3D walk-through implementation and a map-based interface. In contrast, our work focuses on how to discover local information from Web contextually: content and correlation with others. Geographic Search [11] adds the ability to search for web pages within a particular geographic locale to traditional keyword searching. To accomplish this, Daniel converted street addresses found within a large corpus of documents to latitudelongitude-based coordinates using the freely available TIGER and FIPS data sources, and built a two-dimensional index of these coordinates. Daniel’s system provides an interface that allows the user to augment a keyword search with the ability to restrict matches to within a certain radius of a specified address. In consideration of how much a page has stuck to the area, this point differs from our research. The Clever (Client-side Eigenvector-Enhanced Retrieval) [12,13] system uses algorithmic methods to examine the sites linking to and from a Web site. Clever attempts to ensure that the information it retrieves is useful by pointing users toward one of two classes of sites, called authorities and hubs. Because Clever concentrates on hyperlinks, it can discover communities interested in the same topics. Regional interest information, which is kind of local information in our work, can also be considered as the interested information of a ’physical’ regional community.

3

Localness Degree

In this section, we show the approaches to discover local information from aspect of regional content and aspect of regional interest, respectively. As we mentioned, the former is called as localness of content, the latter is called as localness of community. 3.1

Localness of Content

Regional content that describe something (event, etc. ) about some special region or organization may have high region dependence: many geographical words, many organization name, and so on. For the localness degree, the frequency of geographical words in the Web page content should be estimated. Usually, if a Web page has many detailed geographical words, its localness degree considered to be high. More specifically, if the frequency of geographical words is high, and if these words represent very detailed location (i.e., place) information, we say that its localness degree might be high. Moreover, with the location data (i.e., latitude and longitude), we can estimate the geographical coverage of the Web page, based on the MBR (Minimum Bounding Rectangle) [14,15]. When the density of geographical words over the covered area is high, we say that the localness degree of the page is considered to be high. Density of Geographical Words Over Web page’s MBR. Usually, if a page covers a narrow geographical area, it may be a type of regional content and its localness degree is considered to be high. In other words, if the content coverage of a page is high, its localness degree should be lower.

176

C. Matsumoto, Q. Ma, and K. Tanaka

We use latitude and longitude to estimate page coverage. First, we transform all of the location information of a geographical word into two-dimensional points (latitude, longitude). We plot all of these points on a map, on which the x-axis and y-axis are latitude and longitude, respectively. So, page coverage corresponds to the MBR (Minimum Bounding Rectangle) [14,15]. When page coverage is narrow, its localness degree should be high. The number of points plotted on the MBR is also considered well for estimating the localness degree in our approach. In other words, not only the area of MBR, but also the number of plotted points affects the localness degree.

Latitude

Latitude

Longitude

Longitude

Fig. 1. Example for Influence of Density on Localness

For example, in Figure 1, four points are plotted in the left MBR, and eight points in the right MBR. The point size indicates the level of detail of geographical words and organization names. The bigger the point size is, the higher the level of detail of geographical word is. Even if the area of the two MBRs are equal, their localness degree should be different. Fuller discussion about the level of detail and the frequency of geographical words will be described later. We can compute the density of geographical words over the MBR to estimate localness degree. The higher the density is, the higher the localness degree is considered to be. The formula is defined as follows. Lcld (pagex ) =

n

i=1

weight(geowordi ) M BRpagex

(1)

where Lcld is the localness degree, this time with respect to density; weight(geowordi ) is the function i that we use to estimate the level of detail of geographical words or organization names in pagex ; M BRpagex is the area of the MBR for pagex , which is calculated with Formula (2) and will be described later.

Web Information Retrieval Based on the Localness Degree

177

Frequency and Detail of Geographical Words. Generally, a Web page with a high dependence on an area includes a lot of geographical information, such as the names of the country, state (prefecture), city, and so on. Because many organizations are related to a specific location, we can also use the frequency of organization names to estimate the localness degree. In short, if the frequency of geographical words and organization names is high, the page’s localness degree is considered to be high. The level of detail of geographical words and organization names is also considered when we compute the localness degree of a Web page. The level of detail for geographical words depends on their administrative level. For example, a city is at a lower administrative level than a state, so we say that a site is more detailed than a state. Therefore, when computing the frequency, we assign weight values to geographical words 2 and organization names according to their level of detail. The simplest rule is to set up weight values in the following order: country name < organization name < state (prefecture) < city < town. Location Information Based on Geographical Words. In our current working phase, we use a Japanese location information database [16], which includes latitudinal and longitudinal data for the 3,252 cities, towns, and villages in Japan. We match geographical words with this location data. We noticed that some detailed place names have no match in the database. In this case, we downgrade the detail level of the place name to find an approximate location. For example, if the location data of “S street, B city, A state” is not found, we could use the location data of “B city, A state” to approximate a match. Different places may share the same name. To avoid a mismatch, we could analyze the page’s context to clearly specify its location. For example, to match up “Futyu city” to its correct location data, we can examine the page content for either “Hiroshima prefecture” or “Tokyo metropolitan” . If “Hiroshima prefecture” appears, we could use the location data of “Futyu city, Hiroshima prefecture” for the “Futyu city” on this page. Another example is “Paris”, which maybe mean the Paris city of the USA or the capital city of France. Therefore, we can compute the area of the MBR, which is based on the geographical words in pagex , to estimate the localness degree of pagex . If the area is large, the localness degree of pagex is small. MBRpagex = (latmax − latmin )(longmax − longmin )

(2)

where MBRpagex is the area of the MBR for pagex . The maximum latitude and longitude are latmax and longmax , respectively. The minimum latitude and longitude are latmin and longmin , respectively. When only one point exists in a page and MBRpagex is set to 0, it treats as an exception and is referred to as MBRpagex = 1.

2

In our current work, we just consider the level of detail of geographical words based on administrative level. We also observe that the population ratio is also important. We will discuss this issue in our future work

178

3.2

C. Matsumoto, Q. Ma, and K. Tanaka

Localness of Community

Some events occur anywhere and some events occur only in some special region. Information about the latter is a scoop and may be interested by all users (of all regions), who are independent of their region or organization. In other words, it may be global interest information more than regional interest information. Oppositely, a ubiquitous occurrence may have a high localness degree regardless of location. For example, a summer festival, an athletic meet, and so on, are ubiquitous events that happen wherever people are. A ubiquitous occurrence may be a normal part of daily life. Therefore, if a page describes such scoop event (the latter), it may not be local information. On the other hand, if a page describes ubiquitous event (the former), it may be local information. We define the localness degree by investigating the degree of similarity between pages, and the location where an event happens. If the pages show a high degree of similarity, but a difference in event locations, the localness degree would be considered high, as long as the event locations are irrelevant-meaning the events could be held anywhere. Excluding the geographical words, the degree of similarity is calculated. The formula for calculating the degree of similarity sim(A, B) between page A and page B is as follows. sim(A, B) =

v(A)v(B) |v(A)||v(B)|

(3)

where, v(A) and v(B) are keyword vectors of page A and B. The formula for the localness degree Lclr of page pagex , with respect to the comparison of other pages, is as follows. m (dif f erent location) Lclr (pagex ) = N m (4) location) 1 − N (same where m is the number of similar pages, excluding geographical words, organization names, and proper nouns, and N is the number of pages compared.

4

Preliminary Experiments

In this section, we show the results of two preliminary experiments, which are estimating localness of content and localness of community, respectively. In preliminary experiments, we excluded all of the structural information (e.g., HTML tags) and the Ad. content from the HTML source pages. For our preliminary experiments, we used 400 Web pages (written in Japanese) from the Web site ASAHI.COM (http://www.asahi.com/), which is a well-known, news Web site in Japan. 4.1

Preliminary Experiment 1: Localness of Content

In preliminary experiment 1, we compute localness degree of content of Web page with Formula (1), which is from the regional content aspect and compute the localness based

Web Information Retrieval Based on the Localness Degree

179

on the density of geographical words over Web page’s MBR. In contrast to all estimated Web pages (400 pages) are organized as regional information by editors of ASAHI.COM. There are 224 pages whose localness degree is greater than threshold 10 and these 224 pages are considered as local page based on our regional content aspect. The recall ratio is 0.624. As our evaluation is a limited one, there are more improving works needed to do. Nevertheless, these results can confirm that localness of community is fit to organize the local information.

All pages

Pages Selected By System

Pages Selected By User 243

39

5

113

Fig. 2. Results of Preliminary Experiments 2: Localness of Community

4.2

Preliminary Experiment 2: Localness of Community

In preliminary experiment 2, we estimated the localness of community with Formula (4), which is from the aspect of regional interest. As shown in Figure 2, 44 pages have greater value of community localness than threshold 0.0875. Such 44 pages are selected as local pages. On the other hand, the number of human-selected local pages is 282. The recall ratio and precision ratios are 0.138 and 0.886, respectively. The recall ratio is very poor. One of the considerable reasons is: we used the news articles as our example pages for the preliminary experiments. Because news articles are more formal and unique than other Web pages, we maybe fail to compute the similarity between pages for discovering ubiquitous information. The other kind of Web pages is needed to be evaluated. The number of estimated pages is another considerable reason.

5

Conclusion

There are a lot of Web pages whose content is ’local’ and only interested by people in a very narrow region. Usually, users search for information with search engines. However, since it’s not easy to specify the query for acquiring interest information via the conventional keyword based technologies, the efficiency and ability to acquire or

180

C. Matsumoto, Q. Ma, and K. Tanaka

exclude local information are limited. In this paper, we propose a new notion, localness degree, for discovering or excluding local information from the WWW. We consider the following two kinds of information may be local: 1) regional content, and 2) regional interest information. According to this assumption, we defined two types of localness for Web pages: 1. localness of content, which is based on its content analysis (according to aspect of regional content). 2. localness of community, which is based on comparison with related pages for estimating its ubiquitousness (according to aspect of regional interest). Our local information discovering mechanism is useful to – acquire local information, which is not easy to clearly specify in keywords; – exclude the local information; – discover local information over plural regions. The definition of localness degree will be improved and estimated in our future works. For example, for localness of community, we also could analyze the user access history of Web pages. If a Web page is often accessed by users who are from same region, and there is few access from other regions, we could say that this page is regional interest one and its localness of community should be high. The hub/authority style link analysis approach for discovering local information will be discussed in near future. We also plan to develop some applications for local information dissemination based on localness degree of information. Acknowledgement This research is partly supported by the Japanese Ministry of Education, Culture, Sports, Science and Technology under Grant-in-Aid for Scientific Research on “New Web Retrieval Services Based on Discovery of Web Semantic Structures”, No. 14019048, and “Multimodal Information Retrieval, Presentation, and Generation of Broadcast Contents for Mobile Environments”, No. 14208036.

References 1. Cyveillance. http://www.cyveillance.com/web/us/newsroom/releases/2000/2000-07-10.htm. 2. Qiang Ma, Kazutoshi Sumiya, and Katsumi Tanaka. Information filtering based on timeseries features for data dissemination systems. TOD7, 41(SIG6):46–57, 8 2000. 3. Qiang Ma, Shinya Miyazaki, and Katsumi Tanaka. Webscan: Discovering and notifying important changes of web sites. In DEXA, volume 2113 of Lecture Notes in Computer Science, pages 587–598. Springer, 9 2001. 4. Shinya Miyazaki, Qiang Ma, and Katsumi Tanaka. Webscan: Content-based change discovery and broadcast-notification of web sites. TOD10, 42(SIG8):96–107, 7 2001. 5. Yahoo!regional. http://local.yahoo.co.jp. 6. MACHIgoo. http://machi.goo.ne.jp. 10 Chiyako Matsumoto, Qiang Ma and Katsumi Tanaka 7. Nobuyuki Miura, Katsumi Takahashi, Seiji Yokoji, andKenichi Shima. Location oriented ˜ information integration mobile info search 2 experiment ˜. The 57th National Convention of IPSJ, 3:637–638, 10 1998.

Web Information Retrieval Based on the Localness Degree

181

8. MIS2. http://www.kokono.net/. 9. Orkut Buyukkokten, Junghoo Cho, Hector Garcia-Molina, Luis Gravano, and Narayanan Shivakumar. Exploiting geographical location information of web pages. In WebDB (Informal Proceedings), pages 91–96, 1999. 10. Kaoru Hiramatsu and Toru Ishida. An augmented web space for digital cities. In SAINT, pages 105–112, 2001. 11. Daniel Egnor. Geographic search. Technical report, Google Programming Contest, 2002. 12. Soumen Chakurabarti, Btron Dom, David Gibson, Jon M. Kleinberg, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Hypersearching the web. scientific american, 1999. 13. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. 14. Antonin Guttman. R-trees: A dynamic index structure for spatial searching. Proc. ACM SIGMOD Conference on Management of Data, 14(2):47–57, 1984. 15. Carlo Zaniolo, Stefano Ceri, Christos Faloutsos, Richard T. Snodgrass, V. S. Subrahmanian, and Roberto Zicari. Advanced Database Systems. The Morgan Kaufmann, 1997. 16. T. Takeda. The latitude / longitude position database of all-prefectures cities, towns and villages in japan, 2000.

Suggest Documents