GeoViser: Geo-Spatial Clustering and Visualization of ...

8 downloads 2292 Views 331KB Size Report
Nov 30, 1998 - associated with the web documents retrieved via a search engine. The research ... such a system is the the Navigational View Builder 22]. ... easy for the users to become lost in the deluge of hits that result from their query.
Submission - Eighth International WWW conference GeoViser: Geo-Spatial Clustering and Visualization of Search Engine Results

Jayesh Govindara jan and Matthew Ward

Computer Science Department Worcester Polytechnic Institute 100 Institute Road Worcester, MA 01609 femail - jayeshg/[email protected] November 30, 1998 Abstract

With the explosive growth of the available information on the internet, navigating the web successfully in order to obtain information has become an important issue. This essentially involves using techniques such as searching for the information using keyword and subject based searches. To the best of our knowledge, most tools which achieve this goal are not designed to nd the geographic location of information sources. In many instances such information could prove to be extremely useful to answer queries with certain geo-spatial elements. Our work focuses on the problem of nding and visually presenting the geographic location associated with the web documents retrieved via a search engine. The research aims to aggregate geographic locations of the retrieved documents to nd the clusters/outliers. We then look at ways of spatially visualizing these results on a map using glyphs and other techniques. The main contribution of this paper is the idea of discovering the geographic location of webpages obtained via a search engine query and visualizing the results by placing them on a map in an attempt to convey spatial attributes such as location and distance. We also look into issues of presenting such data depending on the relevance of the hits, where the relevance depends on factors such as the ranking provided by the search engine or the number of relevant pages retrieved from a website. Current search engines ignore the existence of such geographic metadata and therefore are unable to process queries which are inherently geo-spatial in nature. Our system (GeoViser) integrates this functionality to supplement the current capabilities of a search engine.

1 Introduction

The internet continues to grow, along with the information available from it. The immense popularity of this medium is because of the existence of techniques which provide e ortless access to information from distributed sources. There are several search tools today, such as Lycos [35], Alta Vista [32] and Savvysearch [36], which enable users to nd relevant information from the web. All these search engines use programs called Web robots, agents or wanderers for ecient and fast resource discovery. These robots create index databases which enable the search engines to retrieve documents containing user-speci ed keywords. In addition there exists another popular method of searching for information, namely subject-based directories, which provide a categorical organization of information. Yahoo [40] is one such engine which exploits the manual classi cation of documents. Most, search engines just provide the user with a scrolled list of documents that match the userspeci ed keywords (hits). Although some search engines, such as Altavista and Infoseek, provide some concept of clustering for query re nement, they do not provide support which would help the user nd existing relations among the several hits which result from their query. The goal of our system (named, GeoViser) is to help users obtain the geo-spatial relationship between the various hits. The system attempts to mine for the geographic location of the URL that is obtained as a result of a query to a search engine. This information is then used to place glyphs representing hits onto a map. Glyphs are \abstract graphical entities whose attributes - position, size, shape, color, orientation, etc. - are bound to data" [29]. GeoViser uses glyphs to encode information such as relevance of the hit, number of occurrences of the keyword or the number of relevant pages retrieved from a given site. Thus the glyph position indicates the location of the relevant document(s) while the glyph size encodes its relevance. The overall objective of our research is to develop a retrieval engine that obtains the geographic location of each URL that results from a search and to use this information to discern the geospatial attributes such as clusters, outliers and distances among the resulting documents. In a certain sense this amounts to building a generic distributed geographic information system (GIS) which works for any relevant keyword search. An obvious advantage is that this does not require maintaining a database, since the Web itself is used as a distributed information source. Our system integrates a unique combination of Search Engine and Geographical Information System technology as a technique to resolve queries with latent geo-spatial information. Speci cs of this query category are examined in the following section. Typically, due to the abundance of information in a web site, the search is expected to culminate in large amounts of data, which are dicult to interpret when taken together. Therefore a challenge in retrieving relevant data is coping with the data volume. In addition there is a need to nd ways of presenting users with the obtained geo-spatial metadata along with its attributes such as relevance (ranking). Most search engines to date do not provide any data visualization mechanism, though there exist several systems for independently visualizing the retrieved information. One example of such a system is the the Navigational View Builder [22]. These systems demonstrate the usefulness of visualization as a strategy for understanding large amounts of data retrieved from the WWW. We look at several possible visualizations that e ectively present the retrieved geo-spatial information.

2 Motivation

In perhaps the largest study to date of actual web searches, Jansen et.al. analyzed the transaction logs of a large number of users of search engines in an attempt to quantitatively study the searching strategies employed by current users [15]. This study addresses questions such as What do users search for on the web ? Are they satis ed with the results they receive ? In order to ascertain some broad subjects of searching, Jansen et.al. classify the 64 top terms used into a set of common themes. The categories ranged from economics (terms such as employment, company, business) to sports (ncaa, basketball). One interesting conclusion from the study is that there exists a great deal of interest in categories dealing with places (e.g., state, America). Search engines in their current capacity are not equipped with facilities to address these interests in physical locations. For instance current search engines cannot answer queries such as 

Where is most work in XYZ carried out ? or Where do people interested in XYZ live ?

An answer to this question in turn helps answer questions such as    

Where should I sell my product ? Where should I hold a conference ? Where should I go for my sabbatical ? Is the work in the speci ed area increasing or decreasing ?

Henceforth we refer to this class of queries with Geo-Spatial attributes as GSQ's (Geo-Spatial Queries). The organization of hits from a search engine (as a scrolled list) does little to answer such queries. Therefore, a medium of presentation which is di erent from what is currently in use is necessary. We present a mapping and presentation scheme which proposes to answer GSQ's. The technique we provide allows the user to see the hits as a scrolled list as well as in a graphical form. In this sense, the user is provided with an an unobtrusive alternative to searching for relevant answers to GSQ's. Another possible advantage of such a system is in allowing the user to lter the number of hits based on the geo-spatial information that he/she might have. As an example, let us assume that a user is searching for `John Doe'. The search engine returns a list of hits corresponding to pages containing the words. Now a map identifying the locations of the documents resulting from the query allows the user to further lter based on geographic information. The user might be interested in John Doe's in the California area, since the user might be equipped with some apriori, related, geographic information (e.g. John Doe works in the Silicon Valley). Thus there is a need for a system which lets the user interactively lter the hits without the need for textually re ning his query. Current search engines do not provide the user with a way to use this knowledge. In addition one would need complex queries to mine for the location of relevant documents. Subsequently, the 2

user has to manually aggregate the list into clusters indexed on their geographic location. This is not a feasible solution to the problem since studies [16] indicate that web queries tend to be short and users tend to interact minimally with the system. Our work attempts to solves this by presenting the user with a click-able map of hits to retrieve data with geo-spatial attributes.

3 Related Work

There has been considerable work in the eld on knowledge discovery and information retrieval over the web [8]. The emergence of search engines as powerful tools for resource discovery are indeed results of several such e orts [26, 5]. Tools for knowledge discovery over the internet can be categorized into one of the following three categories : resource directory services, special purpose agents and web search engines [13]. Resource directory services such as Yahoo [40] and Lycos [35] collect resources from various hosts and classify them into prede ned subjects. Special purpose agents are applications designed to nd speci c pieces of information. Examples of such systems are the FAQ nder agent at the InfoLab of the University of Chicago [13] and DejaNews [33] which searches Usenet newsgroups. The third and probably most widely used of all is the search engine. These index large amounts of information which are retrieved at the time of a query. With the web growing at such a phenomenal rate, one might wonder if search engines are scalable. It is very easy for the users to become lost in the deluge of hits that result from their query. Some form of visual presentation techniques of search engine hits could reduce this problem to a great extent. Our system incorporates a visual, interactive interface to visualize data resulting from a GSQ. Although a lot of work has been done in the eld of Internet Visualization in general, not much has been done in visualization of results of web search engines [24, 7, 21]. Scott et. al [21] have built a system which provides users with the ability to explore common topics within the set of search results. This system provides an interactive visualization of the search results, allowing users to visually obtain the relevance of the results to di erent key terms. The WebQuery system [7] built by Carrire and Kazman on the other hand, o ers a powerful new method for searching the Web based on connectivity and content. With this system, they assist the user nd \hot spots" on the Web that contain information germane to a query. This system uses visualization as a strategy to deal with the volume of resulting information. Our system helps the user nd similar geographic \hotspots". The GeoViser visualization strategy incorporates geography speci c information in addition to encoding relevance and ranking. The subject of network visualization has gained considerable importance in the past few years. A lot of e ort has gone into understanding the data associated with the network and on visualizing the structure of the network itself. Eick et.al. discuss various ways of using node and link maps to answer several questions concerning network capacity and trac ows. These display an appropriate subset of data, based on network geography or topology [3] [12]. In addition these systems support zoom-ins to accomplish geographic restriction. Our work supports similar zoom-in views for presenting details.

3

A considerable amount of research has also been carried out in mapping the WWW usage [2] [9]. This information is particularly of use in analyzing the geographic distribution of Web usage statistics such as hits to a given website. Such a mapping done interactively is tedious and time consuming for a large number of Web hits ; this is when you need to use a GIS for mapping and analysis. The development of spatial databases to be utilized in GIS is usually discipline speci c, such as for planning, ecosystem analysis and conservation, civil engineering, archaeology and social sciences. The objective of distributed GIS research is to support the use of such underlying databases from heterogeneous sources. In contrast, our goal is to build a generic, discipline independent GIS utilizing the WWW as a data repository with mapping of the URL's as a mechanism to obtain spatial attributes of the data. Admittedly, given the nature of the underlying data on the web, this method does not provide results which are as accurate as those provided by discipline speci c GIS's. Our work represents a rst step towards achieving the power of a special purpose GIS in a domain-independent manner. Other related work has been carried out by several large groups of researchers (e.g., Pitkow et.al. [27]) in analyzing WWW access patterns and demographics. Lamm and Reed have investigated realtime geographic visualization of World Wide Web trac [19] which uses the geographic mapping to nd temporal and geographic patterns of WWW server access. Visualization strategies have been investigated with a view to solving the `lost in hyperspace' problem [22, 17, 14, 1, 23]. We apply a subset of these visualization techniques for our purpose of visualizing the geospatial attributes related to the data resulting from query to a search engine. In addition there exist commercial systems such as WebPlot [39], an application that maps the location of "hits" on a Web server (from the access log le) onto a global map base, and Umap [38], which turns results from web search engines into two dimensional information maps which one can interactively explore. Our system provides an environment for interactive visual exploration of search engine results based on the relevance of the retrieved documents. As in WebPlot, we also attempt to map the URL's in cyberspace to their geographic locations.

4 System Details

One of the main reasons that a website is created is because a person or an organization wishes to provide information about the kind of work they are doing or products they are making. GeoViser takes advantage of this by providing the user with a way to nd location-speci c clusters in the eld in which he is interested. For instance, imagine a person planning for a sabbatical or a user who would like to visit places where work related to his interests is carried out. One of the main things he needs to know is the location where pertinent research or work is being undertaken. Another example would be a salesman wanting to know the geographic locations where a particular product of his interest is being sold or manufactured. For such purposes, the services provided by the current search engines are inadequate. In other words one cannot submit a query to a search engine and expect geography-speci c information as the output. As suggested by several researchers [5], the information provided by search engines and other tools for knowledge discovery over the internet has the following characteristics: 4

Information Gathering Level Two Information Gathering Level One Distributed Information Source

Figure 1: Information Gathering Layers-Derived from [13] pg2  



It is incomplete. The results obtained is only a limited snapshot of the overall internet information resources. It is inconsistent and dynamic. The index created quickly becomes inconsistent with the underlying data because of its dynamic nature. There emerge many new information resources everyday, while thousands disappear. It is heterogeneous. The information is maintained by di erent hosts and there is no easy and quick method to integrate the results from these sources.

It is crucial that any system that attempts to serve as a tool for knowledge discovery take into account these characteristics of the underlying data. Given the nature of the data, information gathering becomes a dicult problem in this context. The whole idea of knowledge navigation and discovery systems is to provide a consistent, organized view of information and provide for its easy access[13]. However, most existing services are still in a primitive state of development. Consider for example the GSQ's described earlier. There may be several possible solutions to the queries, e.g, taking the results (URL's) from a search engine, looking through the whois directory (whois is an Internet username directory service used to look up user, host, organization names and server locations in the Network Information Center (NIC) database), remembering the location and repeating this procedure for all retrieved URL's. The problem here is that the users may not know of the existence of these services and even if they do, they need to go back and forth trying to remember the data obtained and integrate the facts themselves. We argue that this is not a feasible solution to the problem and there must be a high-level tool that collects, veri es and integrates this information with its geographic attributes. The whole internet information organization could be layered in a pyramidal structure as shown in Figure 1 [13]. The lowest level consists of the hosts which are the original information producers. 5

VisRender GeoLocator Results Filter Information Gathering Level Two Information Gathering Level One

Search Engine

Distributed Information Source

Query ?

Figure 2: 4 tier GeoViser Architecture diagram At level 1 are the primitive information gathering services which index the information distributed over remote hosts. Search engines and other services fall at this level. A fundamental ineciency of tools in this level is that they do not work at organizing the obtained information, and as a result, users are unable to locate information they need. In our research we are concerned with the second level of information gathering and presentation, which uses the data provided by the services at the lower level. Built on the underlying primitive services, the level 2 tools focus on applying knowledge discovery techniques to the pre-indexed data obtained from the lower services. The aim is to categorize and cluster data from the search engine based on their geo-spatial attribute (location of origin). This essentially provides the users with data indexed on keyword (indexing performed at level 1) as well as location (indexing carried out at level 2) which can then be used to answer the GSQ's. With the GeoViser system (at Level 2), we hope to deliver a useful starting point for designing an advanced geographic information seeking system based on Shneiderman's mantra in the context of Visual Information Seeking, i.e, \overview rst, zoom and lter, then details on demand" [30]. Our main contribution remains in providing for this starting point with which to begin this geographic information seeking. To explain this with an example, consider a research group trying to organize a conference. Information such as density of people working in the eld (related to the conference) over di erent regions of the country would be useful in deciding the location at which to hold the conference. Such information, as stated before, can be found by our system and in addition, this can be used again to make more speci c queries by combining the previous query with a subspace selected from the resulting relevant geographic space. In a sense this mimics the zoom and lter concept introduced by Shneiderman in the context of Visual Information Seeking. 6

GeoViser, as currently designed, is organized in a four tier architecture as shown in Figure 2. The lowest level takes the query and feeds it to the Search Engine. The results obtained by the search engine consists of a list of hits arranged on the basis of the ranking algorithm used by the chosen engine [31]. These results are then ltered to eliminate duplicate hits and are grouped by website. This essentially amounts to clustering based on individual websites. At this point we have data indexed (clustered) on keyword and website. The re ned set of URL's resulting from the ltering operation are processed by the GeoLocator module, which uses a database to map the URL's to their respective geographic locations [28]. The user can then click on the regions of the map to get lists of URL's speci c to the selected area. The system interactions are shown in detail in Figure 3 and each of the modules are described in below. 4.1

Search Engine Details

Due to the tremendous expansion of the web, resource discovery has become a crucial problem to solve. From the start of the WWW project to now, many methods for resource discovery have evolved. In the beginning, there were only a few existing servers, a listing of which was available at CERN. Therefore, resource discovery at that time was simply con ned to browsing through the documents that were present on the server. Obviously this method was not scalable and failed when the number of servers and consequently the number of documents increased. As a solution to this problem, a number of people started maintaining a list of references to web related resources. This led to the idea of indexing, where an index of the documents on a server was maintained. Therefore the users were now able to browse through the index rather than browsing through the pages [18]. But this solution too was un-scalable as resource indexing and browsing both became time consuming with the increase in the resources. Since the index itself was large, it was turned into a searchable database, thus providing an alternative to browsing. Unfortunately this database still su ers from the problem of manual maintenance; it is time-consuming, and the information quickly becomes out of date. This problem is countered using Web spiders (sometimes referred to as robots, walkers, wanderers or worms) which are programs that inspect Web documents and take some action upon them, usually adding them to a pool of searchable documents or constructing a graph from hyperlinks that it identi es within the document [11]. Most search engines use such methods to build indices. We argue that the search engines in the form they are today do not provide a solution that is scalable. In other-words as the resources on the internet increase globally, so do the results from the search engines. With the information sources getting bulkier, there seems to be a need for mechanisms that e ectively display the results. In addition to resolving GSQ's, our system also addresses this scalability issue by providing an interactive interface to further explore the hits. The current system uses the indexing mechanism provided by Infoseek [34]. The search engine returns a list of URL's as the output which are ranked in order of relevance. For our application, in addition to the geo-spatial attributes of the queried element it would be useful to show the ranking of the document. For this purpose we use visualization mechanisms such as glyphs which encode relevance information in their size. In our case in order to rank the websites on the content of relevant information we look at the documents supplied by the same website and rank it as highly relevant if the number of documents retrieved from the website is large. In the future we shall investigate better heuristics for relevance such as taking into account the size of the pages and the ranking provided by the search engines. The current heuristic requires that the duplicate hits be eliminated. The reason for this is because for a given query, if the search engine retrieves the same 7

Server-Location Database

GeoLocator

Info source

Map Database

Location

Vis-Render

{Domain, Count}

Search Engine

Se lec t

Info source

Re su

URL’s

lts

Filter

Query ?

Info source Distributed Information sources

Figure 3: GeoViser System Interactions document multiple times, it may cause the relevance of the website (using our heuristic) to increase erroneously. 4.2

Filtering

The functionality of eliminating the duplicate hits is built into the lter module. It simply amounts to parsing the returned list of URL's corresponding to the hits, rejecting the duplicates and retaining the unique URL's. In addition to the duplicates, due to the working of the search engines and their indexing schemes, it is observed that often there are several hits which hold no relation or relevance to the query. In order to get around this problem, building a separate indexing scheme based on geographic location and the interests seems to be the most obvious option. Clearly building such indices for each set of queries is a highly expensive alternative. Another approach to solving this problem lies in combining other metrics with the ranking provided by the search engines to ascertain the relevance of the websites. We choose this approach of de ning new metrics. The idea is to use the count of the number of webpages from a website as a decision parameter. For this purpose the lter module needs to have the ability to cluster websites. With each website GeoViser associates a count which shows the number of relevant pages returned for that site. This obviously re ects the relevance of the site. We believe this information in conjunction with the ranking provided by the search engines provides a reasonable metric. Using count as a decision parameter has certain disadvantages. Some website are such that they prefer to present information by breaking it up into many documents. There are others which take the exact opposite approach, i.e., websites are designed as a small number of long pages with a lot of information in them. The heuristic of using count as one of the decision parameters will always rank the websites with the 8

former design better than the one with the latter. In its current state, GeoViser su ers from this de ciency. In the future, along with the count of the pages retrieved we shall also investigate ways of factoring in information such as size of the pages into the metric. In addition to serving our purpose as explained, we believe that such a metric would prove to be useful in enhancing the capabilities of current search techniques in general. Summarizing, the lter module eliminates duplicate hits and indexes the hits by website. Therefore, passing the hits through the lter module gives us several sets of hits clustered on keyword and website. The choice of Infoseek as a search engine simpli ed the display of hits clustered on website since it provides an easy mechanism to obtain these clustered documents.

5 Mapping

This section attempts to explain the various mapping and visualization techniques integrated in GeoViser. A lot of research has been carried out in the eld of mapping the geography of cyberspace, but most of these systems aim to solve the problem of users becoming lost in hyperspace, or presenting the topology of the network itself. In spite of the dissimilarity in goals, we have found a wealth of information in this research. Pesce et.al [25] refer to cyberspace as a uni ed conceptualization of space spanning the entire internet. In other words, cyberspace is a spatial equivalent of WWW. The IP addresses are unique 32 bit numbers which specify the location of a host in cyberspace just as the triplet x; y; z describes the location of a point in 3 dimensional space. Our work necessitates an exact opposite mapping. We need to map the websites in cyberspace to locations on a map. (

) = geospatial location = (Lat; Lon)

f cyber location

where f is a mapping function. Mapping amounts to simply placing dots corresponding to the relevant websites. As an example, Figure 4 shows the geospatial mapping of all the domains with websites in the US. Our system aims to show only a subset of these sites which would result from a query to a search engine.This mapping is carried out by the GeoLocator module in GeoViser. This is a one-to-one mapping, where a single server address or host or a website is mapped onto a unique location in geospace. We don't take into consideration, at present, that an organization might be geographically disperse, in which case one would have to perceive a one-to-many mapping. This mapping functionality is embedded in the GeoLocator module which uses the IP to Latitude/Longitude converter developed by the Virtual Reality research group at UIUC [19]. The heuristics incorporated in the converter for mapping the IP address to latitude and longitude rely on the InterNic whois database. The whois database contains information on domain names, hosts, networks and other Internet administrators. The information usually includes a postal address. 9

Figure 4: Map of Domains with websites in US To map the IP addresses to geographic locations, the domain name is determined rst. For the domains in the US, the converter queries the whois database to obtain the data associated with the IP address. This data is parsed for city and country name which is mapped to latitude and longitude using another database of locations of cities and countries. The IP to latitude/longitude converter is publicly available at http://cello.cs.uiuc.edu/cgi-bin/slamm/ip2ll/. The GeoLocator module uses this publicly available converter to map the domain names to their latitude and longitude. Here it is important to remember that there might be several documents on the same website that might result from a query to a search engine. Once this mapping has been determined, visualization techniques are needed to show multiple hits from the same website. This is essential, since conceptually the website with maximum relevant documents is the most pertinent among all others. Thus the GeoLocator module takes the result from the lter module and maps each of the websites (unique due to clustering) to their location.

6 Rendering Visualizations

The system exploits the US Census Tiger Map server to display the location of the IP address corresponding to the clustered website on a simple two-dimensional map of the United States. The goal of the TIGER (Topologically Integrated Geographic Encoding and Referencing) mapping service (TMS) [37] is to provide a public resource for generating high-quality, detailed maps of anywhere in the United States, using public geographic data. The TMS project was undertaken to serve many users and developers on the World Wide Web with easily-accessible street-level and regional maps for places in the United States, for general viewing, research and analysis, usage in interactive map-based services, or inclusion as illustrations in documents. This service is freely accessible to the public, and based on an open architecture that allows other Web developers and publishers to use public domain maps generated by this service in their own applications and documents. The Renderer essentially obtains the results from the GeoLocator and gathers all the data into a form that is suitable for use in conjunction with the TMS. This is then used to generate the maps. The Renderer module then provides interactivity to this otherwise static map. Figure 5 illustrates the mapping of the WPI domain using TMS. 10

The advantage of the open architecture is that it allows developers to specify the locations to be marked in a le which can then be relayed to TMS via an http Get request. The result is a map as shown in Figure 5. The VisRenderer module constructs this le and performs the request to the TMS server. In Figure 6, the rst 200 results to the query \multivariate visualization" are ltered for duplicates, clustered and subsequently mapped onto the map of the US. Notice that the glyph size is not a constant on the map. These mapped results (websites) have a relevance associated with them which is obtained as a result of clustering of the hits on website. The website with the most hits is ranked most relevant. This information is encoded in the size of the glyph that is placed at that location. We are currently investigating using other visualization techniques. Future versions of GeoViser would use color to encode other information. More information can be conveyed via a histogram corresponding to each glyph.

Figure 6: Mapping hits for the query : multivariate visualization

Figure 5: Mapping the WPI Domain.

Current visualization technique allows users to perceive which of the sites are the most relevant. GeoViser also provides navigational capabilities which help the user to interactively manipulate the visualized results to reveal details. For instance (see Results section) on the map shown, the user can identify an area of interest by dragging the mouse over that region. This results in the display of a list of locations corresponding to the matching websites in that location. Multiple entries of a location in the list box indicates more than one server at the same location since the locations are mapped once for each server (e.g. cs.wpi.edu and www.wpi.edu if retrieved, would be listed as Worcester MA, Worcester MA). Double clicking on any of these locations in the list box pops-up a page with links to all the relevant pages retrieved from the corresponding site.

7 Results and Interactivity

GeoViser has been implemented as a Java Applet using JDK 1.1.6 on an SGI Octane machine running IRIX. Figure 7 shows the graphical interface to the system. The interface is designed to be similar to the current search engines except for the part which displays the map. The results can be interactively explored by selecting the locations of interest. Figures 8-10 demonstrate this interactive exploration. The clusters as seen in the gure provide a solution to the GSQ's. In addition, the interactive interface provides a mechanism for dynamic ltering of the results. Repeated queries over 11

time provide a way to predict a change (increase or decrease) in the eld of interest. A substantial change would display new clusters or outliers.

12

Query box starts mapping and rendering

Unprocessed Infoseek results List box displays the selected areas

Click and Drag to select

s Figure 7: GeoViser Graphical User Interface.

Figure 8: Results of the Query : Information retrieval on the web.

Figure 9: Selected area : Massachusetts. Note that the list box populates itself with the locations in the selected area.

Figure 10: Shows hits from the subselected domains. 13

8 Conclusion and Limitations

Essentially, GeoViser provides an approach to answer questions of the form \where is most work in area XYZ carried out" which in turn answers questions of the form \where should I go for a sabbatical" or \where should this conference be held" which are speci c to academics. Broadly, this would also help answer question such as \where should I sell my ideas/product" or \where should I place my Burger King food stall" by simply using the information available on the internet. Results were presented showing the system being used in mining for such information. This approach is in contrast to creating separate application speci c databases, which is the common approach taken by the GIS community. In addition, the system could also be used to identify geographic trends over time. The basic limitation of not having an application speci c database lies in the inaccuracy of the results. This problem occurs because of using the WWW as a repository and the search engine and its indexing technique as a tool. For instance, there might be a document which contains several instances of the query terms but the document might be totally unrelated. As explained before, to counter this problem we rank the resulting websites with a relevance metric which uses the ranking provided by the search engine and combines it with a count of relevant pages from that particular site. Despite the high success rate, network rewalls and national online services (such as AOL) limit the accuracy of the mapping system that obtains the location of the given IP address. Another limitation of the system is the speed. The system relies on the services of IP to Lat/Lon converter to provide the location speci c information of the webpages. This typically takes a while depending on the network trac. Therefore mapping a stream of web-hits resulting from a query would take a substantial amount of time. In order to solve this problem we are investigating options such as local caching of URL/location correspondences. Also there is the issue of mirror sites that might be retrieved as a result of a query. One cannot distinguish between the real sites and the mirrors and this can result in several false mappings. To the best of our knowledge, GeoViser is the rst attempt towards building a search engine that proposes to answer the GSQ's. We believe this functionality would prove be very useful to users in general.

9 Future Work

An interesting enhancement to the GeoViser system will be to incorporate aspects of collaborative ltering to identify potentially relevant sites. If the system builds a pro le of the kinds of people who frequent various website then this lends additional information which can be used to decide the relevance of the hit. This would help in reducing the number of false hits. We are also looking at ways of using GeoViser to study the performance of Web Robots and other indexing mechanisms. In we are investigating several other visualization mechanism to enhance the system capabilities. XML supports associating metadata with a given web document. This metadata could include a short description of the data in it along with information such as the geographic location of the document. We are currently looking at using this information to enhance the working of the 14

GeoLocator module in the system.

10 Acknowledgments

We would like to thank Bhupesh Kothari for his inputs on the implementation of GeoViser. We would like to acknowledge Chris Stuber who made the TMS system publicly available and the CAVE group at UIUC for making the IP to Latitude Longitude converter available. Thanks are also due to the Infoseek group.

References

[1] Andrews, K. Visualizing Cyberspace: Information Visualization in the Harmony Internet Browser, In Proceedings of InfoVis '95, pages 97-104, IEEE Press, Atlanta, 1995. [2] Batty, M., Barr, B. The Electronic Frontier : Exploring and Mapping Cyberspace, In Futures, Volume 26, Number 7, pages 699-712, 1994. [3] Becker, R., Eick, S., Wilks, A. Visualizing Network data, In IEEE Transactions on Visualization and Computer Graphics1, Volume 1, Number 1, March 1995. [4] Berners-Lee, T.J., Cailliau, R., Gro , J.F., Pollerman, B. World Wide Web: The Information Universe, In Electric Networking: Research, Applications, and Policy, 2 (1), 52-58, Spring, Westport, CT: Meckler, 1992. [5] Bowman,C.M. et.al. The Harvest Information Discovery and Access System. In Proceedings of the Second International World Wide Web Conference, October 1994. [6] Buchanan, M. and Zellweger, P. Automatically generating consistent schedules for multimedia documents, In Multimedia Systems, 1(2):55-67, 1993. [7] Carrire, J., Kazman, R. WebQuery: Searching and Visualizing the Web through Connectivity, In Proceedings of the Sixth International World Wide Web Conference, April 1997. [8] DeBra, P., Post, R. Information Reterieval in the World Wide Web: Making Client-based searching feasible, In Proceedings of the First International World Wide Web Conference, May 1994. [9] Dodge, M. Mapping the World Wide Web, In GIS Europe '96, pages 22-24, 1996. [10] Doemel, P. WebMap - A Graphical Hypertext Navigation Tool, In Proceedings of the Second World Wide Web Conference, 1994. [11] Eichmann, D. The RBSE Spider - Balancing E ective Search Against Web Load, In Proceedings of the First International Conference on World Wide Web, pages 113-120, Geneva, Switzerland, May 1994. [12] Eick, S., Wills, G. Navigating Large Networks with Hierarchies, In Proceedings Visualization Conference'93, pp204-210, San Jose, Calif., Oct 93. 15

[13] Fu, X., Hammond, K., Burke, R. ECHO: An Information Gathering Agent, Techincal Report, University of Chicago, July 1996. [14] Hasan, M., Mendelzon, A., Vista, D. Visual Web Sur ng with Hy+, In Proceedings CASCON'95, 1995. [15] Jansen, B. J., Spink, A., Bateman, J., Saracevic, T. Searchers, the subjects they search, and suciency: A study of a large sample of EXCITE searches, In Proceedings WebNet'98, November, 1998. [16] Jansen, B. J., Spink, A., Bateman, J., Saracevic, T. Real life Information Retrieval: A Study of user queries on the web, In SIGIR Forum, Volume 32, Number 1, pages 5-17, 1998. [17] Jerding, D.F., Stasko, J.T. The Information Mural: A Technique for Displaying and Navigating Large Information Spaces, In Proceedings InfoVis'95, IEEE Press, pp. 43-50, Atlanta, 1995. [18] Koster, M. ALIWEB - Archie-like indexing in the WEB, In Computer Networks And ISDN Systems, pages 175-182, 27(2), 1994. [19] Lamm, S., Reed, D., Scullin, W. Real-Time Geographic Visualization of World Wide Web Trac, In Proceedings of the Fifth World Wide Web Conference, May 1996. [20] Marko, B., Yoav, S. Content-Based, Collaborative Recommendation, In Communications Of ACM, pages 66-72, 40(3), 1997. [21] McCrickard, S. D., Kehoe, C. M. Visualizing Search Results using SQWID, In Proceedings of the Sixth International World Wide Web Conference, April 1997. [22] Mukherjea, S. and Foley, J. Visualizing the World-Wide Web with the Navigational View Builder, In Proceedings of the Third World Wide Web Conference, Darmstadt, Germany, April 1995. [23] Mukherjea, S., Foley, J.D., Hudson, S. Visualizing Complex Hypermedia Networks through Multiple Hierarchical Views, In Proceedings of CHI '95, pages 331-337, ACM Press, Denver, CO, 1995. [24] Mukherjea, S., Hirata, K., Hara, Y. Visualizing the Results of Multimedia Web Search engines, In Proceedings IEEE Symposium on Information Visualization '96, pages 64-65, San Francisco, CA, October 1996. [25] Pesce, M., Kennard, P., Parise, A. Cyberspace, In Proceedings of the First International Conference on World Wide Web, Geneva, Switzerland, May 1994. [26] Pinkerton, B. Finding What People Want: Experiences with the WebCrawler, In Proceedings of the Second International World Wide Web Conference, October 1994. [27] Pitkow, J.E., Kehoe, C.M. Results from the Third WWW User Survey, In Proceedings Fourth International World Wide Web Conference, Boston, MA, Dec. 1995. [28] Putz, S. Interactive Information Services Using World-Wide Web Hypertext, In Proceedings of the First International Conference on World Wide Web, Geneva, Switzerland, May 1994. 16

[29] Ribarsky, W., E. Ayers, J. Eble, S. Mukherjea. Using Glyphmaker to Create Customized Visualizations of Complex Data, In IEEE Computer, Volume 27, Number 7, pages 57-64, July 1994. [30] Shneiderman, B. The eyes have it: A task by data type taxonomy for information visualizations, In Proceedings IEEE Visual Languages, pages 336-343, Boulder, CO, Sept 1996. [31] Yuwono, B., Lee, D.L. Searching and Ranking Algorithms for Locating Resources on the World Wide Web, In Proceedings ICDE '96, pages 164-171. [32] Altavista Homepage http://altavista.digital.com. [33] Dejanews Homepage http://www.dejanews.com. [34] Infoseek Homepage http://www.infoseek.com. [35] Lycos Homepage http://www.lycos.com. [36] SavvySearch Homepage http://www.cs.colostate.edu/ dreiling/smartform.html. [37] TMS Homepage http://tiger.census.gov/ [38] UMap Homepage http://www.umap.com. [39] WebPlot Homepage http://www.eit.com/goodies/software/webplot/webplot.html. [40] Yahoo Homepage http://www.yahoo.com.

17

Suggest Documents