makes hyperlink structure and page rank visualisation. ... answer keyword based queries starting from an initial URL and with additional .... The software.
VSEARCH: a Dynamic Visualization Tool for Web Local Searching M. Angelaccio, B. Buttarazzi Dipartimento di Informatica, Sistemi e Produzione - Università degli Studi di Roma "Tor Vergata" Via di Tor Vergata, 110 - 00133 - ROMA (Italy) {angelaccio, buttarazzi }@info.uniroma2.it
Abstract In this paper, we present VSEARCH, a prototype of a web local searching tool which makes hyperlink structure and page rank visualisation. VSEARCH is designed to answer keyword based queries starting from an initial URL and with additional control parameters to prevent too much results being fetched. The prototype is available at http://websearch.info.uniroma2.it/vsearch/. The VSEARCH syntax and semantics is given in terms of WebSQL language with some additional extensions introduced for describing path restrictions. Experimental results show that output relevance does not require path longer than small values (about 8 hops). Keywords: World Wide Web, Search Engines, Information Retrieval, Page Rank,
1 Introduction A number of tools for searching the Web have been developed on an essentially ad hoc basis with respect to both indexing and query support, with a consequent lack of interface consistency. Search-engines build their own indexes of Web-pages. These indexes normally identify all the pages containing a given word. Users can then submit a word, or set of words, (i.e., a query) to the search engine, which will respond with a set of page links. An analysis of the major current search engines, shows that all of them present the user with the same type of interface: a text entry box plus options. In addition, the results of searches are displayed as a list of links. Some search engines list the links in order of relevance, although the way in which ‘relevance’ is determined is usually not known a priori. The last made improvement in Web search technologies (Google [4], Clever [7]) have introduced methods to extract structure information at a global level. In practice after the user has submitted a keyword, a global analysis is carried out on the graph to reduce the number of links that would be returned from the search engine.
1
However a global approach does not allow exploiting the structure and topology of the document network by the user. This has motivated the introduction of a second class of web search tools called local searching tools (in contrast with global searching tools). Basically they have a search scope restricted to the documents of a site (site searching) or to a neighbor of a site (neighbor searching). The problem for local search is that in most cases, the semantic of the query that it has been supported is not clear. Database Techniques have been used to cope with this problem by introducing a data model for document network and a query language able to ask both for contents and for structures. Among the web query languages, WebSQL has been the first example based on a relational model of the hyperlink structure and able to define both content and structure queries ([14]. This capability has been used to characterize a wide class of queries that are encountered from users managing the web (see [3] for a description). Other approaches and systems that perform local searches include WebGlimpse ([13]), Fettuccino ([5]) , HyPursuit and WebQuery. In this paper we present a prototype implementation of a novel local search tool named VSEARCH (Visual SEARCH) implemented at the “Dipartimento di Informatica, Sistemi e Produzione - Università degli Studi di Roma Tor Vergata”. VSEARCH supports keyword based queries on the web and enables user to easily capture and view the structure of the answers. Its main properties are: • Local Searching Model: the search is executed at run-time and starts from a given site (starting URL) and limited to a neighbour. Query characteristics are described following webSQL syntax plus additional graph parameters. • Dynamic Visualisation system: the results are visualized following a graph-based approach and a coloring model. • Dynamic Output, i.e. result window is updated during search process. This features have the following advantages: • In many cases a local approach is more convenient than a global search carried out on all web. For instance the existence and reachability of documents, may be uncompassed. • The dynamic search might follow outgoing links that allow to discover pages that are relevant to the query but did not appear in the results of the static search. • Due to chance of a dynamic window containing results as soon as possible, the time consuming process of typical dynamic local searching systems will be mitigated. However local search may be time consuming for the user due to document fetching delay. Furthermore, the advantage of such dynamic&visual approach is that the query map to be shown on the screen may be limited in size, hence the layout and interaction times may not be critical any more. Limiting the number of visual elements to be displayed both improves the clarity and simultaneously increases performance of layout. Despite the enormous number of web search tools, VSEARCH combines in a unique way a lot of characteristics that otherwise are not simultaneously present in all the others tools. Only Fettuccino supports a local search with a visualisation system, but it lacks of a dynamic output of the answers (see [5]).
2
The paper is organized in the following way. Section 2 illustrates VSEARCH main characteristics with usage examples. The software architecture is discussed in Section 3 at level of software components. Section 4 gives a set of experimental results giving an evaluation of graph based limits for the underlying query model. Some conclusions with further directions are discussed in Section 5.
2 System Features VSEARCH system stands for Visual SEARCH and combines a dynamic and visual crawling (downloading of web pages) approach. As introduced in Section 1, it supports keyword based queries on the web and it has the following properties: • Local and Dynamic Search Model, i.e. the search is restricted to a neighbour of a given site (starting URL) and the results are fetched at time of query submission instead of launching an external search-engine. A description in terms of WebSQL syntax plus additional bounding parameters and Query Graph semantics will be given in more detail. • Page rank function, i.e. a relevance value is computed and displayed for each document that is fetched by the crawler as done by external search engines. • A Dynamic and Graph-based visualization, i.e. Query Graph is implemented following a 2D map format (query map) where for each node it is represented also the corresponding page rank by using a coloring schema. Thus query map yields a compact visualisation of both the link structure and relevance returned by the dynamic local search engine. In order to control the number of documents that are fetched and returned as results, VSEARCH query graph is more restricted than the query graph returned by basic WebSQL query. This restriction is obtained by using two bounding parameters: • max depth that corresponds to the maximum length path that must be traversed by the crawler • max weight that gives an upper bound on the number of outgoing links that must be visited for each document. In the next subsections after an example of query that shows the User Interface we describe in more detail the Query Model and the corresponding Visualisation System. 2.1
VSEARCH User Interface
The Figure 1 shows the client interface of VSEARCH with an example of query with • keywords {Ingegneria, Informatica, etc} • starting URL www.uniroma2.it (Home page of our university) and • default parameters (width = depth = 5 and timeout = 10 min.) It is activated when the user from a standard browser writes the VSEARCH URL: http://websearch.info.uniroma2.it/vsearch/query.htm
3
Figure 1 VSEARCH Query Page
4
The user enters keywords describing the information sought, using the form provided by the Query Page . Once the "Search and Map!" button has clicked, VSEARCH forwards the query to the search agent. It then pushes those results back to the user' s browser "as is" and at the same time feeds them to the dedicated crawler component, which starts the exploration within the time limit specified in the form. Note that during the exploration process, the user can already scan through the regular search results, which are provided in a separate browser window. Dynamically a map of the results is sent back to the user. The map view shows the search results page as root, and the top ranked candidates as firstlevel descendants. The color for each node label indicates the degree of relevance of each candidate, (the darker the more relevant) according to VSEARCH built-in search-engine. Subsequent levels in the tree show the pages fetched by VSEARCH and how they are related to each other. Once the time limit is reached the process end. The Figure 2 shows an example of query maps returned from VSEARCH in reply to the query shown in Figure 1.
)LJXUH 2.2
4XHU\0DS
Local Search Model
To better understanding the main features VSEARCH System we give in this Section a description of the Query Syntax and Semantics in terms of the WebSQL Query Language.
5
This has the advantage to furnish an exact definition of all the elements contained in the VSEARCH query interface. 2.2.1 Query Syntax Let us assume for simplicity that k = k1,.., kn are the keywords used for the search and i, and (w, l) are respectively the starting URL, and bounding parameters used in a VSEARCH query. We adopt the following syntax: Definition 1 A VSEARCH query with parameters i, k, l is given by the WebSQL query Qbls (i, k, l) that denotes a “bounded” local search query containing as arguments i, k and l. We do not take into account for simplicity the other bounding parameter w. The definition of Qbsl (i, k, l) is given in the following way. Let introduce first the regular path expressions of WebSQL. Definition 2 A hypertext link is said to be: • interior if the destination is within the source document; • local if the destination and source documents are different but located on the same server; • global if the destination and the source documents are located on different servers. This distinction is important both from an expressive power point of view and from the point of view of the query cost analysis. In the syntax of path expression arrow-like symb ols denote each link type in the following way: α denotes an interior link, → a local link and ⇒ a global link. Also, let = denote the empty path. Definition 3 Path regular expressions are built from these symbols using concatenation, alternation (|) and repetition (*). For example, = |⇒ →* is a regular expression that represents the set of paths containing the zero length path and all paths that start with a global link and continue with zero or more local links. Moreover to illustrate bounded paths we make use of the following abbreviation. Definition 4 For any path regular expression R without occurrence of * operator, (R)≤l = R |(R)2|….|(R)l Definition 5 (Local Queries) The generic local query Start from i and find all documents containing the keyword k in a neighbour of i or that are linked to i through paths expressed by a regular path expression R. is defined in WebSQL as Qneigh(i,k,R): SELECT x.url FROM Document x SUCH THAT i R x WHERE x.text CONTAINS "k"
6
Thus General Local queries defined by Qneigh, return documents that belong to the neighbor of a site. In fact often these documents have a rank value smaller than homepages from global index servers thus making hard to retrieve them with global queries. For instance, if we consider only local links, we obtain the query characterization of the search tools known as site search Qsite: SELECT x.url FROM Document x SUCH THAT i →* x WHERE x.text CONTAINS "k" Instead VSEARCH crawler explores also external links. Unfortunately, queries like Qneigh requires a transmission cost for implementing regular path expressions. Due to this factor, many practical local queries contains expressions like “… two or less links” that might expressed via the notation ( )≤2 instead of operator (*). This leads to the following class of Local WebSQL queries that is implemented in VSEARCH. Definition 5 ( l-bounded Local Queries) For each pair of starting URL and keyword list (i, k), we denote the corresponding l-bounded Local Query as defined by the following WebSQL query Qbls(i, k, l) = Qneigh(i, k, ( → | ⇒)≤l ) This will correspond to the query Start from i and find all documents containing the keyword k in a l-bounded neighbor of i or that are linked to i through paths that are long at most l hops and following both a local and a global link. In the next section we describe formally the semantics of a query Qneigh in a way that is a special case of the general semantics of a WebSQL query (query graphs). 2.2.2 Query Semantics Let assume that WWW documents are mapped to Node objects having attributes id: url and the hypertext links between them to Link objects having attributes from: url and to: url. Here the only simple type is the given by the url type that has the role of object identifier. Hence we refer to Node, Link and Url as the corresponding object domains (or set of object instances) To give semantics to Qneigh we need to model the correspondent web graph as a function of the query parameters (string to be searched and path regular expression). Definition 6 A web graph is defined as a graph W = (ρNode, ρLink) where: • ρNode : Url :Node models the web document loading function and is defined by ρNode (url)= d where d.id =url • ρLink : Url : Link is the anchor-function and is defined by ρLink (url) = A where A is the set of all anchors in the document ρNode (url).
7
W is a graph model of the global webspace with the assumption that functions ρNode and ρLink are computable. For each link e we denote by P(e) the link property function that returns the link property symbol of e taken from the set {α, → , ⇒ }. Definition 7 A sequence of links p = (e1, e2,…, em) is a path if and only if for every index j, ej ∈ Link and ej+1.from = ej.to (1j where Lrank is the node labeling function defined by Lrank ( d ) = rank (d, k) for each node d of W(i, k, R) (i.e. d ∈Cν (k) ). 2.4
Visualisation System
The graph-based semantics definition allows to enhance the visualisation system with respect to text based interfaces that are common in local search tools. In particular in VSEARCH it has been introduced an high level definition of query graph in terms of query maps that describe graph structure and document information. In particular relevance is directly displayed by using a node colouring schema that described in Figure 4. The red colour is associated to the highest rank value and the remaining colours are associated a to rank value following a decreasing brightness order (hexagonal HSV colouring schema) hue.
Green
Yellow [0.3, 0.4)
Cyan
[0.4, 0.5]
[0.2, 0.3)
> 0.5
[0.1, 0.2)