observation: The website builder generally employs an hyperlink hierarchy to help ..... in the top 10 (20) web pages in the ranked list returned for the query;.
PathRank: Web Page Retrieval with Navigation Path Jianqiang Li and Yu Zhao NEC Laboratories China 11F, Bldg.A, Innovation Plaza, Tsinghua Science Park Haidian District, Beijing 100084, China {lijianqiang,zhaoyu}@research.nec.com.cn
Abstract. This paper describes a path-based method to use the multi-step navigation information discovered from website structures for web page ranking. Use of hyperlinks to enhance page ranking has been widely studied. The underlying assumption is that hyperlinks convey recommendations. Although this technique has been used successfully in global web search, it produces poor results for website search, because the majority of the hyperlinks in local websites are used to organize information and convey no recommendations. This paper defines the Hierarchical Navigation Path (HNP) as a new resource to exploit these hyperlinks for improved web search. HNP is composed of multi-step hyperlinks in visitors’ website navigation. It provides indications of the content of the destination page. The HierPathExt algorithm is given to extract HNPs in local websites. Then, the PathRank algorithm is created to use HNPs for web page retrieval. The experiments show that our approach results in significant improvements over existing solutions. Keywords: Navigation Path, Web Search, Web information retrieval.
1 Introduction Hyperlinks are critical for successful web search. Based on the assumption that hyperlinks convey human recommendations, many have conducted research in hyperlink analysis [15, 11] to capture the relative importance of a web page. This approach has shown significant improvement of the quality of global web search compared with text-only techniques [20]. However, this recommendation assumption is “close enough” to the truth only in the global Web [17]. Generally, it does not hold at the local web level (an intranet or a publicly accessible website. The main reason is that a large amount of hyperlinks in local websites are utilized for organizing web pages. Such hyperlinks are “wellstructured” and have little semantics of recommendation. This fact causes small web search [8] that adopt the same techniques as those used in global search engines to fail [9]. [21] also suggested that PageRank from intranet cannot provide effective discriminative information among pages. Using the hyperlinks with recommendation semantics for web search has been widely studied. However, to the best of our knowledge, little work has been done in using the hyperlinks that are primarily for page organization to help improve web page retrieval. M. Boughanem et al. (Eds.): ECIR 2009, LNCS 5478, pp. 350–361, 2009. © Springer-Verlag Berlin Heidelberg 2009
PathRank: Web Page Retrieval with Navigation Path
351
This paper describes our research in exploiting the “well-structured” hyperlinks to improve web retrieval. Different from PageRank which assumes that hyperlinks convey recommendations and computes a query-independent feature, our research assumes that “well-structured” hyperlinks propagate the semantics/contexts among web pages and focus on the query-dependent measure. It originates from the following observation: The website builder generally employs an hyperlink hierarchy to help organize the collection of pages; Readers generally utilize the information in multi-steps of such hyperlinks as guides for website navigation; Taken as a whole, the anchor texts, URLs, page titles, etc. associated with the hyperlinks in a reader’s navigation path often give clear indication on the nature/purpose of the destination page. We surmise that by using information inherent in such Hierarchical Navigation Paths (HNPs), we can improve the web search accuracy. To show the usefulness of HNPs in web search, let’s suppose a user wants to obtain the alumni list of the Department of Computer Science at Stanford University. He might submit a query “computer science alumni” to the stanford.edu website search engine. However, the top-100 hits (queried on Sept. 23, 2008) do not contain links to any requested alumni pages. Instead, he can find three alumni pages for “Undergrads”, “Masters” and “PhDs” respectively by manual navigation in cs.stanford.edu website. All these pages contain no clues about “computer science”; however, such information could be deduced implicitly from the navigational contexts, i.e., the department’s homepage. From the user’s perspective, because there are navigation paths from department’s homepage to alumni pages, alumni pages are virtually tagged with “computer science”. We define the HNP as a sequence of hyperlinks from the website homepage to a destination page. HNPs can give important indication on the content of the destination page. We propose a novel approach to use HNPs for web retrieval. Section 2 gives the terminology and definitions. Section 3 describes the HierPathExt and PathRank algorithms to construct and consume the HNPs for web page ranking. The experiments are presented in Section 4. Related work is discussed in Section 5. Section 6 concludes the paper.
2 Terminology and Definition A website W can be represented as: W=: P denotes the web page set in W, L is the set of hyperlinks contained in web pages of P, and h P is the root page (homepage) of W. A page p P has two tags: url(p) is p’s URL and title(p) denotes p’s title. A hyperlink can be represented as a triple l=, where ps=s(l) is the source page of l, at is the anchor text displayed for l, and pd =d(l) is the destination page. Based on the roles they play in the Web graph [21, 24, 25], hyperlinks can be classified roughly into three categories: structural, reference, and pure navigational hyperlinks. The structural hyperlinks, which exist largely in local websites, are mainly used by website builders for organizing the web page collection. Basically, they are intra-site and created through an administrative way. A website’s structural hyperlinks collectively reflect a unified view of an organization who the website serves [21]. The semantics implied by this kind of hyperlink can propagate across multiple steps of hops. Since in most cases they are embodied as certain hierarchical relation (e.g.,
∈
∈
352
J. Li and Y. Zhao
whole-part or parent-child) between pages, we also alternatively call them Hierarchical hyperLinks (HL). Reference hyperlinks (RLs) basically represent citations and are implicitly utilized by the web page author for web page recommendation. They are created based on certain relevance/importance judgment by the author of the source page. In general, for a reference hyperlink l=, pd and ps are peers to each other. The fact that they are the results of personal activities (i.e., no practical control) imbues the global Web with the democratic nature [21]. Different from the HL, the semantics inherent in a RL only holds for one step of hop. In most cases, it is embodied as an acrosswebsites hyperlink. The pure navigational hyperlinks (PNLs) are created for neither citation nor information organization purposes. Their role is only to provide the shortcut to facilitate the readers to jump from one page to another, e.g., the hyperlinks connecting the sibling pages or the upward hyperlink from a lower level page to a higher level page in the underlying website architecture (e.g. “back to Home”). Basically, PNLs are intra-site. With the differentiation of the hyperlink roles, HLs are identified as our target resource for web search improvement. A HNP is derived from multiple continuous steps of intra-site HLs starting at the website home page. It can be formally represented as HNP=, where LL is a list of hyperlinks, i.e., LL={li}, i=1, 2, …, n; ps is the source page of HNP and l1, and basically it is the homepage; the source page of li+1 is identical with the destination page of li; pd is the destination page of HNP and ln. Since HNPs are associated to their destination page and will be utilized for discovering the semantic indication of its destination page, the linguistic contents within HNP, including the URLs, anchor texts, and page titles along it, are collected. We reformulate it to an index model as HNP=. TN is a list of text nodes, i.e., TN ={tni}, i=0,1,…,n, where tn0=[title(h), url(h)], and tni=[at(li), title(d(li)), url(d(li))], i=1, 2, …, n. C is the context of HNP, represented by the domain name of the website. If a HNP contains n text nodes, we say its length is n. There might be multiple HNPs for a given web page. Each HNP represents a potential entrance path from the home page of the website to the destination page.
3 HNPs for Web Page Reterieval 3.1 HierPathExt for HNP Extraction Here, we describe our algorithm, HierPathExt, to extract the HNPs as the new resource for web page retrieval. As mentioned above, an HNP in website W= is constituted by multiple steps of intra-site HLs. Thus, in our HierPathExt algorithm, we first discover the HLs from L by removing the (inter-site) RLs and (intra-site) PNLs, and then the HNP for each web page p∈P is constructed by concatenating the pages along a sequence of adjacent HLs starting from the home page h of corresponding website. 3.1.1 HL Discovery Detection of (inter-site) RLs is simple: For a hyperlink l∈L, if its destination page and source page belong to different domains, it is judged as an (inter-site) RL. Assuming
PathRank: Web Page Retrieval with Navigation Path
353
the identified RL set in website W is Lr, the set of intra-site hyperlinks can be obtained as L-Lr. Our strategy for detecting the (intra-site) PNLs from L-Lr consists of two phases of analysis: syntactical URL analysis and semantic linkage analysis. Syntactical URL analysis is easy. It mainly utilizes the directory information of the URLs to discover if there is hierarchical relation between the source and destination pages. However, the directory information from URLs is limited. Even worse, in many cases, the URLs can’t reflect the page hierarchies. So we propose the semantic linkage analysis for detection HLs. It is based on a concept of Link Collection (LC). A LC is a semantic block [5] containing only a set of hyperlinks in a web page. For simplicity, we call the destination pages of the outbound hyperlinks in page p as outbound pages of page p. Naturally, if these hyperlinks are in a LC lc hosted by p, we say that these destination pages are the outbound pages of lc, and together they form lc’s outbound page set. For a set of pages, if each inside page has an outbound hyperlink pointing to a destination page q, q is a common outbound page of this page set. All the common outbound pages of this page set constitute its common outbound page set. The examples are shown in Fig. 1. P1 is a LC’s outbound page set. Since l1, l2, l3, and l4 come from a page in P1 and points to a same destination page, this destination page is a common outbound page of P1. All such common outbound pages form P2. It is the common outbound page set of P1. P1
P2
l4
l8
l7 l6 a Link Collection
l5
l3 l2 l1 The common page set (e.g., homepage)
Fig. 1. The intuition for semantic linkage analysis
The basic intuition underlying the semantic linkage analysis is originated from the assumption that: The destination pages of the hyperlinks from the same link collection are siblings each other, and they should share the same parent page and have different child pages. It means that, if a set of outbound pages OPlc of a LC have a common outbound page set Clc, generally, the web pages in OPlc are sibling each other at the same level of the underlying website hierarchy, and the hyperlinks from the pages in OPlc pointing to the pages in Clc are PNLs. As shown in Fig.1, since the web pages in P1 are outbound pages of a LC (the rectangle in red line hosted by a web page), and their outbound hyperlinks {l1, l2, l3, l4} share a same destination page in P2, then {l1, l2, l3, l4} are identified as PNLs. Similarly, {l5, l6, l7, l8} can also be identified as PNLs.
354
J. Li and Y. Zhao
However, the assumption doesn’t hold in some real websites, e.g., in a company website, a list of products might share a common child page on “terms of use”. Then, an additional constraint is utilized to improve the validity of the assumption. If the sibling pages in set Ps share a common outbound page p, there are only two conclusions: p is directly relevant or not to the topic presented by the pages in Ps. If not directly relevant, p is an important page in the whole website. Then the host page of corresponding LC of Ps, denoted by Plc should also have a hyperlink pointing to p. If directly relevant, p might be just an important page for the topic presented by the pages in Ps, and Plc shouldn’t have a hyperlink pointing to p. So the constraint is that, if the pages in set Ps share a common outbound page p, the hyperlinks pointing to p are regarded as PNLs only if the host page of corresponding LC of Ps also has a hyperlink pointing to p. Assuming the set of identified PNL in the phase of syntactical URL analysis is L', for each page p•P, the set of its outbound hyperlinks is OLp={l|l∈L-Lr-L', s(l)=p}; the outbound page of p is OPp={p|p=d(l), l∈OLp}. LCp is a partition of OLp by the LCs hosted in p. We define OPlc as the set of destination pages of the hyperlinks in link collection lc, i.e., OPlc={d(l)|l∈lc}. The common outbound page set of OPp and the common outbound page set of OPlc can be obtained through following formula, respectively: ⎧ (OPq U {q}) , OP > 1 p ⎪ C p = ⎨q∈OPp , OPp ≤ 1 ⎪ Φ ⎩
I
⎧ (OPq U {q}) , OP > 1 ⎪ lc Clc = ⎨q∈OP lc , OPlc ≤ 1 ⎪ Φ ⎩
I
Based on the constrained assumption, the rule for detecting the PNL can be expressed as: s(l)∈OPlc, d(l)∈(Cp ∪Clc)∩(OPp∪{p}) ⇒ l is a PNL. Then, the set of identified PNLs in this phase is L"={l|l∈L-Lr-L', s(l)∈OPlc, d(l)∈(Cp ∪Clc) ∩ (OPp∪{p}}. 3.1.2 HNP Construction The RLs and PNLs are noise information according to our task for discovering HLs from W=. By removing them from L, the set HL=L-Lr-L'-L" is obtained. Considering there are many cases that two or more hyperlinks share the same source and destination pages. We merge multiple such hyperlinks together as one: the source and destination pages keep the same, the anchor texts of these hyperlinks are concatenated as one (The redundancy of anchor texts from multiple hyperlinks carries multiple types of descriptions on the destination page). This merging operation transforms the HL into HL. To build the HNP from HL, we only need to concatenate the pages along a sequence of adjacent HLs starting from the home page h and ending with corresponding page to which the HNP is associated. For a HNP, no cycle is allowed. To facilitate the utilization of HNPs for web page retrieval, we translate the HNP into its index model HNP=. In addition, unexpected errors, from crawling, page parsing, etc., might cause that some pages has no HLs and then no HNPs pointing to them. For such case, a complementation is adopted to make sure each web page has at least one HNP associated to it: Firstly, for the pages without HLs pointing to them, appending their inbound
PathRank: Web Page Retrieval with Navigation Path
355
hyperlinks in L-Lr-L' as extended HLs; Then, iteratively, if the source page p of the extended HL l have HNPs associated to it, the HNPs of l’s destination pages can be constructed from the HNPs of p. Our current algorithm adopts a relatively “loose” way for discovering HLs and HNPs. One reason is that it is difficult to give a clear differentiation between HLs and PNLs. In addition, since the goal is to utilize HNPs for web retrieval, as show in the experiment, this “loose” method have already generated satisfactory results to verify our ideas. 3.2 PathRank for Web Page Ranking Although the HNPs are extracted at the local website, they can be used directly for page ranking within not only the scope of small Web but also the scope of the whole Web. According to a given query, our PathRank algorithm takes three steps for ranking the web pages, where the HNPs serve as the intermedia between the query and web pages: 1. Estimating the rank value RW for each website W at the global Web; 2. Computing the rank value Rpath for each HNP path according to the input query; 3. The pathrank value Rpage of page is determined by all HNPs pointing to it. In step 1, RW is the site rank which could be generated by the PageRank algorithm to the graph of websites. All HNPs have the same RW in the same website. Step 2 ranks the relevance between each HNP and the query. For a text node tni (i=0, 1, …,n-1) in a n-length HNP path (n is the number of nodes in path) pointing to page p, we defined the weight of tni as s(tni)= 1/(n-i). It means that shorter is the distance between tni and p is, higher the weight of the node should be taken. For the destination node tnn-1 of path, its weight is 1. Then, the formula for computing Rpath(q), i.e., the weight of a HNP path relative to a query q, is given as below: n path
R path ( q ) =
∑ s(tn )O i
i =1
i
n path
where npath is the length of path, i.e., the total node number of path; Oi is a function of the text node tni and the query q, i.e., Oi= f(tni, q), to measure their similarity. For simplicity, we defined it as the percentage of tni’s words occurring in q in this paper. This formula reflects that the indication of a HNP on the content of corresponding web pages comes from not only its texts but also the position of the texts in the HNPs. Assume that HNP={pathi}, i=1,2, …, npage, is the set of HNPs pointing to web page page, Step 3 employs a following heuristic algorithm to combine the results from above two steps together for computing the PathRank value of a web page page for a query q: n page
∑ O ' (q) R i
R page (q ) =
Wi R pathi
i =1
n page
(q)
356
J. Li and Y. Zhao
where O’i(q) refers to the percentage of query q’s terms happened in pathi; npage is the number of HNPs pointing to page, and RWi is the rank value of corresponding website. This combination mechanism enables each HNP of a web page to make a contribution to its final rank value. Multiple HNPs represent multiple types of viewpoints or descriptions on the content of the destination page. And then, more descriptive information can be exploited for web page ranking.
4 Experiment and Evaluation 4.1 Experiments on Website Search This study includes two experiments conducted respectively for informational and navigational queries [2] [28]. The evaluation of the HierPathExt algorithm is embedded. 4.1.1 Experiment Setup We have studied dozens of publicly accessible websites and select stanford.edu as one of the most representative websites to simulate web search, where the intersubdomain hyperlinks are utilized to rank the subdomain site within stanford.edu. The breadth-first crawling strategy [16] is configured in our crawler. The maximum hyperlink hops is set to 15. We collected about 2 million web pages from 2,980 subdomains of stanford.edu. After pruning away web pages from the subdomains containing less than 20 pages and duplicate web pages, about 1.4 million unique pages from 768 distinct subdomains remain. To get the objective evaluation results by the well controlled experiments, we hired a group of students to help us create the query set. They were divided into two groups respectively for the navigational and informational queries. For navigational queries, we asked the students to browse the Stanford.edu website and each one select 10-20 home pages of their interested subjects (e.g., persons, services, projects, etc.). The resulting pages are ranked by voting based on the subject’s common acceptability. The top 50 pages are selected. For each page, the students collectively created a descriptive phrase for it such that they could imagine someone using that phrase as a query to find corresponding page. Actually, the query “computer science alumni” mentioned in Section 1 is selected from the resulting navigational query set. For the informational query, we first asked the students to specify the topic that they are interested in, e.g., the topics on their disciplines or personal interests. Then, these topics are ranked based on the judgment (the voting is used when the students’ decisions are conflict) on the possibility that if relevant descriptions/answers of the topics could be found in Standford.edu website. For each of the top 50 topics, the possible keywords that could be used as a query are created collectively by the students. The following criteria are used for the evaluation of HierPathExt: Precision, Recall and F-measure for EPs, and Precision for Useful Paths (UPs). UPs mean the paths containing useful indication for the destination pages. The ground truth of EPs and UPs is identified manually. We randomly sample 500 pages. All possible Entrance Paths (EPs), i.e., navigation paths starting from the homepage of stanford.edu site) are collected by human for these 500 pages to serve as the ground truth. This resulted in 2,105 EPs.
PathRank: Web Page Retrieval with Navigation Path
357
For the web search experiment, following criteria are utilized for precision evaluation: S@5 (S@10) for navigational query: the proportion of queries for which one of the correct answers is ranked in the top 5 (10) in the ranked list return for the query; P@10 (P@20) for informational query: it is the proportion of relevant web pages in the top 10 (20) web pages in the ranked list returned for the query; SP: is used to evaluate the overall quality of the approach for the website search: SP =
γ (S @ 5) + (1 − γ )( P @ 10) , 2
where γ reflects the weights of navigational vs. informational queries (It is 0.5 here). 4.1.2 Experiment Results The HNP extraction is conducted within the scope of the whole stanford.edu site. At the same time, the maximum length of HNPs can be controlled to make a tradeoff between the result quality and execution time. In this experiment, we ran the extraction six times, with maximum length 4, 5, 6, 7, 8 and unlimited, respectively. The resulting HNPs of the 500 sample pages were selected, which have the amount of 2,253, 2,310, 2,671, 3,225, 4,009 and 5,516, respectively. Figure 2 shows the evaluation result of HierPathExt.
Fig. 2. The evaluation of HierPathExt
The whole results show that the accuracy of the HierPathExt algorithm is satisfactory. Futhermore, it can be observed that the quality is related to the controlled maximum length of HNPs. The recall for EPs are higher when the maximum HNP length is set larger. On the other hand, because the longer HNPs are generated from the shorter ones in HierPathExt, the error of the shorter HNPs might be propagted to the longer ones. We can observe that when the maximum length is greater than 7, the quality is decreasing. In the following experiment, we set the maximum length as 7 for HNP extraction. To make sure each page has at least one HNP, the complementation operation of the HierPathExt algorithm is conducted. It means even though the maximum length of HNP is set to 7 for the algorithm running, the length of some page’s HNPs are longer than 7. In our experiment, we compare following web page retrieval methods: (1) BM25: The similarity between queries and web pages is calculated based on the BM25 formula [22]. Given a page, we extract information and store the result in two fields: content and metadata. A page’s content is all the texts in the tags. The anchor texts
358
J. Li and Y. Zhao
of the hyperlinks pointing to this page and the page title constitute its metadata. Their BM25 scores are combined as BM25 = 0.7 × BM25_content + 0.3 × BM25_metadata. (2) BM25 + PageRank: 0.8 × BM25 + 0.2 × PageRank. (3) PathRank1/HNPFullText: The whole stanford.edu is considered as a website. Then all the site rank values RW are set as 1 when calculating the PathRanks. HNP-FullText serves as a comparison method for our proposed HNP-based ranking algorithm: the collected HNPs of a web page constitute a full-text document, and then the BM25 measure between the query and the representative document is to rank web page. (4) PathRank1(HNPFullText) + BM25_content: 0.5 × PathRank1 + 0.5 × BM25_content (0.4×HNPFullText + 0.6×BM25_content). (5) PathRank2/HNP-FullText: Each subdomain (e.g., cs.stanford.edu) of stanford.edu as a distinct website, and so the HNPs are extracted within each subdomain, and the site rank values RW are calculated from the hyperlinks across the subdomains. (6) PathRank2(HNP-FullText) + BM25_content: 0.5 × PathRank2 + 0.5 × BM25_content (0.4 × HNP-FullText + 0.6 × BM25_content). The above linear combination parameters are the best tuned results based on WT10G [27] and also validated in the Stanford.edu dataset. Due to space limitation, the details are not listed here. The website search engine in stanford.edu serves as a real system for results comparison. Table 1. The evaluation of stanford.edu website search Navigational queries S@5 S@50
Informational queries P@10 P@20
Overall SP
BM25 BM25 + PageRank
0.43 0.59
0.52 0.68
0.79 0.80
0.71 0.72
0.61 0.70
stanford.edu website search PathRank1 (HNP-FullText) PathRank1 (HNP-FullText) + Content
0.64 0.78(0.73) 0.81(0.75)
0.74 0.86(0.77) 0.90(0.79)
0.82 0.75(0.71) 0.83(0.79)
0.79 0.69(0.64) 0.72(0.69)
0.73 0.76(0.72) 0.82(0.77)
PathRank2 (HNP-FullText) PathRank2(HNP-FullText) + Content
0.85(0.79) 0.91(0.85)
0.91(0.82) 0.92(0.87)
0.77(0.73) 0.89(0.81)
0.71(0.69) 0.79(0.72)
0.81(0.76) 0.90(0.83)
For stanford.edu site, 8.6M and 7.8M HNPs are extracted by PathRank1 and PathRank2, respectively. On average, there are about 6 and 5.5 HNPs per page. The results for web page retrieval are summarized in Table 1. The figures in shaded rows are the only query-dependent measures. For others, the query-independent features are adopted. Among the three baselines (i.e., BM25, BM25+PageRank, website search engine), the website search engine has the best results. The performance figures illustrate that our PathRank approach (PathRank2 + BM25_Content) can improve the search quality significantly (utilizing two-tailed ttest with p-value=0.01) comparing to that of the website search engine, especially for the navigational queries. Actually, for the query “computer science alumni”, the web pages on “Undergraduate Alumni”, “Masters Alumni”, and “Ph.D Alumni” from cs.stanford.edu are ranked in Top 5 by our approach. Since the subdomains of stanford.edu are relatively independent of each other, PathRank2 incorporates the website rank RW inside, it performs better than PathRank1. The results in Table1 demonstrate that the contextual information propagated across multi-steps of HLs can provide great potential to improve the web search quality.
PathRank: Web Page Retrieval with Navigation Path
359
4.2 Experiments on TREC WT10G Considering that the above experiments might be biased by the well-defined structure of the Stanford website, this section uses WT10G [27] to evaluate our PathRank approach. 4.2.1 Tasks and Evaluation Criteria Two tasks were conducted. The first one is to retrieve topic relevant pages (similar to informational queries): the queries used were the 50 topics (topics 451-500) of the TREC-9 main web track and 50 topics (topics 501-550) of the TREC-2001 Topic Relevance Task. Another task is to retrieve home/entry page of a specific topic (i.e., navigational queries). The topics used the 145 topics of TREC-2001 Homepage Finding Task. We selected standard measures of TREC as evaluation criteria of our algorithms: AveP (Average Precision), P@5, P@10, and P@20 for topic relevance task; MRR, %top10, and %fail for homepage finding task. 4.2.2 Experiment Results The linear combination 0.5×PathRank + 0.5×BM25_content is used here. The maximum length of HNPs is set to 7. By the definition, an HNP node is constituted by three elements: page title (denoted by t), URL string (denoted by u) and anchor text (denoted by a). In order to investigate their influences on the search results, we conducted three experiments each of which use only one kind of element (t or u or a) to represent HNP nodes. TREC best results and PageRank method (0.8×BM25 +0.2×PageRank) are selected as baselines. Table 2. The evaluation on WT10G TREC-9 topic relevance task P@5 P@10 0.3840 0.3540
P@20 0.3460
AveP 0.1894
TREC-2001 topic relevance task P@5 P@10 0.2320 0.2348
P@20 0.2343
TREC-2001 homepage finding task MRR %top10 %fail 0.522 66.4 23.9
PathRank(t)
AveP 0.1721
PathRank(u)
0.3075
0.5002
0.4813
0.4574
0.2938
0.2936
0.2775
0.2517
0.515
80.4
7.2
PathRank(a) PathRank(t+u+a) TREC best result PageRank
0.4529 0.4784 0.3519 0.1862
0.5812 0.5651 / 0.4012
0.5723 0.5919 0.5180 0.3769
0.5569 0.5881 / 0.3251
0.4027 0.4152 0.3324 0.1510
0.4710 0.4066 0.4320 0.2302
0.4503 0.4799 0.3620 0.2239
0.4459 0.4703 0.3130 0.2175
0.797 0.826 0.774 0.732
89.2 91.6 88.3 85.2
5.1 4.3 4.8 6.8
The results of topic relevance task and homepage finding task are shown in Table. Our PathRank method has better performance than TREC best results and PageRank for both tasks. And also we observe that anchor texts in HNPs play the most important role for the tasks (this result is consistent with the observation in [18]), while the page titles and URL strings could also bring positive influence when combined together. This experiment shows again that the two kinds of hyperlinks, respectively for recommendation and information organization, should be distinguished when using them for web page ranking. One limitation of the PathRank approach is that it is not appropriate for ranking the pages from the “deep Web”. With the increase of the path length, the efficiency of the HNP extraction becomes the bottleneck to block its application in such a scenario.
360
J. Li and Y. Zhao
5 Related Work Current popular models for web information retrieval are mainly the combinations of content-based and hyperlink-based approaches. Content-based approach employs the internal information of web pages for their ranking. The semantic blocks [5], structural characteristics [14], and page titles [23] inside of a page are investigated for web search. Also, anchor text [18] and extended anchor text [6] are extensions of the internal information. These approaches treat each page as an independent document. However, the internal content of a web page, even including anchor texts from other pages, is often not self-contained. Hyperlink-based approach utilizes the location of a page in the Web's graph structure to determine its popularity. PageRank [15] is one of the representative algorithms. [5] and [26] extended the PageRank to block and object levels analysis. HITS [11] uses content information to enhance the hyperlink analysis. The explicit hyperlinks are extended to implicit ones for web page classification [4, 8]. These researches assume that hyperlinks convey recommendation. However, it does not hold in general in local websites, since a large amount of hyperlinks are created not for recommendation but for web page organization. In fact, the global Web structure is totally different from its local structure and the hyperlinks of the local Web are more regular than those in the global Web [1]. This paper focuses on exploiting the structural hyperlinks created in the local Web as a new resource for web page ranking. Several researches on such hyperlinks have been reported. [3] utilizes the shortest paths in the intranet to organize web search results. [19] proposed a navigation-aided retrieval model. Assuming that each page has at most one entrance path, the entrance path extraction algorithm are presented in [24]. A stochastic model is given in [7] to compute the probability that a user navigates along some paths from a web page to another. Regarding to our proposed ideas for the differentiation of HLs and PNLs, [25] proposed an algorithm to differentiate the navigational and semantic links for web thesaurus building. The directory information in URLs and navigation list [12] is used to identify navigational links. However, in many cases, there is no directory information from URLs. Also, the hyperlinks in a navigation list might not be used for highlighting news/products but for organizing local information of a specific object. Basically, these researches are different from our work as they have not addressed on how the structural hyperlinks can be exploited to improve web page retrieval.
6 Conclusion Based on the observation that a large number of structural hyperlinks in local websites are created for web page organization and have no recommendation semantics, this paper identifies the structural hyperlinks to serve as a new type of information for improving web page retrieval. Through our path-based technique, the multi-steps hyperlink together with the co-occurring textual information (e.g., anchor texts, page titles, URL, etc.) in the local web are exploited for high quality web page retrieval. The evaluation showed that the proposed approach can improve the accuracy of web page retrieval significantly.
PathRank: Web Page Retrieval with Navigation Path
361
References 1. Broder, Kumar, R., Maghoul, F., et al.: Graph structure in the web. In: Proc. of WWW 2000 (2000) 2. Broder: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002) 3. Chen, M., et al.: A System for Organizing Intranet Search Results. In: Proc. of USENIX USITS (1999) 4. Shen, Sun, J.-T., Yang, Q., Chen, Z.: A comparison of implicit and explicit links for web page classification. In: Proc. of WWW 2006, pp. 643–650 (2006) 5. Cai, D., He, X., et al.: Block-Level Link Analysis. In: Proc. of SIGIR 2004, pp. 440–447 (2004) 6. Glover, J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D.M., Flake, G.W.: Using web structure for classifying and describing web pages. In: Proc. of WWW 2002, pp. 562–569 (2002) 7. Chi, E.H., et al.: Using Information Scent to Model User Information Needs and Actions on the Web. In: Proc. of SIGCHI (2001) 8. Xue, Zeng, H., et al.: Implicit Link Analysis to Small Web Search. In: Proc. of SIGIR 2003, pp. 56–63 (2003) 9. Hagen, P., Manning, H., Paul, Y.: Must search stink? The Forrester report (June 2000) 10. Hawking, D., Voorhees, E., Bailey, P., Craswell, N.: Overview of TREC-8 web track. In: Proceeding of TREC-8, pp. 131–150 (1999) 11. Kleinberg: Authoritative source in a hyperlinked environment. J. of ACM 46, 604–622 (1999) 12. Chen, L., Baoyao, Y., et al.: Function-based object model towards Website Adaptation. In: Proc. WWW 2001 (2001) 13. Sepandar, H., Taher, M., Christopher, G.: Gene. Exploiting the Block Structure of the Web for Computing PageRank, Stanford University Technical Report (2003) 14. Matsuda, Fukushima, T.: Task-Oriented World Wide Web Retrieval by Document Type Classification. In: Proc. of CIKM 1999, pp. 109–113 (1999) 15. Page, Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web, Technical Report, Stanford University (1998) 16. Najork, Wiener, J.: Breadth-First Search Crawling Yields High-Quality Pages. In: Proc. of WWW 2000, pp. 114–118 (2000) 17. Henzinger, M.: Hyperlink analysis on the world wide web. In: Proc. of ACM Hypertext 2005 (2005) 18. Eiron, McCurley, K.: Analysis of anchor text for web search. In: Proc. of SIGIR 2003, pp. 459–460 (2003) 19. Pandit, S., Olston, C.: Source, Navigation-Aided Retrieval. In: Proc. of WWW 2007, pp. 391–400 (2007) 20. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999) 21. Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. In: Proc. of WWW 2003, pp. 366–375 (2003) 22. Robertson, S.E., Walker, S., et al.: Okapi at TREC. In: Text REtrieval Conference (1992) 23. Hu, Y., Xin, G., Song, R., Hu, G., et al.: Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval. In: Proceeding of SIGIR 2005, pp. 250–257 (2005) 24. Mizuuchi, Y., Tajima, K.: Finding context path for web pages. In: Proc. of ACM Hypertext (1999) 25. Chen, Z., Liu, S.: Building Web Thesaurus from Web Link Structure. In: Proc. of SIGIR 2003 (2003) 26. Nie, Z., Zhang, Y., Wen, J.R., et al.: Object-level ranking: bringing order to objects. In: Proc. of WWW 2005 (2005) 27. http://ir.dcs.gla.ac.uk/test_collections/access_to_data.html 28. Rose, Levinson: Understanding User Goals in Web Search. In: Proc. of WWW 2004, pp. 13–19 (2004)