length and depth of a URL) for improving some types of Web search tasks. ... an obstacle of exploiting such priors is that shortening and concatenation are.
URLs (Update) Keywords in urls
Exploring URL Hit Priors for Web Search Ruihua Song, Guomao Xin, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma Microsoft Research Asia 5F, Sigma Center, No.49 Zhichun Road, 100080 Beijing, P. R. China {rsong, guomxin, shumings,
jrwen, wyma}@microsoft.com
Abstract. URL usually contains meaningful information for measuring the relevance of a Web page to a query in Web search. Some existing works util ize URL depth priors (i.e. the probability of being a good page given the length and depth of a URL) for improving some types of Web search tasks. This paper suggests the use of the location of query terms occur in a URL for measuring how well a web page is matched with a user’s information need in web search. First, we define and estimate URL hit types, i.e. the priori prob ability of being a good answer given the type of query term hits in the URL. The main advantage of URL hit priors (over depth priors) is that it can achieve stable improvement for both informational and navigational queries. Second, an obstacle of exploiting such priors is that shortening and concatenation are frequently used in a URL. Our investigation shows that only 30% URL hits are recognized by an ordinary word breaking approach. Thus we combine three methods to improve matching. Finally, the priors are integrated into the probabilistic model for enhancing web document retrieval. Our experiments were conducted using 7 query sets of TREC2002, TREC2003 and TREC2004, and show that the proposed approach is stable and improve retrieval effective ness by 4%~11% for navigational queries and 10% for informational queries.
1 Introduction When searching the World Wide Web, “end users want to achieve their goals with a minimum of cognitive load and a maximum of enjoyment.”[11] Some recent studies [4] [17] [10] found that the goal of a user can be classified into at least two categories: navigational and informational. A user searches a navigational query to reach a par ticular Web page in mind, whereas an informational query is usually short and broad where the user intends to visit multiple pages to learn about a topic. Actually, real Web search is to deal with the mixed query stream. Therefore, finding robust evidence which works well for various types of queries has been one challenging interest of Web IR community. As a workshop that provides the infrastructure necessary for large-scale evaluation of text retrieval methodologies, Text Retrieval Conference (TREC) has set up 3 tasks, namely homepage finding, named page finding and topic distillation, in Web track to
encourage research on Web information retrieval. Homepage finding (HP) and named page finding (NP) is to model two types of navigational queries. The difference is that a homepage finding query is the name of a site while a named page finding query is the name of a non-homepage that the user wishes to reach. Topic distillation (TD), on the other hand, is to model informational queries. It was first proposed by Bharat and Henzinger [3] to refer to the process of finding quality document on a query topic. They argued that it is more practical to return quality documents related to the topic than to exactly satisfy the users’ information need since most short queries do not express the need unambiguously. In TRECs, a topic distillation query describes a general topic and requires retrieval systems to return homepages of relevant sites. Until now, these three types of queries are acknowledged and TRECs cumulated valu able data through years for related research. URL, as a Uniform Resource Locator [19] for each Web page, usually contains meaningful information for measuring the relevance of the Web page to a query. Re lated works can be roughly grouped into 3 categories: one is to use the length or depth of a URL as query-independent evidence in ranking [9][21][12][6]; another is to use URL-based sitemap to enhance topic distillation [20][18]; the other addresses the issue of word break in URLs [5][12]. Kraaij et al [9] found that the probability of being an entry page, i.e. homepage, seems to inversely related to the depth of the path in the corresponding URL. They classified URLs into four types in terms of the depth, estimated prior relevance prob ability for each type, and integrated the priors in the language model. Their experi mental results verified that the depth is a strong indicator for homepages. By doing some extension, Ogilvie and Callan [12] reported improvements on mixed homepage/named-page finding task. However, by closely observing the URL priors in [12], we found that the priors for homepage finding queries are quite different from those for named page finding queries (see Section 2 for details). Thus the priors may hurt named-page finding while improving homepage finding. In this paper, we aim to find a kind of stable priors to enhance retrieval perform ance for various kinds of queries. We observe that the occurrence location of the query terms in a URL is an effective indicator of the quality and relevance of a Web page. Especially, a URL with some query term appearing near to its tail promises to be a relevant domain, directory or file. Our statistics on queries of past TREC experi ments verify this observation. Therefore, we treat the occurrence location of the query terms in a URL as a good prior for the relevance of a page. We call this kind of priors the URL hit priors as a hit refers to a query term occurrence. The effectiveness of URL hit priors relies on the capability of detecting the hits of query terms in URLs. To increase the hit rates of query terms, we explore three suc cessive methods to recognize terms in URLs. First, a simple rule is used to recognize most of acronyms in URLs. Second, the recognition of concatenations is formulated as a search problem with constraints. Third, prefix matching is used to recognize other fuzzily matched words. With this 3-step approach, it is shown on the TREC data that the recall of URL hits is doubled from 33% to 66% while the precision is close to 99%. We integrate the URL hit priors into the probabilistic model. Experimental results, on seven TREC Web Track datasets, indicate that, with the URL hit priors and URL
hit recognition methods, the performance is consistently improved across various types of queries. The rest of the paper is organized as follows. Section 2 introduces the related work. In section 3, we give the details of URL hit priors, URL hit recognition methods, and how to combine URL hit priors into the probability model. We conduct experiments to verify the proposed methods in Section 4. Conclusion and future work are given in Section 5.
2 Related Work As mentioned in the introduction, several URL-related approaches have been pro posed to enhance Web search or recognize more query terms. In this section, we will briefly review four latest and representative works. Kraaij et al found that the URL depth is a good predictor for entry page search [9]. Four types of URLs are defined in their work [21] as follows: “ROOT: a domain name, optionally followed by 'index.html'. SUBROOT: a domain name, followed by a single directory, optionally fol lowed by 'index.html'. PATH: a domain name, followed by a path with arbitrarily deep, but not ending with a file name other than 'index.html'. FILE: any other URL ending with a filename other than 'index.html'.” The priori probability of being an entry page is elegantly integrated in the language model. As a result, the performance is improved by over 100%. About 70% of entry pages are ranked at No.1. The TREC2001 evaluation confirmed some successful ex ploitation of URL depth in entry page search [7][13]. Ogilvie and Callan extends the usage of URLs in TREC2003 [12]. A characterbased trigram generative probability is computed for each URL. A shortened word or a concatenation of words is handled by treating a URL and a query term as a character sequence. Another extension is that they include named page in the estimation of URL depth priors. Based on the TREC2003 data, we did some statistics about the distributions of URL depth types for different retrieval tasks. The results are shown in Table 1. It is clear that most of the relevant documents for HP queries have the ROOT type URLs, while the majority of NP queries tend to have the FILE type URLs for their relevant documents. For TD queries, more than half of relevant documents' URLs are with the FILE type, whereas the distributions in the other three URL types are quite even. Therefore, the computed priors based on URL depth are unlikely to benefit all query types. Craswell et al [6] use URL length in characters as query independent evidence and propose a function to transform the original depth for effective combination. Their results show a significant improvement on a mixed query set. And their finding is that the average URL length of relevant pages is shorter than that of the whole collection. Chi et al [5] reported that over 70% URL words are "compound word", that means multiple words are concatenated to form one word. Such phenomenon is caused by the
special of URLs. Some of the most frequent delimiters, such as spaces, in a document are not allowed to appear in URLs [19]. Consequently, webmasters have to concate nate multiple words when creating a URL. These compound words cannot be found in the ordinary dictionaries. Thus they proposed to exploit maximal matching, a Chinese word segmentation mechanism, to segment a “compound word”. An interesting idea is that title, anchor text and file names and alternated text of embedded objects are used as a reference base to help disambiguate segmentation candidates. Although the au thors aim to recover the content hierarchy of Web documents in terms of URLs, the approach is also a good solution for recognizing URL hits. We have not implemented their approach because this paper focuses on the effectiveness of URL hit priors for search and their approach does not handle individual shortened words. In addition, our recognition methods do not use any dictionary but the query only. Another solution worth mention was proposed by Ogilvie and Callan [12]. They treat a URL and a query term as a character sequence and compute a character-based trigram generative probability for each URL. Table 1. Distributions of URL depth types (TREC2003)
3.
URL Depth Type
HP
NP
TD
ROOT SUBROOT PATH FILE
103 33 13 45
1 8 11 138
79 65 77 295
Our Approach
In this section, we first define a new classification of URL types and the related URL priors called URL hit priors. Then three methods are described to recognize URL hits. Finally, we introduce how to combine the URL hit priors into the probabilistic model and for improving retrieval performance. 3.1
URL Hit Priors
A query term occurrence in a URL is called a URL hit. We assume that the location of a URL hit may be a good hint to distinguish a good answer from other pages. For example, when a user is querying "wireless communication" and 2 URLs below are returned, U2 is more probably to be a better answer because it seems to be a good entry point, neither too general nor too narrow. U1: cio.doe.gov/wireless/3g/3g_index.htm U2: cio.doe.gov/wireless/
When “ADA Enforcement” is queried, U3 looks like a perfect answer as a URL hit occurs in the file name. U3: http://www.usdoj.gov/crt/ada/enforce.htm Given the query of “NIST CTCMS”, U4 is easy to beat other pages like U5 and again the URL hits appear in a good position. U4: http://www.ctcms.nist.gov/ U5:http://www.ctcms.nist.gov/people/ Given a URL, slashes can easily split the URL into several segments (the last slash will be removed if there is no character followed by it). U2, U3 and U4 are similar for the last URL hit occurs in the last segment. Therefore, we define four kinds of URL hit types: Hit-Last: A URL, in which the last URL hit occurs in the last segment; Hit-Second-Last: A URL, in which the last URL hit occurs in the second last segment; Hit-Other: A URL, in which all the URL hits occur in other segment than the last two; Hit-None: A URL, in which no URL hit is found. In our examples, U2, U3 and U4 belong to the type of “Hit-Last”, U5 is of the type “Hit-Second-Last”, while U1 is of the type “Hit-Other”. We perform a statistical analysis base on the TREC2003 data. The distribution of URL hit types is shown in Table 2. There are two important observations from the statistics. First, a large portion of good answers have query term hits in their URLs. Second, the distributions of good answers in different types are quite consistent across different query types. Except for the "Hit-None" type, most of the good answers fall into the URL type "Hit-Last" for all the three query types HP, NP and TD. Also, type "Hit-Second-Last" has more good answers than type "Hit-Other". Thus, we expect to find a stable prior relevance probability for the URL hit types, which can be uniformly used in various tasks. Table 2. Distribution of URL hit types (TREC2003) URL Hit Type
Hit-Last Hit-Second-Last Hit-Other Hit-None
HP
136 21 8 29
NP
86 17 12 43
TD
129 21 6 360
Based on the above observations, we target to assign each URL a prior relevance probability. Given a hit type t, this prior is consistently used for HP, NP and TD que ries. Given a query q and a page with URL u, we denote P(t) as the probability of URL u having hit type t for the query. We denote P(R) as the probability of u being relavant to query q. And P(TD), P(HP), and P(NP) are denoted respectively as the
probability of query q being a TD, HP, and NP query. Since NP, HP and TD are dis joint, we can estimate the prior for hit type t by the following formula, P(R | t) = P(R,TD ∨ HP ∨ NP | t) = P(R,TD | t) + P(R, HP | t) + P(R, NP | t)
By applying Bayes’ formula [2], we get P(R | t) =
P(R,t | TD) ⋅ P(TD ) P(R,t | HP) ⋅ P(HP) P(R,t | NP) ⋅ P(NP ) + + P(t) P(t) P(t)
As P(t) = P(t,TD ∨ HP ∨ NP) = P(t,TD) + P(t, HP) + P(t, NP) , we have P(R,t | TD) ⋅ P(TD) + P(R, t | HP) ⋅ P(HP) + P(R,t | NP) ⋅ P(NP ) P(t,TD ) + P(t, HP) + P(t, NP) P(R, t | TD) ⋅ P(TD ) + P(R, t | HP) ⋅ P(HP) + P(R,t | NP) ⋅ P(NP) = P(t | TD) ⋅ P(TD ) + P(t | HP) ⋅ P(HP) + P(t | NP) ⋅ P(NP )
P(R | t) =
By applying a training query set, the values of P(R,t|TD), P(t|TD), and P(TD) can be roughly estimated by maximal likelihood estimation (the probabilities for HP and NP can be estimated in a similar way) as follows, c r (t,TD) ntd ⋅ K c(t,TD) P(t | TD ) ≈ ntd ⋅ K ntd P(TD ) ≈ n P(R,t | TD) ≈
where ntd and n are the numbers of TD queries and all queries respectively. We denote c r (t,TD ) as the total number of relevant pages in top K with hit type t for all TD queries in the training data and c(t,TD) denotes the number of all pages (relevant or irrelevant) in top K for all TD queries. Please note that only top K query result pages are consid ered in counting the number of Web pages. Consequently, the estimated priors for different URL hit types are shown in Table 3. Table 3. Estimated URL hit priors
3.2
Type
Prior
Hit-Last
0.03273
Hit-Second-Last
0.00382
Hit-Other
0.00056
Hit-None
0.00349
URL Hit Recognition
The key of estimating and applying URL hit priors is to correctly identify URL hits. However, the way of word usage in forming a URL is very different from word usage in composing a document. Our investigation shows that only about 30% URL hits are
recognized by an ordinary word break method (see Section 4.2 for details). Therefore, we use three URL hit recognition methods to detect acronym hits, concatenation hits and fuzzy hits sequentially. Step 1: Acronym Hits Recognition Similar to [5], this method was used to recognize acronyms. The assumption is that an acronym is often the concatenation of the first character of each word in the full name. For example, “fdic” is the acronym of “Federal Deposit Insurance Corporation” in the following URL: http://www.fdic.gov/ Given an ordered list of query terms Q =< q1 ,..., qn > , when eliminating functional words, such as “of”, “and” and “the”, in Q , we get Q' =< q1' ,..., qm' > . The first characters of all words in Q are concatenated as s , and the first characters of all words of Q ' are concatenated as s ' . Then s ' or s is matched against the URL to see if any URL word is a substring of s ' or s . If matched, the URL word is mapped to the set of query terms. Step 2: Concatenation Hits Recognition This method aims at recognizing the URL word that is concatenated by the whole or prefix of query terms. For example, the query 185 in known-item task of TREC2003 is “Parent’s Guide to Internet Safety” and the target URL is as below: http://www.fbi.gov/publications/pguide/pguide.htm “pguide” concatenates the first character “p” of “Parent’s” and the word “guide”. The concatenated query terms are required to appear continuously and in the same order as in Q ' or Q . A dynamic programming algorithm is used in this step. Step 3: Fuzzy Hits Recognition In some other URL words, only parts of them match with the whole or parts of query terms. We call such a hit as a fuzzy hit. For example, given a query of “FDA Human Gene Therapy” and a target document URL: http://www.fda.gov/cberlinfosheets/genezn.htm “gene” is a partial of the URL word of “genezn”, which is a fuzzy hit. Given strings a and b , the operation | a | returns the count of characters in the string a . The operation of prefix match a ∩ b is defined as the longest prefix of a that is also a substring of b . Therefore, for each query term q and a URL word u , u will be recognized as a fuzzy hit if it satisfies two conditions as follows. 1)
| q ∩ u |> Threshold1
∑ | q j ∩ u |
2)
q j ∈Q
| u |
> Threshod 2
In our latter experiments, Threshold1 is set to 3 and Threshold 2 is set to 0.6. A more complex way of abbreviating may omit some characters in the middle. For example, “standards” is shortened as “stds”. In this paper, we will not address this complex case that occurs less often in our investigations. 3.3
Combining URL Hit Priors into Retrieval Models
A classic probability model is the binary independent retrieval (BIR) model, which has been introduced by Robertson and Sparck Jones [15]. Please refer to [1] for more details. The ranking function, well-known as BM25, is derived from such a model and has shown its power in TRECs [14]. For our experiments, we choose BM25 as our basic ranking function, in which the retrieval status value (RSV) is computed as fol lows. RSV (D, Q) = ∑ i∈Q
N − df i + 0.5 (k1 + 1)tfi log dl dfi + 0.5 k1 ((1 − b) + b ) + tfi avdl
Where, i denotes a word in the query Q , tfi and dfi are term frequency and docu ment frequency of the word i respectively, N is the total number of documents in the collection, dl is document length, avdl is average document length, and k1 , b are pa rameters. In our experiments, a document D is represented as all of the texts in the title, body and anchor (i.e. the anchor text of its incoming links), while URL is treated as a spe cial field that is labeled as U . We linearly interpolate two scores based on D and U to get the final. S combi = S D + w ⋅ SU
Here, w is combination weight for the URL score. To make the combination easy, it is necessary to transform the original scores on D and U to the same scale and also to eliminate the query dependent factors. The original score on D is RSV ( D , Q ) , we divide RSV ( D , Q ) by the query dependent factor below to get S D as Hu et al did in [8].
∑ (k
1
i∈Q
+ 1) log
N − dfi + 0.5 dfi + 0.5
S U is the URL hit priors that we have estimated in section 3.1. Such a score is a
probability and thus needs no transformation.
4.
Experiments
In this section, we report the results of four kinds of experiments: a) by using the URL hit recognition methods, how many new query term hits can be found; b) the effec
tiveness of using URL hit priors in the probabilistic model; and c) the performance comparison between using the URL hit recognition methods and not using them. 4.1
Experimental Settings
Our experiments are conducted on the Web track datasets of TREC2002, TREC2003 and TREC2004. All of them use the ".GOV" Web page set, which is crawled in 2002. Topic distillation task of TREC2002 is not used because its relevance judgments are not consistent with the guidelines of TREC2003 and TREC2004 [10]. In order to evaluate the performance for different types of queries, we separate que ries of the known item finding task of TREC2003 into HP and NP queries, and queries of the mixed query task of TREC2004 into HP, NP and TD queries. Totally seven query sets used. The numbers of queries are 300, 150, 150, 50, 75, 75 and 75 respec tively. In our retrieval experiments, the mean reciprocal rank (MRR) is the main measure for named-page finding and homepage finding tasks, while the mean average precision (MAP) is the main measure for topic distillation tasks. 4.2
Experiments on URL Hit Recognition
The experiments on URL hit recognition are conducted on the URLs that are judged as relevant for the TREC data. Two volunteers labeled all the pairs that a query term occurs in a URL word. Then we applied the ordinary method and our 3-step method respectively to automatically recognize URL hits and output the pairs. The ordinary method breaks words in a URL based on delimiters, and then stems the words with the Porter stemmer. Finally precision and recall is calculated. The ordinary method achieves 100% precision but low recall, about 33.2% only. Our 3-step method doubles the recall while the precision is high, about 98.5%. 4.3
Experiments on Retrieval Performance
As described in Section 3.3.1, we use computed on all the texts in the title, body and anchor by BM25 formula as the baseline. In our experiments, parameters are trained on TREC2003. The optimized parameters of BM25 formula are k1 = 1.1,b = 0.7 . And Figure 1 shows the tuning curves on the training set. The start point at the left is the baseline. The improvements are significant and it is easy to find a common and stable wide range of the optimal parameter for three types of queries.
Mean Reciprocal Rank
2003HP
2003NP
0.74
0.7
0.66
0.62
0.58 0
0.05
0.1
0.15
0.2
0.25
0.3
Combination Weight for the URL Score
(a)
Mean Average Precision
2003TD 0.16
0.15
0.14
0.13
0.12 0
0.05
0.1
0.15
0.2
0.25
0.3
Combination Weight for the URL Score
(b) Fig. 1. Tuning the combination weight on TREC2003 data. (a) shows the results for HP and NP task in terms of MRR and (b) shows the result for TD in terms of MAP
On the test set, the URL hit priors improve MRR by about 4% for named page finding queries and by about 11% for homepage finding queries. And it also improves MAP by about 10% for topic distillation queries (See Table 4). Therefore, it is safe to conclude that the improvement with the usage of URL hit priors is stable for different types of queries. In addition, the improvement for NP tasks are less than those for the HP and TD tasks, which may be caused by the relatively rare occurrences of query terms in file names. 4.4
Experiments on Using 3-Step Recognition Method vs. Not Using
It is necessary to evaluate how the URL hit recognition affects URL hit priors and the retrieval performance. Therefore, we use the ordinary word break method to recognize URL hits and apply the same approach to estimate the URL hit priors. And we redo the retrieval experiments of combining the priors with the basic content score. Figure 2 shows the results on HP task of TREC2003. There is a big gap between priors based
on different recognition methods. The same gaps are also found for other query sets and data sets. We omit the figures due to space limitation. In summary, the URL hits recognition methods are essential for fully taking advantage of the URL hits priors. If not sufficient URL hits are detected, the URL hit priors are less useful for improving retrieval performance. Table 4. Integrating URL Hit Priors in the Probability Model Query
SD
Scombi
Improve
2002NP 2004NP 2004HP 2004TD
0.6294 0.557 0.5404 0.13
0.6529 0.5818 0.6002 0.1436
+3.73% +4.45% +11.07% +10.46%
Mean Reciprocal Rank
Ordinary
3-Step
0.72 0.7 0.68 0.66 0.64 0.62 0.6 0
0.05
0.1
0.15
0.2
0.25
0.3
Combination Weight for the URL priors
Fig. 2. Comparison of priors based on the ordinary word break method and our 3-step method
5.
Conclusion and Future Work
Through observation and statistics, we found that the location of a query term appearing in a URL is closely related to whether the corresponding document is a good answer for homepage finding, named-page finding and topic distillation queries. However, shortening and concatenating make it difficult to match a URL word with query terms. We proposed three steps together to recognize URL hits. Such method improves the recall of URL hits from 33% to 66% for relevant URLs of TREC data of three years. Based on recognized URL hits, URL hit priors are estimated and integrated into the probability model. Experiments conducted on the TREC datasets show that the URL hit priors can achieve stable improvement across various types of queries. In the current implementation, URL hits are detected when a query is submitted to the search engine. This requires additional time in processing the query, which could be an issue when the approach is used in a real large-scale search engine. We will
leave offline URL hit recognition as our future works. Our current experiments are based on TREC dataset which have little spam. As a next step, more experiments can be done for current real Web data to further test the effectiveness of our approach.
References 1. R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval., ACM Press, 1999. 2. J. Berger. Statistical decision theory and Bayesian analysis. New York: Springer-Verlag, 1985. 3. K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In 21st Annual International ACM SIGIR Conference, pages 104--111, Mel bourne, Australia, August 1998. 4. A. Border. A taxonomy of Web search. SIGIR Forum, 36(2), 2002 5. C.-H. Chi, C. Ding and A. Lim. Word segmentation and recognition for web document framework. CIKM'99, 1999 6. N. Craswell, S. Robertson, H. Zaragoza and M. Taylor. Relevance weight for query inde pendent evidence. In Proceedings of ACM SIGIR’05, Salvador, Brazil, 2005 7. D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the TREC-8 web track. In The Eighth Text Retrieval Conference (TREC8), NIST, 2001 8. Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao and H. Li. Title extraction from bodies of HTML documents and its application to Web page retrieval. In Proceedings of SIGIR'05, Salvador, Brazil, 2005 9. W. Kraaij, T. Westerveld and D. Hiemstra. The importance of prior probabilities for entry page search. SIGIR'02, 2001 10. U. Lee, Z. Liu and J. Cho. Automatic identification of user goals in Web search. In the Proceedings of the Fourteenth Int'l World Wide Web Conference (WWW2005), Chiba, Ja pan, 2005 11. G. Marchionini. Interfaces for End-User Information Seeking. Journal of the American Society for Information Science, 43(2):156-163, 1992. 12. P. Ogilvie and J. Callan. Combining structural information and the use of priors in mixed named-page and homepage finding. TREC2003, 2003 13. D.-Y. Ra, E.-K. Park, and J.-S. Jang. Yonsi/etri at TREC-10: Utilizing web document properties. In The Tenth Text Retrieval Conference (TREC-2001), NIST, 2002 14. S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In the Eighth Text REtrieval Conference (TREC 8), 1999, pp. 151-162. 15. S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society of Information Science, Vol. 27, No. May-June, 1976, pp. 129-146. 16. TREC-2004 Web Track Guidelines. http://es.csiro.au/TRECWeb/guidelines_2004.html 17. D. E. Rose and D. Levinson. Understanding user goals in Web search. In Proceedings of the Thirteenth Int'l World Wide Web Conference (WWW2004), New York, USA, 2004 18. T. Qin, T.-Y. Liu, X.-D. Zhang, Z. Chen and W.-Y. Ma. A study on relevance propagation for Web search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), Salvador, Brazil, 2005 19. Universal Resource Identifiers. http://www.w3.org/Addressing/URL/URI_Overview.html 20. J.-R. Wen, R. Song, D. Cai, K. Zhu, S. Yu, S. Ye and W.-Y. Ma, Microsoft Research Asia at the Web Track of TREC 2003. In the Twelfth Text Retrieval Conference, 2003 21. T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, URLs and anchors. TREC2001, 2001