The method is an improvement over PageRank [1, 6]. PageRate can be used to ... implemented in Google [4] to sort search results. On the other hand, Web log ...
PageRate: Counting Web Users’ Votes Jianhan Zhu, Jun Hong, John G. Hughes School of Information and Software Engineering, University of Ulster at Jordanstown Newtownabbey, Co. Antrim BT37 0QB, United Kingdom Phone: +44 (0)28 9036 8197 Fax: +44 (0)28 9036 6859 E-mail: {jh.zhu, j.hong, jg.hughes}@ulst.ac.uk ABSTRACT
composed of Web pages as nodes and hyperlinks as links between the nodes. In our approach, the link structure is extracted from Web log files. Each record in Web log files of the Extended Log File Format [5] contains the URI (Unified Resource Identifier) of a requested page and the URI of the referrer, which the request came from and usually has a hyperlink to the requested page. We use a set of link pairs extracted from Web log files to construct a Web link structure. Each link pair consists of the URI of a requested page, the URI of the referrer, and the link from the referrer to the requested page. Then the traversals of same links in the link structure are aggregated and the aggregates are assigned as weights to the corresponding links in the link structure. The link structure is thus transformed into a directed weighted graph.
We propose a PageRate method to give Web pages on a Web site ratings based on the Web link structure and user usage data, which are both recorded in the Web log files. The method is an improvement over PageRank [1, 6]. PageRate can be used to objectively evaluate the importance of pages. A PageClustering algorithm is proposed to cluster Web pages with similar incoming links and ratings. The results are used to integrate with search results returned by search engines. KEYWORDS: Web link structures, Web log files,
clustering, rating INTRODUCTION
There are usually a considerable number of Web pages on a Web site and these pages are highly heterogeneous in both their types and contents. It is useful to evaluate the importance of Web pages from the users’ point of view for both recommendation and site administration. Brin and Page [1, 6] proposed the PageRank method to give rankings to Web pages based on the link structure of Web pages and to integrate the rankings with keyword-based search results. The PageRank method has been implemented in Google [4] to sort search results. On the other hand, Web log files, which record every page request of past users, can be used as usage data to give Web page ratings in a more objective and user-centric way. Our proposed PageRate method has two phases. First, the Web link structure and the Web users’ visiting behaviors are used for page rating. Second, Web pages are compared with each other on the similarity of incoming links. Web pages with similar incoming links and ratings are clustered together to enhance the page rating results.
We calculate the rating, R (u , Q ) , of a Web page, u , for a query Q as follows.
R (u , Q ) = a × RL (u , Q ) + b × RC (u , Q ) RL (u , Q ) =
r ( v , u ) = W (v, u ) IN (v )
IN (v ) = å iÎB ( v ) W (i , v ) where RL (u , Q ) is the link structure-based (LS-based) rating of page u , RC (u , Q ) is the content-based rating [3] of u to measure the relevancy of page u to the query Q , a and b are the contributing factors for each part of the rating value respectively [3]; B (u ) is the set of pages that point to u , RL (v , Q ) is the LS-based rating of page v in B (u ) , B (v ) is the set of pages that point to v , r (v, u ) is the normalized weight on link v à u , W ( v , u ) is the number of traversals on link v à u and IN ( v ) is the sum of weights on all incoming links of v .
PAGERATE METHOD
The Web link structure is a directed graph, which is
Different from PageRank, in which the LS-based ranking of page u is evenly divided among its forward links to contribute to the LS-based rankings of the pages u points to, in PageRate, we distribute the LS-based rating of page u based on past users’ visiting behaviors. The normalized weights on the links from u to the pages it points to are used to distribute the LS-based rating of u . It is a more conclusive way to rate pages by taking into consideration both links and users’ preferences. Pages having links in a highly rated page does not necessarily have same level of importance in that page. It is obvious that a link in
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT'01 8/01 Aarhus, Denmark ® 2001 ACM ISBN 1-59113-420-7/01/0008…$5.00
131
å vÎB (u ) RL(v, Q) × r (v, u )
131
prominent position, in large font and in bold face is viewed by the Web designer as more important and will normally receive much more hits from Web users than the link in an unnoticeable corner, in small font and in normal face. And the Web log files can loyally reflect this. So, the visits of Web users will eventually more or less reflect the importance of a Web page, which is the meaning of page ratings given by the PageRate method. In the PageRate method, the more past users have followed a certain link, the more important the page is regarded. Rating is thus biased towards the pages that have been followed the most, as opposed to links from all pages are counted equally in PageRank [1, 6].
Web sites. RELATED WORK
Brin and Page et al [1, 6] have proposed to rank Web pages based on the Web link structure. In [2], it was proposed to detect authorities, hubs from the Web link structure and to use the authoritative pages to improve page ranking in PageRank [1, 6]. Ding and Chi [3] discussed an adaptive and task-specific ranking mechanism which takes into consideration the contents of pages, the link structure, and link usage data. Perkowitz and Etzioni [7] proposed a conceptual clustering mining method based on co-occurrences of Web pages in Web users’ visits record in the Web log files to generate the index page for Web site adaptation.
PAGERATE COMPUTATION
Since Web users can choose to end their visits in every Web page, we add an exit node into the link structure. Due to influences such as caching, the amount of weights on all incoming links, IN ( v ) , of some page is not equal to the amount of weights on all outgoing links, OUT ( v ) . We can either assign extra incoming weights to the link to the exit node or distribute extra outgoing weights to the incoming links so that IN ( v ) = OUT ( v ) . We define a probability transition matrix P over a link structure with rows and columns corresponding to Web pages. P u , v = r (v, u ) if there is a link from u to v and P u , v =0 otherwise. Page ratings can be calculated using an iterative algorithm [2, 6].
CONCLUSION
In this paper, we propose a PageRate method by taking into consideration of both the link information and users’ visits to a Web site. Then, a PageClustering algorithm based on incoming link similarity is used to cluster Web pages with similar ratings together to form conceptually described clusters of Web pages. The results are integrated with search results to present organized, easyto-use results to Web users. A prototype system to integrate search results with PageRate based ratings has been set up and the initial results show that the system can adapt the orderings of search results to Web users’ changing interests and provide better performance over traditional ranking systems.
PAGECLUSTERING ALGORITHM
After page ratings are calculated, Web pages with similar ratings still do not necessarily have similar contents or navigational functions. By taking into consideration the incoming links and the transition probabilities on them, we try to cluster Web pages having similar incoming links and ratings together to integrate with search results and give them more semantic meanings. We define incoming link similarity of two Web pages as the accumulated difference of transition probabilities on their incoming links. By setting a threshold, Web pages are clustered together based on both incoming links and ratings. The clustering algorithm reflects the observation that Web pages, that have links in a similar set of pages and receive a similar number of hits from these pages, tend to have similar contents or navigational functions. Each cluster of pages can be given a description based on concept learning [7].
REFERENCES
1. S. Brin and L. Page (1998). The anatomy of a largescale hypertextual Web search engine. In Proc. of WWW7, pages 107-117, Brisbane, Australia.
IMPLEMENTATION
Page ratings obtained from PageRate can be used to rank search results that match the search text and satisfy the other specified search conditions [4]. The clusters generated from PageClustering can be used to group search results. Generally, we put the clusters in the order of the accumulated ratings of all the Web pages in each cluster and in each cluster in the order of each page’s rating. PageRate and PageClustering can be easily used for searching within a Web site. If searching with a global search engine like Google [4], some form of integration of ratings and search results needs to be done across multiple
132
132
2.
S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg (1999). Mining the web's link structure. COMPUTER, 32:60-67.
3.
Chen Ding, Chi-Hung Chi, (2000). Towards an adaptive and task-specific ranking mechanism in web searching. In Proc. of ACM SIGIR’00, pages 375376, Athens, Greece.
4.
Google Search Engine. http://www.google.com.
5.
Phillip M. Hallam-Baker, Brian Behlendorf, (1996). Extended Log File Format. W3C Working Draft WDlogfile-960323. http://www.w3.org/TR/WD-logfile.
6.
L. Page, S. Brin, R. Motwani, and T. Winograd (1998). The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford University.
7.
Mike Perkowitz, Oren Etzioni, (1999). Adaptive Web Sites: Conceptual Clustering Mining. In Proc. of IJCAI99