2009 Fourth International Conference on Computer Sciences and Convergence Information Technology
PageRank vs. Katz Status Index, a Theoretical Approach
Nguyen Quang Phuoc
Sung-Ryul Kim*
Han-Ku Lee
Department of Advanced Technology Fusion Konkuk University Seoul, Korea
[email protected] [email protected]
HyungSeok Kim
Department of Internet & Multimedia Engineering Konkuk University Seoul, Korea {hlee,hyuskim}@konkuk.ac.kr
* contact author Abstract—In World Wide Web search engines, it is important to have a good ranking system. One of the most famous ranking components is the PageRank system by Google. However, PageRank is protected by patents and it is impossible for other companies to use it in their search engines. There is an old model, called Katz status index, that is reported to work very similar to PageRank. If the quality of Katz status index turns out to be similar to or better than that of PageRank, it could become a patent-free alternative to PageRank. We consider the problem of comparing Katz status index to PageRank in this paper with some preliminary results on the theoretical comparison and give a proposal for practical comparison of the two models. Keywords-World Wide Web; search engines; RageRank; Katz status index
I.
INTRODUCTION
The tremendous size of the World Wide Web (or Web for short) makes the use of search engines an absolute necessity. Typically, the query to a search engine consists of just a few keywords and the search engine finds the Web pages that contain all of the given keywords. Because the number of keywords in the query is small, there are tremendous number of results unless the query contains a very specific combination of keywords. In many cases, what the user wants is a small set of pages that are relevant to what he or she has in mind, not just any page that contains all of the keywords that he or she has given to the search engine. For example, when the query is "apple computer," what the user intended is most likely the Apple Computer site. However, many other pages also contain both keywords and become legitimate (but irrelevant) results. If the results are given without any ordering, then the results become useless to the user. So the issue for the search engine is to find the relevant pages and show the relevant ones first. Many heuristics are used to compute the relevance of a page. One is the use of the content of a page and anchor text, i.e., the text that appear close to the link to the page from some other page. Some examples are the relative frequency of the keywords, the location of keywords such as being in the title or appearing close to the start of a page, and the proximity of keywords, i.e., how close the keywords appear together in a page [8]. 978-0-7695-3896-9/09 $26.00 © 2009 IEEE DOI 10.1109/ICCIT.2009.272
There are models that use the link structure of the Web to determine the relative importance (or popularity) of the pages and use the score as a factor in the ranking. A simple (but not very useful) example is the method of counting backlinks that comes into a page. A backlink is a link from some other page which points to the page. Other examples with better results include the hub and authority model by Kleinberg [6], the PageRank model by Page, et al. [2, 7], and the status index method by Katz [4], which is a generalization of the backlink counting method. There are also a few similar methods available in the literature [1, 3]. We focus on PageRank and the status index method in this paper. Our main contribution is to provide an alternative to patent-protected PageRank in Katz status index. There is a report [5] which implies that Katz may work better than PageRank in some cases. Also, as we show in this paper, there are hints that PageRank and Katz status index may turn out to be very similar, practically indistinguishible, or even the same, albeit with some modifications, where the modifications would be such a thing that, even after the modifications, the models will stay well within the original ideas behind the original models. The paper is structured as follows. In section 2, we give a broad idea about the methodology of the comparison. In section 3, we define the models in some detail. In section 4, we present a few modifications to the models and give some results from theoretical comparison of the modified models. In section 5, we give a proposal for practical comparison of the models and in section 6, we conclude. II.
NECESSITY AND DIFFICULTY OF COMPARISON
Ranking in search engines depends of quite a number of factors and it is not easy to measure the contribution of an individual factor. Also, the quality of a ranking system is distinctly subjective and thus it is in itself a difficult problem to give a measure of overall ranking quality of a search engine system. To some extent, the relative numbers of users that search engines attract can be used as an indirect measure of the quality of the ranking system, and in turn, the quality of PageRank or other factors with similar kinds of contributions to the overall ranking. There seems to be some, although not abosolute, consensus among professionals that PageRank gives, 1276
arguably, the best results, even though the grounds for the opinions cannot be given in convincing theories or exact numbers. The bad point, if you are not in Google, is that PageRank is patent-protected. Thus it is not possible for search engines, other than Google, to use PageRank in their ranking system. However, there is a report [5] that Katz status index may work very similar to PageRank, or works even better, in some circumstances, than PageRank. If it can be verified, theoretically and practically, that Katz status index is very similar to or better than PageRank, then Katz status index will provide a nice alternative to patent-protected PageRank. The problem is that it is very difficult to define what it means for the two models to be the same or equivalent. In a complete theoretical sense, we may conclude that the two models are different if there is just one case where the two models give different results but that will be too simplistic. In fact, the models in themselves are just approximations to the real popularity of the pages and thus it will not be enough for the two models to give slightly different results to conclude that they are different. Further, the models can be modified in several ways and it is still possible for some of the modifications to be exactly the same. In the following we will make several attempts to see if and how the comparisons can be done to make a practical sense. III.
∞
I (vi ) = ∑ [α k N (vi , k )] k =0
where N(vi, k) is the number of paths of length k that starts at any page and ends at vi and α is the decay factor. Solutions for all the pages are guaranteed to exist as long as α is smaller than λ < 1, where 1 / λ is the maximum in-degree of any page. For both methods above, it is not practical to compute the exact solution for a large database. Instead, various schemes are used to compute an approximate solution. IV.
THEORETICAL COMPARISON
As we can see in the definitions above, PageRank and Katz status index apparently looks quite different and it seems impossible for them to give exactly the same results, even for a very restricted set of graphs. However, there is a way to modify the definition of Katz in such a way that the modified definition of Katz status index looks remarkably similar to that of PageRank. See the following figure and lemma.
BASIC DEFINITIONS
We model the web as a directed graph G = (V, E) where V is the set of all pages in the Web, E is the set of directed links. In the PageRank model the rank PR(vi) of a page vi is defined as
PR(vi ) =
PR(v j ) 1− d +d ∑ N v j ∈M ( vi ) L(v j )
(1)
where the sum ranges over all pages vj that has a link to vi, L(vj)is the number of outgoing links from vj , and d is the damping factor. The PageRank model can be considered to be a random walk model. That is, PageRank of a page vi is the probability that a random walker (which continues to follow arbitrary links to move from page to page) will be at vi at any given time. The damping factor corresponds to the probability of the random walk to jump to an arbitrary page, rather than to follow a link, on the Web. It is required to reduce the effects on the PageRank computation of loops and dangling links in the Web. In the status index method, which is a generalization of the backlink-counting method, the status index of a page is determined by the number of directed paths that ends in the page, where the influence of longer paths is attenuated by a decay factor. The length of a path is defined to be the number of edges it contains. The status index I(vi) of a page vi is formally defined as follows.
Figure 1. Paths of lengths k and k + 1
Lemma 1. If M(vi) is the set of nodes that have edges to vi, then
I (vi ) = β +
∑ α ⋅ I (v ) j
(2)
v j ∈M ( vi )
where
β = N (vi ,0) . Proof. It is obvious that the last edge of any path of length k from any node s to vi is an edge from a node vj to vi. By eliminating the last edge from the path, we find a path of length k from s to vj. Conversely, if we have a path of length
1277
k from any node s to a vj, we find a path of length k from s to vi by adding (vj, vi) at the end of the path. Thus, we have found a one-to-one correspondence. □ Using the definitions (1) and (2), it is now possible to compare them more directly. First, we find a sufficient, and possibly a necessary, condition for the object graph G under which the two models to give exactly the same results. As has been mentioned earlier, this condition makes only the most strict theoretical sense and may have no meaning in practical sense. However, this restriction is not a bad thing because under a less restrictive sense, more equivalence might be deduced. From comparing (1) and (2), one can easily see that the two definitions become exactly the same when β = (1-d)/N and α = d / L(vj). The first condition can be easily satisfied because β is a customizable parameter in Katz status index (what is the number of paths of length zero?) and d is also a customizable parameter in PageRank. We have to mention that the fact that the first condition can be satisfies may have no meaning in practical terms because the values for β and/or d with which the first condition is satisfied may not give practically good results when actual computation is performed with real graphs. That being said, the second condition is unlikely to be satisfied by a real graph as we can see from the following discussion. Because L(vj) is the number of outgoing links from node vj and α is a constant, it is necessary that the number of outgoing links from any page has to be the same throughout the graph, which is highly unlikely in a graph modeled from real Web. However, we can draw a conclusion that there exists a set of graphs, albeit a very restricted one, where PageRank and Katz status index gives the exact same result. The above result shows that, theoretically, the original definitions of PageRank and Katz stats index give noncompatible results because the set of graphs where the two model gives the same result is too restrictive. However, both PageRank and Katz status index can be and are being modified in nontrivial ways to give better practical results and to be applied to specific cases of search engines. These possibilities of modifications give us new ways of showing the equivalence of the models in the following sense: modify one or both of the models to be mutually equivalent. If and when the modifications are within the principles of original definitions and, when applied to real graphs, gives reasonable results, the equivalence of the two models can still be practically established. There are many ways we may modify the models and the difficulty lies in the fact that there seems to be no theoretically satisfying ways of classifying them. In the following, we give some possible modifications to one or both of the models and discuss the implications to the equivalence of the models. A. Modify Katz Status Index to equal PageRank One obvious modification to Katz status index is to introduce a new factor in (2) so that αI(vj) becomes αD(vj)I(vj) where D(vj) is a new per-node parameter. This new factor can be justified in the Katz status index idea as a rate at which each node propagates its incoming flow to
adjacent nodes. If this new factor is set so that αD(vj)=d/L(vj) for every node, then PageRank and Katz status index may become completely equivalent for any possible graphs. One might say from this result that PageRank is a special case of Katz status index. We have to note that it is also possible to modify PageRank to become completely equivalent to Katz status index. However, in that case, the modified PageRank is no longer a random work model because it can no longer be modeled from a probabilistic standpoint. B. Application of Selective Source Selective source is an idea where the random jump does not reach every other node in an equivalent way as in the original definition of PageRank. Instead, each node has a different value for the parameter (1 - d) / N. However, usually, the sum over all nodes of this parameter amounts to (1 - d) so that PageRank still can be modeled from a probabilistic standpoint. This idea can be directly applied to Katz status index by modifying the definition of β to be different for each node. Thus, even within this idea Katz status index may be modified to be equivalent to PageRank. C. Application of Weighted Links Weighted links is an idea where the contributions of each link differ across the graph. In PageRank, this idea is implemented by changing the value of L(vj) for each node and each link. The same idea is readily applicable to Katz status index also. D. Application of Selective Source and Weighted Links Applying both the idea of selective source and weighted links is also possible to both PageRank and Katz status index. In this case also, Katz status index turns out to be modifiable to be equivalent to PageRank. V.
PROPOSAL FOR PRACTICAL COMPARISON
The discussions up to this point have shown that, in some sense, Katz status index may be considered a more general form of PageRank because in can be modified, within a reasonable range, to be equivalent to PageRank. This comparison was under a very strict definition of equivalence that they give the exact same result. Considering the fact that both PageRank and Katz status index are approximations of real popularity or authority of web pages in real graphs, we can see that the definition of equivalence we have been using is quite too restrictive. Thus we may relax the condition of equivalence so that, for example, values within 10% of each other are considered to be equivalent. Note, however, that the theoretical analysis in such definitions of equivalence may become quite difficult. Under such relaxed definition of equivalence, it may turn out that even the original definitions of PageRank and Katz status index are equivalent for a large group of graphs which may, in turn, include a large group of practically possible Web graphs. If such case is true, then we may safely say that PageRank and Katz status index is practically equivalent to each other. A lot more analysis has to be performed to
1278
confirm this speculation is actually true. The experiments with random graphs in [5] gives some credibility to this speculation. VI.
CONCLUSIONS
We have considered the problem of equivalent between PageRank and Katz status index and found some promising theoretical results that might point to the ultimate equivalence between the two models even under a very strict definition of equivalence. We also have discussed that, under more practical definition of equivalence, the two model might indeed turn out to be equivalent. Further work is to extend the analysis to be feasible under a practical definition of equivalence. ACKNOWLEDGMENT This work was supported by NAP of Korea Research Council of Fundamental Science & Technology.
[1] [2]
[3] [4] [5]
[6] [7]
[8]
REFERENCES
1279
P. Bonacich and P. Lloyd, “Eigenvector-like measures of centrality for asymmetric relations,” manuscript. S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems, 30(1-7):107117, 1998. C. H. Hubbell, “An input-output approach to clique identification,” Sociometry, 28, 377-399, 1965 L. Katz, “A new status index derived from sociometric analysis,” Psychometrika,18, 39-43, 1953. S.-R. Kim, “Effcient sequential and parallel algorithms for popularity computation on the World Wide Web with applications against spamming,” LNCS 3045, 367-375, 2004. J. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM, 46(5), 604-632, 1999 L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the Web,” Technical report, Stanford University, 1998. Addison-Wesley, 1992. K. Sadakane and H. Imai, “Fast algorithms for k-word proximity search,” IEICE Trans. Fundamentals, Vol. E84-A, No.9, 312-319, Sep. 2001.