Applied Mathematics and Computation 179 (2006) 799–806 www.elsevier.com/locate/amc
A note on the PageRank algorithm Huan Sun 1, Yimin Wei
q
*
School of Mathematical Sciences, Fudan University and Key Laboratory of Mathematics for Nonlinear Sciences, Ministry of Education, Shanghai 200433, PR China
Abstract In this paper we present some notes of the PageRank algorithm, including its L1 condition number and some observation of the numerical tests of two variant algorithms which are based on the extrapolation method. Ó 2005 Elsevier Inc. All rights reserved. Keywords: PageRank; Condition number; Eigenvalue problem
1. Introduction With the booming development of the Internet, web search engines [2–5,10,12,14,16–18] have become the most important Internet tools to retrieve information. Among thousands of web search engines based on various algorithms that have emerged in recent years (there had been more than 3500 different search engines by the year 2000 according to [13,15]), Google has become one of the most popular and successful one. Google’s triumph should largely attribute to its simple but elegant algorithm: PageRank. PageRank is a link structure-based algorithm, which gives a rank of importance of all the pages crawled in the Internet by the Googles’s web crawler. To compute a PageRank is actually to compute the stable distribution of a transition matrix, also called the Google matrix, which is based on the web graph structure. The overwhelming size of the Google matrix pales many popular and intricate algorithms, which may otherwise have excellent performances in those normal scale computing. By comparison, the simple power method stands out for its stable and reliable performances in spite of its low efficiency. So variants are introduced to improve its convergence speed, which is part of our work in this paper. We will begin our paper with a brief review of PageRank algorithm and then display our work in the following sections.
q This project is supported by the National Natural Science Foundation of China under Grant 10471027 and Shanghai Education Committee. * Corresponding author. E-mail addresses:
[email protected] (H. Sun),
[email protected] (Y. Wei). 1 Present address: Department of Mathematics, The Pennsylvania State University, USA.
0096-3003/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2005.11.120
800
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
2. Notations and preliminaries We consider a directed graph based on the hyperlink structure of the global web [4,5], in which every page stands for a node and there is an edge from node u to v if there is a link from page u to v. Then we define the matrix P elementwisely as follows: 1=jP i j; if page P i links to page P j ; pi;j ¼ 0; otherwise; where |Pi| denotes the outdegree of page Pi. To make P a transition probability matrix, we modified it as: P ¼ P þ dvT ;
ð1Þ
where d is an n-dimensional (n is the total number of pages in the web, i.e, the order of P) vector satisfying 1; degðiÞ ¼ 0; di ¼ 0; otherwise; and v is a distribution vector. Google usually set it as uniformly distributed, i.e., v = (1, 1, . . . , 1)/n. In order to force the irreducibility, we make a further modification of P: P ¼ cP þ ð1 cÞevT ;
ð2Þ
where e = (1, 1, . . . , 1)T and c is a parameter ranging from 0 to 1. For convenience, we usually use the transpose of P in computation and call it the Google matrix, denoted as G. G ¼ PT ¼ ½cP þ ð1 cÞevT T
ð3Þ
and the very famous PageRank is the stable distribution of G, i.e., Gp ¼ p.
ð4Þ
3. L1 norm of the PageRank problem We convert the eigen problem of the Google matrix into a following linear equations problem by transforming Eq. (4): eT p ¼ kpk1 ¼ 1;
ð5Þ
and p ¼ PT p ¼ cPT p þ ð1 cÞv;
ð6Þ
T
and ð1 cP Þp ¼ ð1 cÞv.
ð7Þ
Then the following proposition holds: Proposition 1. If c 5 1, the L1 norm of 1 cPT equals to
1þc 1c,
i.e., j1 ð1 cPT Þ ¼ 1þc 1c.
Proof. According to the definition of P, it is easily verified that kI cPT k1 ¼ kIk1 þ ckPT k1 ¼ 1 þ c. On the other hand, the Gerschgorin Theorem [6] assures the existence of ðI cPT Þ1 , so we have ðI cPT Þp ¼ ð1 cÞv; and p ¼ ð1 cÞðI cPT Þ1 v; 1
and kpk1 6 ð1 cÞkðI cPT Þ k1 kvk1 ; 1 and kðI cPT Þ1 k1 P 1c
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
801
and because kcPk1 < 1, the following holds kðI cPÞ1 k1 6
1 1 . ¼ 1 c 1 kcPk1
Therefore, kðI cPÞ1 k1 ¼
1 1c
and 1
j1 ðI cPÞ ¼ kI cPk1 kðI cPÞ k ¼
1þc . 1c
Note. S. Kamvar and T. Haveliwala gives the same equation in [11]. But their conclusion is based on the presupposition that all pages have a outdegree larger than 0. h 4. Extrapolation methods for accelerating PageRank method Here we quote two extrapolation methods: Aitken extrapolation and Quadratic extrapolation [21,12], both of which try to approximate the stationary distribution. Suppose we obtain p(k) after k iterations. And now we approximate it with the first two eigenvectors p, u2: pðkÞ ¼ p þ au2 .
ð8Þ
Then pðkþ1Þ ¼ GpðkÞ ¼ p þ ak2 u2 ; pðkþ2Þ ¼ Gpðkþ1Þ ¼ p þ ak2 u2 .
ð9Þ
Define g, h as ðkþ1Þ
gi ¼ ðpi hi ¼
ðkþ2Þ pi
ðkÞ
pi Þ 2 ;
ðkþ1Þ 2pi
ð10Þ þ
ðkÞ pi .
ð11Þ
2
ð12Þ
So we get 2
gi ¼ a2 ðk2 1Þ ðu2 ðiÞÞ ; 2
hi ¼ aðk2 1Þ ðu2 ðiÞÞ.
ð13Þ
If hi 5 0, then define vector f g fi ¼ i ¼ au2 ðiÞ; hi i.e., f ¼ au2 . So the dominating eigenvector can be expressed by: p ¼ pðkÞ f.
ð14Þ
Then we use a combination of the previous iterations to get a new approximation to the true eigenvector. As we regard p(k+2) is closer to the true value than p(k), so we replace the latter with the former in our practice. Besides, we have to re-normalize the computing result by setting as zero those negative items caused by the subtraction operation. Algorithm 1 Aitken extrapolation function y = Aitken extrapolation(p(k), p(k+1), p(k+2)) g = (p(k+1) p(k))2;
802
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
c=0.99
1
c=0.85
0
10
10
as c=0.99 as c=0.99 as c=0.99
Power Method Without Extrapolation Aitken Method Applied at 10th Iteration 10
Aitken Method Applied at 20th Iteration
10
0
1
2
10
10
Residual
Residual
10
1
10
10
10 10
4
5
6
2
10
10
10
3
0
50
100
Iteration Number
150
200
10
7
8
9
0
20
40
60
80
100
Iteration Number
Fig. 1. Convergence of standard power method and Aitken extrapolation at different iterations, with different damping factor c.
h = p(k+2) + 2p(k+1) + p(k); f = g h; y = p(k+2) f; renormalize y return We apply this extrapolation in the power method to accelerate it. The numerical test results are illustrated in the following graph (see Fig. 1). The above numerical test is based on the database STANFORD.EDU, which is available at http://www.stanford.edu/~sdkamvar [6–8]. The graph illustrates that in the next iteration after Aitken extrapolation is implemented, the residual increases sharply, which causes a big spike. But the residual drops more quickly than the standard power method after the spike. Though such an optimization is not guaranteed, as is illustrated in the case c = 0.99, Aitken extrapolation is still valuable considering its small cost of flop. In a similar manner, we approximate the iterations with the first three eigenvectors of the Google matrix, which is called the quadratic extrapolation. Suppose the Google matrix G has only three linear independent eigenvectors p, u2, u3. And we approximate the kth iteration with the three ones pðkÞ ¼ p þ a1 u2 þ a2 u3 . Since we assume G has three independent eigenvectors, the minimal polynomial of G is cubic. pG ðkÞ ¼ c0 þ c1 k þ c2 k2 þ c3 k3 .
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
803
For the polynomial has a root 1, then c0 þ c1 þ c2 þ c3 ¼ 0. Because pG ðGÞ ¼ 0; and
pG ðGÞpðkÞ ¼ 0;
and and
c0 pðkÞ þ c1 pðkþ1Þ þ c2 pðkþ2Þ þ c3 pðkþ3Þ ¼ 0; ðc1 c2 c3 ÞpðkÞ þ c1 pðkþ1Þ þ c2 pðkþ2Þ þ c3 pðkþ3Þ ¼ 0.
ð15Þ
We define the following vectors: yðkþ3Þ ¼ pðkþ3Þ pðkÞ ; yðkþ2Þ ¼ pðkþ2Þ pðkÞ ; yðkþ1Þ ¼ pðkþ1Þ pðkÞ and T
c ¼ ðc1 ; c2 ; c3 Þ . So we have ðyðkþ1Þ ; yðkþ2Þ ; yðkþ3Þ Þc ¼ 0. We can set c3 = 1 for convenience, and get c1 c1 ðyðkþ1Þ ; yðkþ2Þ Þ ¼Y ¼ yðkþ3Þ . c2 c2
ð16Þ
ð17Þ
This is an overdetermined equation and we can approximate the solution by Moore-Penrose generalized inverse [9,20]: c1 ¼ Yy yðkþ3Þ . c2 Then we define qG ðkÞ ¼
pG ðkÞ ¼ b0 þ b1 k þ b2 k2 k1
and according to the definition of minimal polynomial, we have qG ðGÞui ¼ 0;
ði ¼ 2; 3Þ;
where u2 and u3 are the second and the third eigenvectors. Then we can approximate the iteration with the three eigenvectors p ¼ Gp ¼ GqG ðGÞpðkÞ =ðb0 þ b1 þ b2 Þ ¼ ðqG ðGÞpðkþ1Þ Þ=ðb0 þ b1 þ b2 Þ ¼ ðb0 pðkþ1Þ þ b1 pðkþ2Þ þ b2 pðkþ3Þ Þ=ðb0 þ b1 þ b2 Þ. It is easy to determine the coefficients bi from the relation of pG and qG: b0 ¼ c1 þ c2 þ c3 ; b1 ¼ c2 þ c3 ; b2 ¼ c3 .
ð18Þ
804
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
c=0.99
1
c=0.85
0
10
10 Standard Power Method Without Extrapolation Aitken Method Applied at 10th Iteration Quadratic Method Applied at 10th Iteration
10
as c=0.99 as c=0.99 as c=0.99
1
0
10
10
10
Residual 10
10
10
10
3
10
10
4
5
2
10
10
3
1
Residual
10
2
0
50
100
150
200
10
6
7
8
9
0
Iteration Number
20
40
60
80
100
Iteration Number
Fig. 2. Convergence of standard power method and two extrapolations at different iterations, with different damping factor c.
Algorithm 2 Quadratic extrapolation algorithm function y = Quadratic extrapolation(p(k), p(k+1), p(k+2), p(k+3)) for i = k + 1 to k + 3 y(i) = p(i) p(k); end Y = (y(k+1), y(k+2)); c3 = 1; c1 ¼ Yy yðkþ3Þ ; c2 b 0 = c 1 + c 2 + c 3; b 1 = c 2 + c 3; b 2 = c 3; y = b0p(k+1) + b1 p(k+2) + b0p(k+3); renormalize y return We note that though both two extrapolation methods suffer spikes in the residual, the quadratic extrapolation have a smaller one because it approximates the iterations with three eigenvectors while Aitken extrapolation does with two (see Fig. 2). 5. Restarting power method To efficiently update the stable distribution of a Markov chain [19] has been studied for a long time and there are various methods. Restarting power method with the previous stable distribution as the initial iteration, simple as it is, has proved quite efficient in our numerical experiment.
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
805
Fig. 3. Convergence of restarting power method with different initial vector. Table 1 The comparison of the iteration steps of power method and its variants Algorithms
c = 0.85
c = 0.9
Standard power method Aitken extrapolation at 10th iteration Aitken extrapolation at 20th iteration Quadratic extrapolation at 10th iteration Quadratic extrapolation at 20th iteration
91 100 90 81 77
138 145 134 124 119
In our experiment, we update the web by adding some new pages, whose outdegrees meets the normal distribution of total web outdegrees. And we compare the effect of restarting with the old distribution with those restarting with uniform distribution and randomized distribution as the initial vector, respectively. The result is illustrated as follows: It is obvious that restarting with the old PageRank converges more quickly than with the other two initial vectors. But the preponderance is diminished as the number of new pages added becomes larger (see Fig. 3). 6. Conclusion and areas of future work This paper gives some propositions, algorithms and numerical tests on the PageRank problem. The L1 norm of PageRank problem shows that it is a well-posed problem when c is not close to 1, thus computing PageRank will not cause large round-off. The numerical tests shows the extrapolation method is a promising variant to compute PageRank, though not stable yet. Further improvements is needed to make the algorithm have a stable performance [1] (see Table 1). References [1] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, H. van der Vorst (Eds.), Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM, Philadelphia, 2000. [2] Albert-Laszlo Barabasi, Reka Albert, Hawoong Jeong, Scale-free characteristics of random networks: the topology of the world-wide web, Physica A 281 (2000) 69–77.
806
H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806
[3] Krishna Bharat, George A. Mihaila, When experts agree: using non-affiliated experts to rank popular topics, ACM Transactions on Information Systems 20 (1) (2002) 47–58. [4] Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems 33 (1998) 107–117. [5] Chris H.Q. Ding, Hongyuan Zha, Xiaofeng He, Parry Husbands, Horst D. Simon, Link Analysis: Hubs and Authorities on the World Wide Web, SIAM Review 46 (2) (2004) 256–268. [6] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., The Johns Hopkins University Press, Baltimore, 1996. [7] Taher H. Haveliwala, Sepandar D. Kamvar, Glen Jeh. An Analytical Comparison of Approaches to Personalizing PageRank, Preprint, June, 2003. [8] Taher H. Haveliwala, Sepandar D. Kamvar, The Second Eigenvalue of the Google Matrix, Preprint, March, 2003. [9] Xiao-qing Jin, Yimin Wei, Numerical Linear Algebra and Its Applications, Science Press, Beijing/New York, 2004. [10] Sepandar D. Kamvar, Taher H. Haveliwala, Gene H. Golub, Adaptive Methods for the Computation of PageRank, Linear Algebra and its Applications 386 (2004) 51–65. [11] Sepandar D. Kamvar, Taher H. Haveliwala, The Condition Number of the PageRank Problem, Preprint, June, 2003. [12] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, Gene H. Golub, Extrapolation methods for accelerating PageRank computations, in: Proceedings of the Twelfth International World Wide Web Conference, May 2003. [13] Amy N. Langville, C.D. Meyer, The use of linear algebra by Web search engines, IMAGE Newsletter 33 (2004) 2–6. [14] Amy N. Langville, Carl D. Meyer, Updating the stationary vector of an irreducible Markov chain with an eye on Google’s PageRank, SIAM Journal on Matrix Analysis (to appear). [15] Amy N. Langville, Carl D. Meyer, Deeper inside PageRank, Internet Mathematics 1 (3) (2005) 335–380. [16] Amy N. Langville, Carl D. Meyer, A survey of eigenvector methods of Web information retrieval, SIAM Review 47 (1) (2005) 135– 161. [17] Cleve Moler, The world’s largest matrix computation. Matlab News and Notes, October 2002, pp. 12–13. [18] S. Serra-Capizzano, Jordan canonical form of the google matrix and extrapolation algorithms for the Pagerank computation, SIAM Journal of Matrix Analytical and Application 27 (2005) 305–312. [19] William J. Stewart, Introduction to the Numerical Solution of Markov Chains, Princeton University Press, 1994. [20] Guorong Wang, Yimin Wei, Sanzheng Qiao, Generalized Inverses: Theory and Computation, Science Press, Beijing/New York, 2004. [21] J.H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.