A note on the PageRank algorithm - Semantic Scholar

Applied Mathematics and Computation 179 (2006) 799–806 www.elsevier.com/locate/amc

A note on the PageRank algorithm Huan Sun 1, Yimin Wei

q

*

School of Mathematical Sciences, Fudan University and Key Laboratory of Mathematics for Nonlinear Sciences, Ministry of Education, Shanghai 200433, PR China

Abstract In this paper we present some notes of the PageRank algorithm, including its L1 condition number and some observation of the numerical tests of two variant algorithms which are based on the extrapolation method. Ó 2005 Elsevier Inc. All rights reserved. Keywords: PageRank; Condition number; Eigenvalue problem

1. Introduction With the booming development of the Internet, web search engines [2–5,10,12,14,16–18] have become the most important Internet tools to retrieve information. Among thousands of web search engines based on various algorithms that have emerged in recent years (there had been more than 3500 different search engines by the year 2000 according to [13,15]), Google has become one of the most popular and successful one. Google’s triumph should largely attribute to its simple but elegant algorithm: PageRank. PageRank is a link structure-based algorithm, which gives a rank of importance of all the pages crawled in the Internet by the Googles’s web crawler. To compute a PageRank is actually to compute the stable distribution of a transition matrix, also called the Google matrix, which is based on the web graph structure. The overwhelming size of the Google matrix pales many popular and intricate algorithms, which may otherwise have excellent performances in those normal scale computing. By comparison, the simple power method stands out for its stable and reliable performances in spite of its low efficiency. So variants are introduced to improve its convergence speed, which is part of our work in this paper. We will begin our paper with a brief review of PageRank algorithm and then display our work in the following sections.

q This project is supported by the National Natural Science Foundation of China under Grant 10471027 and Shanghai Education Committee. * Corresponding author. E-mail addresses: [email protected] (H. Sun), [email protected] (Y. Wei). 1 Present address: Department of Mathematics, The Pennsylvania State University, USA.

0096-3003/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2005.11.120

800

H. Sun, Y. Wei / Applied Mathematics and Computation 179 (2006) 799–806

2. Notations and preliminaries We consider a directed graph based on the hyperlink structure of the global web [4,5], in which every page stands for a node and there is an edge from node u to v if there is a link from page u to v. Then we define the matrix P elementwisely as follows: 1=jP i j; if page P i links to page P j ; pi;j ¼ 0; otherwise; where |Pi| denotes the outdegree of page Pi. To make P a transition probability matrix, we modified it as: P ¼ P þ dvT ;

ð1Þ

where d is an n-dimensional (n is the total number of pages in the web, i.e, the order of P) vector satisfying 1; degðiÞ ¼ 0; di ¼ 0; otherwise; and v is a distribution vector. Google usually set it as uniformly distributed, i.e., v = (1, 1, . . . , 1)/n. In order to force the irreducibility, we make a further modification of P: P ¼ cP þ ð1 cÞevT ;

ð2Þ

where e = (1, 1, . . . , 1)T and c is a parameter ranging from 0 to 1. For convenience, we usually use the transpose of P in computation and call it the Google matrix, denoted as G. G ¼ PT ¼ ½cP þ ð1 cÞevT T

ð3Þ

and the very famous PageRank is the stable distribution of G, i.e., Gp ¼ p.

ð4Þ

3. L1 norm of the PageRank problem We convert the eigen problem of the Google matrix into a following linear equations problem by transforming Eq. (4): eT p ¼ kpk1 ¼ 1;

ð5Þ

and p ¼ PT p ¼ cPT p þ ð1 cÞv;

ð6Þ

T

and ð1 cP Þp ¼ ð1 cÞv.

ð7Þ

Then the following proposition holds: Proposition 1. If c 5 1, the L1 norm of 1 cPT equals to

1þc 1c,

i.e., j1 ð1 cPT Þ ¼ 1þc 1c.

Proof. According to the definition of P, it is easily verified that kI cPT k1 ¼ kIk1 þ ckPT k1 ¼ 1 þ c. On the other hand, the Gerschgorin Theorem [6] assures the existence of ðI cPT Þ1 , so we have ðI cPT Þp ¼ ð1 cÞv; and p ¼ ð1 cÞðI cPT Þ1 v; 1

and kpk1 6 ð1 cÞkðI cPT Þ k1 kvk1 ; 1 and kðI cPT Þ1 k1 P 1c


801

and because kcPk1 < 1, the following holds kðI cPÞ1 k1 6

1 1 . ¼ 1 c 1 kcPk1

Therefore, kðI cPÞ1 k1 ¼

1 1c

and 1

j1 ðI cPÞ ¼ kI cPk1 kðI cPÞ k ¼

1þc . 1c

Note. S. Kamvar and T. Haveliwala gives the same equation in [11]. But their conclusion is based on the presupposition that all pages have a outdegree larger than 0. h 4. Extrapolation methods for accelerating PageRank method Here we quote two extrapolation methods: Aitken extrapolation and Quadratic extrapolation [21,12], both of which try to approximate the stationary distribution. Suppose we obtain p(k) after k iterations. And now we approximate it with the first two eigenvectors p, u2: pðkÞ ¼ p þ au2 .

ð8Þ

Then pðkþ1Þ ¼ GpðkÞ ¼ p þ ak2 u2 ; pðkþ2Þ ¼ Gpðkþ1Þ ¼ p þ ak2 u2 .

ð9Þ

Define g, h as ðkþ1Þ

gi ¼ ðpi hi ¼

ðkþ2Þ pi

ðkÞ

pi Þ 2 ;

ðkþ1Þ 2pi

ð10Þ þ

ðkÞ pi .

ð11Þ

2

ð12Þ

So we get 2

gi ¼ a2 ðk2 1Þ ðu2 ðiÞÞ ; 2

hi ¼ aðk2 1Þ ðu2 ðiÞÞ.

ð13Þ

If hi 5 0, then define vector f g fi ¼ i ¼ au2 ðiÞ; hi i.e., f ¼ au2 . So the dominating eigenvector can be expressed by: p ¼ pðkÞ f.

ð14Þ

Then we use a combination of the previous iterations to get a new approximation to the true eigenvector. As we regard p(k+2) is closer to the true value than p(k), so we replace the latter with the former in our practice. Besides, we have to re-normalize the computing result by setting as zero those negative items caused by the subtraction operation. Algorithm 1 Aitken extrapolation function y = Aitken extrapolation(p(k), p(k+1), p(k+2)) g = (p(k+1) p(k))2;

802


c=0.99

1

c=0.85

0

10

10

as c=0.99 as c=0.99 as c=0.99

Power Method Without Extrapolation Aitken Method Applied at 10th Iteration 10

Aitken Method Applied at 20th Iteration

10

0

1

2

10

10

Residual

Residual

10

1

10

10

10 10

4

5

6

2

10

10

10

3

0

50

100

Iteration Number

150

200

10

7

8

9

0

20

40

60

80

100

Iteration Number

Fig. 1. Convergence of standard power method and Aitken extrapolation at different iterations, with different damping factor c.

h = p(k+2) + 2p(k+1) + p(k); f = g h; y = p(k+2) f; renormalize y return We apply this extrapolation in the power method to accelerate it. The numerical test results are illustrated in the following graph (see Fig. 1). The above numerical test is based on the database STANFORD.EDU, which is available at http://www.stanford.edu/~sdkamvar [6–8]. The graph illustrates that in the next iteration after Aitken extrapolation is implemented, the residual increases sharply, which causes a big spike. But the residual drops more quickly than the standard power method after the spike. Though such an optimization is not guaranteed, as is illustrated in the case c = 0.99, Aitken extrapolation is still valuable considering its small cost of flop. In a similar manner, we approximate the iterations with the first three eigenvectors of the Google matrix, which is called the quadratic extrapolation. Suppose the Google matrix G has only three linear independent eigenvectors p, u2, u3. And we approximate the kth iteration with the three ones pðkÞ ¼ p þ a1 u2 þ a2 u3 . Since we assume G has three independent eigenvectors, the minimal polynomial of G is cubic. pG ðkÞ ¼ c0 þ c1 k þ c2 k2 þ c3 k3 .


803

For the polynomial has a root 1, then c0 þ c1 þ c2 þ c3 ¼ 0. Because pG ðGÞ ¼ 0; and

pG ðGÞpðkÞ ¼ 0;

and and

c0 pðkÞ þ c1 pðkþ1Þ þ c2 pðkþ2Þ þ c3 pðkþ3Þ ¼ 0; ðc1 c2 c3 ÞpðkÞ þ c1 pðkþ1Þ þ c2 pðkþ2Þ þ c3 pðkþ3Þ ¼ 0.

ð15Þ

We define the following vectors: yðkþ3Þ ¼ pðkþ3Þ pðkÞ ; yðkþ2Þ ¼ pðkþ2Þ pðkÞ ; yðkþ1Þ ¼ pðkþ1Þ pðkÞ and T

c ¼ ðc1 ; c2 ; c3 Þ . So we have ðyðkþ1Þ ; yðkþ2Þ ; yðkþ3Þ Þc ¼ 0. We can set c3 = 1 for convenience, and get c1 c1 ðyðkþ1Þ ; yðkþ2Þ Þ ¼Y ¼ yðkþ3Þ . c2 c2

ð16Þ

ð17Þ

This is an overdetermined equation and we can approximate the solution by Moore-Penrose generalized inverse [9,20]: c1 ¼ Yy yðkþ3Þ . c2 Then we define qG ðkÞ ¼

pG ðkÞ ¼ b0 þ b1 k þ b2 k2 k1

and according to the definition of minimal polynomial, we have qG ðGÞui ¼ 0;

ði ¼ 2; 3Þ;

where u2 and u3 are the second and the third eigenvectors. Then we can approximate the iteration with the three eigenvectors p ¼ Gp ¼ GqG ðGÞpðkÞ =ðb0 þ b1 þ b2 Þ ¼ ðqG ðGÞpðkþ1Þ Þ=ðb0 þ b1 þ b2 Þ ¼ ðb0 pðkþ1Þ þ b1 pðkþ2Þ þ b2 pðkþ3Þ Þ=ðb0 þ b1 þ b2 Þ. It is easy to determine the coefficients bi from the relation of pG and qG: b0 ¼ c1 þ c2 þ c3 ; b1 ¼ c2 þ c3 ; b2 ¼ c3 .

ð18Þ

804


c=0.99

1

c=0.85

0

10

10 Standard Power Method Without Extrapolation Aitken Method Applied at 10th Iteration Quadratic Method Applied at 10th Iteration

10

as c=0.99 as c=0.99 as c=0.99

1

0

10

10

10

Residual 10

10

10

10

3

10

10

4

5

2

10

10

3

1

Residual

10

2

0

50

100

150

200

10

6

7

8

9

0

Iteration Number

20

40

60

80

100

Iteration Number

Fig. 2. Convergence of standard power method and two extrapolations at different iterations, with different damping factor c.

Algorithm 2 Quadratic extrapolation algorithm function y = Quadratic extrapolation(p(k), p(k+1), p(k+2), p(k+3)) for i = k + 1 to k + 3 y(i) = p(i) p(k); end Y = (y(k+1), y(k+2)); c3 = 1; c1 ¼ Yy yðkþ3Þ ; c2 b 0 = c 1 + c 2 + c 3; b 1 = c 2 + c 3; b 2 = c 3; y = b0p(k+1) + b1 p(k+2) + b0p(k+3); renormalize y return We note that though both two extrapolation methods suffer spikes in the residual, the quadratic extrapolation have a smaller one because it approximates the iterations with three eigenvectors while Aitken extrapolation does with two (see Fig. 2). 5. Restarting power method To efficiently update the stable distribution of a Markov chain [19] has been studied for a long time and there are various methods. Restarting power method with the previous stable distribution as the initial iteration, simple as it is, has proved quite efficient in our numerical experiment.


805

Fig. 3. Convergence of restarting power method with different initial vector. Table 1 The comparison of the iteration steps of power method and its variants Algorithms

c = 0.85

c = 0.9

Standard power method Aitken extrapolation at 10th iteration Aitken extrapolation at 20th iteration Quadratic extrapolation at 10th iteration Quadratic extrapolation at 20th iteration

91 100 90 81 77

138 145 134 124 119

In our experiment, we update the web by adding some new pages, whose outdegrees meets the normal distribution of total web outdegrees. And we compare the effect of restarting with the old distribution with those restarting with uniform distribution and randomized distribution as the initial vector, respectively. The result is illustrated as follows: It is obvious that restarting with the old PageRank converges more quickly than with the other two initial vectors. But the preponderance is diminished as the number of new pages added becomes larger (see Fig. 3). 6. Conclusion and areas of future work This paper gives some propositions, algorithms and numerical tests on the PageRank problem. The L1 norm of PageRank problem shows that it is a well-posed problem when c is not close to 1, thus computing PageRank will not cause large round-off. The numerical tests shows the extrapolation method is a promising variant to compute PageRank, though not stable yet. Further improvements is needed to make the algorithm have a stable performance [1] (see Table 1). References [1] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, H. van der Vorst (Eds.), Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM, Philadelphia, 2000. [2] Albert-Laszlo Barabasi, Reka Albert, Hawoong Jeong, Scale-free characteristics of random networks: the topology of the world-wide web, Physica A 281 (2000) 69–77.

806


[3] Krishna Bharat, George A. Mihaila, When experts agree: using non-affiliated experts to rank popular topics, ACM Transactions on Information Systems 20 (1) (2002) 47–58. [4] Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems 33 (1998) 107–117. [5] Chris H.Q. Ding, Hongyuan Zha, Xiaofeng He, Parry Husbands, Horst D. Simon, Link Analysis: Hubs and Authorities on the World Wide Web, SIAM Review 46 (2) (2004) 256–268. [6] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., The Johns Hopkins University Press, Baltimore, 1996. [7] Taher H. Haveliwala, Sepandar D. Kamvar, Glen Jeh. An Analytical Comparison of Approaches to Personalizing PageRank, Preprint, June, 2003. [8] Taher H. Haveliwala, Sepandar D. Kamvar, The Second Eigenvalue of the Google Matrix, Preprint, March, 2003. [9] Xiao-qing Jin, Yimin Wei, Numerical Linear Algebra and Its Applications, Science Press, Beijing/New York, 2004. [10] Sepandar D. Kamvar, Taher H. Haveliwala, Gene H. Golub, Adaptive Methods for the Computation of PageRank, Linear Algebra and its Applications 386 (2004) 51–65. [11] Sepandar D. Kamvar, Taher H. Haveliwala, The Condition Number of the PageRank Problem, Preprint, June, 2003. [12] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, Gene H. Golub, Extrapolation methods for accelerating PageRank computations, in: Proceedings of the Twelfth International World Wide Web Conference, May 2003. [13] Amy N. Langville, C.D. Meyer, The use of linear algebra by Web search engines, IMAGE Newsletter 33 (2004) 2–6. [14] Amy N. Langville, Carl D. Meyer, Updating the stationary vector of an irreducible Markov chain with an eye on Google’s PageRank, SIAM Journal on Matrix Analysis (to appear). [15] Amy N. Langville, Carl D. Meyer, Deeper inside PageRank, Internet Mathematics 1 (3) (2005) 335–380. [16] Amy N. Langville, Carl D. Meyer, A survey of eigenvector methods of Web information retrieval, SIAM Review 47 (1) (2005) 135– 161. [17] Cleve Moler, The world’s largest matrix computation. Matlab News and Notes, October 2002, pp. 12–13. [18] S. Serra-Capizzano, Jordan canonical form of the google matrix and extrapolation algorithms for the Pagerank computation, SIAM Journal of Matrix Analytical and Application 27 (2005) 305–312. [19] William J. Stewart, Introduction to the Numerical Solution of Markov Chains, Princeton University Press, 1994. [20] Guorong Wang, Yimin Wei, Sanzheng Qiao, Generalized Inverses: Theory and Computation, Science Press, Beijing/New York, 2004. [21] J.H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, 1965.

A note on the PageRank algorithm - Semantic Scholar

A note on the PageRank algorithm - Semantic Scholar

Suggest Documents

A Note on Karr's Algorithm - Semantic Scholar

A Note on the PageRank of Undirected Graphs

A Generalization of the PageRank Algorithm - ThinkMind

Google's PageRank Algorithm

PageRank Algorithm - IOSR journals

On local estimations of PageRank: A mean field ... - Semantic Scholar

THE PAGERANK VECTOR: PROPERTIES ... - Semantic Scholar

PageRank Beyond the Web - Semantic Scholar

A Topology Evolution Model Based on Revised PageRank Algorithm ...

A Topology Evolution Model Based on Revised PageRank Algorithm ...

A Note on Harmony - Semantic Scholar

A Note on Fisher's Inequality - Semantic Scholar

A Note on Signature Standards - Semantic Scholar

A Note on Zador's Formula - Semantic Scholar

Note: Note on a Zero-Sum Problem - Semantic Scholar

On the Policy Iteration algorithm for PageRank Optimization

A note on Baker's algorithm - DePaul University

A note on the asymptotic distribution of the ... - Semantic Scholar

A note on the convergence of the SDDP algorithm - Optimization ...

PageRank Approximation for a Subgraph or in a ... - Semantic Scholar

random alpha pagerank 1 introduction - Semantic Scholar

The Synchrosqueezing Algorithm Based on ... - Semantic Scholar

Approximating Personalized PageRank with ... - Semantic Scholar

a note on the treatment of uncertainty - Semantic Scholar