ISSN 1064–5624, Doklady Mathematics, 2009, Vol. 79, No. 3, pp. 424–427. © Pleiades Publishing, Ltd., 2009. Published in Russian in Doklady Akademii Nauk, 2009, Vol. 426, No. 6, pp. 734–737.
MATHEMATICS
The Randomized Algorithm for Finding an Eigenvector of the Stochastic Matrix with Application to PageRank¶ A. V. Nazin and B. T. Polyak Presented by Academician S.N. Vasil’ev February 9, 2009 Received February 24, 2009
Abstract—The problem of finding the eigenvector corresponding to the largest eigenvalue of a stochastic matrix has numerous applications in ranking search results, multi-agent, consensus, networked control and data mining. The power method is a typical tool for its solution. However randomized methods could be competitors vs standard ones; they require much less calculations for one iteration and are well tailored for distributed computations. We propose a new randomized algorithm and provide upper bound for its rate of convergence which is O(lnN/n), where N is the dimension and n is the number of iterations. The bound looks promising because lnN is not large even for very high dimensions. The algorithm is based on the mirror-descent method for convex stochastic optimization. Applications to PageRank problem are discussed. DOI: 10.1134/S1064562409030338
Let A be a stochastic N × N matrix, i.e., its columns are contained in the standard simplex ΘN ⊂ N. By the Perron–Frobenius theorem, the matrix has an eigenvector x* corresponding to the largest eigenvalue (by absolute value being equal 1), that is Ax∗ = x∗. Finding this eigenvector is a basic problem in numerical analysis and in numerous applications arising in ranking search results, multi-agent consensus, networked control and data mining. It suffices to mention the famous PageRank problem of ranking web pages, which is the basis for Google’s search engine rankings, see the original paper [1] and recent monograph [2] where numerous references can be found. What is typical for such applications is their high dimension. For instance, for web page ranking case N equals to several billions. The standard technique for solving the problem is the power method. Starting from initial approximation x0 the method calculates iteratively xn + 1 = Axn, and under some assumptions (e.g., if A has all positive entries) lim x n = x∗, while is unique in this case. The method is n→∞
very simple and converges with geometric rate, see [2]. Calculation of vectors Axn can be performed easily
¶ The
article was translated by the authors.
Trapeznikov Institute of Control Sciences RAS, ul. Profsoyuznaya 65, 117997 Moscow; e-mail:
[email protected],
[email protected]
because in typical applications A is sparse. Nevertheless for huge-scale N the method requires serious computational efforts; it is reported that web PageRank computation on supercomputers takes about a week. Having this in mind, several approaches have recently been proposed. Some of them exploit the block structure of the matrix, others use distributed calculations. One line of research is based on randomized algorithms, see the recent work [3] and the references therein. Such approach has several advantages. First, each iteration is much cheaper computationally than that of the power method. Second, these algorithms easily admit their distributed versions. In general, randomized algorithms play significant role in modern control and optimization approaches, compare [4]. In the present paper we follow this line of research. We propose an iterative randomized method for 2 minimization of the square Euclidean norm ||Ax – x || 2 for the residual. At each step it calculates a stochastic gradient of this function, which exploits one column and one or two rows of the matrix A, chosen randomly according to vectors xn considered as probability distributions on {1, 2, …, N}. The method is a version of the general stochastic mirror descent method (or, otherwise, primal-dual method of convex optimization) whose idea goes back to [5] and was developed in [6–8]. We get the ( n + 1 ) ln N 2 bound E||A xˆ n – xˆ n || 2 ≤ 8 ------------------------------- , where estimate n xˆ n is the result obtained at nth iteration (see also [10]). The dependence on dimension N is highly promising—
424
THE RANDOMIZED ALGORITHM FOR FINDING AN EIGENVECTOR
even for N = 109 we have ln N < 5. Moreover, the bound is valid for all stochastic matrices, it does not depend on the properties of a particular A (for instance, it does not depend on the second eigenvalue of the matrix, what is the case for the power method). It is of interest to compare our bound with (30) in [3], which 3/2 N 2 reads in our notation E|| xˆ n – x∗ || 2 = O ⎛ ---------⎞ . Here the ⎝ n ⎠ dependence on n is better, however dependence on dimension looks hopeless: O(N3/2). It is necessary to emphasize that our bound holds under very mild assumptions; for instance we do not suppose the uniqueness of x∗ ∈ ΘN.
then to apply mirror descent method for minimization on ΘN. STOCHASTIC GRADIENT (1)
∑a
ij
= 1, ∀aij ≥ 0. Denote N the set of all sto-
i=1
chastic N × N matrices A. Also define the A matrix rows A(i) = (ai1, ai2, …, aiN), and a standard simplex ⎧ ΘN = ⎨x ∈ N| ∀xi ≥ 0, ⎩
∑A
⎫ xi = 1 ⎬ . ⎭ i=1
∑
x∈ . N
(1)
By the Perron–Frobenius theorem, there exists a solution x∗ ∈ ΘN of the system (1), a stationary distribution related to the Markov chain. The set of all such solutions generate a convex compact X∗ = Arg min || Ax –
(η )
of the column A k under a random index ηk ∈ {1, 2, …, N} with the conditional probability distribution (1) (2) (N) { x k , x k , …, x k ), i.e., P ( ηk = j xk ) = xk ,
x||2 = {x ∈ ΘN: Ax = x}. If matrix A of the Markov chain is strongly connected (i.e., equivalently. irreducible) the set X∗ is a singleton, that is the eigenvector x∗ corresponding to the largest in absolute value eigenvalue (equal one) is unique. Under stronger assumption of positiveness of all entries of A the convergence of the power method is guaranteed. For instance the power method does not converge for N = 2, a11 = a22 = 0, a21 = a12 = 1. Below we consider the general situation where A ∈ N and the matrix A is not assumed to be irreducible or positive definite. Our goal is to find a point in X∗, that is to minimize 1 2 Q ( x ) = --- Ax – x 2 → min . x ∈ ΘN 2
(2)
The idea of the iterative algorithm for minimization Q(x) is to construct stochastic gradient, of this function by use of random choice of rows and columns of A and Vol. 79
j = 1, 2, …, N.
No. 3
2009
(3)
Observe that the idea of such a random realization for vector Ax has been proposed first in [8]. Further, the gradient in (2) is ∇Q(x) = (A – I)T(A – I)x = ATAx – ATx – Ax + x. Choose the second index ξk ∈ {1, 2, …, N} at random by the conditional probability distribution ( a 1ηk , a 2ηk , …, a N ηk ), i.e., using the stochastic vec( ηk )
: i = 1, 2, …, N.
(4)
Thus, by sequential choice of two indexes ηk and ξk at random, we form the realization of the stochastic gradient at current iteration, that is ζ k =∆ ( A ( ξk ) ) – ( A ( ηk ) ) – A T
T
( ηk )
+ xk .
(5)
Then E(ζk|xk) = ∇Q(xk), since E ( ζ k x k ) = E { E ( ζ k x k, η k ) x k }
x ∈ ΘN
DOKLADY MATHEMATICS
may be treated as a conditional expectation
P ( ξ k = i x k, η k ) = a iηk ,
Consider a system of linear equations Ax = x,
( j) ( j) xk
j=1
tor A
N
(N)
N
( j)
N
aNj)T:
(2)
Let estimate xk = ( x k , x k , …, x k )T ∈ ΘN be obtained on iteration k. Then the vector Axk =
PROBLEM STATEMENT Let A = ||aij||N × N be a stochastic matrix, i.e., its columns represent the stochastic vectors A(j) = (a1j, a2j, …,
425
N
=
∑x
T ( j) k a ij ( A ( i ) )
T
– A x k – Ax k + x k
i, j = 1
= ( A A – A – A + I )x k = ∇Q ( x k ). T
T
Note a significant property of ζk in (5): its ∞-norm is bounded by two, ζk
≤ ( A ( ξk ) ) – ( A ( ηk ) ) T
∞
T ∞
+ xk – A
( ηk )
∞
≤ 2.
THE GENERAL FORM OF THE RANDOMIZED ALGORITHM The algorithm contains two kinds of variables: (a) ψk being the dual variable; it is defined by the stochastic gradients Sk as the result of descent in the dual space, (b) xk being the primal variable representing the “mirror image” of ψk onto the primal space. In order to properly adjust the algorithm, one should fix the two positive sequences: (γk)k ≥ 0 (a step gain) and (βk)k ≥ 0
426
NAZIN, POLYAK
(“a temperature”) with non-decreasing property, i.e., βk ≥ βk – 1, ∀k ≥ 1. Let a stochastic matrix A ∈ N be given. The mirror descent algorithm relies on the entropy proxy function N
V(θ) = lnN +
∑θ
( j)
lnθ(j), β-conjugate Legendre–
j=1
Fenchel function Wβ: N → and the related Gibbs potential ∇Wβ : N → ΘN : k ⎛1 – z /β⎞ W β ( z ) = β ln ⎜ ---e ⎟, ⎝Nk = 1 ⎠
with the initial values ψ0 = 0, x0 ∈ ΘN and with parameters (βk)k ≥ 0 from (10). Equation (7) in (13) and vector ζk in (5) are also applied. Note that the recursive form (14) does not use the horizon n in advance, and the algorithm becomes completely recursive. Theorem 1. Let N ≥ 2 and the estimate xˆ n be defined by the randomized algorithm (8)–(14) with the stochastic gradient (5). Then for any n ≥ 1, 2 1/2 n + 1 E Axˆ n – xˆ n 2 ≤ 8 ( ln N ) ---------------- . n
N
∑
(6)
–1
( j) k ∂W β ( z ) – z /β ⎛ – z /β⎞ ----------------e = – e ⎜ ⎟ , ( j) ∂z ⎝k = 1 ⎠
N
∑
j = 1, 2, …, N. (7)
The algorithm reads as follows: Fix the initial values x0 ∈ ΘN and ψ0 = 0 ∈ N. Specify the positive sequences (γk)k ≥ 1 and (βk)k ≥ 1, and define the horizon n > 1. At each k = 0, 1, …, n – 1, when xk and ψk are defined, one generates two random indexes ηk and ξk, related to the conditional distributions (3) and (4), and calculates the realization for the stochastic gradient ζk (5). Then the variables are returned recursively ψ k + 1 = ψ k + γ k ζk ,
(8)
x k + 1 = – ∇W βk ( ψ k + 1 ),
where ∇Wβ is defined by (7). The n-th iteration defines the convex combination, the basic estimate is ⎛ ⎞ γ k⎟ xˆ n = ⎜ ⎝k = 0 ⎠ n
∑
–1 n
∑γ x .
(9)
k k
k=0
RECURSIVE ALGORITHM AND MAIN RESULT Specify the sequences (γk)k ≥ 0 and (βk)k ≥ 0: γ k ≡ 1,
βk = β0 k + 1 ,
k = 0, 1, …;
The proof exploits the ideas of [7]. DISCUSSION AND COMMENTS There are several ways to modify the algorithm in order to improve its convergence. (1) We can randomly generate nk ∈ + independent realizations of the stochastic gradient ζk(t), t = 1, 2, …, nk (5) at each iteration k and calculate 1 ζ k = ---nk ∆
nk
∑ ζ (t)
here
in order to use the arithmetic means ζ k as more precise estimates of the gradients ∇Q(xk). This can be a source for more effective algorithms and a basis of distributed versions of the method. (2) There are other versions of the stochastic gradient than (5). For instance, the quadratic function Q(x) can be replaced with other functions. (3) The choice of parameters, suggested in (10), (11), is not the only possible one. More flexible strategy can lead to faster convergence. Also the rate of convergence depends on the choice of the initial point x0. Another problem of interest is the rate of convergence to the eigenvector x∗ under assumption of its uniqueness. Indeed, we have the bound for E||A xˆ n – xˆ n || 2 , but what one can say about || xˆ n – x∗ || 2 ? The following result clarifies the situation. Proposition 1. Let a stochastic matrix A be irreducible. Then the matrix has the unique eigenvector x∗ ∈ ΘN, and there exists c = c(A) > 0 such that 2
Ax – x 2 ≥ c x – x 2 , ∀ x ∈ Θ N . * However the constant c depends on A (and in particular on the second in absolute value eigenvector of A being closest to 1). 2
β 0 = 2 ( ln N )
– 1/2
.
(11)
Thus the algorithm takes the form: ∀k = 0, 1, …, n, ψ k + 1 = ψ k + ζk ,
(12)
x k + 1 = – ∇W βk ( ψ k + 1 ),
(13)
1 xˆ k + 1 = xˆ k – ------------ ( xˆ k – x k ) k+1
(14)
(16)
k
t=1
2
(10)
(15)
2
APPLICATION TO PAGERANK The famous PageRank problem [1–3] can be treated in the framework of the above problem statement, however some details are worth mentioning. The initial link DOKLADY MATHEMATICS
Vol. 79
No. 3
2009
THE RANDOMIZED ALGORITHM FOR FINDING AN EIGENVECTOR
matrix A represents the web graph with N vertices 1 (pages). Its entries are aij = ---- if page j has an outgoing nj link to page i (nj being the total number of outgoing links of j) and aij = 0 otherwise. The desired ranks x(j) of the pages satisfy the equation Ax = x. The matrix A may be non-stochastic due to existence of dangling nodes, i.e., the pages having no outgoing links. To avoid this 1 difficulty the matrix is redefined, aij = ---- , i = 1, 2, …, N N for all dangling nodes j. Now matrix A becomes stochastic, and an eigenvalue x∗ ∈ ΘN, Ax∗ = x∗ always exists. However it may be non-unique. Traditionally to overcome this difficulty researchers deal with transformed matrix m M = ( 1 – m ) A + ---- S , N where m ∈ (0, 1) is a parameter while S is the matrix with all entries equal 1. The resulting matrix M has all positive entries, the vector x∗ is unique, and the power method can be applied. Its rate of convergence is ||xn – x∗||2 ≤ c(1 – m)n [2]. As it was proposed in [1], m = 0.15 is the standard value. The researches confirm that dependence on the parameter m can be strong enough [2], and it is not obvious what is the sense of the solution x∗, corresponding to m = 0.15 which is not small. The benefit of the approach presented in our paper is the opportunity to deal with small m > 0 or with m = 0, when the power method converges slowly or even diverges. Matrix A for PageRank problems is very sparse, moreover just link graph should be stored. This allows to implement, the proposed method effectively. The results of numerical simulation are preliminary meanwhile. We tested the method for models of PageRank problems with N of order 1000–10 000. When calculations could be solved on standard PC. It was hard to obtain high accuracy of the solution; however in reallife PageRank only the pages with relatively high rank are of interest, and these ranks could be reconstructed relatively well. CONCLUSIONS Novel randomized method for finding eigenvector corresponding to eigenvalue 1 for stochastic matrices
DOKLADY MATHEMATICS
Vol. 79
No. 3
2009
427
has been proposed. The obtained bound of its rate of convergence is of non-asymptotic type and has the explicit numerical factor. Moreover, the bound is valid for the whole class of stochastic matrices and does not depend on properties of the individual matrix. The method can be applied for rank problems, essentially, for PageRank problems with small parameter m. Further work on acceleration of the method is desirable. ACKNOWLEDGMENTS The authors are grateful to Roberto Tempo, who attracted their attention to PageRank problem, to Arkadii Nemirovski for important ideas on randomized methods, and to Elena Gryazina who performed the calculations. REFERENCES 1. S. Brin and L. Page, Comput. Netw. ISDN Syst. 30 (1/7), 107–117 (1998). 2. A. N. Langville and C. D. Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings (Princeton Univ. Press, Princeton, 2006). 3. H. Ishii and R. Tempo, Proceedings of XLVII IEEE Conference on Decision and Control (Mexico, 2008), pp. 3523–3528; pp. 3529–3534. 4. R. Tempo, G. Calafiore, and F. Dabbene, Randomized Algorithms for Analysis and Control of Uncertain Systems (Springer-Verlag, London, 2005). 5. A. S. Nemirovskii and D. B. Yudin, Complexity of Problems and Efficiency of Optimization Methods (Nauka, Moscow, 1979) [in Russian]. 6. Yu. Nesterov, Math. Program. 103 (1), 127–152 (2005). 7. A. B. Yuditskii, A. V. Nazin, A. B. Tsybakov, and N. Vayatis, Probl. Peredachi Inf. 41 (4), 78–96 (2005). 8. A. Juditsky, G. Lan, A. Nemirovski, and A. Shapiro, SIAM J. Optim. http://www.optimization-online.org/ DB-HTML/2007/09/1787.html. 9. A. Juditsky, A. Nemirovski, and C. Tauvel, SIAM J. Contr. Optim. http://arxiv.org/PS-cache/arxiv/pdf/0809/ 0809.0815v1.pdf. 10. A. V. Nazin, VIII Meeting on Mathematical Statistics (CIRM, Luminy, 2008); http://www.cirm.univ-mrs.fr. 11. R. T. Rockafellar and R. J. B. Wets, Variational Analysis (Springer-Verlag, New York, 1998). 12. A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications (SIAM, Philadelphia, 2001).