Interdisciplinary Information Sciences Vol. 15, No. 2 (2009) 291–299 #Graduate School of Information Sciences, Tohoku University ISSN 1340-9050 print/1347-6157 online DOI 10.4036/iis.2009.291
Convergence Rates to Optimal Distribution of the Boltzmann Machine on Simulated Annealing Hajime URAKAWA Division of Mathematics, Graduate School of Information Sciences, Tohoku University, Aoba 6-3-09, Sendai, 980-8579, Japan Received November 17, 2008; final version accepted March 2, 2009 In this paper, explicit estimations of the convergence rates of the Boltzmann machine on simulated annealing are given. As its applications, quantitative bounds of the temperature and the transition times of the Boltzmann machine to optimal distribution are given. KEYWORDS: Boltzmann machine, Markov chain, Gibbs matrix, total variation distance, convergence rates
1. Introduction Since the nineteen eighties, many studies of simulated annealing, so called the Boltzmann machine searching the minimum of a given function have been done. But, effective estimation of the convergence rates might be so far still unknown. However, the convergence rates of Markov chains have widely been studied (cf. [2], [3], [4], [5], [7], [10], [11], [12], [13], [14], [15], [16], [18]). In this paper, applying the result of Diaconnis and Stroock [3] on the estimate of the convergence rates of Markov chains, we give an explicit estimation of the convergence rates to the optimal distribution of the Boltzmann machine (cf. Theorem 1.2). Moreover, we give concrete estimations of the temperature and transition time of the Gibbs-Boltzmann distributions of simulated annealing to optimal distribution in the simplest cases in which the number of states are four or eight (cf. Theorem 1.3 and Proposition 4.1). Even though the simplest case, our situation is completely general, and our estimation explains very well the character of the Boltzmann machine (See Remark 1.4). Let us explain our situation, more precisely. We consider a function F on a finite set S having a minimum Fmin on a subset S0 of S. Let us consider an optimal distribution q0 given by 8 < 1 ; x 2 S0 ; q0 ðxÞ :¼ jS j : 0 0; x 2 SnS0 : It is well known (cf. see [1], for example) that (1) the Gibbs distribution qT with temperature T > 0 converges to q0 when T tends to 0, and (2) the distribution r GT m converges to the Gibbs distribution qT when m tends to infinity for every initial distribution r. But, so far, it has been unknown how much T and m one should take rigorously in order to approximate qT and r GT m enough to q0 . In this paper, we want to answer to this question. Let us recall several notations which are necessary for later use. For every distribution r on S, r GT m on S is given by X ðr GT m ÞðxÞ :¼ rðyÞ GT m ðy; xÞ; y2S
where GT m is the m-th power of the Gibbs matrix GT with temperature T, that is, X GT ðx; z1 Þ GT ðzm ; yÞ: GT m ðx; yÞ :¼ z1 ;;zm 2S
The Gibbs matrix GT ¼ ðGT ðx; yÞÞx;y2S is given by definition as 2000 Mathematics Subject Classification. Primary 60J05; Secondary 62L20. Supported by the Grant-in-Aid for the Scientific Research, (A)19204004,(C)21540207, Japan Society for the Promotion of Science. Corresponding author. E-mail:
[email protected]
292
URAKAWA
8 if x 6¼ y, yÞAT ðx; yÞ < Pðx; X GT ðx; yÞ ¼ 1 Pðx; zÞAT ðx; zÞ if x ¼ y, :
ð1:1Þ
z6¼x
P ¼ ðPðx; yÞÞx;y2S is the generating probability matrix and AT ¼ ðAT ðx; yÞÞx;y2S is the accepting matrix, which is given by qT ðyÞ AT ðx; yÞ ¼ g ; ð1:2Þ qT ðxÞ where gðuÞ is the accepting function on the open interval ð0; 1Þ into the interval ð0; 1 satisfying that 1 gðuÞ ¼ u g ; 0 < u < 1: u Furthermore, let us recall the definition of Gibbs distribution qT : 1
e T FðxÞ ; x 2 S; Z where Z is the Boltzman constant which is by definition X 1 Z¼ e T Fð yÞ : qT ðxÞ ¼
ð1:3Þ
ð1:4Þ
y2S
Then, GT is reversible with respect to qT , that is, qT ðxÞ GT ðx; yÞ ¼ qT ðyÞ GT ðy; xÞ; and qT is the equilibrium distribution, that is, X qT ðxÞ GT ðx; yÞ ¼ qT ðyÞ;
8
8
x; y 2 S;
ð1:5Þ
y 2 S:
ð1:6Þ
x2S
It is well known that GT is irreducible, that is, for all x and y in S, there exists a positive integer k such that GT k ðx; yÞ > 0. Furthermore, GT is aperiodic, that is, gcdfk; GT k ðx; yÞ > 0g ¼ 1. Due to ergodic theorem, it holds (cf. [1]) that Theorem 1.1. For all initial distribution r, lim r GT m ¼ qT :
ð1:7Þ
m!1
Now, let us recall the notion of the total variation distance between two probability distributions and 0 on S: k 0 kTV :¼ supfjðAÞ 0 ðAÞj : A 2 Bg 1X jðxÞ 0 ðxÞj; ð1:8Þ ¼ 2 x2S where B is the -algebra of all Borel sets in S. Diaconis and Stroock [3] estimated the total variation distance between any reversible Markov chain G (with respect to ) and the stationary distribution in terms of the eigenvalues of G (See the next section). Applying their result to the Boltzmann machine, we have: Theorem 1.2. For every T > 0 and m ¼ 1; 2; . . . ; ! P pffiffiffiffiffiffi T1 ðFðxÞFmin Þ X 1 jSj x2SnS0 e m m ðFð yÞFmin Þ 2T kr GT q0 kTV e ; þ jS0 j þ P 1 2 jS0 j þ e T ðFðxÞFmin Þ y2SnS0
ð1:9Þ
x2SnS0
where 0 :¼ maxf1 ; ja1 jg < 1. Here, we write the eigenvalues of the Gibbs matrix GT as 1 ¼ 0 > 1 a1 1;
(where a :¼ jSj):
Applying this estimation (1.9) to the simplest set S of states with jSj ¼ 4, we can concretely estimate T and m such that kr GT m q0 kTV < ¼ 0:1:
ð1:10Þ
For the case jSj ¼ 8, see Section Four. In the case jSj ¼ 4, more precisely, let us consider the function F on S ¼ fa; b; c; dg with four values, say 0 > v1 > v2 > w12 v1 v2 without loss of generality. We choose the accepting function g as
ð1:11Þ
Convergence Rates of the Boltzmann Machine
293
gðuÞ ¼
u 1 ¼ ¼: ðuÞ: 1 þ u 1 þ u1
ð1:12Þ
How much 0 < T < 1 and m can one take to search the point of S ¼ fa; b; c; dg at which F attains the minimum value w12 v1 v2 by simulated annealing? Our answer is this: Theorem 1.3. If 0 < T
w12 þv1 2:70806
and m 31:2415
w12 þ v1 þ v2 þ 169:207 T
ð1:13Þ
m 1:06382
w12 þ v1 þ v2 þ 5:76176 T
ð1:14Þ
in the case w12 > 0, or
in the case w12 < 0, then kr GT m q0 kTV
1 2 a1 1; where a ¼ jSj. Then, it holds (cf. [3]) that 2kGm ðx; Þ ðÞkTV where :¼ maxf1 ; ja1 jg. By applying (2.9) to GT , we have
1 ðxÞ ðxÞ
m
2kGT ðx; Þ qT ðÞkTV
1=2
1 qT ðxÞ qT ðxÞ
m ;
1=2
m :
ð2:9Þ
ð2:10Þ
(Proof of Theorem 2.2). We have by definition and (1.8) X kr GT m qT kTV ¼ k rðyÞGT m ðy; Þ qT ðÞkTV y2S
1X X ¼ j rðyÞGT m ðy; xÞ qT ðxÞj 2 x2S y2S 1X X ¼ j rðyÞðGT m ðy; xÞ qT ðxÞÞj 2 x2S y2S 1 XX rðyÞjGT m ðy; xÞ qT ðxÞj 2 x2S y2S X 1X ¼ rðyÞ jGT m ðy; xÞ qT ðxÞj 2 y2S x2S X rðyÞkGT m ðy; Þ qT ðÞkTV ¼ y2S
X y2S
rðyÞ
1 1 qT ðyÞ 1=2 m ; 2 qT ðyÞ
which follows from (2.10). (2.11) is the desired one.
ð2:11Þ
To estimate the right hand side of (2.8), we have Proposition 2.3. (1) For the upper bound, ! X X 1 1 qT ðyÞ 1=2 rðyÞ jSj1=2 jS0 j þ e 2T ðFð yÞFmin Þ : qT ðyÞ y2S y2SnS0 (2) For the lower bound,
ð2:12Þ
Convergence Rates of the Boltzmann Machine
X y2S
295
1 qT ðyÞ rðyÞ qT ðyÞ
1=2
X
jS0 j1=2
1
rðyÞ e 2T ðFð yÞFmin Þ :
ð2:13Þ
y2SnS0
Proof. (1) Since 0 rðyÞ 1 and qT ðyÞ 1, we have X X 1 1 qT ðyÞ 1=2 X rðyÞ qT ðyÞ1=2 ¼ Z 1=2 e 2T Fð yÞ : q ðyÞ T y2S y2S y2S
ð2:14Þ
Here, by definition of Z, we have X
1
e T Fmin Z ¼ jS0 j þ
1
e T ðFðzÞFmin Þ jS0 j þ jSnS0 j ¼ jSj:
ð2:15Þ
z2SnS0
Then, by (2.15), we have Z 1=2
X
1
X
1
e 2T Fð yÞ jSj1=2 e 2T Fmin
y2S
y2S
X
¼ jSj1=2
1
e 2T Fð yÞ
e
1 2T ðFð yÞFmin Þ
y2S 1=2
¼ jSj
X
jS0 j þ
e
1 2T ðFð yÞFmin Þ
! ;
y2SnS0
which is (2.12). For (2), by definition of qT , we have, for every y0 2 SnS0 , 1
1 1 qT ðy0 Þ 1 e T Fð y0 Þ =Z ¼ e T Fð y0 Þ Z 1: ¼ T1 Fð y0 Þ qT ðy0 Þ e =Z
ð2:16Þ
Here, we have e
1 T Fð y0 Þ
Z¼e
1 T Fð y0 Þ
jS0 je
T1 Fmin
X
þ
e
T1 Fð yÞ
!
y2SnS0
X
1
¼ jS0 je T ðFð y0 ÞFmin Þ þ
y2SnS0
¼ jS0 je
1 T ðFð y0 ÞFmin Þ
þ1þ
1
e T ðFð y0 ÞFð yÞÞ X
1
e T ðFð y0 ÞFð yÞÞ
y0 6¼y2SnS0
jS0 je
1 T ðFð y0 ÞFmin Þ
þ 1:
ð2:17Þ
Thus, we have, for every y0 2 SnS0 , 1 1 qT ðy0 Þ jS0 j e T ðFð y0 ÞFmin Þ ; qT ðy0 Þ
which yields (2.13). By Theorems 2.1, 2.2 and Proposition 2.3, we have Theorem 1.2.
ð2:18Þ
3. The Bolzmann Machine of Neural Networks Now we apply our estimate to the Boltzmann machine coming from neural networks. The set S of states is given by S ¼ fðx1 ; x2 ; ; xn Þ; xi ¼ 0 or 1ði ¼ 1; ; nÞg ðjSj ¼ 2n Þ, and our function F on S is given by n n X 1X FðxÞ ¼ wij xi xj vi x i ; ð3:1Þ 2 i; j¼1 i¼1 where wii ¼ 0 and wij ¼ wji ði; j ¼ 1; . . . ; nÞ. For two states x and y in S, the transition probability from x to y is given by 8 1 ðA CÞðB DÞ > 0 : 1> 2 2 Thus, we obtain ¼ Now, we take ¼
1 10 ,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 þ ðA CÞðB DÞ : 2
ð3:6Þ
jSj ¼ 4, jS0 j ¼ 1 and jSnS0 j ¼ 3. By Theorems 1.2 or 2.1, if 0 < T T0 , kqT q0 kTV
1 ; 20
ð3:7Þ
where F1 Fmin log þ log 2 log jSj þ log jSnS0 j w12 þ v1 ¼ : 2:70806
T0 ¼
ð3:8Þ
Here, we put Fmax ¼ 0 > F2 ¼ v1 > F1 ¼ v2 > Fmin ¼ w12 v1 v2
ð3:9Þ
satisfying that 0 < v1 < v 2 ;
and 0 < w12 þ v1 :
ð3:10Þ
Convergence Rates of the Boltzmann Machine
297
On the other hand, for the above T > 0, by Theorems 1.1 or 2.2, and Proposition 2.3, to satisfy kr GT m qT kTV < 2 , it suffices to take for any m satisfying that m
jSj3=2 1 ðFmax Fmin Þ e 2T < 2 2
ð3:11Þ
which is equivalent to the following:
1 3 1 ðFmax Fmin Þ log þ log jSj þ log 2 2T 1 1 ðw12 þ v1 þ v2 Þ log 10 þ 3 log 2 þ ¼ log 2T 1 1 ¼ ðw12 þ v1 þ v2 Þ ; 4:38203 þ log 2T
m>
1 since ¼ 10 , jSj ¼ 4, Fmax ¼ 0, and Fmin ¼ w12 v1 v2 . Here, ¼ 12 ð1 þ xÞ with 0 < x < 1 which is given by ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s v1 w12 þ v1 v2 w12 þ v2 x¼ T0 T0 T0 T0 ( v1 )1=2 ( v2 )1=2 e T0 e T0 ¼ v1 v 2 1 þ e T0 1 þ e T0 ( )1=2 ( )1=2 w 1 1 12 je T0 1j w12 þv1 w12 þv2 1 þ e T0 1 þ e T0 ( v1 )1=2 ( v2 )1=2 e T0 e T0 ¼ v1 v2 1 þ e T0 1 þ e T0 ( )1=2 ( )1=2 w12 1 1 je T0 1j: w12 þv1 w12 þv2 1 þ e T0 1 þ e T0 If w12 0, by (3.13), we have
ð3:12Þ
ð3:13Þ
ð3:14Þ
1 62:9842: log
ð3:15Þ
1 2:12764: log
ð3:16Þ
And if w12 < 0, by (3.14), we have
We obtain the desired lower bound of m.
4. The Boltzmann Machine for the Case jSj ¼ 8 In this section, we retain the general setting for the Boltzmann machine for neural networks and set jSj ¼ 8 ðn ¼ 3Þ. For simplicity, we assume that v1 ¼ v2 ¼ v3 ¼ v > 0;
w12 ¼ w13 ¼ w23 ¼ w > 0:
Then, the matrices P, AT , and GT can be calculated as follows. 0 0 13 13 0 13 B1 1 B B3 0 0 3 0 B1 B 3 0 0 13 0 B B B 0 13 13 0 0 P¼B B1 0 0 0 0 B3 B B0 1 0 0 1 B 3 3 B B0 0 1 0 1 3 3 @ 0
0 0
1 3
0
0 0 1 3
0
0
1 3
0 0 1 3
1 3
0 0 0 0 1 3
1 3
0
1
C 0C C C 0C C 1C 3C C; 0C C C 1C 3C C 1C 3A 0
ð4:1Þ
298
URAKAWA
Fig. 1. Neural Network for n ¼ 3.
and GT is equal to 0 1 3a B 1a B 3 B B 1a B 3 B B 0 B B 1 B 3a B B B 0 B B 0 @
a 2 3
þ a 2b 0 1 3
b
2 3
1 3
0
0
a
0
0
0
b
0
b
0
b
0
0
b
0
0
0
b
b
b
1 3
þ 2b c
0
b 0
0
þ a 2b
0 1 3
a
0
0 1 3
0
b 0
0 1 3
c
2 3
þ a 2b 1 3 1 3
b
1 3
þ 2b c
b 0
0 1 3
c
0 1 3
þ 2b c 1 3
c
0
1
0C C C 0C C C c C C C; 0C C C c C C c C A 3c
1 2wþv where a ¼ 13 ð Tv Þ, b ¼ 13 ð wþv T Þ, and c ¼ 3 ð T Þ. By using ‘‘Mathematica’’, the eigenvalues of GT are given as follows: 1 1 1; 0; ð1 þ a c Þ ðwith multiplicities 2Þ; ð1 a þ 2c Þ ; 2 2
where 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ 6a þ 9a2 12b 72ab þ 108b2 þ 6c þ 18ac 72bc þ 9c2 3 ffi 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 12a þ 36a2 þ 24b 72ab 12c þ 72ac 72bc þ 36c2 : ¼ 3
¼
Then, we obtain that the in Theorem 1.2 is given by 1 1 ¼ max ð1 þ a c þ Þ; ð1 a þ 2c þ Þ : 2 2 Note that 0 < < 1. The set of values of the function F is given as
ð4:2Þ
Convergence Rates of the Boltzmann Machine
299
Fmax ¼ 0 > F2 ¼ v > F1 ¼ w 2v > Fmin ¼ 3w 3v;
ð4:3Þ
and S0 ¼ fx 2 S; FðxÞ ¼ Fmin g ¼ fð1; 1; 1Þg, so we have jS0 j ¼ 1, and jSnS0 j ¼ 7. Then, the right hand side of (1.9) of Theorem 1.2 can be calculated as follows: The first term of the right hand side of (1.9) coincides with pffiffiffi m pffiffiffi 1 1 1 1 2 ð1 þ e 2T ð3wþ3vÞ þ 3 e 2T ð3wþ2vÞ þ 3 e 2T ð2wþvÞ Þ 8 2 m e 2T ð3wþ3vÞ ; ð4:4Þ and the second term of the right hand side of (1.9) coincides with 1
1
1
e T ð3wþ3vÞ þ 3e T ð3wþ2vÞ þ 3e T ð2wþvÞ T1 ð3wþ3vÞ
T1 ð3wþ2vÞ
T1 ð2wþvÞ
1
7 e T ð2wþvÞ :
1þe þ 3e þ 3e Therefore, the right hand side of (1.9) is smaller than or equal to pffiffiffi 1 1 8 2 m e 2T ð3wþ2vÞ þ 7 e T ð2wþvÞ :
ð4:5Þ
ð4:6Þ
Thus, we obtain Proposition 4.1. For every T > 0 and m ¼ 1; 2; . . . ; pffiffiffi 1 1 kr GT m q0 kTV 8 2 m e 2T ð3wþ2vÞ þ 7 e T ð2wþvÞ :
ð4:7Þ
Here, is given by (4.2). REFERENCES [1] Aarts, E., and Korst, J., Simulated Annealing and Boltzmann Machine, A Stochastic Approach to Combinatorial Optimization and Neural Computing, Wiley Series in Discrete Math. and Optimization, Wiley, New York, (1989). [2] Catoni, O., Rates of convergence for sequential annealing: a large deviation approach, In: Simulated annealing, Wiley Series in Discrete Math. and Optimization, Wiley, New York, (1992), 25–35. [3] Diaconis, P., and Stroock, D., ‘‘Geometric bounds for eigenvaluesof Markov chains,’’ Ann. Appl. Probability, 1: 36–61 (1991). [4] Douc, R., Moulines, E., and Rosenthal, J. S., ‘‘Quantitative bounds on convergence of time-inhomogeneous Markov chains,’’ Ann. Applied Probability, 14: 1643–1665 (2004). [5] Fill, J. A., ‘‘Eigenvalue bounds on convergence to stationarity for nonreversible Markov chains, with an application to the exclusion process,’’ Ann. Appl. Probability, 1: 62–87 (1991). [6] Jerrum, M., Counting, Sampling and Integrating: Algorithms and Complexity, Lectures in Math., ETH Zu¨rich, Birka¨user, 2003. [7] Jones, G. L., and Hobert, J. P., ‘‘Honest exploration of intractable probability distributions via Markov chain Monte Carlo,’’ Statistic Science, 16: 312–334 (2001). [8] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., ‘‘Optimization by simulated annealing,’’ Science, 220: 671–680 (1983). [9] Kirkpatrick, S., ‘‘Optimization by simulated annealing: quantitative studies,’’ J. Statist. Physics, 34: 975–986 (1984). [10] Ma´rquez, D., ‘‘Convergence rates for annealing diffusion processes,’’ Ann. Applied Probability, 7: 1118–1139 (1997). [11] Meyn, S. P., and Tweedie, R. L., ‘‘Computable bounds for geometric convergence rates of Markov chains,’’ Ann. Applied Probability, 4: 981–1011 (1994). [12] Mengersen, K. L., and Tweedie, R. L., ‘‘Rates of convergence of the Hastings and Metropolis algorithms,’’ Ann. Statistics, 24: 101–121 (1996). [13] Roberts, G. O., and Tweedie, R. L., ‘‘Bounds on regeneration times and convergence rates for Markov chains,’’ Stochastic Process. Appl., 80: 211–229 (1999); Corrigendum, 91: 337–338 (2001). [14] Rosenthal, J. S., ‘‘Minorization conditions and convergence rates for Markov chain Monte Carlo,’’ J. Amer. Statist. Assoc., 90: 558–566 (1995). [15] ————, ‘‘Quantitative convergence rates of Markov chains: a simple account,’’ Electron. Comm. Probab., 7: 123–128 (2002). [16] Saloff-Coste, L., Lectures on Finite Markov Chains, Lectures on Probability Theory and Statistics, Springer-Verlag, Berlin Heidelberg (1997). [17] ————, ‘‘Probability on groups: random walks and invariant diffusions,’’ Notices Amer. Math. Soc., 48: 968–977 (2001). [18] ————, Random Walks on Finite Groups, Encyclopaedia Math. Sci., Springer-Verlag, Berlin Heidelberg, (2004), 263–346. [19] Sinclair, A., and Jerrum, M., ‘‘Approximate counting, uniform generation and rapidly mixing Markov chains,’’ Information Computation, 82: 93–133 (1989). [20] Urakawa, H., ‘‘Convergence rates to equilibrium of the heat kernels on compact Riemannian manifolds,’’ Indiana Univ. Math. J., 55: 259–288 (2006).