SELF-AVERAGING IN A CLASS OF ... - Semantic Scholar

0 downloads 0 Views 198KB Size Report
interesting paper by Pastur, Shcherbina and Tirozzi PST]. Self-averaging properties of the large deviation rate function as a function of the macroscopic ...
SELF-AVERAGING IN A CLASS OF GENERALIZED HOPFIELD MODELS# Anton Bovier 1

Weierstra{Institut fur Angewandte Analysis und Stochastik Mohrenstrasse 39, D-10117 Berlin, Germany

Abstract: We prove the almost sure convergence to zero of the uctuations of the free energy, resp.

local free energies, in a class of disordered mean- eld spin systems that generalize the Hop eld model in two ways: 1) Multi-spin interactions are permitted and 2) the random variables i describing the `patterns' can have arbitrary distributions with mean zero and nite 4+ -th moments. The number of patterns, M , is allowed to be an arbitrary multiple of the systemsize. This generalizes a previous result of Bovier, Gayrard, and Picco [BGP3] for the standard Hop eld model, and improves a result of Feng and Tirozzi [FT] that required M to be a nite constant. Note that the convergence of the mean of the free energy is not proven. Keywords: Hop eld model, neural networks, self-averaging, law of large numbers. PACS Classi cation Numbers: 75.10.Nr, 64.60.Cn, 89.70.+c

# Work partially supported by the Commission of the European Communities under contract CHRX-CT93-0411 1 e-mail: [email protected]

I. Introduction Over the last years some interesting properties of `self-averaging' have been observed in two classes of `spin-glass' type models of the mean eld type, the Sherrington-Kirkpatrick model [SK] and the Hop eld model [FP, Ho]. The latter, largely used in the context of neural networks, is maybe particularly interesting, as it contains a parameter, the number M of stored patterns as a function of the size of the system, N which can be adjusted to alter the properties of the model. In a paper of Pastur and Shcherbina [PS], it was observed that the variance of the free energy of a nite system of size N in the SK-model tends to zero like 1=N , implying the convergence to zero in probability of the di erence between the free energy and its mean. This result was later generalized to the Hop eld model by Shcherbina and Tirozzi [ST] under the assumption that the ratio = M=N remains bounded as N " 1. Further results of this type can be found in an interesting paper by Pastur, Shcherbina and Tirozzi [PST]. Self-averaging properties of the large deviation rate function as a function of the macroscopic parameters of the model (the so called `overlap-parameters', see below) were used crucially in two papers by Bovier, Gayrard and Picco [BGP2, BGP3]. There, sharper than variance estimates were needed, and as a consequence [BGP3] contains in particular a proof of the almost sure convergence to zero of the di erence between the free energy and its mean, both in the Hop eld model under the assumption the M=N be bounded, and in the SK-model. Independently, Feng and Tirozzi [FT] have recently proven such a result in a class of generalized Hop eld models, however under the very restrictive assumption that M itself be a bounded function of N . The purpose of the present note is to show that such a condition is in fact unnecessary. Let us describe the class of models we will consider. We denote by SN = f?1; 1gN the space of functions  :  ! f?1; 1g. We call  a spin con guration on . S  f?1; 1gIN denotes the space of half in nite sequences equipped with the product topology of the discrete topology on f?1; 1g. We denote by B and B the corresponding Borel sigma algebras. We will de ne a random Hamiltonian function on the spaces S as follows. Let ( ; F ; IP ) be an abstract probability space. Let   fi gi;2IN be a two-parameter family of independent, random variables on this space. We will specify we assumptions on their distribution later. In the context of neural networks, one assumes usually that IP (i = 1) = IP (i = ?1) = 21 , but here we aim for more general distributions. We consider Hamiltonians of the form

HN ()  ? N r1?1

N X

MX (N )

=1 i1 ;:::;1r =1

i : : : ir i : : : ir 1

1

(1:1)

Here r  2 is some chosen integer. The case r = 2 corresponds to the usual Hop eld model, and models with general r were introduced by Lee et al. [Lee] and Peretto and Niez [PN]. Feng and Tirozzi [FT] also studied these models, but removed the terms in the sum where two or more indices =november=1995;

18

12:19

1

coincide, which actually amounts to adding a term of the order of a constant to H which does not alter the free energy. One may actually consider more general models in which the Hamiltonian is given as a linear combination of terms of the type (1.1) with di erent values of r. This only complicates but not really alters the proofs, and our results can easily be extended to this situation. Let us introduce the so-called `overlap-parameters'. This is the M -dimensional vector mN () whose components are given by N X 1  m ()   (1:2) N

N i=1 i i

In terms of these quantities, the Hamiltonian can be written in the very pleasant form

HN () = ?N kmN ()krr We de ne the partition function

ZN  21N and the free energy It will be important to realize that

X

(1:3)

e? HN ()

(1:4)

1 ln Z ( ) FN ( )  ? N N

(1:5)

kmN ()k22  kA(N )k

(1:6)

2SN

where A(N ) is the N  N -matrix whose elements are M X 1 Aij (N )  N i j

(1:7)

=1

Properties of the maximal eigenvalues of this matrix will be crucial for us. The eigenvalue distribution of this matrix was rst analyzed by Marchenko and Pastur [MP]. Girko [Gi] proved that under the hypothesis of Theorem 1, the maximal eigenvalue of A(N ) converges to (1 + )2 in probability. Adding the ideas used by Bai and Yin [BY] one can easily show that this convergence takes also place almost surely, and even in the case where only the 4-th moment of i is nite. We will need additional estimates on the moments of kA(N )k which we are only able to prove if we have a little more than four moments. The relevant estimate is formulated in the following lemma. Assume that IEi = 0, IE (i )2 = 1 and IE (i )4+  c < 1, for some  > 0. Then, for any   6 and any  > 0, if N is suciently large,

Lemma 1.1:

  p IP kA(N )k  (1 + )2 (1 + z)  N (1 + z)?N = ? = + N =c 2  4+ 1

=november=1995;

18

12:19

2

(1:8)

 where = M N , = 4(4+) .

Remark: The proof of Lemma 3 is in fact an adaptation of the the truncation idea in [BY] and

fairly standard estimates on the traces of powers of A, as in [BY] (but see also [BGP1]). We will therefore not give the details of the proof of Lemma 3, but only mention that the second term is a p bound on the probability that any of the i exceeds the value N, while the rst is a bound one would obtain if all i satis ed this condition. With this in mind we de ne

f~N ( )  ? ?1 ln ZN ( )1IfkAk2(1+ ) g 2

(1:9)

We will prove

Theorem 1:

Then

Assume that lim MN(N ) = < 1 and  satis es the assumptions of Lemma 1.1.

(i) If r = 2, for all n < 1 there exists n < 1, such that for all   n , and for N large enough,

h ~ i ~ IP fN ( ) ? IE fN ( )   (ln N )3=2 N  N ?n 1 2

(1:10)

(ii) If r  3, then there exist constants C; c; c0 > 0 s.t.

h ~ i  e?cNz ; ~ IP fN ( ) ? IE fN ( )  zN  ?Nc0 z e ; 2

if 0  z < C if z  C

(1:11)

We prove Theorem 1 in the next section. Before doing that, we will show that it implies

Theorem 2:

Under the assumptions of Theorem 1,

lim jF ( ) ? IEFN ( )j = 0; N "1 N

a:s:

(1:12)

Remark: Theorem 2 was proven under the additional assumption that i = 1 for the case r = 2 in [BGP3]. In [FT] Theorem 2 was proven under the hypothesis M (N )  M0 < 1 and that IE (i )4 < 1. Remark: Theorem 2 may in some way be regarded as a strong law of large numbers. We are,

however, reluctant to employ this term, because the convergence of IEFN ( ) to a limit is in general not proven. In the standard Hop eld model this was proven under the assumption limN "1 MN(N ) = 0 by Koch [K] (see also [BG]). We conclude the introduction by giving the proof of Theorem 2, assuming Theorem 1. =november=1995;

18

12:19

3

Proof:

(of Theorem 2) Obviously,

FN ( ) = N1 f~n ( ) + FN ( )1IfkA(N )k>g

(1:13)

By Theorem 1 and the rst Borel-Cantelli lemma it follows that lim 1 f~ ( ) ? IE f~ ( ) = 0; a:s: N "1 N

N

N



(1:14)

Thus Theorem 2 will be proven if we can show that FN ( )1IfkA(N )k>g # 0 both almost surely and in mean. The almost sure convergence follows easily, since





IP FN ( )1IfkA(N )k>g 6= 0 i.o.  IP [kA(N )k >  i.o.] = 0

(1:15)

where the last equality follows from applying Lemma 3.1 from Bai and Yin [BY]. Finally, to prove convergence of the mean, we use that rst of all

and therefore,

But

jHN ()j  N kmN ()k22 kmN ()k1r?2  N kmN ()kr2  N kA(N )kr=2

(1:16)

1 jln Z j 1I jFN ( )j 1IfkA(N )k>g  N N fkA(N )k>g  kA(N )kr=2 1IfkA(N )k>g

(1:17)

Z1

xyx?1 IP [kA(N )k > y] dy  ?N = c  p 2 x x + =2  2 (1 + ) N 2 N  Z1 p

= x y?x=  x c 2 x x ? 1 ? yN x x(1 + y) Ne + =2 x+=2 dy + 2 (1 + ) N y 1 (1:18) To obtain the last expression we used Lemma 3 and made the choice  = (y; x) = yx=4 and (x) = max (6; x=8). Obviously, the right hand side of (1.18) tends to zero as N " 1, as desired. IE kA(N )kx 1IfkA(N )k>g = x IP [kA(N )k > ] +

6

( )

(4 ( ))

This concludes the proof of Theorem 2, assuming Theorem 1.}

Remark: Note that the estimate in (1.18) implies in particular that p IE kA(N )k  C (1 + )2

(1:19)

for some constant depending only on . This is relevant for proving Theorem 1 in the case r = 2 (see [BGP3]). =november=1995;

18

12:19

4

2. Proof of Theorem 1 The basic idea of the proof is the same as in [BGP3] where the case r = 2 has been considered, but some modi cations are necessary, in particular to avoid any restrictions on the value of . The fact that in the case r  3 sharper estimates can be obtained may justify the presentation of the details of the proof in that case. We rst introduce the decreasing sequence of sigma-algebras Fk that are generated by the random variables figi2kIN . Since the variables f~N ( ) are nonzero p only if kAk  2((1 + )2 , we may introduce the event (2:1) A  fkAk  2(1 + p )2 g  F and the corresponding trace-sigma algebras F~k  Fk \ A. This allows to introduce the the corresponding martingale di erence sequence [Yu]

h

i

h



f~N(k) ( )  IE fN ( ) F~k ? IE fN ( ) F~k+1

Notice that we have the identity

f~N ( ) ? IE f~N ( ) =

i

(2:2)

N X ~(k) k=1

fN ( )IP [A]

(2:3)

by the de nition of conditional expectations. The factor IP [A] tends to 1 as IN " 1, so that we just have to control the sum of the f~Nk)( ). To get the sharpest possible estimates, we want to use an exponential inequality. To this end we observe that [BGP3]

) # (X " X N N ( k ) ( k ) e?jtjNz IE exp t f~N ( ) IP f~N ( )  Nz  2 tinf 2 IR k=1 h h h tf~ ( ) ~ i tf~ k(=1 ) ~ i tf~ N ( ) ~ i ?j t j Nz F = 2 inf e IE IE : : : IE e N F e N F : : : e N (1)

(2)

2

t2IR

(

3

)

(2:4)

N +1

To make use of this inequality, we need bounds on hthe conditionali Laplace transforms; namely, if k we can show that, for some function L(k) (t), ln IE etf~N ( ) F~k+1  L(k) (t), uniformly in F~k+1 , then we obtain that ( )

# " X N P ?jtjNz+ Nk e IP f~N(k) ( )  Nz  2 tinf 2IR k=1

=1

L(k) (t)

(2:5)

Note that this construction, so far, is completely model independent. In the estimation of the conditional Laplace transforms, a conventional trick [PS] is to introduce a continuous family of Hamiltonians, H~ N(k) (; u), that are equal to the original one for u = 1 and are independent of k for u = 0. We rst introduce the M (N )-dimensional vectors

1 0 X m(Nk) (; u)  N1 B A @ ii + uk k C i

i6=k

=november=1995;

18

12:19

5

(2:6)

and then de ne

r

H~ N(k) (; u) = ?N

m(Nk) (; u)

r

(2:7) Note that this procedure can of course be used in all cases where the Hamiltonian is a function of the macroscopic order parameters. Naturally, we set X ? H~ Nk (;u) ZN(k) ( ; u)  21N (2:8) e ( )

and nally

2SN





(2:9) fN(k) ( ; u) = ? ?1 ln ZN(k) ( ; u) ? ln ZN(k) ( ; 0) Since for the remainder of the proof, as well as N will be xed values, to simplify our notations we will write fk (u)  fN(k) ( ; u). It relates to f~N(k) ( ) via i i h h (2:10) f~N(k) ( ) = IE fk (1)jF~k ? IE fk (1)jF~k+1 To bound the Laplace transform, we use that, for all x 2 IR, ex  1 + x + 21 x2 ejxj (2:11)

 h tf~ k ( ) ~ i 2 jtf~ k ( )j ~  1 ( k ) 2 ~ N Fk+1  1 + 2 t IE fN ( ) e N Fk+1 IE e

so that

( )

( )

(2:12)

Our strategy in [BGP3] was to use a rather poor uniform bound on f~N(k) ( ) in the exponent but to prove a better estimate on the remaining conditioned expectation of the square. Here it will do the same. Notice that fk (u) is convex, fk (0) = 0, and therefore jfk (1)j  max (jfk0 (0)j; jf 0 (1)j). But



@ H (k) (; u) fk0 (u) = Ek;u @u N



(2:13)

where Ek;u denotes the expectation w.r.t. the probability measure 1 ? H~ Nk (;u) d (2:14) e ZN(k) (m; ~ u) One easily veri es that fk0 (0) = 0, so that we can use in the sequel that jfk (1)j  jfk0 (1)j. Obviously, ( )



Computing the derivative we get



@ H (k) (; u) jfk0 (u)j  Ek;1 @u N

(2:15)

@ MX (N ) HN(k) (; u = 1) = rkk [mN ()]r?1 @u =1 M (N ) X  r kk [mN ()]r?1 =1 M (N ) r?3 X   r kmN ()k1

[mN ()]2

=1 r ? 3  r kmN ()k1 kmN ()k22

=november=1995;

18

12:19

6

(2:16)

In the usual case, where ji j  1, we can bound the sup-norms appearing in (2.16) by kmN ()k1  1; in the case of unbounded  , we can still use that kmN ()k1  kmN ()k2 . On A, the latter is p bounded by 2(1 + )2 . Of course here we assumed that r  3. In this case, therefore, on A,

if ji j  1, and

@ ( k ) Ek;1 @u HN (; u = 1)  rEk;1kmN ()k22  r2 ?1 + p 2

(2:17)

@ ( k ) Ek;1 @u HN (; u = 1)  rEk;1kmN ()k2r?1  ?  r 2(1 + p )2 (r?1)=2

(2:18)

(k) f~N ( )  C

(2:19)

in general. Using these estimates, we see that on A we have

where C  C ( ) is some nite constant depending on MN . Using this bound in (2.12) we see that 2

L(k)(t)  C2 t2eCjtj

(2:20)

To obtain (1.11), we insert this bound into (2.5) and bound the in mum over t by its value for t = Cz , if z < ln 2C , and by its value for t = C ?1 , if z  C ln 2. This proves (ii) of Theorem 1. This leaves us with the case r = 2. The new diculties have been treated in [BGP3] and here we just recall the main steps. Instead of (2.16) we have here that 2

X @ M ( k ) HN (; u)  mN(k);(; u) @u

=1(k)

p

(k)

= mN (; u)  M mN (; u) 1

(2:21) 2

Using this, we get

 h tf~ k ( ) ~ i p ) jtjpM  (k) 2 1 2 4(1+ ~ ~ N Fk+1  1 + 2 t e IE e IE fN (m~ ) Fk+1   2  p p  exp 1 t2 e4(1+ ) jtj M IE f~N(k)(m~ ) F~k+1 ( )

2

2

2

(2:22)

Then, just as in [BGP3] one easily veri es (using the convexity of fk (u)) that

 2 ~  h 0 2 ~ i ( k ) ~ IE fN (m~ ) Fk+1  IE (fk (1)) Fk+1

(2:23)

Symmetrizing with respect to the i that are integrated over in (2.23) one obtains from here [BGP3]

h



i

h

i

 IE (fk0 (1))2 F~k+1 2(1 + p )2 IE kB (k)k A

=november=1995;

18

12:19

7

(2:24)

where B (k) is the random matrix with elements

B (k)  

k 1X 

k j=1 j j

(2:25)

Note that the conditioning on A in (2.24) is essentially irrelevant here, since the probability of A is close to 1. Now it is easy to see that kB (k) k = kA(k)k, and so, by the estimate (1.18), we have that h i  q 2 (2:26) IE kB (k) k  c 1 + Mk Therefore we have

) (X N ( k ) ~

IE exp t

k=1

n o p p fN (m~ )  exp ct2 eCjtj M N (1 + 4 + ln N )

(2:27)

Inserting this in (2.4) and choosing t = n lnzNN then gives estimate (i) of Theorem 1.} Let me conclude this paper with some nal remarks. We have shown in this paper how sharp estimates on the uctuations of the free energy (as function of the \overlap parameters") can be obtained in a very wide class of disordered mean- eld models generalizing the Hop eld model (note that the case of the Sherrington-Kirkpatrick model can be treated in much the same way, see p [BGP3]). In particular, we have shown that these uctuations are of order 1= N and converge to zero almost surely. I should stress again, that this does not imply, however, that the free energy itself converges almost surely to some value in the thermodynamic limit. The problem here is the average of the free energy. Note that in most of the literature on disordered systems, one tries to compute this average, tentatively assuming the self-averaging. But although heuristic techniques and in particular the replica-trick allow to do this to some extent, there is in general no rigorous argument that would assure even that the average of the nite volume free energy converges. For the models considered in this paper, the average of the free energy is uniformly bounded between two constants; this follows from the bound (1:19) (for this we need to assume 4 + -th moments of the variables i ; with some extra sophistication it would otherwise be possible to prove Theorem 2 under just the assumption that 4-th moments are nite). But nothing does, in principle, exclude that the free energy is a very irregular, oscillating function of the volume. Currently, only in the case (N ) # 0, or at hight temperatures, can this be rigorously excluded.

References [BG] A. Bovier and V. Gayrard, \Rigorous results on the thermodynamics of the dilute Hop eld model", J. Stat. Phys. 69: 597-627 (1993). =november=1995;

18

12:19

8

[BGP1] A. Bovier, V. Gayrard, and P. Picco, \Gibbs states of the Hop eld model in the regime of perfect memory", to appear in Prob. Theor. Rel. Fields 99 (1994). [BGP2] A. Bovier, V. Gayrard, and P. Picco, \Large deviation principles for the Hop eld model and the Kac-Hop eld model", IAAS-preprint No. 95, to appear in Prob. Theor. Rel. Fields (1994). [BGP3] A. Bovier, V. Gayrard, and P. Picco, \Gibbs states of the Hop eld model with extensively many patterns", IAAS-preprint No. 97, to appear in J. Stat. Phys. (1994). [BY] Z.D. Bai and Y.Q. Yin, \Necessary and sucient conditions for almost sure convergence of the largest eigenvalue of a Wigner matrix", Ann. Probab. 16, 1729-1741 (1988). [FP] L.A. Pastur and A.L. Figotin, \Exactly soluble model of a spin glass", Sov. J. Low Temp. Phys. 3(6): 378-383 (1977); \On the theory of disordered spin systems", Theor. Math. Phys. 35: 403-414 (1978). [FT] J. Feng and B. Tirozzi, \The SLLN for the free energy in the Hop eld type model and Little model with nite number of patterns", preprint 1994. [Gi] V.L. Girko, \Limit theorems for maximal and minimal eigenvalues of random matrices", Theor. Prob. Appl. 35, 680-695 (1989). [Ho] J.J. Hop eld,\Neural networks and physical systems with emergent collective computational abilities", Proc. Natl. Acad. Sci. USA 79: 2554-2558 (1982). [K] H. Koch, \A free energy bound for the Hop eld model", J. Phys. A : Math Gen. 26: L353-L355 (1993). [Lee] Y.C. Lee, G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee, and C.L. Gilles, \Machine learning using higher order correlation networks", Physica D 22, 276-306 (1986). [MP] V.A. Marchenko and L.A. Pastur, \Eigenvalue distribution in some ensembles of random matrices", Math. USSR Sbornik 72, 536-567 (1967). [PN] P. Peretto and J.J. Niez, \Long term memory storage capacity of multiconnected neural networks", Biological Cybernetics 54, 53-63 (1986). [PS] L. Pastur and M. Shcherbina, \Absence of self-averaging of the order parameter in the Sherrington-Kirkpatrick model", J. Stat. Phys. 62 : 1-19 (1991). [PST] L. Pastur, M. Shcherbina, and B. Tirozzi, \The replica symmetric solution without the replica trick for the Hop eld model", J. Stat. Phys. 74: 1161-1183 (1994). [SK] D. Sherrington and S. Kirkpatrick, \Solvable model of a spin glass", Phys. Rev. Lett. 35: =november=1995;

18

12:19

9

1792-1796 (1972). [ST] M. Shcherbina and B. Tirozzi, \The free energy for a class of Hop eld models", J. Stat. Phys. 72: 113-125 (1992). [Yu] V.V. Yurinskii, \Exponential inequalities for sums of random vectors", J. Multivariate Anal. 6: 473-499 (1976).

=november=1995;

18

12:19

10