Acceleration of stochastic approximation by averaging.

SIAM J. CONTROL AND OPTIMIZATION Vol. 30, No. 4, pp. 838-855, July 1992

1992 Society for Industrial and Applied Mathematics 006

ACCELERATION OF STOCHASTIC APPROXIMATION BY AVERAGING* B. T. POLYAK?

AND

A. B. JUDITSKY$

Abstract. A new recursive algorithm of stochastic approximation type with the averaging of trajectories is investigated. Convergence with probability one is proved for a variety of classical optimization and identification problems. It is also demonstrated for these problems that the proposed algorithm achieves the highest possible rate of convergence.

Key words, stochastic approximation, recursive estimation, stochastic optimization, optimal algorithms

AMS(MOS) subject classifications. 62L20, 93E10, 93E12

1. Introduction. The methods of stochastic approximation originate in the works

[29], [12] and are currently well studied [5], [21], [14], [16], [40]. These methods are widely applied in problems of adaptation, identification, estimation, and stochastic optimization [36], [37], [1], [8]-[10], [18]. The optimal versions (algorithms having the highest rate of convergence) of these methods have been developed as well [34], [38], [6], [27], [28]. However, the application of these optimal methods requires a large amount of a priori information. For example, the matrix V2re(x *) must be known in the problem of stochastic optimization (here x* is the minimum point of re(x)). The new way of developing optimal algorithms that does not require such information is based on the idea of averaging the trajectories. It was proposed independently by Polyak [24] and Ruppert [32]. In the latter work, the linear algorithm for the one-dimensional case was considered, and asymptotic normality of the procedure was proved. Polyak [24] studies multidimensional problems and nonlinear algorithms. He has demonstrated the mean square convergence for these methods. In this paper we consider the same framework as in [24], but we demonstrate the asymptotic normality of the estimates. The use of essentially new techniques in the proofs allows us to substantially weaken the conditions of the theorems. Moreover, we prove the statements on almost sure convergence. The idea of using averaging to accelerate stochastic approximation algorithms appeared in the 1960s (see [36] and the references therein). Afterward, the result was that the hopes associated with this method could not be realized; see, for instance, [23], where it was proved that usual averaging methods are not optimal for linear problems. Nevertheless, the processes with averaging were proposed and studied in the vast variety of papers [11], [14], [20], [13], [33], [4]. The essential advancement [24], [32] was reached on the basis of the paradoxical idea: a slow algorithm having less than optimal convergence rate must be averaged. The paper is organized as follows. In 2 the linear case is discussed (i.e., linear equation and linear algorithm). The formulation of the result and proofs are the most clear for that problem. Then in 3 the general problem of stochastic approximation is studied. The general result obtained is then applied to the unconstrained stochastic optimization problem and to the problem of estimation of linear regression parameters. 2. Linear problem. We want to find x*, which solves the following equation: Ax=b. (1) Here bR xGR N, and AR Nxs. The sequence (Yt)t>=l is observed, where Yt-Ax_- b + so,. Here Ax,_- b is a prediction residual and set is a random disturbance.

,

Received by the editors July 30, 1990; accepted for publication (in revised form) June 24, 1991. Russia. $ Institut de Recherche en Informatique et Systemes Aleatoires (IRISA), 35042 Rennes, France.

t Institute for Control Sciences, Profsoyuznaya 65, 117806, Moscow, 838

ACCELERATION OF STOCHASTIC APPROXIMATION

839

To obtain the sequence of estimates (t)tl of the solution x* of (1), the following recursive algorithm will be used: Xt

(2)

2,

Xt- YtYt lt-1

7E

Axt-

Yt

b + t,

Xi-

i=0

v.

Xo is an arbitrary (nonrandom) point in R Let us suppose that the following assumptions hold. Assumption 2.1. The matrix -A is Hurwitz, i.e., Re ai(A) > 0. (Here ai(A) are the eigenvalues of the matrix A.) Assumption 2.2. Coefficients 3’t > 0 satisfy either

(3)

y,

y,

0 < 3’ < 2 min Re A i(A)

or

(4)

3’,

0,

3’,- 3’,+I

O(3’t

Commentary. Condition (4) for 3’,-+0 is the requirement on 3’, to decrease with 0< a < 1 satisfy this sufficiently slow. For example, the sequences 3’t--yt t-1 does not. restriction, but the sequence 3’t 3’ We assume a probability space with an increasing family of Borel fields (, P). Suppose that s:, is a random variable, adopted to Assumption 2.3. s:, is martingale-difference process, i.e., E (:, It-1) 0; sup E(Is,12l,_,) < oo a.s.

,.

,,

,

I’l is a Euclidean norm in Rv.) Assumption 2.4. The following limit exists:

(Here

lim lim

E(I,I2I(I,I > C)I,-,)

0.

C-oo

(Here I(A) is the characteristic function of a set A.) Assumption 2.5. The following hold"

(a)

p-lim E (:,:tT t_l)

(b)

lim Esc,

:,r

S > 0.

The notation S > 0 means that a matrix S is symmetrical and positive definite. THEOREM 1. (a) Let Assumptions 2.1-2.4, 2.5(a) be satisfied. Then

(z,- x*) v(0, v); of normalized error v/-i(,-x*) is asymptotically

i.e., the distribution mean and the covariance matrix

(5)

V=A-1S(A-1) r. (b) If Assumptions 2.1-2.3, 2.5(b) are satisfied, then lim Et(Y., x*)(,, x*) r V.

normal with zero

840

B. T. POLYAK AND A. B. JUDITSKY

(c) Let Assumptions 2.1-2.3 be and identically distributed. Then

satisfied and let (t),_->l be mutually independent

-’t X* -> 0

a.s.

The proofs of the theorems in this paper are in the Appendix. Part (b) of the theorem was developed in [24], for the case of independent disturbances. Note that Assumption 2.2 on y, is significant. If the sequence y, yt -1 is chosen for algorithm (2) (as is often done for methods using averaging), then the rate of convergence decreases [23]. It was shown in [26] that in the case of independent noises,

E(t-x*)(,-x*) r >_ t-l V+ o(t -1) for all linear recursive estimates )t. This asymptotic rate of convergence is achieved by the algorithm

x, =X,_l-t-A-ly t.

(6)

Method (2) provides the same rate of convergence as the optimal linear algorithm. The advantage of this method is that it does not require any knowledge about A and does not use matrix-valued %. Several versions of algorithm (6) use an estimate of instead of the true value [22], [31]. The significant advantage of these matrix A procedures is that they require only the nonsingularity of A (compare to the rather restrictive Assumption 2.1).

-,

3. Nonlinear problem. For nonlinear problems, consider the classical problem of stochastic approximation [21]. Let R(x):R N-->R u be some unknown function. Observations y, of the function are available at any point x,_ R N and contain the following random disturbances

,

y,=R(x,_,)+,. The problem is finding the solution x* of the equation R(x) 0 by using the observations y, under the assumption that a unique solution exists. To solve the problem, we Use the following modification of algorithm (2): Xt

(7)

Xt-t-1

2

Yt

"/tYt Xi,

R x 1) -11- t, t_

Xo RN"

The first equation in (7) defines the standard stochastic approximation process. Let the following assumptions be fulfilled. Assumption 3.1. There exists a function V(x):R r --> R such that for some A > 0, a>0, e>0, L>0, and all x, yR the conditions V(x)>-alxl 2, IVV(x)-VV(y)lO for xx* hold true. Moreover, VV(xx*)rR(x)>=AV(x) for all [x-x*l 0, 0< A _--< 1 such that

,

IR(x) a(x x*)l

(8) for all

tx

x*[ _-< e and Re Ai(G) > 0,

1, N.

KIIx x*l I+A,


841

, ,,

Assumption 3.3. (s,),=>1 is a martingale-difference process, defined on a probability space (fl, P), i.e., E(sSl,_l) =0 almost surely, and for some K2

E(I,II,_I)/IR(x,_)[ -1. The following decomposition takes place:

(9) where

E(,(O) I,_,) 0 a.s.,

E(:,(O)sef(O)[,_,) P-- S sup

and, for all

as t---oo; S>0,

E(l,(O)]I(lsS(O)] > C)I Ct_l) L 0

as C

large enough,

E(lff,(x,_,)l= ,_,) -
0 as x --> 0. Assumption 3.4. It holds that (]/t--]/t+l)/]/,--O(]/t) ’/t >0 for all t;

Y (1 + A )/ Tet

(10)

-1/2

0.

,

4. Stochastic optimization. Consider the problem of searching for the minimum

x* of the smooth function/(x), x R N. The values of the gradient y, V/(X,_l)+

842


-

:,

containing random noise are available at an arbitrary point xt-1 of R this problem, we use the following algorithm of the form (7):

x,

(12)

2,

x,_- ,(y,), 1

Xo R

xi,

.

V/(x,_) +

y,

i=0

v. To solve

Let the following assumptions be fulfilled. Assumption 4.1. Let /(x) be a twice continuously differentiable function and H 72/(X) LI for all x and some l> 0 and L> 0; here I is the identity matrix. Assumption 4.2. (,),e is the sequence of mutually independent and identically distributed random variables E 0. Assumption 4.3. It holds that ]p(x)l N Kl(1 +]x]). Assumption 4.4. The function (x)= E(x+ ) is defined and has a derivative at zero, (0) 0 and xr(x) > 0 for all x 0. Moreover, there exist e, K 2 > 0, 0 % A 1, such that

’(O)x- (x) for

Klxl

l+

Ixl < e.

Assumption 4.5. The matrix function X(x)= E(x + )(x + 1) is defined and is continuous at zero. Assumption 4.6. The matrix -G=-’(0)7/(x *) is Hurwitz, i.e., Re Ai(G)> 0,

i=l,N. Assumption 4.7. It holds that (y,- y,+)/y,

o(yt), y, > 0 for all t;

T+a/et -/__ of the parameter 0"

,

Ot Ot-1 + 3/tqg(Yt- oTt-lXt)Xt, (16)

Ot

1

0o R N

-i=o 0,

Suppose the following assumptions hold true. Assumption 5.1. Let (,),__>1 be a sequence of mutually independent and identically distributed random variables E 0, E: < Assumption 5.2. Let (x,)t>_ be a sequence of mutually independent and identically distributed random variables EIxl 4 < Exx= B, B> O. Sequences (:)__> and are mutually independent. Assumption 5.3. There exists K1 such that Iqg(x)l=< KI(1 +[xl) for all x R The functions O(x)= E(x + ), X(x)= E(x + ) are defined under Assumptions 5.1-5.3. Now we state restrictions on 0, X. Assumption 5.4. It holds that tp(0) 0, xO(x) 0 for all x 0, 0(x) has a derivative at zero, and 4,’(0)> 0. Moreover, there exist K < and 0< A =< 1 such that

.

,

.

IO(x)- O’(0)xl =< g2lxl l+x. Assumption 5.5. The function X(x) is continuous at zero. Assumption 5.6. It holds that (%-%+1)/y, o(y,), %>0 for all t;

y+/2t -/ < t=l

.

THZORZM 4. Assume that Assumptions 5.1-5.6 hold. Then, for algorithm (16), the 0 almost surely and (O-t- 0)/ following properties hold true" O-t N(O, V), where

-

v= B x(O)

6’(o)

The problem of the mean square convergence of method (16) is discussed in [24]. Note that conditions of Theorem 4 are similar to the conditions of standard results for this problem [27]. If possesses a continuously differentiable density function pC, then the optimal algorithm proposed in the latter paper has the following form:

:

(17)

-

Ot Ot_, + Ftqg(yt-- oTt_lXt)Xt, F B

-1

q(x)

-J(p)-Ip’(x)/p(x).

844


For method (17),

V=J(p)-B -’.

(o,-o)47 U(O, V),

-

Since the matrix B is unknown, algorithm (17) is unimplementable. Nevertheless, it is possible to use, instead of B, its estimate

,=

(8)

k=l

xx2

In particular, for linear algorithms (i.e., for Gaussian noises) methods (17), (18) coincide with the recursive MLS algorithm. It follows from Theorem 4 that if we choose q(x)

-J(p)-lp’(x)/p(x)

for algorithm (16), then the rate of convergence is equal to asymptotical rates of convergence of (16) and (17) coincide.

V=J(p)-B

-. So

the

Appenix. Proofs of the theorems consist of the sequence of propositions followed by their proofs. Everywhere in the following, we use the notation b, x,- x* for an g,-x* for an estimation error. error of the first equation of the algorithm, and Nonrandom constants that are unimpoant will be denoted by the symbols K and a. All relations between random variables are supposed to be true almost surely (unless declared otherwise). The two matrix lemmas below will be useful in later developments. Let (X),, (Xj),j be the sequences of matrices, X, X R NxN determined by the following recursive relations"

,

t+l

X

_,

(A) and

X

X

X L

%AXe,

t--1

7j

Z X.

i=j

--t A -1- Xj. LEMMA 1. Let the following hold" (i) Assumption 2.2 of Theorem 1 holds; (ii) Re h(A) > 0, i= 1, S. Then there is constant K < o such that for all j and >-_j

(A2) lim -1

(A3)

t

Proof of Lemrna

tl ;

0.

j=O

1.

Part 1. Proposition of the lemma is true with y,-= y. Proof We obtain from (A1) that

X

I yA

’-j

X

I,

--t T(I+(I-TA)+. .+(I-yA) -J)= A-I-(I-TA) -J+IA-1. Xj=

The eigenvalues of the matrix I-yA are Ai(I-yA)= 1-yAi(A) and IA(1-yA)I < 1. So lim (I

yA

0;

845


hence (A2) holds, and 1

t-

t--1

7 j=OE

7jE

0"=

t+l

1

(I-TA) t-j+lA-=-t

k=2E (I-TA) kA- -->0

as

Part 2. It holds that tyt-->

Proof. Let us cet+l

--

1/t%. Then

define cet

(t + 1)

1

1 t+l

(y-’ + o(1))= at .--7-7+ o(1)

+

t+l

t+l’ where o(1)0 as tooo. Since Et=l 1/(t+ 1) =oe, we obtain that ct0. Part 3. There are c > 0 and K < eo such that for all j and ->_j t+

[3

i=j

Proo From assumption (ii) of the lemma and from the Lyapunov theoren, we have that there exists the solution V V r > 0 of the Lyapunov equation ArV + VA I. Define L max Ai(V), min A(V), U (X) rVXj., Then Ut+, (X) T (I- TtA) TV(I- %A)X

(A4)

U,-%X)(AV+ VA)X+ r,X)AVAX. 2

-

Note that (Xj)X>(1/L)(Xj) VXj and (Xj)AVAXc(Xj)WVX, where c= (]]AI]:L)/I. Then, for suciently large and some A > 0, we get from (A4) that

U,+, 0, we have from Part 1 of the proof that P(sup,]A, < oo) 1. Thus, for every e > 0, there exists some R < oo such that

oo

(A12)

P(sup [A,l R) - l

e.

Define the stopping time ZR inf { => 1" [At] > R}. Part 2. It holds that E[At[ZI(a’R Proof On {ZR > t} we have from (All) that

(A13)

E( Vd(ZR > t)It_,) 2 --< Vt-l(1 + TtK)-o%gt-1]I(’R

2

850


for some K, a > 0. Taking the expectation, we obtain from (A13) that FVtl(7"R )t)=< EVt_lI(’g > t-1)(1- "yta h- K’y2t) h- KT2t Finally, by Lemma 2.1.26 [3], we obtain that

(A14) EVtI(’rR > t)

E(I,I-I(I,I > C)l,_l)

O.

Proof. By decomposition (9), we have that

SO

Then I 0 as and C by Assumption 2.3; I1 0, since A, converges to zero. Therefore all the conditions of proposition (a) of Theorem 1 hold for the process t" We demonstrate the proximity of the processes A, -land,.Set6,=t- ,’then for 6, we obtain the equation (compare with (A9)) 1 1 t-1

,ao+

2 (G-l+w)((aj)-Gaj)

.

z,)+

Part 4. It holds that 6t 0 as Proof From Lemma 2 we immediately get that Assumption 2.2 and Lemma 2, we get that 1 i=0

1 i=0

i=0

I )0 as

t.

Next, due to


851

Thus we obtain from (A14) and Assumption 2.4 that

E

E(IAi]I+AI("I"R > t)) 0 for all x 0. Let re(x*) 0 for the sake of simplicity. It follows from Assumptions 3.1 and 3.4 that there exist a > 0, a’> 0, e > 0 such that

R r(x)V/(x) => alV/(x)l e => a’/(x)

Ix-

for all x*l -< e; hence re(x) is a Lyapunov function for (A15), and all corresponding conditions of Theorem 2 are fulfilled.

852


So we obtain by Assumption 3.4 that

I(x) (x x*)l

(V/(x)) O’(o)v/(x*)(x _-< I(V/(x))- ’(0)V/(x)l

+ q/(0)V/(x) O’(o)v/(x*)(x x*)l C) ,_,) + KI&,_,I 2. From the definition (A16) by Assumption 3.3, we get that

I(l,(a,-,)l > c) I(]at-,I > KC) + 1(1,1 > KC); SO

E(]t(At-,)12I(lt(At=,)l > C)lt-) KC)[t_l)">O

-

as t-oC.

(Here o(1) -* 0 as c.) This means that Assumption 3.3 of Theorem 2 holds. Therefore all conditions of the proposition of Theorem 2 are fulfilled, and the matrix V is defined by the equation V= G-’x(O)G-’

,,

Theorem 4. Let A,= 0,-0" be an error of the first equation in (16). the minimum r-algebra generated by disturbances and inputs until the r(s, x,’-’, x,). Let R(A)= E6(brx)x. We obtain the following

Proof of Denote by time t"

(q,’(0)V2/(0))-’X(0)(q,’(0)V,g(0))-’.

,,

equation for

At-- At_l-’ytEj(A Tt_lXt)Xt (A18)

+ T,(Eq,(A,x,)x, (4(A Tt-lXt t)Xt) "3

At_

TtR(At_) 4- Ttst,

R(A,_,) (A,x, + s,)x,. We check the fulfillment of the assumptions of Theorem 2 in that case. Assumption 5.4 implies that ArR(A)> 0 for all A#0 and R(A) =0 for A=0; so V(A) IA[ z is the

where e,

853


Lyapunov function for (A18). Hence Assumption 3.1 of Theorem 2 is satisfied. Next, for some K,

IR(A) 4,’(0)BAI _--< KEIATx, ’+

Ix, e>K

+I

Ig,

>

il+ i(+, i3. So we obtain that

E(l,l=I(l,l > KE ((t_llxtl 4 + I,l=lx, l=)(I ’ + I => + KI’[,_,I + KI l + KE(Ix, IeI)+ KE(I,II 3) I + I + I3 + I4. From Assumption and I o 0 and Ie 0 as C

Since t 0, I4 o 0. By the Chebyshev inequality, we get that

.

5.1 we get that

I3 K (Elx,4)’/eP1/(lx, le > ) NK

(Ex,14) / 0

as

So Assumption 3.3 of Theorem 2 is fulfilled. Therefore all the conditions of Theorem 2 are fulfilled under the assumptions of Theorem 4. Finally, we obtain for the matrix V that V= (O’(O)B)-x(O)B(O’(O)B)

-.

REFERENCES

[1] M. A. AIZERMAN, E. M. BRAVERMAN, AND L. I. ROZONOER, The Method of Potential Functions in the Machine Learning Theory, Nauka, Moscow, 1970. (In Russian.) [2] A. M. BENDERSKIJ AND M. B. NEVEL’SON, Multidimensional asymptotically optimal stochastic approximation procedure, Problems Inform. Transmission, 17 (1982), pp. 423-434.

854

B.T. POLYAK AND A. B. JUDITSKY

[3] A. BENVENISTE, M. METIVIER, AND P. PRIOURET, Algorithmes Adaptatifs et Approximations Stochastiques Thorie et Applications), Masson, Paris, 1987. [4] A. BERMAN, A. FEUER, AND E. WAHNON, Convergence analysis of smoothed stochastic gradient-type algorithm, Internat. J. Systems Sci., 18 (1987), pp. 1061-1078. [5] Yu. M. ERMOL’EV, Stochastic Programming Methods, Nauka, Moscow, 1976. (In Russian.) [6] V. FABIAN, Asymptotically efficient stochastic approximation: The RM case, Ann. Statist., (1973), pp. 486-495.

[7] , On asymptotically efficient recursive estimation, Ann. Statist., 6 (1978), pp. 854-866. [8] V. N. FOMIN, Recursive Estimation and Adaptive Filtering, Nauka, Moscow, 1984. (In Russian.) [9] K. S. Fu, Sequential Methods in Pattern Recognition and Machine Learning, Academic Press, New York, London, 1968. [10] G. S. GOODWIN AND K. S. SIN, Adaptive Filtering, Prediction and Control, Prentice-Hall, Englewood Cliffs, NJ, 1984. [11] A. M. GUPAL AND L. G. BAJENOV, Stochastic analog of the conjugate gradients method, Cybernetics, N1 (1972), pp. 125-126. (In Russian.) [12] E. KIEFER AND J. WOLFOVITZ, Stochastic estimation of the maximum of a regression function, Ann. Math. Statist., 23 (1952), pp. 462-466. [13] I. P. KORNFEL’D AND SH. E. SHTEINBERG, Estimation of the parameters of linear and nonlinear systems using the method of averaged residuals, Automat. Remote Control, 46 (1986), pp. 966-974. [14] A. P. KOROSTELEV, Stochastic Recurrent Procedures, Nauka, Moscow, 1981. (In Russian.) [15] , On multi-step stochastic optimization procedures, Automat. Remote Control, 43 (1982), pp. 606611.

16] H.J. KUSHNER AND D. S. CLARK, Stochastic Approximation Methodsfor Constrained and Unconstrained Systems, Springer, New York, 1978. [17] R. SH. LIPTZER AND A. N. SHIRYAEV, Martingale Theory, Nauka, Moscow, 1986. (In Russian.) 18] L. LJUNG AND Z. StDERSTRM, Theory and Practice of Recursive Identification, MIT Press, Cambridge, MA, 1983. [19] A. V. NAZIN, Informational bounds for gradient stochastic optimization and optimal implemented algorithms, Automat. Remote Control, 50 (1989), pp. 520-531. [20] A. S. NEMIROVSKIJ AND D. B. YUDIN, Complexity of Problems and Effectiveness of Optimization Methods, Nauka, Moscow, 1980. (In Russian.) [21] M. B. NEVEL’SON AND R. Z. KHAS’MINSKIJ, Stochastic Approximation and Recursive Estimation, American Mathematical Society, Providence, RI, 1973. [22] ., Adaptive Robbins-Monro procedure, Automat. Remote Control, 34 (1974), pp. 1594-1607. [23] B. T. POLYAK, Comparison of convergence rate for single-step and multi-step optimization algorithms in the presence of noise, Engrg. Cybernet., 15 (1977), pp. 6-10. [24] ., New stochastic approximation type procedures, Avtomat. Telemekh., N7 (1990), pp. 98-107. (In Russian.); translated in Automat. Remote Control, 51 (1991), to appear. [25] ., Introduction to Optimization, Optimization Software, New York, 1987. [26] B. T. POLYAK AND YA. Z. TSYPKIN, Attainable accuracy of adaptation algorithms, in Problems of Cybernetics. Adaptive Systems, Nauka, Moscow, 1976, pp. 6-19. (In Russian.) [27] ., Adaptive estimation algorithms (convergence, optimality, stability), Automat. Remote Control, 40 (1980), pp. 378-389. [28] ., Optimal pseudogradient adaptation algorithms, Automat. Remote Control, 41 1981 ), pp. 11011110. AND S. MONROE, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400-407. H. ROBBINS AND D. SIEGMUND, A convergence theorem for nonnegative almost supermartingales and some applications, in Optimizing Methods in Statistics, J. S. Rustaji, ed., Academic Press, New York, 1971, pp. 233-257. D. RUPPERT, A Newton-Rafson version of the multivariate Robbins-Monro procedure, Ann. Statist., 13 (1985), pp. 236-245. ., Efficient estimators from a slowly convergent Robbins-Monro process, Tech. Report No. 781, School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY, 1988. A. RUSZYNSKI AND W. SYSKI, Stochastic approximation method with gradient averaging for unconstrained problems, IEEE Trans., AC-28 (1983), pp. 1097-1105. D. T. SAKRISON, Stochastic approximation: A recursive method for solving regression problems, in Advances in Communication Theory and Applications, 2, A. V. Balakrishnan, ed., Academic Press, New York, London, 1966, pp. 51-106. A. N. SHIRYAEV, Probability, Nauka, Moscow, 1980. (In Russian.)

[29] H. ROBBINS

[30] [31]

[32] [33]

[34] [35]


855

[36] YA. Z. TSYPKIN, Adaptation and Learning in Automatic Systems, Academic Press, New York, London, 1971.

[37] , Foundations of Informational Theory of Identification, Nauka, Moscow, 1984. (In Russian.) [38] J. H. VENTER, An extension of the Robbins-Monroprocedure, Ann. Math. Statist., 38 (1967), pp. 181-190. [39] M. L. VIL’K AND S. V. SHIL’MAN, Convergence and optimality ofimplementable, adaptation algorithms (informational approach), Problems Inform. Transmission, 20 (1985), pp. 314-326. [40] M. WASAN, Stochastic Approximation, Cambridge University Press, London, 1970.

Acceleration of stochastic approximation by averaging.

Acceleration of stochastic approximation by averaging.

Suggest Documents

Stochastic Approximation with Averaging Innovation

Trajectory averaging for stochastic approximation MCMC

Weighted Averaging and Stochastic Approximation - JHU Computer ...

Weighted Averaging and Stochastic Approximation - JHU Computer

Stochastic Acceleration of Electrons and Protons. I. Acceleration by ...

Stochastic Acceleration of Electrons and Protons. I. Acceleration by ...

Decentralized Double Stochastic Averaging Gradient

Nonlinear Acceleration of Stochastic Algorithms

Convergence of stochastic approximation for

Sample Average Approximation of Stochastic

SEMIMARTINGALE STOCHASTIC APPROXIMATION PROCEDURE

OPTIMAL APPROXIMATION OF STOCHASTIC ... - CiteSeerX

Single Timescale Stochastic Approximation for Stochastic Nash ...

Stochastic approximation approaches to the stochastic variational ...

Stochastic Approximation approach to Stochastic Programming - ISyE

Stochastic approximation approaches to the stochastic variational

Stochastic Acceleration of Charged Particles by an Electromagnetic ...

Stochastic Acceleration of Electrons by Fast Magnetosonic Waves in ...

Stochastic Acceleration of Charged Particles by an Electromagnetic

On the averaging principle for stochastic delay

DSA: Decentralized Double Stochastic Averaging Gradient Algorithm

Averaging for martingale problems and stochastic ...

Approximation of stochastic differential equations driven by Î±-stable

Variable Metric Stochastic Approximation Theory