However, the application of these optimal methods requires a large amount ... The new way of developing optimal algorithms that does not require such informa-.
SIAM J. CONTROL AND OPTIMIZATION Vol. 30, No. 4, pp. 838-855, July 1992
1992 Society for Industrial and Applied Mathematics 006
ACCELERATION OF STOCHASTIC APPROXIMATION BY AVERAGING* B. T. POLYAK?
AND
A. B. JUDITSKY$
Abstract. A new recursive algorithm of stochastic approximation type with the averaging of trajectories is investigated. Convergence with probability one is proved for a variety of classical optimization and identification problems. It is also demonstrated for these problems that the proposed algorithm achieves the highest possible rate of convergence.
Key words, stochastic approximation, recursive estimation, stochastic optimization, optimal algorithms
AMS(MOS) subject classifications. 62L20, 93E10, 93E12
1. Introduction. The methods of stochastic approximation originate in the works
[29], [12] and are currently well studied [5], [21], [14], [16], [40]. These methods are widely applied in problems of adaptation, identification, estimation, and stochastic optimization [36], [37], [1], [8]-[10], [18]. The optimal versions (algorithms having the highest rate of convergence) of these methods have been developed as well [34], [38], [6], [27], [28]. However, the application of these optimal methods requires a large amount of a priori information. For example, the matrix V2re(x *) must be known in the problem of stochastic optimization (here x* is the minimum point of re(x)). The new way of developing optimal algorithms that does not require such information is based on the idea of averaging the trajectories. It was proposed independently by Polyak [24] and Ruppert [32]. In the latter work, the linear algorithm for the one-dimensional case was considered, and asymptotic normality of the procedure was proved. Polyak [24] studies multidimensional problems and nonlinear algorithms. He has demonstrated the mean square convergence for these methods. In this paper we consider the same framework as in [24], but we demonstrate the asymptotic normality of the estimates. The use of essentially new techniques in the proofs allows us to substantially weaken the conditions of the theorems. Moreover, we prove the statements on almost sure convergence. The idea of using averaging to accelerate stochastic approximation algorithms appeared in the 1960s (see [36] and the references therein). Afterward, the result was that the hopes associated with this method could not be realized; see, for instance, [23], where it was proved that usual averaging methods are not optimal for linear problems. Nevertheless, the processes with averaging were proposed and studied in the vast variety of papers [11], [14], [20], [13], [33], [4]. The essential advancement [24], [32] was reached on the basis of the paradoxical idea: a slow algorithm having less than optimal convergence rate must be averaged. The paper is organized as follows. In 2 the linear case is discussed (i.e., linear equation and linear algorithm). The formulation of the result and proofs are the most clear for that problem. Then in 3 the general problem of stochastic approximation is studied. The general result obtained is then applied to the unconstrained stochastic optimization problem and to the problem of estimation of linear regression parameters. 2. Linear problem. We want to find x*, which solves the following equation: Ax=b. (1) Here bR xGR N, and AR Nxs. The sequence (Yt)t>=l is observed, where Yt-Ax_- b + so,. Here Ax,_- b is a prediction residual and set is a random disturbance.
,
Received by the editors July 30, 1990; accepted for publication (in revised form) June 24, 1991. Russia. $ Institut de Recherche en Informatique et Systemes Aleatoires (IRISA), 35042 Rennes, France.
t Institute for Control Sciences, Profsoyuznaya 65, 117806, Moscow, 838
ACCELERATION OF STOCHASTIC APPROXIMATION
839
To obtain the sequence of estimates (t)tl of the solution x* of (1), the following recursive algorithm will be used: Xt
(2)
2,
Xt- YtYt lt-1
7E
Axt-
Yt
b + t,
Xi-
i=0
v.
Xo is an arbitrary (nonrandom) point in R Let us suppose that the following assumptions hold. Assumption 2.1. The matrix -A is Hurwitz, i.e., Re ai(A) > 0. (Here ai(A) are the eigenvalues of the matrix A.) Assumption 2.2. Coefficients 3’t > 0 satisfy either
(3)
y,
y,
0 < 3’ < 2 min Re A i(A)
or
(4)
3’,
0,
3’,- 3’,+I
O(3’t
Commentary. Condition (4) for 3’,-+0 is the requirement on 3’, to decrease with 0< a < 1 satisfy this sufficiently slow. For example, the sequences 3’t--yt t-1 does not. restriction, but the sequence 3’t 3’ We assume a probability space with an increasing family of Borel fields (, P). Suppose that s:, is a random variable, adopted to Assumption 2.3. s:, is martingale-difference process, i.e., E (:, It-1) 0; sup E(Is,12l,_,) < oo a.s.
,.
,,
,
I’l is a Euclidean norm in Rv.) Assumption 2.4. The following limit exists:
(Here
lim lim
E(I,I2I(I,I > C)I,-,)
0.
C-oo
(Here I(A) is the characteristic function of a set A.) Assumption 2.5. The following hold"
(a)
p-lim E (:,:tT t_l)
(b)
lim Esc,
:,r
S > 0.
The notation S > 0 means that a matrix S is symmetrical and positive definite. THEOREM 1. (a) Let Assumptions 2.1-2.4, 2.5(a) be satisfied. Then
(z,- x*) v(0, v); of normalized error v/-i(,-x*) is asymptotically
i.e., the distribution mean and the covariance matrix
(5)
V=A-1S(A-1) r. (b) If Assumptions 2.1-2.3, 2.5(b) are satisfied, then lim Et(Y., x*)(,, x*) r V.
normal with zero
840
B. T. POLYAK AND A. B. JUDITSKY
(c) Let Assumptions 2.1-2.3 be and identically distributed. Then
satisfied and let (t),_->l be mutually independent
-’t X* -> 0
a.s.
The proofs of the theorems in this paper are in the Appendix. Part (b) of the theorem was developed in [24], for the case of independent disturbances. Note that Assumption 2.2 on y, is significant. If the sequence y, yt -1 is chosen for algorithm (2) (as is often done for methods using averaging), then the rate of convergence decreases [23]. It was shown in [26] that in the case of independent noises,
E(t-x*)(,-x*) r >_ t-l V+ o(t -1) for all linear recursive estimates )t. This asymptotic rate of convergence is achieved by the algorithm
x, =X,_l-t-A-ly t.
(6)
Method (2) provides the same rate of convergence as the optimal linear algorithm. The advantage of this method is that it does not require any knowledge about A and does not use matrix-valued %. Several versions of algorithm (6) use an estimate of instead of the true value [22], [31]. The significant advantage of these matrix A procedures is that they require only the nonsingularity of A (compare to the rather restrictive Assumption 2.1).
-,
3. Nonlinear problem. For nonlinear problems, consider the classical problem of stochastic approximation [21]. Let R(x):R N-->R u be some unknown function. Observations y, of the function are available at any point x,_ R N and contain the following random disturbances
,
y,=R(x,_,)+,. The problem is finding the solution x* of the equation R(x) 0 by using the observations y, under the assumption that a unique solution exists. To solve the problem, we Use the following modification of algorithm (2): Xt
(7)
Xt-t-1
2
Yt
"/tYt Xi,
R x 1) -11- t, t_
Xo RN"
The first equation in (7) defines the standard stochastic approximation process. Let the following assumptions be fulfilled. Assumption 3.1. There exists a function V(x):R r --> R such that for some A > 0, a>0, e>0, L>0, and all x, yR the conditions V(x)>-alxl 2, IVV(x)-VV(y)lO for xx* hold true. Moreover, VV(xx*)rR(x)>=AV(x) for all [x-x*l 0, 0< A _--< 1 such that
,
IR(x) a(x x*)l
(8) for all
tx
x*[ _-< e and Re Ai(G) > 0,
1, N.
KIIx x*l I+A,
ACCELERATION OF STOCHASTIC APPROXIMATION
841
, ,,
Assumption 3.3. (s,),=>1 is a martingale-difference process, defined on a probability space (fl, P), i.e., E(sSl,_l) =0 almost surely, and for some K2
E(I,II,_I)/IR(x,_)[ -1. The following decomposition takes place:
(9) where
E(,(O) I,_,) 0 a.s.,
E(:,(O)sef(O)[,_,) P-- S sup
and, for all
as t---oo; S>0,
E(l,(O)]I(lsS(O)] > C)I Ct_l) L 0
as C
large enough,
E(lff,(x,_,)l= ,_,) -
0 as x --> 0. Assumption 3.4. It holds that (]/t--]/t+l)/]/,--O(]/t) ’/t >0 for all t;
Y (1 + A )/ Tet
(10)
-1/2
0.
,
4. Stochastic optimization. Consider the problem of searching for the minimum
x* of the smooth function/(x), x R N. The values of the gradient y, V/(X,_l)+
842
B. T. POLYAK AND A. B. JUDITSKY
-
:,
containing random noise are available at an arbitrary point xt-1 of R this problem, we use the following algorithm of the form (7):
x,
(12)
2,
x,_- ,(y,), 1
Xo R
xi,
.
V/(x,_) +
y,
i=0
v. To solve
Let the following assumptions be fulfilled. Assumption 4.1. Let /(x) be a twice continuously differentiable function and H 72/(X) LI for all x and some l> 0 and L> 0; here I is the identity matrix. Assumption 4.2. (,),e is the sequence of mutually independent and identically distributed random variables E 0. Assumption 4.3. It holds that ]p(x)l N Kl(1 +]x]). Assumption 4.4. The function (x)= E(x+ ) is defined and has a derivative at zero, (0) 0 and xr(x) > 0 for all x 0. Moreover, there exist e, K 2 > 0, 0 % A 1, such that
’(O)x- (x) for
Klxl
l+
Ixl < e.
Assumption 4.5. The matrix function X(x)= E(x + )(x + 1) is defined and is continuous at zero. Assumption 4.6. The matrix -G=-’(0)7/(x *) is Hurwitz, i.e., Re Ai(G)> 0,
i=l,N. Assumption 4.7. It holds that (y,- y,+)/y,
o(yt), y, > 0 for all t;
T+a/et -/__ of the parameter 0"
,
Ot Ot-1 + 3/tqg(Yt- oTt-lXt)Xt, (16)
Ot
1
0o R N
-i=o 0,
Suppose the following assumptions hold true. Assumption 5.1. Let (,),__>1 be a sequence of mutually independent and identically distributed random variables E 0, E: < Assumption 5.2. Let (x,)t>_ be a sequence of mutually independent and identically distributed random variables EIxl 4 < Exx= B, B> O. Sequences (:)__> and are mutually independent. Assumption 5.3. There exists K1 such that Iqg(x)l=< KI(1 +[xl) for all x R The functions O(x)= E(x + ), X(x)= E(x + ) are defined under Assumptions 5.1-5.3. Now we state restrictions on 0, X. Assumption 5.4. It holds that tp(0) 0, xO(x) 0 for all x 0, 0(x) has a derivative at zero, and 4,’(0)> 0. Moreover, there exist K < and 0< A =< 1 such that
.
,
.
IO(x)- O’(0)xl =< g2lxl l+x. Assumption 5.5. The function X(x) is continuous at zero. Assumption 5.6. It holds that (%-%+1)/y, o(y,), %>0 for all t;
y+/2t -/ < t=l
.
THZORZM 4. Assume that Assumptions 5.1-5.6 hold. Then, for algorithm (16), the 0 almost surely and (O-t- 0)/ following properties hold true" O-t N(O, V), where
-
v= B x(O)
6’(o)
The problem of the mean square convergence of method (16) is discussed in [24]. Note that conditions of Theorem 4 are similar to the conditions of standard results for this problem [27]. If possesses a continuously differentiable density function pC, then the optimal algorithm proposed in the latter paper has the following form:
:
(17)
-
Ot Ot_, + Ftqg(yt-- oTt_lXt)Xt, F B
-1
q(x)
-J(p)-Ip’(x)/p(x).
844
B. T. POLYAK AND A. B. JUDITSKY
For method (17),
V=J(p)-B -’.
(o,-o)47 U(O, V),
-
Since the matrix B is unknown, algorithm (17) is unimplementable. Nevertheless, it is possible to use, instead of B, its estimate
,=
(8)
k=l
xx2
In particular, for linear algorithms (i.e., for Gaussian noises) methods (17), (18) coincide with the recursive MLS algorithm. It follows from Theorem 4 that if we choose q(x)
-J(p)-lp’(x)/p(x)
for algorithm (16), then the rate of convergence is equal to asymptotical rates of convergence of (16) and (17) coincide.
V=J(p)-B
-. So
the
Appenix. Proofs of the theorems consist of the sequence of propositions followed by their proofs. Everywhere in the following, we use the notation b, x,- x* for an g,-x* for an estimation error. error of the first equation of the algorithm, and Nonrandom constants that are unimpoant will be denoted by the symbols K and a. All relations between random variables are supposed to be true almost surely (unless declared otherwise). The two matrix lemmas below will be useful in later developments. Let (X),, (Xj),j be the sequences of matrices, X, X R NxN determined by the following recursive relations"
,
t+l
X
_,
(A) and
X
X
X L
%AXe,
t--1
7j
Z X.
i=j
--t A -1- Xj. LEMMA 1. Let the following hold" (i) Assumption 2.2 of Theorem 1 holds; (ii) Re h(A) > 0, i= 1, S. Then there is constant K < o such that for all j and >-_j
(A2) lim -1
(A3)
t
Proof of Lemrna
tl ;
0.
j=O
1.
Part 1. Proposition of the lemma is true with y,-= y. Proof We obtain from (A1) that
X
I yA
’-j
X
I,
--t T(I+(I-TA)+. .+(I-yA) -J)= A-I-(I-TA) -J+IA-1. Xj=
The eigenvalues of the matrix I-yA are Ai(I-yA)= 1-yAi(A) and IA(1-yA)I < 1. So lim (I
yA
0;
845
ACCELERATION OF STOCHASTIC APPROXIMATION
hence (A2) holds, and 1
t-
t--1
7 j=OE
7jE
0"=
t+l
1
(I-TA) t-j+lA-=-t
k=2E (I-TA) kA- -->0
as
Part 2. It holds that tyt-->
Proof. Let us cet+l
--
1/t%. Then
define cet
(t + 1)
1
1 t+l
(y-’ + o(1))= at .--7-7+ o(1)
+
t+l
t+l’ where o(1)0 as tooo. Since Et=l 1/(t+ 1) =oe, we obtain that ct0. Part 3. There are c > 0 and K < eo such that for all j and ->_j t+
[3
i=j
Proo From assumption (ii) of the lemma and from the Lyapunov theoren, we have that there exists the solution V V r > 0 of the Lyapunov equation ArV + VA I. Define L max Ai(V), min A(V), U (X) rVXj., Then Ut+, (X) T (I- TtA) TV(I- %A)X
(A4)
U,-%X)(AV+ VA)X+ r,X)AVAX. 2
-
Note that (Xj)X>(1/L)(Xj) VXj and (Xj)AVAXc(Xj)WVX, where c= (]]AI]:L)/I. Then, for suciently large and some A > 0, we get from (A4) that
U,+, 0, we have from Part 1 of the proof that P(sup,]A, < oo) 1. Thus, for every e > 0, there exists some R < oo such that
oo
(A12)
P(sup [A,l R) - l
e.
Define the stopping time ZR inf { => 1" [At] > R}. Part 2. It holds that E[At[ZI(a’R Proof On {ZR > t} we have from (All) that
(A13)
E( Vd(ZR > t)It_,) 2 --< Vt-l(1 + TtK)-o%gt-1]I(’R
2
850
B. T. POLYAK AND A. B. JUDITSKY
for some K, a > 0. Taking the expectation, we obtain from (A13) that FVtl(7"R )t)=< EVt_lI(’g > t-1)(1- "yta h- K’y2t) h- KT2t Finally, by Lemma 2.1.26 [3], we obtain that
(A14) EVtI(’rR > t)
E(I,I-I(I,I > C)l,_l)
O.
Proof. By decomposition (9), we have that
SO
Then I 0 as and C by Assumption 2.3; I1 0, since A, converges to zero. Therefore all the conditions of proposition (a) of Theorem 1 hold for the process t" We demonstrate the proximity of the processes A, -land,.Set6,=t- ,’then for 6, we obtain the equation (compare with (A9)) 1 1 t-1
,ao+
2 (G-l+w)((aj)-Gaj)
.
z,)+
Part 4. It holds that 6t 0 as Proof From Lemma 2 we immediately get that Assumption 2.2 and Lemma 2, we get that 1 i=0
1 i=0
i=0
I )0 as
t.
Next, due to
ACCELERATION OF STOCHASTIC APPROXIMATION
851
Thus we obtain from (A14) and Assumption 2.4 that
E
E(IAi]I+AI("I"R > t)) 0 for all x 0. Let re(x*) 0 for the sake of simplicity. It follows from Assumptions 3.1 and 3.4 that there exist a > 0, a’> 0, e > 0 such that
R r(x)V/(x) => alV/(x)l e => a’/(x)
Ix-
for all x*l -< e; hence re(x) is a Lyapunov function for (A15), and all corresponding conditions of Theorem 2 are fulfilled.
852
B. T. POLYAK AND A. B. JUDITSKY
So we obtain by Assumption 3.4 that
I(x) (x x*)l
(V/(x)) O’(o)v/(x*)(x _-< I(V/(x))- ’(0)V/(x)l
+ q/(0)V/(x) O’(o)v/(x*)(x x*)l C) ,_,) + KI&,_,I 2. From the definition (A16) by Assumption 3.3, we get that
I(l,(a,-,)l > c) I(]at-,I > KC) + 1(1,1 > KC); SO
E(]t(At-,)12I(lt(At=,)l > C)lt-) KC)[t_l)">O
-
as t-oC.
(Here o(1) -* 0 as c.) This means that Assumption 3.3 of Theorem 2 holds. Therefore all conditions of the proposition of Theorem 2 are fulfilled, and the matrix V is defined by the equation V= G-’x(O)G-’
,,
Theorem 4. Let A,= 0,-0" be an error of the first equation in (16). the minimum r-algebra generated by disturbances and inputs until the r(s, x,’-’, x,). Let R(A)= E6(brx)x. We obtain the following
Proof of Denote by time t"
(q,’(0)V2/(0))-’X(0)(q,’(0)V,g(0))-’.
,,
equation for
At-- At_l-’ytEj(A Tt_lXt)Xt (A18)
+ T,(Eq,(A,x,)x, (4(A Tt-lXt t)Xt) "3
At_
TtR(At_) 4- Ttst,
R(A,_,) (A,x, + s,)x,. We check the fulfillment of the assumptions of Theorem 2 in that case. Assumption 5.4 implies that ArR(A)> 0 for all A#0 and R(A) =0 for A=0; so V(A) IA[ z is the
where e,
853
ACCELERATION OF STOCHASTIC APPROXIMATION
Lyapunov function for (A18). Hence Assumption 3.1 of Theorem 2 is satisfied. Next, for some K,
IR(A) 4,’(0)BAI _--< KEIATx, ’+
Ix, e>K
+I
Ig,
>
il+ i(+, i3. So we obtain that
E(l,l=I(l,l > KE ((t_llxtl 4 + I,l=lx, l=)(I ’ + I => + KI’[,_,I + KI l + KE(Ix, IeI)+ KE(I,II 3) I + I + I3 + I4. From Assumption and I o 0 and Ie 0 as C
Since t 0, I4 o 0. By the Chebyshev inequality, we get that
.
5.1 we get that
I3 K (Elx,4)’/eP1/(lx, le > ) NK
(Ex,14) / 0
as
So Assumption 3.3 of Theorem 2 is fulfilled. Therefore all the conditions of Theorem 2 are fulfilled under the assumptions of Theorem 4. Finally, we obtain for the matrix V that V= (O’(O)B)-x(O)B(O’(O)B)
-.
REFERENCES
[1] M. A. AIZERMAN, E. M. BRAVERMAN, AND L. I. ROZONOER, The Method of Potential Functions in the Machine Learning Theory, Nauka, Moscow, 1970. (In Russian.) [2] A. M. BENDERSKIJ AND M. B. NEVEL’SON, Multidimensional asymptotically optimal stochastic approximation procedure, Problems Inform. Transmission, 17 (1982), pp. 423-434.
854
B.T. POLYAK AND A. B. JUDITSKY
[3] A. BENVENISTE, M. METIVIER, AND P. PRIOURET, Algorithmes Adaptatifs et Approximations Stochastiques Thorie et Applications), Masson, Paris, 1987. [4] A. BERMAN, A. FEUER, AND E. WAHNON, Convergence analysis of smoothed stochastic gradient-type algorithm, Internat. J. Systems Sci., 18 (1987), pp. 1061-1078. [5] Yu. M. ERMOL’EV, Stochastic Programming Methods, Nauka, Moscow, 1976. (In Russian.) [6] V. FABIAN, Asymptotically efficient stochastic approximation: The RM case, Ann. Statist., (1973), pp. 486-495.
[7] , On asymptotically efficient recursive estimation, Ann. Statist., 6 (1978), pp. 854-866. [8] V. N. FOMIN, Recursive Estimation and Adaptive Filtering, Nauka, Moscow, 1984. (In Russian.) [9] K. S. Fu, Sequential Methods in Pattern Recognition and Machine Learning, Academic Press, New York, London, 1968. [10] G. S. GOODWIN AND K. S. SIN, Adaptive Filtering, Prediction and Control, Prentice-Hall, Englewood Cliffs, NJ, 1984. [11] A. M. GUPAL AND L. G. BAJENOV, Stochastic analog of the conjugate gradients method, Cybernetics, N1 (1972), pp. 125-126. (In Russian.) [12] E. KIEFER AND J. WOLFOVITZ, Stochastic estimation of the maximum of a regression function, Ann. Math. Statist., 23 (1952), pp. 462-466. [13] I. P. KORNFEL’D AND SH. E. SHTEINBERG, Estimation of the parameters of linear and nonlinear systems using the method of averaged residuals, Automat. Remote Control, 46 (1986), pp. 966-974. [14] A. P. KOROSTELEV, Stochastic Recurrent Procedures, Nauka, Moscow, 1981. (In Russian.) [15] , On multi-step stochastic optimization procedures, Automat. Remote Control, 43 (1982), pp. 606611.
16] H.J. KUSHNER AND D. S. CLARK, Stochastic Approximation Methodsfor Constrained and Unconstrained Systems, Springer, New York, 1978. [17] R. SH. LIPTZER AND A. N. SHIRYAEV, Martingale Theory, Nauka, Moscow, 1986. (In Russian.) 18] L. LJUNG AND Z. StDERSTRM, Theory and Practice of Recursive Identification, MIT Press, Cambridge, MA, 1983. [19] A. V. NAZIN, Informational bounds for gradient stochastic optimization and optimal implemented algorithms, Automat. Remote Control, 50 (1989), pp. 520-531. [20] A. S. NEMIROVSKIJ AND D. B. YUDIN, Complexity of Problems and Effectiveness of Optimization Methods, Nauka, Moscow, 1980. (In Russian.) [21] M. B. NEVEL’SON AND R. Z. KHAS’MINSKIJ, Stochastic Approximation and Recursive Estimation, American Mathematical Society, Providence, RI, 1973. [22] ., Adaptive Robbins-Monro procedure, Automat. Remote Control, 34 (1974), pp. 1594-1607. [23] B. T. POLYAK, Comparison of convergence rate for single-step and multi-step optimization algorithms in the presence of noise, Engrg. Cybernet., 15 (1977), pp. 6-10. [24] ., New stochastic approximation type procedures, Avtomat. Telemekh., N7 (1990), pp. 98-107. (In Russian.); translated in Automat. Remote Control, 51 (1991), to appear. [25] ., Introduction to Optimization, Optimization Software, New York, 1987. [26] B. T. POLYAK AND YA. Z. TSYPKIN, Attainable accuracy of adaptation algorithms, in Problems of Cybernetics. Adaptive Systems, Nauka, Moscow, 1976, pp. 6-19. (In Russian.) [27] ., Adaptive estimation algorithms (convergence, optimality, stability), Automat. Remote Control, 40 (1980), pp. 378-389. [28] ., Optimal pseudogradient adaptation algorithms, Automat. Remote Control, 41 1981 ), pp. 11011110. AND S. MONROE, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400-407. H. ROBBINS AND D. SIEGMUND, A convergence theorem for nonnegative almost supermartingales and some applications, in Optimizing Methods in Statistics, J. S. Rustaji, ed., Academic Press, New York, 1971, pp. 233-257. D. RUPPERT, A Newton-Rafson version of the multivariate Robbins-Monro procedure, Ann. Statist., 13 (1985), pp. 236-245. ., Efficient estimators from a slowly convergent Robbins-Monro process, Tech. Report No. 781, School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY, 1988. A. RUSZYNSKI AND W. SYSKI, Stochastic approximation method with gradient averaging for unconstrained problems, IEEE Trans., AC-28 (1983), pp. 1097-1105. D. T. SAKRISON, Stochastic approximation: A recursive method for solving regression problems, in Advances in Communication Theory and Applications, 2, A. V. Balakrishnan, ed., Academic Press, New York, London, 1966, pp. 51-106. A. N. SHIRYAEV, Probability, Nauka, Moscow, 1980. (In Russian.)
[29] H. ROBBINS
[30] [31]
[32] [33]
[34] [35]
ACCELERATION OF STOCHASTIC APPROXIMATION
855
[36] YA. Z. TSYPKIN, Adaptation and Learning in Automatic Systems, Academic Press, New York, London, 1971.
[37] , Foundations of Informational Theory of Identification, Nauka, Moscow, 1984. (In Russian.) [38] J. H. VENTER, An extension of the Robbins-Monroprocedure, Ann. Math. Statist., 38 (1967), pp. 181-190. [39] M. L. VIL’K AND S. V. SHIL’MAN, Convergence and optimality ofimplementable, adaptation algorithms (informational approach), Problems Inform. Transmission, 20 (1985), pp. 314-326. [40] M. WASAN, Stochastic Approximation, Cambridge University Press, London, 1970.