RESEARCH ARTICLE
Sequential linear regression with online standardized data Ke´vin Duarte1,2,3*, Jean-Marie Monnez1,2,3,4, Eliane Albuisson1,5,6 1 Universite´ de Lorraine, Institut Elie Cartan de Lorraine, UMR 7502, Vandoeuvre-lès-Nancy, F-54506, France, 2 Project-Team BIGS, INRIA, Villers-lès-Nancy, F-54600, France, 3 INSERM U1116, Centre d’Investigations Cliniques-Plurithe´matique 1433, Universite´ de Lorraine, Nancy, France, 4 Universite´ de Lorraine, Institut Universitaire de Technologie Nancy-Charlemagne, Nancy, F-54052, France, 5 BIOBASE, Poˆle S2R, CHRU de Nancy, Vandoeuvre-lès-Nancy, France, 6 Faculte´ de Me´decine, InSciDenS, Vandoeuvre-lès-Nancy, France
a1111111111 a1111111111 a1111111111 a1111111111 a1111111111
OPEN ACCESS Citation: Duarte K, Monnez J-M, Albuisson E (2018) Sequential linear regression with online standardized data. PLoS ONE 13(1): e0191186. https://doi.org/10.1371/journal.pone.0191186 Editor: Chenping Hou, National University of Defense Technology, CHINA Received: April 1, 2017 Accepted: December 31, 2017 Published: January 18, 2018
*
[email protected]
Abstract The present study addresses the problem of sequential least square multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose using a process with online standardized data instead of raw data and the use of several observations per step or all observations until the current step. Herein, we define and study the almost sure convergence of three processes with online standardized data: a classical process with a variable step-size and use of a varying number of observations per step, an averaged process with a constant step-size and use of a varying number of observations per step, and a process with a variable or constant step-size and use of all observations until the current step. Their convergence is obtained under more general assumptions than classical ones. These processes are compared to classical processes on 11 datasets for a fixed total number of observations used and thereafter for a fixed processing time. Analyses indicate that the third-defined process typically yields the best results.
Copyright: © 2018 Duarte et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All datasets used in our experiments except those derived from EPHESUS study are available online and links to download these data appear in Table 2 of our article. Due to legal restrictions, data from EPHESUS study are only available upon request. Interested researchers may request access to data upon approval from the EPHESUS Executive Steering Committee of the study. This committee can be reached through Pr Faiez Zannad (f.
[email protected]) who is member of this board.
1 Introduction In the present analysis, A0 denotes the transposed matrix of A while the abbreviation “a.s.” signifies almost surely. Let R = (R1,. . .,Rp) and S = (S1,. . .,Sq) be random vectors in Rp and Rq respectively. Considering the least square multidimensional linear regression of S with respect to R: the (p, q) matrix θ and the (q, 1) matrix η are estimated such that E[kS − θ0 R − ηk2] is minimal.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
1 / 27
Sequential linear regression with online standardized data
Funding: This work is supported by a public grant overseen by the French National Research Agency (ANR) as part of the second “Investissements d’Avenir” programme (reference: ANR-15-RHU0004). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Denote the covariance matrices
B ¼ F
¼
h i 0 Covar ½R ¼ E ðR E½RÞðR E½RÞ ; h i 0 Covar ½R; S ¼ E ðR E½RÞðS E½SÞ :
Competing interests: The authors have declared that no competing interests exist.
If we assume B is positive definite, i.e. there is no affine relation between the components of R, then
y ¼ B 1 F; Z ¼ E½S
0
y E½R:
0 0 Note that, R1 denoting the random vector in Rpþ1 such that R1 ¼ R 1 , θ1 the (p + 1, q) 0 0 0 matrix such that y1 ¼ y Z , B1 ¼ E R1 R1 and F1 = E[R1 S0 ], we obtain y1 ¼ B1 1 F1 . In order to estimate θ (or θ1), a stochastic approximation process (Xn) in Rpq (or Rðpþ1Þq ) is recursively defined such that
Xnþ1 ¼ Xn
an ðBn Xn
Fn Þ;
where (an) is a sequence of positive real numbers, eventually constant, called step-sizes (or gains). Matrices Bn and Fn have the same dimensions as B and F, respectively. The convergence of (Xn) towards θ is studied under appropriate definitions and assumptions on Bn and Fn. Suppose that ((R1n, Sn), n 1) is an i.i.d. sample of (R1, S). In the case where q = 1, 0 0 Bn ¼ R1n R1n and Fn ¼ R1n Sn , several studies have been devoted to this stochastic gradient process (see for example Monnez [1], Ljung [2] and references hereafter). In order to accelerate general stochastic approximation procedures, Polyak [3] and Polyak and Juditsky [4] introduced the averaging technique. In the case of linear regression, Gyo¨rfi and Walk [5] studied an averaged stochastic approximation process with a constant step-size. With the same type of process, Bach and Moulines [6] proved that the optimal convergence rate is achieved without strong convexity assumption on the loss function. However, this type of process may be subject to the risk of numerical explosion when components of R or S exhibit great variances and may have very high values. For datasets used as test sets by Bach and Moulines [6], all sample points whose norm of R is fivefold greater than the average norm are removed. Moreover, generally only one observation of (R, S) is introduced at each step of the process. This may be not convenient for a large amount of data generated by a data stream for example. Two modifications of this type of process are thus proposed in this article. The first change in order to avoid numerical explosion is the use of standardized, i.e. of zero mean and unit variance, components of R and S. In fact, the expectation and the variance of the components are usually unknown and will be estimated online. The parameter θ can be computed from the standardized components as follows. Let σj the standard deviation of Rj for j = 1,. . .,p and sk1 the standard deviation of Sk for k = 1,. . .,q. Define
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
2 / 27
Sequential linear regression with online standardized data
the following matrices 0
1 B s1 B B B G ¼ B .. B . B @ 0
1
..
.
0
1 0C B s1 B 1 C B C B 1 .. C . ;G ¼ B B .. . C C B C B @ 1A 0 p s
..
.
1 0C C C C .. C: . C C C 1A sq1
Let Sc = Γ1(S − E[S]) and Rc = Γ(R − E[R]). The least square linear regression of Sc with h i 0 2 respect to Rc is achieved by estimating the (p, q) matrix θc such that E jjSc yc Rc jj is minimal. Then θc = Γ−1(B−1 F)Γ1 , θ = B−1 F = Γθc(Γ1)−1. The second change is to use, at each step of the process, several observations of (R, S) or an estimation of B and F computed recursively from all observations until the current step without storing them. More precisely, the convergence of three processes with online standardized data is studied in sections 2, 3, 4 respectively. First, in section 2, a process with a variable step-size an and use of several online standardized observations at each step is studied; note that the number of observations at each step may vary with n. Secondly, in section 3, an averaged process with a constant step-size and use of a varying number of online standardized observations at each step is studied. Thirdly, in section 4, a process with a constant or variable step-size and use of all online standardized observations until the current step to estimate B and F is studied. These three processes are tested on several datasets when q = 1, S being a continuous or binary variable, and compared to existing processes in section 5. Note that when S is a binary variable, linear regression is equivalent to a linear discriminant analysis. It appears that the third-defined process most often yields the best results for the same number of observations used or for the same duration of computing time used. These processes belong to the family of stochastic gradient processes and are adapted to data streams. Batch gradient and stochastic gradient methods are presented and compared in [7] and reviewed in [8], including noise reduction methods, like dynamic sample sizes methods, stochastic variance reduced gradient (also studied in [9]), second-order methods, ADAGRAD [10] and other methods. This work makes the following contributions to the variance reduction methods: • In [9], the authors proposed a modification of the classical stochastic gradient algorithm to reduce directly the gradient of the function to be optimized in order to obtain a faster convergence. It is proposed in this article to reduce this gradient by an online standardization of the data. • Gradient clipping [11] is another method to avoid a numerical explosion. The idea is to limit the norm of the gradient to a maximum number called threshold. This number must be chosen, a bad choice of threshold can affect the computing speed. Moreover it is then necessary to compare the norm of the gradient to this threshold at each step. In our approach the limitation of the gradient is implicitly obtained by online standardization of the data. • If the expectation and the variance of the components of R and S were known, standardization of these variables could be made directly and convergence of the processes obtained using existing theorems. But these moments are unknown in the case of a data stream and
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
3 / 27
Sequential linear regression with online standardized data
are estimated online in this study. Thus the assumptions of the theorems of almost sure (a.s.) convergence of the processes studied in sections 2 and 3 and the corresponding proofs are more general than the classical ones in the linear regression case [1–5]. • The process defined in section 4 is not a classical batch method. Indeed in this type of method (gradient descent), the whole set of data is known a priori and is used at each step of the process. In the present study, new data are supposed to arrive at each step, as in a data stream, and are added to the preceding set of data, thus reducing by averaging the variance. This process can be considered as a dynamic batch method. • A suitable choice of step-size is often crucial for obtaining good performance of a stochastic gradient process. If the step-size is too small, the convergence will be slower. Conversely, if the step-size is too large, a numerical explosion may occur during the first iterations. Following [6], a very simple choice of the step-size is proposed for the methods with a constant step-size. • Another objective is to reduce computing time in order to take into account a maximum of data in the case of a data stream. It appears in the experiments that the use of all observations until the current step without storing them, several observations being introduced at each step, increases at best in general the convergence speed of the process. Moreover this can reduce the influence of outliers. As a whole the major contributions of this work are to reduce gradient variance by online standardization of the data or use of a “dynamic” batch process, to avoid numerical explosions, to reduce computing time and consequently to better adapt the stochastic approximation processes used to the case of a data stream.
2 Convergence of a process with a variable step-size Let (Bn, n 1) and (Fn, n 1) be two sequences of random matrices in Rpp and Rpq respectively. In this section, the convergence of the process (Xn, n 1) in Rpq recursively defined by Xnþ1 ¼ Xn
an ðBn Xn
Fn Þ
and its application to sequential linear regression are studied.
2.1 Theorem Let X1 be a random variable in Rpq independent from the sequence of random variables ((Bn, Fn), n 1) in Rpp Rpq . Denote Tn the σ-field generated by X1 and (B1, F1),. . .,(Bn−1, Fn−1). X1, X2,. . .,Xn are Tnmeasurable. Let (an) be a sequence of positive numbers. Make the following assumptions: (H1a) There exists a positive definite symmetrical matrix B such that a.s. 1 X 1) an jjE½Bn jTn Bjj < 1 n¼1 1 X
2)
a2n E jjBn
2 Bjj jTn < 1.
n¼1
(H2a) There exists a matrix F such that a.s. 1 X 1) an jjE½Fn jTn Fjj < 1 n¼1
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
4 / 27
Sequential linear regression with online standardized data
1 X
2)
a2n E jjFn
2 Fjj jTn < 1.
n¼1 1 X
1 X
a2n < 1.
an ¼ 1;
(H3a) n¼1
n¼1
Theorem 1 Suppose H1a, H2a and H3a hold. Then Xn converges to θ = B−1 F a.s. State the Robbins-Siegmund lemma [12] used in the proof. Lemma 2 Let (O, A, P) be a probability space and (Tn) a non-decreasing sequence of sub-σfields of A. Suppose for all n, zn, αn, βn and γn are four integrable non-negative Tn-measurable random variables defined on (O, A, P) such that: E znþ1 jTn ( 1 X Then, in the set
zn ð1 þ an Þ þ bn
a:s:
)
1 X
bn < 1 , (zn) converges to a finite random variable and
an < 1; n¼1
gn
n¼1
1 X
gn < 1 a.s. n¼1
Proof of Theorem 1. The Frobenius norm kAk for a matrix A is used. Recall that, if kAk2 denotes the spectral norm of A, kABk kAk2kBk. Xnþ1
y
¼ Xn
y
an ðBn Xn
¼ ðI
an BÞðXn
yÞ
Fn Þ an ððBn
BÞXn
ðFn
FÞÞ
Denote Zn = (Bn − B)Xn − (Fn − F) = (Bn − B)(Xn − θ) + (Bn − B)θ − (Fn − F) and Xn1 ¼ Xn y. Then: 1 Xnþ1 1 jjXnþ1 jj
an BÞXn1
¼ ðI
2
¼ jjðI
an Zn
an BÞXn1 jj
2
2an hðI
2
an BÞXn1 ; Zn i þ a2n jjZn jj :
Denote λ the smallest eigenvalue of B. As an ! 0, we have for n sufficiently large jjI
an Bjj2 ¼ 1
an l < 1:
Then, taking the conditional expectation with respect to Tn yields almost surely: 1 2 E jjXnþ1 jj jTn ð1
2
2
an lÞ jjXn1 jj þ 2an jhðI 2 a2n E jjZn jj jTn ;
E½Zn jTn ¼ ðE½Bn jTn
BÞXn1 þ ðE½Bn jTn
an BÞXn1 ; E½Zn jTn ijþ BÞy
ðE½Fn jTn
FÞ:
Denoting bn bn
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
¼ jjE½Bn jTn Bjj; dn ¼ jjE½Fn jTn Fjj; 2 2 ¼ E jjBn Bjj jTn ; dn ¼ E jjFn Fjj jTn ;
5 / 27
Sequential linear regression with online standardized data
2
we obtain, as jjXn1 jj 1 þ jjXn1 jj : jhðI
an BÞXn1 ; E½Zn jTn ij
jjXn1 jj jjE½Zn jTn jj 2
2 E jjZn jj jTn 1 2 E jjXnþ1 jj jTn
jjXn1 jj ðbn ð1 þ jjyjjÞ þ dn Þ þ bn jjyjj þ dn ; 2
2
3bn jjXn1 jj þ 3bn jjyjj þ 3dn ; 2
ð1 þ a2n l2 þ 2ð1 þ jjyjjÞan bn þ 2an dn þ 3a2n bn ÞjjXn1 jj þ 2
2jjyjjan bn þ 2an dn þ 3jjyjj a2n bn þ 3a2n dn
2
2an ljjXn1 jj :
Applying Robbins-Siegmund lemma under assumptions H1a, H2a and H3a implies that there exists a non-negative random variable T such that a.s. jjXn1 jj ! T;
1 X
2
an jjXn1 jj < 1:
n¼1 1 X
an ¼ 1, T = 0 a.s. ∎
As n¼1
A particular case with the following assumptions is now studied. (H1a’) There exist a positive definite symmetrical matrix B and a positive real number b such that a.s. 1) for all n, E[Bn|Tn] = B 2 2) sup n E[kBn − Bk |Tn] < b. (H2a’) There exist a matrix F and a positive real number d such that a.s. 1) for all n, E[Fn|Tn] = F sup 2) n E[kFn − Fk2|Tn] < d. (H3a’) Denoting λ the smallest eigenvalue of B, a 1 a 1 an ¼ a ; a > 0; < a < 1 or an ¼ ; a > . n 2 n 2l Theorem 3 Suppose H1a’, H2a’ and H3a’ hold. Then Xn converges to θ almost surely and in 2 quadratic mean. Moreover lim a1n E jjXn yjj < 1.
Proof of Theorem 3. In the proof of theorem 1, take βn = 0, δn = 0, bn < b, dn < d; then a.s.: 1 2 2 2 2 E jjXnþ1 jj jTn ð1 þ l2 a2n þ 3ba2n ÞjjXn1 jj þ 3ðbjjyjj þ dÞa2n 2an ljjXn1 jj : Taking the mathematical expectation yields: 1 2 2 2 E jjXnþ1 jj ð1 þ ðl2 þ 3bÞa2n ÞE jjXn1 jj þ 3ðbjjyjj þ dÞa2n
2 2an lE jjXn1 jj :
By Robbins-Siegmund lemma: 1 X 2 2 9t 0 : E jjXn1 jj !t; an E jjXn1 jj < 1: n¼1 1 X
As n¼1
an ¼ 1, t = 0. Therefore, there exist N 2 N and f > 0 such that for n > N: 1 2 2 E jjXnþ1 jj ð1 2an lÞE jjXn1 jj þ fa2n :
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
6 / 27
Sequential linear regression with online standardized data
a 1 Applying a lemma of Schmetterer [13] for an ¼ a with < a < 1 yields: n 2 2 lim na E jjXn1 jj < 1: a 1 Applying a lemma of Venter [14] for an ¼ with a > yields: n 2l 1 2 lim nE jjXn jj < 1 ∎
2.2 Application to linear regression with online standardized data Let (R1, S1),. . .,(Rn, Sn),. . . be an i.i.d. sample of a random vector (R, S) in Rp Rq . Let Γ (respectively Γ1) be the diagonal matrix of order p (respectively q) of the inverses of the standard deviations of the components of R (respectively S). Define the correlation matrices h i 0 B ¼ GE ðR E½RÞðR E½RÞ G; h i 0 F ¼ GE ðR E½RÞðS E½SÞ G1 : Suppose that B−1 exists. Let θ = B−1 F. n (respectively Sn ) the mean of the n-sample (R1, R2,. . .,Rn) of R (respectively Denote R (S1, S2,. . .,Sn) of S). 2 Denote ðVnj Þ the variance of the n-sample ðRj1 ; Rj2 ; :::; Rjn Þ of the jth component Rj of R, and 2
ðVn1k Þ the variance of the n-sample ðSk1 ; Sk2 ; :::; Skn Þ of the kth component Sk of S. Denote Γn (respectively G1n ) the diagonal matrix of order p (respectively q) whose element rffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffi n n (j, j) (respectively (k, k)) is the inverse of Vnj (respectively V 1k ). n 1 n 1 n n X Let (mn, n 1) be a sequence of integers. Denote Mn ¼ mk for n 1, M0 = 0 and k¼1
In = {Mn−1+1,. . .,Mn}. Define Bn
¼ G Mn
Fn
¼ G Mn
1 X ðRj j2In mn 1 X ðRj 1 j2In mn
1
0
M ÞðRj R n 1
M Þ GM ; R n 1 n 1
M ÞðSj R n 1
S M Þ G1 : Mn 1 n 1
0
Define recursively the process (Xn, n 1) in Rpq by Xnþ1 ¼ Xn
an ðBn Xn
Fn Þ:
Corollary 4 Suppose there is no affine relation between the components of R and the moments of order 4 of (R, S) exist. Suppose moreover that assumption H3a” holds: 1 1 X X a pnffiffiffi < 1; (H3a”) an > 0; a2n < 1: n n¼1 n¼1 Then Xn converges to θ a.s. This process was tested on several datasets and some results are given in section 5 (process S11 for mn = 1 and S12 for mn = 10). The following lemma is first proved.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
7 / 27
Sequential linear regression with online standardized data
Lemma 5 Suppose the moments of order 4 of R exist and an > 0, 1 X
M an jjR n
1 X a pnffiffiffi < 1. Then n n¼1
1 X
E½Rjj < 1 and
1
n¼1
an jjGMn
Gjj < 1 a.s.
1
n¼1
Proof of Lemma 5. The usual Euclidean norm for vectors and the spectral norm for matrices are used in the proof. Step 1: p X 2 Denote Var ½R ¼ E jjR E½Rjj ¼ Var ½Rj . j¼1
h M E jjR n
E½Rjj
1
2
i
p X
¼
p j X Var ½Rj Var ½R M : ¼ Var R n 1 Mn 1 n 1 j¼1 j¼1
Then: 1 X
M an E½jjR n
E½Rjj
1
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiX a pffiffiffiffinffiffiffiffiffiffiffi < 1 by H3a”: Var ½R n 1 n¼1
n¼1
1 X
It follows that
M an jjR n
1
E½Rjj < 1 a.s.
n¼1 1 X
Likewise
an jjSMn
E½Sjj < 1 a.s.
1
n¼1
Step 2:
jjGMn
1
1 Gjj ¼ max qffiffiffiffiffiffiffiffiffiffiffiffi Mn 1 j¼1;...;p Mn 1 1
VMj n
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi j Var ½R 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi Var ½Rj pffiffiffiffiffiffiffiffiffiffiffiffiffiffi Var ½Rj 1
qffiffiffiffiffiffiffiffiffiffiffiffi Mn 1 Vj p Mn 1 1 Mn 1 X qffiffiffiffiffiffiffiffiffiffiffiffi Mn 1 Vj j¼1 Mn 1 1 Mn M n 1 ðVMj Þ2 Var ½Rj p Mn 1 1 n 1 X qffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffi ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi p pffiffiffiffiffiffiffiffiffiffiffiffiffiffi: j j Mn 1 Mn 1 j j¼1 V Var R V ½ þ Var ½Rj Mn 1 1 Mn 1 Mn 1 1 Mn 1 Denote mj4 the centered moment of order 4 of Rj. We have: Mn E M n 1
1
1
2 j Mn 1
ðV
Þ
Var ½R j
¼
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Mn 1 2 Var ðV j Þ Mn 1 1 Mn 1 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! 2 mj4 ðVar ½Rj Þ : O Mn 1
8 / 27
Sequential linear regression with online standardized data
Then by H3a”, as Mn−1 n−1:
p 1 X X an E n¼1
)
p 1 X X Mn an M n¼1
As ðVMj n 1 Þ
2
j¼1
Mn Mn 1
j¼1
n 1
1
1
1
1
ðVMj n 1 Þ
ðVMj n 1 Þ
Var ½Rj < 1
2
2
Var ½Rj < 1 a:s:
! Var ½Rj a.s., j = 1,. . .,p, this implies:
1 X
an jjGMn
1
Gjj < 1 a:s:∎
n¼1
Proof of Corollary 4. Step 1: prove that assumption H1a1 of theorem 1 is verified. cj ¼ R j E½R. Denote Rc = R − E[R], Rcj ¼ Rj E½R, R
Bn
¼ ¼
B ¼
0 1 X c c c c Rj R Mn 1 ðRj R Mn 1 Þ GMn 1 1 mn j2I n X 0 0 1 cM Rcj 0 Rcj ðR cM Þ þ R cM ðR cM Þ GM : GMn 1 Rcj Rcj 0 R n 1 n 1 n 1 n 1 n 1 mn j2I n 0 GE Rc Rc G:
GMn
h i M are Tn-measurable and Rcj , j 2 In, is independent of Tn, with E Rcj ¼ 0: As ΓMn−1 and R n 1
E½Bn jTn
B ¼ ¼
0 0 0 cM ðR cM Þ ÞGM GMn 1 ðE Rc Rc þ R GE Rc Rc G n 1 n 1 n 1 0 0 ðGMn 1 GÞE Rc Rc GMn 1 þ GE Rc Rc ðGMn 1 GÞ 0
cM ðR cM Þ GM þGMn 1 R n n 1 n 1
1
a:s:
cM converge respectively to Γ and 0 a.s. and by lemma 5, As ΓMn−1 and R n 1 1 X X cM jj < 1 a.s., it follows that an jjGMn 1 Gjj < 1 and an jjR n 1 1
n¼1
n¼1
1 X
an jjE½Bn jTn
Bjj < 1 a.s.
n¼1
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
9 / 27
Sequential linear regression with online standardized data
Step 2: prove that assumption H1a2 of theorem 1 is verified.
2
0 1 X c c 2 c c
kBn Bk 2 G ðR R ÞðR R Þ G Mn 1 j Mn 1 j Mn 1
Mn 1 m n j2In 0 2 þ2kGE Rc Rc Gk X 4 1 cM k4 þ 2kGE Rc Rc 0 Gk2 2kGMn 1 k kRcj R n 1 mn j2I n
!
X
c 4 4 1 3 c 4 M þ 2kGE Rc Rc 0 Gk2 : 2kGMn 1 k 2 kRj k þ R
n 1 mn j2I n
E jjBn
2
Bjj jTn
4
2 jjGMn 1 jj
4
c 4 E jjRc jj þ
R Mn
4 !
þ 2jjGE Rc Rc 0 Gjj2 a:s: 1
cM converge respectively to Γ and 0 a.s., and As ΓMn−1 and R n 1
1 X
a2n < 1, it follows that
n¼1 1 X
a2n E jjBn
2
Bjj jTn < 1 a.s.
n¼1
Step 3: the proofs of the verification of assumptions H2a1 and H2a2 of theorem 1 are similar to the previous ones, Bn and B being respectively replaced by 1 X c c c c 0 1 Fn ¼ GMn 1 Rj R Mn 1 Sj S Mn 1 GMn 1 ; mn j2I n 0 F ¼ GE Rc Sc G1 ∎
3 Convergence of an averaged process with a constant step-size In this section, the process (Xn, n 1) with a constant step-size a and the averaged process (Yn, n 1) in Rpq are recursively defined by Xnþ1
¼ Xn
aðBn Xn
Fn Þ
1 X X ¼ Yn n þ 1 j¼1 j nþ1
Ynþ1
¼
1 Y nþ1 n
Xnþ1 :
The a.s. convergence of (Yn, n 1) and its application to sequential linear regression are studied.
3.1 Lemma Lemma 6 Let three real sequences (un), (vn) and (an), with un > 0 and an > 0 for all n, and a real positive number λ such that, for n 1, unþ1
ð1
an lÞun þ an vn :
Suppose: 1) vn ! 0 1 2) an ¼ a < or l
1 X
an ! 0;
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
! an ¼ 1 .
n¼1
10 / 27
Sequential linear regression with online standardized data
Under assumptions 1 and 2, un ! 0. Proof of Lemma 6. In the case an depending on n, as an ! 0, we can suppose without loss of generality that 1 − an λ > 0 for n 1. We have: n n Y X ai ð1
n Y
unþ1
ð1
ai lÞu1 þ
i¼1
i¼1
n Y al lÞvi ; with ¼ 1:
l¼iþ1
nþ1
Now, for n1 n2 n and 0 < ci < 1 with ci = ai λ for all i, we have: n2 n X Y ci ð1
n2 X
cl Þ
¼
l¼iþ1
i¼n1
n Y
ð1
ci ÞÞ
ð1
n2 n X Y
¼
n Y ð1
cl Þ
ð1 l¼iþ1
i¼n1
cl Þ
ð1
! cl Þ
l¼i
n Y
¼
cl Þ
ð1 l¼iþ1
i¼n1
l¼n2 þ1
n Y ð1 l¼n1
n Y
cl Þ
ð1
cl Þ 1:
l¼n2 þ1
Let > 0. There exists N such that for i > N, jvi j < l. Then for n N, applying the previ3 ous inequality with ci = ai λ, n1 = 1, n2 = N, yields: n Y
unþ1
N X
ai lÞu1 þ
ð1 i¼1
i¼1
n Y ð1
ai lÞu1 þ
i¼1
n Y
ai l
ð1
al lÞ
l¼iþ1
n Y 1 max jvi j ð1 l 1iN l¼Nþ1
n n Y jvi j X þ ai l ð1 l 3 i¼Nþ1 l¼iþ1
al lÞ
al lÞ þ : 3
In the case an depending on n, ln(1 − ai λ) * −ai λ as ai ! 0(i ! 1); then, as 1 n X Y an ¼ 1, ð1 al lÞ ! 0ðn ! 1Þ. n¼1
l¼Nþ1 n Y
In the case an = a,
ð1
alÞ ¼ ð1
alÞ
n N
! 0ðn ! 1Þ as 0 < 1 − aλ < 1.
l¼Nþ1
Thus there exists N1 such that un+1 < for n > N1 ∎
3.2 Theorem Make the following assumptions (H1b) There exist a positive definite symmetrical matrix B in Rpp and a positive real number b such that a.s. 1) limn!1(E[Bn|Tn] − B) = 0 1 X 1 2 1 2) ðE jjE½Bn jTn Bjj Þ2 < 1 n n¼1 3) supn E[kBn − Bk2|Tn] b. (H2b) There exist a matrix F in Rpq and a positive real number d such that a.s. 1) limn!1(E[Fn|Tn] − F) = 0 2) supn E [kFn − Fk2|Tn] d.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
11 / 27
Sequential linear regression with online standardized data
(H3b) λ and λmax being respectively the smallest and the largest eigenvalue of B, 1 2l 0 < a < min ; . lmax l2 þ b Theorem 7 Suppose H1b, H2b and H3b hold. Then Yn converges to θ = B−1 F a.s. Remark 1 Györfi and Walk [5] proved that Yn converges to θ a.s. and in quadratic mean under the assumptions E[Bn|Tn] = B, E[Fn|Tn] = F, H1b2 and H2b2. Theorem 7 is an extension of their a.s. convergence result when E[Bn|Tn] ! B and E[Fn|Tn] ! F a.s. ! R 0 Remark 2 Define R1 ¼ , B ¼ E R1 R1 , F = E[R1 S0 ]. If ((R1n, Sn), n 1) is an i.i.d. sam1 ple of (R1, S) whose moments of order 4 exist, assumptions H1b and H2b are verified for 0 0 0 0 0 Bn ¼ R1n R1n and Fn ¼ R1n Sn as E R1n R1n jTn ¼ E R1 R1 ¼ B and E R1n Sn jTn ¼ F. Proof of Theorem 7. Denote Zn
¼ ðBn
BÞðXn
1 n
X
¼ Xn
y;
Yn1
¼ Yn
y¼
yÞ þ ðBn
BÞy
ðFn
FÞ;
n 1X X1: n j¼1 j
Step 1: give a sufficient condition to have Yn1 ! 0 a.s. We have (cf. proof of theorem 1): 1 Xnþ1
¼ ðI
aBÞXn1
1 Ynþ1
¼
nþ1 1 1 X X11 þ X1 nþ1 n þ 1 j¼2 j
¼
nþ1 1 1 X X11 þ ðI nþ1 n þ 1 j¼2
¼
1 n X1 þ ðI nþ1 1 nþ1
aZn ;
aBÞXj1 1
aBÞYn1
a
a
nþ1 1 X Z n þ 1 j¼2 j
1
n 1 X Z: n þ 1 j¼1 j
1 Take now the Frobenius norm of Ynþ1 :
1 kYnþ1 k
kðI
aBÞYn1 k þ a
n 1 X Z n þ 1 j¼1 j
1 1 1 : X1 nþ1a
Under H3b, all the eigenvalues of I − aB are positive and the spectral norm of I − aB is equal to 1 − aλ. Then:
n
1 X 1 1 1 1
: jjYnþ1 X jj ð1 alÞkYn1 k þ a Z j
n þ 1 n þ 1 a 1 j¼1 n 1X Z ! 0 a.s. to conclude Yn1 ! 0 a.s. n j¼1 j n 1X Step 2: prove that assumptions H1b and H2b imply respectively B ! B and n j¼1 j n 1X F ! F a.s. n j¼1 j
By lemma 6, it suffices to prove
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
12 / 27
Sequential linear regression with online standardized data
The proof is only given for (Bn), the other one being similar. Assumption H1b3 implies supn E[kBn − Bk2] < 1. It follows that, for each element Bkln and 1 X Var Bkln Bkl Bkl of Bn and B respectively, < 1. Therefore: n2 n¼1 n h i 1X ! 0 a:s: Bklj Bkl E Bklj Bkl jTj n j¼1 h As E Bklj
i Bkl jTj
! 0 a.s. by H1b1, we have for each (k, l) n 1X Bkl n j¼1 j
Then
n 1X ðB n j¼1 j
Bkl
! 0 a:s:
BÞ ! 0 a.s.
Step 3: prove now that
n 1X ðB n j¼1 j
BÞXj1 ! 0 a.s.
Denote βn = kE[Bn|Tn] − Bk and γn = kE[Fn|Tn] − Fk. βn ! 0 and γn ! 0 a.s. under H1b1 and H2b1. Then: 8δ > 0, 8ε > 0, 9N(δ, ε): 8n N(δ, ε), T P fsupj>n ðbj Þ dg fsupj>n ðgj Þ dg > 1 ε: As a
1
n 1X B n j¼1 j
B Xj1 ! 0 a.s. Then: ε. This is true for every ε > 0. Thus: B Xj1 ! 0 a:s:
Therefore by step 2 and step 1, we conclude that
n 1X Z ! 0 and Yn1 ! 0 a.s. ∎ n j¼1 j
3.3 Application to linear regression with online standardized data Define as in section 2: Bn
¼ G Mn
1 X ðR 1 mn j2I j
M ÞðRj R n 1
M Þ GM ; R n 1 n 1
¼ G Mn
1 X ðR 1 mn j2I j
M ÞðSj R n 1
SMn 1 Þ G1Mn 1 :
0
n
Fn
0
n
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
15 / 27
Sequential linear regression with online standardized data
Denote U = (R − E[R]) (R − E[R])0 , B = ΓE[U]Γ the correlation matrix of R, λ and λmax respectively the smallest and the largest eigenvalue of B, b1 = E[kΓUΓ − Bk2], F = ΓE[(R − E[R]) (S − E[S])0 ]Γ1. Corollary 8 Suppose there is no affine relation between the components of R and the moments of order 4 of (R,S) exist. Suppose H3b1 holds: 1 2l (H3b1) 0 < a < min . ; lmax l2 þ b1 Then Yn converges to θ = B−1F a.s. This process was tested on several datasets and some results are given in section 5 (process S21 for mn = 1 and S22 for mn = 10). Proof of Corollary 8. Step 1: introduction. Using the decomposition of E[Bn|Tn] − B established in the proof of corollary 4, as M R ! E½R and ΓMn−1 ! Γ a.s., it is obvious that E[Bn|Tn] − B ! 0 a.s. Likewise E[Fn|Tn] − n 1 F ! 0 a.s. Thus assumptions H1b1 and H2b1 are verified. Suppose that Yn does not converge to θ almost surely. Then there exists a set of probability ε1 > 0 in which Yn does not converge to θ. pffiffiffiffiffiffiffiffiffiffiffiffiffiffi Denote sj ¼ Var ½Rj , j = 1,. . .,p. qffiffiffiffiffiffiffiffiffiffiffi Mn 1 M Vj As R E ½ R ! 0, sj ! 0, j = 1,. . .,p and ΓMn−1 − Γ ! 0 almost Mn 1 1 Mn 1 n 1 ε1 2
surely, there exists a set G of probability greater than 1 dom variables converge uniformly to θ. 1 i1 X 1 h Step 2: prove that ðE jjGMn 1 GjjIG Þ2 < 1. n n¼1
in which these sequences of ran-
By step 2 of the proof of lemma 5, we have for n > N:
jjGMn
1
GjjIG
M 2 j j 2 n 1 ðs Þ p Mn 1 1 ðVMn 1 Þ X qffiffiffiffiffiffiffiffiffiffiffiffi IG : qffiffiffiffiffiffiffiffiffiffiffiffi j j Mn 1 Mn 1 j j j¼1 V V s þ s M M Mn 1 1 Mn 1 1 n 1 n 1
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Mn 1 As in G, converges uniformly to σj for j = 1,. . .,p, there exists c > 0 such Vj M n 1 1 Mn 1 that
jjGMn
GjjIG
1
p X c j¼1
Mn Mn 1
1
1
2 j Mn 1
ðV
Þ
ðs Þ : j 2
Then there exists d > 0 such that h E jjGMn
Therefore
1 X 1 n¼1
n
h ðE jjGMn
1
i 1
GjjIG
d d pffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffi : Mn 1 n 1
i1 GjjIG Þ2 < 1.
Step 3: prove that assumption H1b2 is verified in G.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
16 / 27
Sequential linear regression with online standardized data
Using the decomposition of E[Bn|Tn] − B given in step 1 of the proof of corollary 4, with cM ¼ R M Rc = R − E[R] and R E½R yields a.s.: n 1 n 1 0 0 ðE½Bn jTn BÞIG ¼ ððGMn 1 GÞE Rc Rc GMn 1 þ GE Rc Rc ðGMn 1 GÞ 0
cM ðR cM Þ GM ÞIG : þGMn 1 R n 1 n 1 n 1 cM converge uniformly to 0, E[Bn|Tn] − B converges uniformly to As in G, ΓMn−1 − Γ and R n 1 0. Moreover there exists c1 > 0 such that
c
jjE½Bn jTn BjjIG c1 jjGMn 1 GjjIG þ R Mn 1
a:s:
c By the proof of lemma 5: E
R Mn
12 1 X
c 1
Var ½R ; then E
R Mn 1 n 1 n
12
< 1. 1
n¼1
By step 2:
1 X 1 n¼1
Then:
1 X 1 n¼1
n
n
ðE½jjGMn
1 2
GjjIG Þ < 1.
1
1
ðE½jjE½Bn jTn
BjjIG Þ2 < 1.
As E[Bn|Tn] − B converges uniformly to 0 on G, we obtain: 1 X 1 n¼1
n
E jjE½Bn jTn
2
Bjj IG
12
< 1:
Thus assumption H1b2 is verified in G. Step 4: prove that assumption H1b3 is verified in G. cj ¼ R j E½R. Consider the decomposition: Denote Rc = R − E[R], Rcj ¼ Rj E½R, R 1 X c c c c 0 Bn B ¼ GMn 1 Rj R Mn 1 Rj R Mn 1 GMn 1 mn j2I n 0 GE Rc Rc G ¼
with an
¼ G Mn
an þ bn
1 X c c0 Rj Rj 1 mn j2I
cM Rcj 0 R n 1
cM R R n c j
0 1
cM þR n
1
cM R n
0 GMn
1
1
n
G
1 X c c0 RR G mn j2I j j n
¼
GMn
G
1
! ! 1 X c c0 1 X c c0 G Mn R R GMn 1 þ G RR mn j2I j j mn j2I j j n
cM G Mn 1 R n
1
1 X c0 R G mn j2I j Mn
1
G
n
1
n
GMn
1
1 X c c 0 R R G Mn mn j2I j Mn 1
1
n
0
bn
cM ðR cM Þ GM ; þGMn 1 R n 1 n 1 n 1 ! X 1 0 ¼ G Rc Rc 0 E½Rc Rc G: mn j2I j j n
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
17 / 27
Sequential linear regression with online standardized data
Let η > 0. E jjBn
2
Bjj IG jTn
2 E jjan þ bn jj IG jTn 1 2 1 þ E jjan jj IG jTn Z 2 þð1 þ ZÞE jjbn jj IG jTn a:s:
¼
cM are Tn-measurable As random variables Rcj , j 2 In, are independent of Tn, as ΓMn−1 and R n 1 2 and converge uniformly respectively to Γ and 0 on G, E[kαnk IG|Tn] converges uniformly to 0. Then, for δ > 0, there exists N1 such that for n > N1, E[kαnk2 IG|Tn] δ a.s. Moreover, denoting U = RcRc0 and Uj ¼ Rcj Rcj 0 , we have, as the random variables Uj form an i.i.d. sample of U: " #
1 X
2 2
E jjbn jj jTn ¼ E G Uj E½U G jTn mn j2I n 2 2 E jjGðU E½U ÞGjj ¼ E jjGUG E½GUGjj ¼ b1 a:s: Then: E jjBn
2
Bjj IG jTn
1 1 þ d þ ð1 þ ZÞb1 ¼ b a:s: Z
Thus assumption H1b3 is verified in G. As S Mn 1 E½S ! 0 and G1Mn 1 G1 ! 0 almost surely, it can be proved likewise that ε1 there exist a set H of probability greater than 1 and d > 0 such that E[kFn−Fk2 IH|Tn] d 2 a.s. Thus assumption H2b2 is verified in H. Step 5: conclusion. 1 2l 2l As a < min , b1 < l2 . ; 2 lmax l þ b1 a 2l a
l2
2l a
l2
ð1 þ ZÞb1 1 and 0 < d < such that b1 1 þ 1Z 1 2l 2l b ¼ 1 þ d þ ð1 þ ZÞb1 < l2 ()a < 2 : Z a l þb
Choose 0 < Z
1 − ε1. This is in contradiction with PðYn ↛ yÞ ¼ ε1 . Thus Yn converges to θ a.s. ∎
4 Convergence of a process with a variable or constant step-size and use of all observations until the current step In this section, the convergence of the process (Xn, n 1) in Rpq recursively defined by Xnþ1 ¼ Xn
an ðBn Xn
Fn Þ
and its application to sequential linear regression are studied.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
18 / 27
Sequential linear regression with online standardized data
4.1 Theorem Make the following assumptions (H1c) There exists a positive definite symmetrical matrix B such that Bn ! B a.s. (H2c) There exists a matrix F such that Fn ! F a.s. (H3c) λmax denoting the largest eigenvalue of B, ! 1 X 1 an ¼ a < or an ! 0; an ¼ 1 . lmax n¼1 Theorem 9 Suppose H1c, H2c and H3c hold. Then Xn converges to B−1F a.s. Proof of Theorem 9. Denote θ = B−1F, Xn1 ¼ Xn y, Zn = (Bn − B)θ − (Fn − F). Then: 1 Xnþ1 ¼ ðI
an Bn ÞXn1
an Zn :
Let ω be fixed belonging to the intersection of the convergence sets {Bn ! B} and {Fn ! F}. The writing of ω is omitted in the following. Denote kAk the spectral norm of a matrix A and λ the smallest eigenvalue of B. In the case an depending on n, as an ! 0, we can suppose without loss of generality 1 an < for all n. Then all the eigenvalues of I − anB are positive and kI − anBk = 1 − anλ. lmax Let 0 < ε < λ. As Bn − B ! 0, we obtain for n sufficiently large: an Bn jj
jjI
jjI 1
1 jjXnþ1 jj
ð1
an Bjj þ an jjBn
Bjj
an l þ an ε ; with an < an ðl
1
l ε εÞÞjjXn1 jj þ an jjZn jj:
As Zn ! 0, applying lemma 6 yields jjXn1 jj ! 0. Therefore Xn ! B−1F a.s. ∎
4.2 Application to linear regression with online standardized data n X
Let (mn, n 1) be a sequence of integers. Denote Mn ¼
mk for n 1, M0 = 0 and k¼1
In = {Mn−1 + 1,. . .,Mn}. Define Bn
¼
GMn
n X 1 X 0 RR Mn i¼1 j2I j j
GMn
n X 1 X 0 RS Mn i¼1 j2I j j
! 0
M R G Mn ; R n Mn
i
Fn
¼
! 0
M S M G1M : R n n n
i
As ((Rn, Sn), n 1) is an i.i.d. sample of (R, S), assumptions H1c and H2c are obviously verified with B = ΓE[(R − E[R])(R − E[R])0 ]Γ and F = ΓE[(R − E[R])(S − E[S])0 ]Γ1. Then: Corollary 10 Suppose there is no affine relation between the components of R and the moments of order 4 of (R, S) exist. Suppose H3c holds. Then Xn converges to B−1F a.s. Remark 3 B is the correlation matrix of R of dimension p. Then λmax < Trace(B) = p. In the 1 case of a constant step-size a, it suffices to take a to verify H3c. p
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
19 / 27
Sequential linear regression with online standardized data
Remark 4 In the definition of Bn and Fn, the Rj and the Sj are not directly pseudo-centered M and SM respectively. Another equivalent definition of Bn and Fn can be used. with respect to R n n M by R M It consists of replacing Rj by Rj − m, R m, Sj by Sj − m, S M by SM m1 , m and m1 n
n
n
n
being respectively an estimation of E[R] and E[S] computed in a preliminary phase with a small X 0 number of observations. For example, at step n, GMn ðRj mÞ ðGMn ðRj mÞÞ is computed j2In
X instead of
0
GMn Rj ðGMn Rj Þ . This limits the risk of numerical explosion. j2In
This process was tested on several datasets and some results are given in section 5 (with a variable step-size: process S13 for mn = 1 and S14 for mn = 10; with a constant step-size: process S31 for mn = 1 and S32 for mn = 10).
5 Experiments The three previously-defined processes of stochastic approximation with online standardized data were compared with the classical stochastic approximation and averaged stochastic approximation (or averaged stochastic gradient descent) processes with constant step-size (denoted ASGD) studied in [5] and [6]. A description of the methods along with abbreviations and parameters used is given in Table 1. With the variable S set at dimension 1, 11 datasets were considered, some of which are available in free access on the Internet, while others were derived from the EPHESUS study [15]: 6 in regression (continuous dependent variable) and 5 in linear discriminant analysis (binary dependent variable). All datasets used in our experiments are presented in detail in Table 2, along with their download links. An a priori selection of variables was performed on each dataset using a stepwise procedure based on Fisher’s test with p-to-enter and p-to-remove fixed at 5 percent. Let D = {(ri, si), i = 1, 2,. . .,N} be the set of data in Rp R and assuming that it represents the set of realizations of a random vector (R, S) uniformly distributed in D, then minimizing N 1X 2 E[(S−θ 0 R−η)2] is equivalent to minimizing ðs y0 ri ZÞ . One element of D (or several N i¼1 i according to the process) is randomly drawn at each step to iterate the process. Table 1. Description of the methods. Method type Classic
Abbreviation
Type of data
Number of observations used at each step of the process
Use of all the observations until the current step
C1
Raw data
1
No
C2
ASGD Standardization 1
Standardization 2 Standardization 3
1
C4
10
A1
1
A2
1
S11
variable
No
No
constant
Yes
No
variable
No
constant
Yes
10
C3
S12
Step-size Use of the averaged process
Online standardized data
1
Yes
10
S13
1
S14
10
S21
1
S22
10
S31
1
S32
10
Yes No Yes
No
https://doi.org/10.1371/journal.pone.0191186.t001
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
20 / 27
Sequential linear regression with online standardized data
Table 2. Datasets used in our experiments. N
pa
p
Type of dependent variable
T2
Number of outliers
CADATA
20640
8
8
Continuous
1.6x106
122
AILERONS
7154
40
9
Continuous
247.1
0
www.dcc.fc.up.pt/*ltorgo/Regression/DataSets.html
ELEVATORS
8752
18
10
Continuous
7.7x104
0
www.dcc.fc.up.pt/*ltorgo/Regression/DataSets.html
POLY
5000
48
12
Continuous
4.1x104
0
www.dcc.fc.up.pt/*ltorgo/Regression/DataSets.html
eGFR
21382
31
15
Continuous
2.9x104
0
derived from EPHESUS study [15]
HEMG
21382
31
17
Continuous
6.0x104
0
derived from EPHESUS study [15]
QUANTUM
50000
78
14
Binary
22.5
1068
ADULT
45222
97
95
Binary
4.7x1010
20
www.cs.toronto.edu/*delve/data/datasets.html
RINGNORM
7400
20
20
Binary
52.8
0
www.cs.toronto.edu/*delve/data/datasets.html
TWONORM
7400
20
20
Binary
24.9
0
www.cs.toronto.edu/*delve/data/datasets.html
HOSPHF30D
21382
32
15
Binary
8.1x105
0
derived from EPHESUS study [15]
Dataset name
www.dcc.fc.up.pt/*ltorgo/Regression/DataSets.html
www.osmot.cs.cornell.edu/kddcup
N denotes the size of global sample, pa the number of parameters available, p the number of parameters selected and T2 the trace of E[RR0 ]. Outlier is defined as an observation whose the L2 norm is greater than five times the average norm. https://doi.org/10.1371/journal.pone.0191186.t002
To compare the methods, two different studies were performed: one by setting the total number of observations used, the other by setting the computing time. The choice of step-size, the initialization for each method and the convergence criterion used are respectively presented and commented below. Choice of step-size In all methods of stochastic approximation, a suitable choice of step-size is often crucial for obtaining good performance of the process. If the step-size is too small, the convergence rate will be slower. Conversely, if the step-size is too large, a numerical explosion phenomenon may occur during the first iterations. For the processes with a variable step-size (processes C1 to C4 and S11 to S14), we chose to use an of the following type: an
¼
cg a: ðb þ nÞ
2 The constant a ¼ was fixed, as suggested by Xu [16] in the case of stochastic approxima3 1 tion in linear regression, and b = 1. The results obtained for the choice cg ¼ are presented p although the latter does not correspond to the best choice for a classical method. For the ASGD method (A1, A2), two different constant step-sizes a as used in [6] were 1 1 tested: a ¼ 2 and a ¼ 2 , T2 denoting the trace of E[RR0 ]. Note that this choice of constant T 2T step-size assumes knowing a priori the dataset and is not suitable for a data stream. 1 For the methods with standardization and a constant step-size a (S21, S22, S31, S32), a ¼ p was chosen since the matrix E[RR0 ] is thus the correlation matrix of R, whose trace is equal to p, such that this choice corresponds to that of [6]. Initialization of processes All processes (Xn) were initialized by X1 ¼ 0, the null vector. For the processes with standardization, a small number of observations (n = 1000) were taken into account in order to calculate an initial estimate of the means and standard deviations.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
21 / 27
Sequential linear regression with online standardized data
Convergence criterion The “theoretical vector” θ1 is assigned as that obtained by the least square method in D such 0 0 that y1 ¼ y Z . Let Y1nþ1 be the estimator of θ1 obtained by stochastic approximation after n iterations. In the case of a process (Xn) with standardized data, which yields an estimation of the vector denoted θc in section 1 as θ = Γθc(Γ1)−1 and η = E[S] − θ0 E[R], we can define: 0 0 Y1nþ1 ¼ Ynþ1 Hnþ1 with Ynþ1
¼
GMn Xnþ1 ðG1Mn Þ
Hnþ1
¼
S M n
1
0
M : Ynþ1 R n
To judge the convergence of the method, the cosine of the angle formed by exact θ1 and its estimation y1nþ1 was used as criterion, 1
1 nþ1
cos y ; y
0
y1 y1nþ1 ¼ 1 : jjy jj2 jjy1nþ1 jj2
y1nþ1 jj2 f ðy1nþ1 Þ f ðy1 Þ or , f being the loss function, were also f ðy1 Þ jjy1 jj2 tested, although the results are not presented in this article. Other criteria, such as
jjy1
5.1 Study for a fixed total number of observations used For all N observations used by the algorithm (N being the size of D) up to a maximum of 100N observations, the criterion value associated with each method and for each dataset was recorded. The results obtained after using 10N observations are provided in Table 3. As can be seen in Table 3, a numerical explosion occured in most datasets using the classical methods with raw data and a variable step-size (C1 to C4). As noted in Table 2, these datasets had a high T2 = Tr(E[RR 0 ]). Corresponding methods S11 to S14 using the same variable stepTable 3. Results after using 10N observations. CADATA
AILERONS
ELEVATORS
POLY
EGFR
HEMG
QUANTUM
ADULT
RINGNORM
TWONORM
HOSPHF30D
Mean rank
C1
Expl.
-0.0385
Expl.
Expl.
Expl.
Expl.
0.9252
Expl.
0.9998
1.0000
Expl.
11.6
C2
Expl.
0.0680
Expl.
Expl.
Expl.
Expl.
0.8551
Expl.
0.9976
0.9996
Expl.
12.2
C3
Expl.
0.0223
Expl.
Expl.
Expl.
Expl.
0.9262
Expl.
0.9999
1.0000
Expl.
9.9
C4
Expl.
-0.0100
Expl.
Expl.
Expl.
Expl.
0.8575
Expl.
0.9981
0.9996
Expl.
12.3
A1
-0.0013
0.4174
0.0005
0.3361
0.2786
0.2005
Expl.
0.0027
0.9998
1.0000
0.0264
9.2
A2
0.0039
0.2526
0.0004
0.1875
0.2375
0.1846
0.0000
0.0022
0.9999
1.0000
0.2047
8.8
S11
1.0000
0.9516
0.9298
1.0000
1.0000
0.9996
0.9999
0.7599
0.9999
1.0000
0.7723
5.2
S12
0.9999
0.9579
0.9311
1.0000
0.9999
0.9994
0.9991
0.6842
0.9999
1.0000
0.4566
6.1
S13
1.0000
0.9802
0.9306
1.0000
1.0000
0.9998
1.0000
0.7142
0.9999
1.0000
0.7754
3.7
S14
0.9999
0.9732
0.9303
1.0000
0.9999
0.9994
0.9991
0.6225
0.9998
1.0000
0.4551
6.9
S21
0.9993
0.6261
0.9935
Expl.
Expl.
Expl.
Expl.
Expl.
0.9998
1.0000
Expl.
10.5
S22
1.0000
0.9977
0.9900
1.0000
1.0000
0.9989
0.9999
-0.0094
0.9999
1.0000
0.9454
4.1
S31
1.0000
0.9988
0.9999
1.0000
1.0000
0.9992
0.9999
0.9907
0.9999
1.0000
0.9788
2.3
S32
1.0000
0.9991
0.9998
1.0000
1.0000
0.9992
0.9999
0.9867
0.9999
1.0000
0.9806
2.2
Expl. means numerical explosion. https://doi.org/10.1371/journal.pone.0191186.t003
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
22 / 27
Sequential linear regression with online standardized data
Fig 1. Results obtained for dataset POLY using 10N and 100N observations: A/ process C1 with variable step-size an ¼ process C1 with variable step-size an ¼
1 p 2
ðb þ nÞ3
1 2
ðb þ nÞ3
by varying b, B/
by varying b, C/ process S21 by varying constant step-size a.
https://doi.org/10.1371/journal.pone.0191186.g001
size but with online standardized data quickly converged in most cases. However classical methods with raw data can yield good results for a suitable choice of step-size, as demonstrated by the results obtained for POLY dataset in Fig 1. The numerical explosion can arise from a too high step-size when n is small. This phenomenon can be avoided if the step-size is reduced, although if the latter is too small, the convergence rate will be slowed. Hence, the right balance
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
23 / 27
Sequential linear regression with online standardized data
must be found between step-size and convergence rate. Furthermore, the choice of this stepsize generally depends on the dataset which is not known a priori in the case of a data stream. In conclusion, methods with standardized data appear to be more robust to the choice of stepsize. 1 1 The ASGD method (A1 with constant step-size a ¼ 2 and A2 with a ¼ 2 ) did not yield T 2T good results except for the RINGNORM and TWONORM datasets which were obtained by simulation (note that all methods functioned very well for these two datasets). Of note, A1 exploded for the QUANTUM dataset containing 1068 observations (2.1%) whose L2 norm was fivefold greater than the average norm (Table 2). The corresponding method S21 with 1 online standardized data yielded several numerical explosions with the a ¼ step-size, howp ever these explosions disappeared when using a smaller step-size (see Fig 1). Of note, it is 1 2l 1 1 assumed in corollary 8 that 0 < a < min ; in the case of a ¼ , only a < ; 2 lmax l þ b1 p lmax is certain. Finally, for methods S31 and S32 with standardized data, the use of all observations until 1 the current step and the very simple choice of the constant step-size a ¼ uniformly yielded p good results. Thereafter, for each fixed number of observations used and for each dataset, the 14 methods ranging from the best (the highest cosine) to the worst (the lowest cosine) were ranked by assigning each of the latter a rank from 1 to 14 respectively, after which the mean rank in all 11 datasets was calculated for each method. A total of 100 mean rank values were calculated for a number of observations used varying from N to 100N. The graph depicting the change in mean rank based on the number of observations used and the boxplot of the mean rank are shown in Fig 2.
Fig 2. Results for a fixed total number of observations used: A/ change in the mean rank based on the number of observations used, B/ boxplot of the mean rank by method. https://doi.org/10.1371/journal.pone.0191186.g002
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
24 / 27
Sequential linear regression with online standardized data
Overall, for these 11 datasets, a method with standardized data, a constant step-size and use of all observations until the current step (S31, S32) represented the best method when the total number of observations used was fixed.
5.2 Study for a fixed processing time For every second up to a maximum of 2 minutes, the criterion value associated to each dataset was recorded. The results obtained after a processing time of 1 minute are provided in Table 4. The same conclusions can be drawn as those described in section 5.1 for the classical methods and the ASGD method. The methods with online standardized data typically faired better. As in the previous study in section 5.1, the 14 methods were ranked from the best to the worst on the basis of the mean rank for a fixed processing time. The graph depicting the change in mean rank based on the processing time varying from 1 second to 2 minutes as well as the boxplot of the mean rank are shown in Fig 3. As can be seen, these methods with online standardized data using more than one observation per step yielded the best results (S32, S22). One explanation may be that the total number of observations used in a fixed processing time is higher when several observations are used per step rather than one observation per step. This can be verified in Table 5 in which the total number of observations used per second for each method and for each dataset during a processing time of 2 minutes is given. Of note, the number of observations used per second in a process with standardized data and one observation per step (S11, S13, S21, S31) was found to be generally lower than in a process with raw data and one observation per step (C1, C3, A1, A2), since a method with standardization requires the recursive estimation of means and variances at each step. Of note, for the ADULT dataset with a large number of parameters selected (95), the only method yielding sufficiently adequate results after a processing time of one minute was S32, and methods S31 and S32 when 10N observations were used. Table 4. Results obtained after a fixed time of 1 minute.
C1
CADATA
AILERONS
ELEVATORS
POLY
EGFR
HEMG
QUANTUM
ADULT
RINGNORM
TWONORM
HOSPHF30D
Mean rank
Expl.
-0.2486
Expl.
Expl.
Expl.
Expl.
0.9561
Expl.
1.0000
1.0000
Expl.
12.2
C2
Expl.
0.7719
Expl.
Expl.
Expl.
Expl.
0.9519
Expl.
1.0000
1.0000
Expl.
9.9
C3
Expl.
0.4206
Expl.
Expl.
Expl.
Expl.
0.9547
Expl.
1.0000
1.0000
Expl.
10.6 10.1
C4
Expl.
0.0504
Expl.
Expl.
Expl.
Expl.
0.9439
Expl.
1.0000
1.0000
Expl.
A1
-0.0067
0.8323
0.0022
0.9974
0.7049
0.2964
Expl.
0.0036
1.0000
1.0000
Expl.
9.0
A2
0.0131
0.8269
0.0015
0.9893
0.5100
0.2648
Expl.
0.0027
1.0000
1.0000
0.2521
8.6
S11
1.0000
0.9858
0.9305
1.0000
1.0000
1.0000
1.0000
0.6786
1.0000
1.0000
0.9686
5.8
S12
1.0000
0.9767
0.9276
1.0000
1.0000
0.9999
1.0000
0.6644
1.0000
1.0000
0.9112
5.8
S13
1.0000
0.9814
0.9299
1.0000
1.0000
0.9999
1.0000
0.4538
1.0000
1.0000
0.9329
6.1
S14
1.0000
0.9760
0.9274
1.0000
1.0000
1.0000
0.9999
0.5932
1.0000
1.0000
0.8801
6.1
S21
-0.9998
0.2424
0.6665
Expl.
Expl.
Expl.
Expl.
0.0000
1.0000
1.0000
Expl.
11.5
S22
1.0000
0.9999
1.0000
1.0000
1.0000
1.0000
1.0000
-0.0159
1.0000
1.0000
0.9995
3.1
S31
1.0000
0.9995
1.0000
1.0000
1.0000
0.9999
1.0000
0.9533
1.0000
1.0000
0.9997
4.5
S32
1.0000
0.9999
1.0000
1.0000
1.0000
1.0000
1.0000
0.9820
1.0000
1.0000
0.9999
1.5
Expl. means numerical explosion. https://doi.org/10.1371/journal.pone.0191186.t004
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
25 / 27
Sequential linear regression with online standardized data
Fig 3. Results for a fixed processing time: A/ change in the mean rank based on the processing time, B/ boxplot of the mean rank by method. https://doi.org/10.1371/journal.pone.0191186.g003
Table 5. Number of observations used after 2 minutes (expressed in number of observations per second). CADATA
AILERONS
ELEVATORS
POLY
EGFR
HEMG
QUANTUM
ADULT
RINGNORM
TWONORM
C1
19843
33170
17133
14300
10979
9243
33021
476
31843
31677
HOSPHF30D 10922
C2
166473
291558
159134
134249
104152
89485
281384
4565
262847
261881
102563
C3
17206
28985
16036
13449
10383
8878
28707
462
28123
28472
10404
C4
132088
194031
125880
106259
87844
76128
184386
4252
171711
166878
86895
A1
33622
35388
36540
35800
35280
34494
11815
15390
34898
34216
14049
A2
33317
32807
36271
35628
35314
34454
15439
16349
34401
34205
34890
S11
17174
17133
17166
16783
15648
14764
16296
1122
14067
13836
14334
S12
45717
47209
45893
43470
39937
37376
40943
4554
34799
34507
36389
S13
12062
12731
11888
12057
11211
10369
11466
620
9687
9526
10137
S14
43674
46080
43068
42123
38350
35338
39170
4512
33594
31333
32701
S21
15396
17997
16772
10265
8404
7238
9166
996
13942
13274
7672
S22
47156
47865
46318
43899
40325
37467
41320
4577
34478
31758
37418
S31
12495
12859
12775
12350
11495
10619
11608
621
9890
9694
10863
S32
44827
47035
45123
42398
38932
36288
39362
4532
33435
33385
35556
https://doi.org/10.1371/journal.pone.0191186.t005
6 Conclusion In the present study, three processes with online standardized data were defined and for which their a.s. convergence was proven. A stochastic approximation method with standardized data appears to be advantageous compared to a method with raw data. First, it is easier to choose the step-size. For processes S31 and S32 for example, the definition of a constant step-size only requires knowing the number of parameters p. Secondly, the standardization usually allows avoiding the phenomenon of numerical explosion often obtained in the examples given with a classical method.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
26 / 27
Sequential linear regression with online standardized data
The use of all observations until the current step can reduce the influence of outliers and increase the convergence rate of a process. Moreover, this approach is particularly adapted to the case of a data stream. Finally, among all processes tested on 11 different datasets (linear regression or linear discriminant analysis), the best was a method using standardization, a constant step-size equal to 1 and all observations until the current step, and the use of several new observations at each p step improved the convergence rate.
Author Contributions Conceptualization: Ke´vin Duarte, Jean-Marie Monnez, Eliane Albuisson. Formal analysis: Ke´vin Duarte, Jean-Marie Monnez, Eliane Albuisson. Writing – original draft: Ke´vin Duarte, Jean-Marie Monnez, Eliane Albuisson.
References 1.
Monnez JM. Le processus d’approximation stochastique de Robbins-Monro: re´sultats the´oriques; estimation se´quentielle d’une espe´rance conditionnelle. Statistique et Analyse des Donne´es. 1979; 4 (2):11–29.
2.
Ljung L. Analysis of stochastic gradient algorithms for linear regression problems. IEEE Transactions on Information Theory. 1984; 30(2):151–160. https://doi.org/10.1109/TIT.1984.1056895
3.
Polyak BT. New method of stochastic approximation type. Automation and remote control. 1990; 51 (7):937–946.
4.
Polyak BT, Juditsky AB. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization. 1992; 30(4):838–855. https://doi.org/10.1137/0330046
5.
Gyo¨rfi L, Walk H. On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization. 1996; 34(1):31–61. https://doi.org/10.1137/S0363012992226661
6.
Bach F, Moulines E. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems. 2013;773–781.
7.
Bottou L, Le Cun Y. On-line learning for very large data sets. Applied Stochastic Models in Business and Industry. 2005; 21(2):137–151. https://doi.org/10.1002/asmb.538
8.
Bottou L, Curtis FE, Noceda J. Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838v2. 2017.
9.
Johnson R, Zhang Tong. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. Advances in Neural Information Processing Systems. 2013:315–323.
10.
Duchi J, Hazan E, Singer Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research. 2011; 12:2121–2159.
11.
Pascanu R, Mikolov T, Bengio Y. Understanding the exploding gradient problem. arXiv:1211.5063v1. 2012.
12.
Robbins H, Siegmund D. A convergence theorem for nonnegative almost supermartingales and some applications. Optimizing Methods in Statistics, Rustagi J.S. (ed.), Academic Press, New York. 1971;233–257.
13.
Schmetterer L. Multidimensional stochastic approximation. Multivariate Analysis II, Proc. 2nd Int. Symp., Dayton, Ohio, Academic Press. 1969;443–460.
14.
Venter JH. On Dvoretzky stochastic approximation theorems. The Annals of Mathematical Statistics. 1966; 37:1534–1544. https://doi.org/10.1214/aoms/1177699145
15.
Pitt B., Remme W., Zannad F. et al. Eplerenone, a selective aldosterone blocker, in patients with left ventricular dysfunction after myocardial infarction. New England Journal of Medicine. 2003; 348 (14):1309–1321. https://doi.org/10.1056/NEJMoa030207 PMID: 12668699
16.
Xu W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv:1107.2490v2. 2011.
PLOS ONE | https://doi.org/10.1371/journal.pone.0191186 January 18, 2018
27 / 27