Gradient Descent with Random Initialization - Princeton University

Gradient Descent with Random Initialization: Fast Global Convergence for Nonconvex Phase Retrieval Yuxin Chen∗

Yuejie Chi†

Jianqing Fan‡

Cong Ma‡

March 2018

Abstract This paper considers the problem of solving systems of quadratic equations, namely, recovering an \ 2 object of interest x\ ∈ Rn from m quadratic equations / samples yi = (a> i x ) , 1 ≤ i ≤ m. This problem, also dubbed as phase retrieval, spans multiple domains including physical sciences and machine learning. We investigate the efficiency of gradient descent (or Wirtinger flow) designed for the nonconvex least squares problem. We prove that under Gaussian designs, gradient descent — when randomly initialized — yields an -accurate solution in O log n + log(1/) iterations given nearly minimal samples, thus achieving near-optimal computational and sample complexities at once. This provides the first global convergence guarantee concerning vanilla gradient descent for phase retrieval, without the need of (i) carefully-designed initialization, (ii) sample splitting, or (iii) sophisticated saddle-point escaping schemes. All of these are achieved by exploiting the statistical models in analyzing optimization algorithms, via a leave-one-out approach that enables the decoupling of certain statistical dependency between the gradient descent iterates and the data.

1

Introduction

Suppose we are interested in learning an unknown object x\ ∈ Rn , but only have access to a few quadratic equations of the form \ 2 yi = a> , 1 ≤ i ≤ m, (1) i x where yi is the sample we collect and ai is the design vector known a priori. Is it feasible to reconstruct x\ in an accurate and efficient manner? The problem of solving systems of quadratic equations (1) is of fundamental importance and finds applications in numerous contexts. Perhaps one of the best-known applications is the so-called phase retrieval problem arising in physical sciences [CESV13, SEC+ 15]. In X-ray crystallography, due to the ultra-high frequency of the X-rays, the optical sensors and detectors are incapable of recording the phases of the diffractive waves; rather, only intensity measurements are collected. The phase retrieval problem comes down to reconstructing the specimen of interest given intensity-only measurements. If one thinks of x\ as the specimen under study and uses {yi } to represent the intensity measurements, then phase retrieval is precisely about inverting the quadratic system (1). Moving beyond physical sciences, the above problem also spans various machine learning applications. One example is mixed linear regression, where one wishes to estimate two unknown vectors β1 and β2 from unlabeled linear measurements [CYC14, BWY14]. The acquired data {ai , bi }1≤i≤m take the form of either > bi ≈ a> i β1 or bi ≈ ai β2 , without knowing which of the two vectors generates the data. In a simple symmetric \ 2 > \ 2 case with β1 = −β2 = x\ (so that bi ≈ ±a> i x ), the squared measurements yi = bi ≈ (ai x ) become \ the sufficient statistics, and hence mixed linear regression can be converted to learning x from {ai , yi }. Author names are sorted alphabetically. of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA; Email: [email protected]. † Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Email: [email protected]. ‡ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; Email: {jqfan, congm}@princeton.edu. ∗ Department

1

Furthermore, the quadratic measurement model in (1) allows to represent a single neuron associated with a quadratic activation function, where {ai , yi } are the data and x\ encodes the parameters to be learned. As described in [SJL17, LMZ17], learning neural nets with quadratic activations involves solving systems of quadratic equations.

1.1

Nonconvex optimization via gradient descent

A natural strategy for inverting the system of quadratic equations (1) is to solve the following nonconvex least squares estimation problem minimizex∈Rn

f (x) :=

m i2 1 X h > 2 ai x − yi . 4m i=1

(2)

i.i.d.

Under Gaussian designs where ai ∼ N (0, In ), the solution to (2) is known to be exact — up to some global sign — with high probability, as soon as the number m of equations (samples) exceeds the order of the number n of unknowns [BCMN14]. However, the loss function in (2) is highly nonconvex, thus resulting in severe computational challenges. With this issue in mind, can we still hope to find the global minimizer of (2) via low-complexity algorithms which, ideally, run in time proportional to that taken to read the data? Fortunately, in spite of nonconvexity, a variety of optimization-based methods are shown to be effective in the presence of proper statistical models. Arguably, one of the simplest algorithms for solving (2) is vanilla gradient descent (GD), which attempts recovery via the update rule xt+1 = xt − ηt ∇f xt , t = 0, 1, · · · (3) with ηt being the stepsize / learning rate. The above iterative procedure is also dubbed Wirtinger flow for phase retrieval, which can accommodate the complex-valued case as well [CLS15]. This simple algorithm is remarkably efficient under Gaussian designs: in conjunction with carefully-designed initialization and stepsize rules, GD provably converges to the truth x\ at a linear rate1 , provided that the ratio m/n of the number of equations to the number of unknowns exceeds some logarithmic factor [CLS15, Sol14, MWCC17]. One crucial element in prior convergence analysis is initialization. In order to guarantee linear convergence, prior works typically recommend spectral initialization or its variants [CLS15,CC17,WGE17,ZZLC17, MWCC17, LL17, MM17]. Specifically, the spectral method forms an initial estimate x0 using the (properly scaled) leading eigenvector of a certain data matrix. Two important features are worth emphasizing: • x0 falls within a local `2 -ball surrounding x\ with a reasonably small radius, where f (·) enjoys strong convexity; 0 • x0 is incoherent with all the design vectors {ai } — in the sense that |a> i x | is reasonably small for all 0 1 ≤ i ≤ m — and hence x falls within a region where f (·) enjoys desired smoothness conditions.

These two properties taken collectively allow gradient descent to converge rapidly from the very beginning.

1.2

Random initialization?

The enormous success of spectral initialization gives rise to a curious question: is carefully-designed initialization necessary for achieving fast convergence? Obviously, vanilla GD cannot start from arbitrary points, since it may get trapped in undesirable stationary points (e.g. saddle points). However, is there any simpler initialization approach that allows to avoid such stationary points and that works equally well as spectral initialization? A strategy that practitioners often like to employ is to initialize GD randomly. The advantage is clear: compared with spectral methods, random initialization is model-agnostic and is usually more robust visa-vis model mismatch. Despite its wide use in practice, however, GD with random initialization is poorly understood in theory. One way to study this method is through a geometric lens [SQW16]: under Gaussian designs, the loss function f (·) (cf. (2)) does not have any spurious local minima as long as the sample size 1 An

iterative algorithm is said to enjoy linear convergence if the iterates {xt } converge geometrically fast to the minimizer x\ .

2

Stage Stage 1 Stage 2 1 Stage 2 dist(xt ; x\ )=kx\ k2

10 0

10 -2

10 -4

10 -6

n n n n n

0

= = = = =

100 200 500 800 1000

50

100

150

200

t : iteration count

Figure 1: The relative `2 error vs. iteration count for GD with random initialization, plotted semilogarithmically. The results are shown for n = 100, 200, 500, 800, 1000 with m = 10n and ηt ≡ 0.1. m is on the order of n log3 n. Moreover, all saddle points are strict [GHJY15], meaning that the associated Hessian matrices have at least one negative eigenvalue if they are not local minima. Armed with these two conditions, the theory of Lee et al. [LSJR16, LPP+ 17] implies that vanilla GD converges almost surely to the truth. However, the convergence rate remains unsettled. In fact, we are not aware of any theory that guarantees polynomial-time convergence of vanilla GD for phase retrieval in the absence of carefully-designed initialization. Motivated by this, we aim to pursue a formal understanding about the convergence properties of GD with random initialization. Before embarking on theoretical analyses, we first assess its practical efficiency through numerical experiments. Generate the true object x\ and the initial guess x0 randomly as x\ ∼ N (0, n−1 In )

and

x0 ∼ N (0, n−1 In ).

We vary the number n of unknowns (i.e. n = 100, 200, 500, 800, 1000), set m = 10n, and take a constant i.i.d.

stepsize ηt ≡ 0.1. Here the measurement vectors are generated from Gaussian distributions, i.e. ai ∼ N (0, In ) for 1 ≤ i ≤ m. The relative `2 errors dist(xt , x\ )/kx\ k2 of the GD iterates in a random trial are plotted in Figure 1, where dist(xt , x\ ) := min kxt − x\ k2 , kxt + x\ k2 (4) represents the `2 distance between xt and x\ modulo the unrecoverable global sign. In all experiments carried out in Figure 1, we observe two stages for GD: (1) Stage 1: the relative error of xt stays nearly flat; (2) Stage 2: the relative error of xt experiences geometric decay. Interestingly, Stage 1 lasts only for a few tens of iterations. These numerical findings taken together reveal appealing numerical 1 efficiency of GD in the presence of random initialization — it attains 5-digit accuracy within about 200 1 iterations! To further illustrate this point, we take a closer inspection of the signal component hxt , x\ ix\ and the perpendicular component xt − hxt , x\ ix\ , where we normalize kx\ k2 = 1 for simplicity. Denote by kxt⊥ k2 the `2 norm of the perpendicular component. We highlight two important and somewhat surprising observations that allude to why random initialization works. • Exponential growth of the magnitude ratio of the signal to the perpendicular components. The ratio, |hxt , x\ i| / kxt⊥ k2 , grows exponentially fast throughout the execution of the algorithm, as demonstrated in Figure 2(a). This metric |hxt , x\ i| / kxt⊥ k2 in some sense captures the signal-to-noise ratio of the running iterates. • Exponential growth of the signal strength in Stage 1. While the `2 estimation error of xt may not drop significantly during Stage 1, the size |hxt , x\ i| of the signal component increases exponentially fast and becomes the dominant component within several tens of iterations, as demonstrated in Figure 2(b). This helps explain why Stage 1 lasts only for a short duration.

3

6 n n n n n

jhxt ; x\ ij=kxt? k2

10 4

10

= = = = =

100 200 500 800 1000

10 0

jhxt ; x\ ij and dist(xt ; x\ )

10

10

2

10 0

10 -2

-2 jhxt ; x\ ij (n = 100) jhxt ; x\ ij (n = 200) jhxt ; x\ ij (n = 500) jhxt ; x\ ij (n = 1000) dist(xt ; x\ ) (n = 100) dist(xt ; x\ ) (n = 200) dist(xt ; x\ ) (n = 500) dist(xt ; x\ ) (n = 1000)

10 -4

0

50

100

150

0

t : iteration count

50

100

150

t : iteration count

(a)

(b)

Figure 2: (a) The ratio |hxt , x\ i| / kxt⊥ k2 , and (b) the size |hxt , x\ i| of the signal component and the `2 error vs. iteration count, both plotted on semilogarithmic axes. The results are shown for n = 100, 200, 500, 800, 1000 with m = 10n, ηt ≡ 0.1, and kx\ k2 = 1. The central question then amounts to whether one can develop a mathematical theory to interpret such intriguing numerical performance. In particular, how many iterations does Stage 1 encompass, and how fast can the algorithm converge in Stage 2?

1.3

Main findings

The objective of the current paper is to demystify the computational efficiency of GD with random initialization, thus bridging the gap between theory and practice. Assuming a tractable random design model in which ai ’s follow Gaussian distributions, our main findings are summarized in the following theorem. Here and throughout, the notation f (n) . g(n) or f (n) = O(g(n)) (resp. f (n) & g(n), f (n) g(n)) means that there exist constants c1 , c2 > 0 such that f (n) ≤ c1 g(n) (resp. f (n) ≥ c2 g(n), c1 g(n) ≤ f (n) ≤ c2 g(n)). i.i.d.

Theorem 1. Fix x\ ∈ Rn with kx\ k2 = 1. Suppose that ai ∼ N (0, In ) for 1 ≤ i ≤ m, x0 ∼ N (0, n−1 In ), and ηt ≡ η = c/kx\ k22 for some sufficiently small constant c > 0. Then with probability approaching one, there exist some sufficiently small constant 0 < γ < 1 and Tγ . log n such that the GD iterates (3) obey dist xt , x\ ≤ γ(1 − ρ)t−Tγ , ∀ t ≥ Tγ for some absolute constant 0 < ρ < 1, provided that the sample size m & n poly log(m). Remark 1. The readers are referred to Theorem 2 for a more general statement. Here, the stepsize is taken to be a fixed constant throughout all iterations, and we reuse all data across all iterations (i.e. no sample splitting is needed to establish this theorem). The GD trajectory is divided into 2 stages: (1) Stage 1 consists of the first Tγ iterations, corresponding to the first tens of iterations discussed in Section 1.2; (2) Stage 2 consists of all remaining iterations, where the estimation error contracts linearly. Several important implications / remarks follow immediately. • Stage 1 takes O(log n) iterations. When seeded with a random initial guess, GD is capable of entering a local region surrounding x\ within Tγ . log n iterations, namely, dist xTγ , x\ ≤ γ for some sufficiently small constant γ > 0. Even though Stage 1 may not enjoy linear convergence in terms of the estimation error, it is of fairly short duration. • Stage 2 takes O(log(1/)) iterations. After entering the local region, GD converges linearly to the ground truth x\ with a contraction rate 1 − ρ. This tells us that GD reaches -accuracy (in a relative sense) within O(log(1/)) iterations. 4

1

2 0.2 0.2 2 0.8 2

1

0.64 =0.01 0.9 22 =0.01 0.9 \> \ rF (x) := 3kxk22 1 x + 2 x\> rF (x) := 1 x + 2 x x x , 0.8 0.2 3kxk =0.05 0.62 22 =0.05 0.8 ⇥ ⇥ ⇤ 0.8 0.8>i x)2 =0.1 2 \ 0.7 2 > by rF (x) = E[rf (x)] = E {(a 0.8>i x) =0.1 0.7 essentially computed which essentially computed by rF (x) = E[rf (x)]2 = Ewhich {(a (a> i x ) }ai ai x0.6 assuming that x and

the ai1 ’s are algebraic manipulation reveals 0.6 1reveals the ai ’s are independent. theindependent. dynamics for1Simple both the signal and the 1 Simple algebraic0manipulation 1 th 0.6 1 0.4 0.2 0.6 0.8 1 0.64 0.58 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 perpendicular components: 0.9 perpendicular components: 0.9 0 0.05 0.1

0.6

1

,t

0.8 (b)(a)

0

1

0.05

0.8

0.9 0.8

0.7

0.1

xt+1 k xt+1 ?

0.6

1

0.6

,t

t 2 = 1 + 3⌘ 10.8(b) kx0.8 k2

1 0.83kxt k22

xtk ;

0.64

0.8 0.62 0

0.62

0.7

xt? .

0.6

0.1

0.2

xt+1 = 1(8a) + 3⌘ 1 k

-t

00 00

1

0.4 1 1 Here, the population gradient given by 1 rF (x) represents Here, rF (x) represents the population gradient1given by

-t

0.4

xt+1 ?

0.6 kxt k2

0.82

x

3kxt k22

-t

-t

0.6 0.8 0.7 = 1(8b) +⌘ 1 x 0.6 0.6 = 1+⌘ 0.58 and Figure the signal and 0.6 ↵ components t represent 3: The trajectory of the (↵t ,perpendicular the signal and the perpendicular components 0 0.05 0.1 t ), where t and t represent 0.58 0 0.05 0.1 0.4 0.4 2that ⌘2 is sufficiently 2 0.4 0.05\ k 0.11, 0.4 \ arrive for nof =the 1000 with m = 10n, ⌘Assuming 0.01, 0.05, kx 0.2 andfollowing recognizing that kxt k22 0.6 0.6 that ⌘shown is 0and sufficiently small and m recognizing that kxt k0.05, ↵t +0and we0.15 t = results 2 = GD iterates. (a)0.6 The are 0.1, for n= 1000 with = 10n, ⌘0.6 = 0.01, k20.20 =small 1,at0.1 the 0.05 0.1 0.6 0.6 t , kx 2 = 0.1, tAssuming population-level state evolution for both ↵t and t (cf. (7)): The results are instance shown forasnplotted =population-level 1000in with m approaching infinity, for are bothshown ↵t andfortn(cf. (7)): the same Figure 1.state (b)evolution The results = 1000 with m approaching infinity, dots represent population-level points. 2 =0.01 ⇥ ⇤ ⇥ ⇤ ⌘t = 0.01,the 0.05, 0.1, and kx\ k2saddle = 1. The red dots represent the population-level saddle points. 2 ↵t+1 = 1 +(9a) 3⌘ 1 ↵t2 0.4 + t2 0.4 ↵t+1 = 1 + 3⌘ 1 0.4 + t2 ↵t ; t 0.4 0.2 0.2 ⇥ 0.4↵0.2 0.4 0.2 ⇥ ⇤ ⇤ 2 =0.05 ⌘ 1 3 ↵t2 + t2 3 ↵t2 + t2 t+1 = 1 +(9b) t+1 = 1 + ⌘ 1 t. in characterizing the dynamics of the algorithm the 0.2 of the algorithm 0.2 plays statistical observation a crucial role without in characterizing the dynamics without the =0.01 22=0.1 This0.2 recursive system has three fixed points: This recursive system has three fixed points: 0.2 0.2 need of sample splitting. 0.2

0

0

0

2 =0.05

0

p

0 confined within a (↵, ectoryItofisGD is automatically certain region (↵, =3), (1, 0), (↵, and 0 )0.8 = (1, (↵, ) = (0, and (↵, )a =0.6 (0,0.4 1/) 0.8 0 0.2 0.6 0 0), 0confined 0.4 0.6 0.8 10.8 ) = (0,10), 0 0.2 0.2 0.6 0.8 worth emphasizing the 0.4 entire of 0), GD is 1 automatically certain region 2 =0.1 0that 0.4 trajectory 0.6 1 00.2 0.2 within 0.4 1 as weenjoying shall make precise in Section 4, the GD iterates are alfavorable geometry. For example, as we shall make precise in Section 4, the GD iterates , are al0 0 0correspond 0 which to thepoints, globalt respectively. minimizer, the local maximizer, 0and which correspond to exhibit the global minimizer, the local maximizer, and the saddle y sufficiently away from anythe saddle point, and desired 0 point, 0.2 and 0.4 0.6 0.8 00.8 1 1 ways incoherent with vectors, stay sufficiently away 1from any saddle exhibit desired 0 0.2 0.4 0.6 0design 0.2 0.4 0.6 0.8 Wetrajectory make note of the following key observations in the presence a randomly initialized , which will in the presence We makeofnote of the keyxobservations tric properties underlying the Such GD are (a)geometric (b)following (a) (a)not explained (b) smoothness conditions. delicate properties underlying the be GDformalized trajectory areinnot explained , be formalized later in Lemma 1: later Lemma t 1: analysis basedworks. on global geometry [SQW16] — which provides by prior In light of this, convergence analysis based on global geometry [SQW16] — which provides \ t \ (a) \(b) h arbitrary initialization — results in suboptimal (or even pes(a) Figure 3: The trajectory of (α , β ), where αinitialization = |hx xrepresent i| and βt the =the kx − ix kthe represent respectively 1. the ratio ↵ / of the size of the signal tot ,the perpendicular components increases exponentially 1. ratio ↵hx /t , tx of the of the therepresent perpendicular t), twhere tand 2size valuable insights into algorithm of designs arbitrary — results in suboptimal (or even pest t , with t(↵ Figure 3: The trajectory (↵ ↵ signal and perpendicular Figure 3: The trajectory of , ), where ↵tsignal andtofast; thecomp sig t t t t t t t components alyzing a concrete algorithm like GD. In contrast, the current the size of the signal component and that of the perpendicular component of the GD iterates (assume simistic) computational guarantees when analyzing a concrete algorithm like GD. In contrast, the current \ 2. (a) the size ↵t results oftothe signal keeps it m plateaus around 1;are of the GD The are shown for n =growing 1000 with = signal ⌘and = 0.05, 0.1, and kwith 2. the size ↵10n, of component keeps growing until plateau Figure 3:until trajectory of (↵ ,0.01, ), where ↵ and represent the the GD iterates. (a) The for = 1000 m = s1 \ particular tthe t shown tn tkx t signal 2 =it 1, guarantees paying dynamics Figure The trajectory offiner (↵ , tcomponent ), where the perpendicular components kxiterates. k2 near-optimal =3:1). (a)attention The results are forofof nby =↵paying 1000 m =The 10n, and ηttresults =the 0.01, 0.05, 0.1. (b) results tshown t andwith t represent paper by establishes performance guarantees particular attention to finer dynamics of The of the GD iterates. (a) The results are shown for n = 1000 with m = \ accomplished by heavily exploiting statistical models in each the sameofare instance in Figure 1.are(b) The results are shown for nofin= with mcomponent approaching infinity, the same instance astheplotted Figure 1. (b) The results shown nplotted =size 1000 with m infinity, and = 3. 0.01, 0.05, The blue filled circles the iterates. (a) The results shown for n = η1000 with mstatistical =0.1. ⌘1000 = 0.01, 0.1,represent and kx kare =shown 1, zero.fo 3.as the perpendicular component drops towards zero. the algorithm. AsGD willfor bethe seen later, this isapproaching accomplished by heavily models in 0.05, each size the drops texploiting tperpendicular 2towards t of t10n, \ the same instance as plotted in Figure 1. (b) The results are shown f \ n = 1000 the population-level andred the orange arrows indicate thekx directions increasing t. iterative update. ⌘t = 0.01, 0.05, 0.1, and kx k =points, 1.inrandomly The represent the population-level saddle points. the same instance assaddle plotted Figure 1. (b) The results are shown for with m approaching ⌘initialized, = 0.01, 0.05, 0.1, and krapidly, =\ 1.ofrandomly The red dots represent the popul t t tdots 2 when t tinfinity, In other words,2 when (↵ , ) converges to (1,0) thus indicating rapid converIn other words, initialized, (↵ , ) converges to (1, ⌘t represent = 0.01, 0.05, 0.1, and kx k2 = \saddle 1. Thepoints. red dots represent the popu \ t ⌘t = 0.01, 0.05, kxtruth k2 =x\1., without The red dots the population-level gence0.1, of xand to the getting stuck around undesirable points. We also getting illustrate these gence of xt to saddle the truth x , without stuck around undesirab 0 1 phenomena numerically. Set n = 1000,Taken ⌘t ⌘ 0.1 and x phenomena ⇠ Nthese 0, n imply In . that Figure 4Set displays the dynamics of x0 ⇠ N 0, n on works? numerically. n = 1000, ⌘t ⌘ 0.1 and Near linear-time computational complexity. collectively, the iteration complexity 2 Why• random initialization works? ↵t /random and which areisprecisely discussedobservation above. ↵t / tthe , ↵t , dynamics and t , a which precisely as discussed above. the t , ↵t , plays t,a statistical observation crucial rolestatistical inascharacterizing ofaarecrucial the algorithm without plays crucial role in characterizing the of GD with initialization statistical observation plays role in without characterizing thed eorem, we pause to develop intuitions regarding why random statistical observation plays a crucial role in characterizing the dynamics of the algorithm the 1 2 t t Before diving into the splitting. proof ofwethe main pause to develop intuitions regarding why random 2 Here, t . As Here, do not taketheorem, the absolutewe value of x . As we shall see later, the x ’s are of the same sign throughout the execution we do not take the absolute value of x we shall see later, the xtk ’s need of sample of nneed sample splitting. k splitting. O klog + logof sample . k uild our understanding step step: (i) we theneed of by sample splitting. initialization isneed expected Wefirst will investigate build our understanding step by (i) we first investigate the ofstep: of to the work. algorithm. the algorithm. ce (the case where we population have infinitegradient samples); (ii) we (the then case turn where we have infinite samples); (ii) we then turn dynamics of the sequence t worth emphasizing that the entire trajectory of GD isautomat automa It is worth emphasizing that the entire trajectory ofiscalculating GD is isautomatically within certain region Given that the cost of each iteration mainly liesItin the gradient ∇fconfined (x ), the whole algorithm It is assuming worth emphasizing that the entire trajectory of GD is stic argument assuming independence between the iterates and It is worth the entire trajectory of GD automatically confined within a acertain region to the finite-sample caseemphasizing and present athat heuristic argument independence between the iterates and 6 enjoying favorable geometry. For example, as we shall make preci 6are takes nearly linear time, namely, it enjoys computational complexity proportional to taken to are the true trajectory is remarkably close to the oneexample, heuristically enjoying favorable geometry. For asa we shall make precise Section 4,the thetime GD iterates enjoying favorable geometry. For as we shall make precise inFor Section 4, the GD iterates al- alfavorable geometry. example, as we shall make precise the design vectors; (iii) finally, we argue that the example, trueenjoying trajectory is remarkably close to in the one heuristically t ways incoherent with the design vectors, stay sufficiently away from readthe the“near-independence” data (modulo some logarithmic ey property between {x }factor). t and withdesign the design vectors, stay sufficiently awayfrom from any saddle saddle point, exhibit desired analyzed inways Step incoherent (ii), with which the arises from avectors, key property concerning the “near-independence” between {xstay } and waysconcerning incoherent stay sufficiently away any point, exhibit desired ways incoherent with the design vectors, sufficiently away from smoothness conditions. Such delicate geometric properties underlyi and the design vectors {a }. smoothness conditions. Such delicate geometric properties underlying the GD trajectory are not explained i • Near-minimal sample complexity. The preceding computational guarantees occur as soon as the sample smoothness conditions. Such delicate geometric properties underlying the GD trajectory are not explained smoothness conditions. Such geometric analysis properties underlyin = e1 throughout this section, where e1 denotes the\ first standard by prior works.eIn light delicate of this, convergence based on glob Without loss of exceeds generality, we assume x = econvergence throughout this the standard 1 Given 1 denotes prior works. light oflog(m). this, analysis based onglobal global geometry [SQW16] — which size mIn &of nthis, poly that one section, needs atwhere least n samples to first recover n unknowns, theprovides note byby priorby works. In light convergence analysis based on geometry [SQW16] — which provides by prior works. In light of this, convergence analysis based on globa valuable insights into algorithm designs with arbitrary initialization basis vector.valuable For notational simplicity, we denote by sample complexity randomly initialized GD is optimal up to some logarithmic insights intoofalgorithm designs with arbitrary initialization — resultsfactor. in suboptimal (or even pes-

valuable insights into algorithm designs with arbitrary initialization — results in suboptimal (or aeven pes- algo simistic) computational guarantees when analyzing concrete valuable insights into algorithm designs with arbitrary initialization xt? := [xti ]2incomputationalt guarantees (5) analyzing simistic) a concrete algorithm like GD. In contrast, the current t xk := xt1 when andwhen x [xtia]2in (5) •computational Saddle points? The GD iterates never hit the saddle points (see Figure 3 for an illustration). In fact, after ? := paper establishes near-optimal performance guarantees by paying p simistic)paper guarantees analyzing concrete algorithm like GD. In contrast, the current simistic) computational guarantees when analyzing a concrete establishes near-optimal guarantees bywill paying particular attention to finer dynamics of algor a constant number of iterationsperformance at the very beginning, GD follow a path that increasingly distances the algorithm. As will be seen later, this is accomplished byofheav paper establishes near-optimal guarantees by paying particular attention to finer dynamics paper near-optimal performance guarantees byeach paying pa theitself algorithm. beperformance seen later, this is establishes accomplished by There heavily exploiting statistical models in from theAs set will of saddle points as the algorithm progresses. is no need to adopt sophisticated iterative update. + the algorithm. As will be seen later, this is accomplished by heavily exploiting statistical models in each iterative update. thein algorithm. As willtheory be seen later, this accomplished by heavil saddle-point escaping schemes developed generic optimization (e.g. perturbed GDis[JGN 17]). iterative update. iterative update. • Weak dependency w.r.t. the design vectors. As we will elaborate in Section 4, the statistical dependency

and

5

2

2works? Why random works? t the random GD iterates {xinitialization } and certain components of the design vectors {ainitialization i } stays at an exceedingly 2 between Why 5 t weak level. Consequently, the GD iterates {x } proceed as if fresh samples were employed in each iteration.

Before diving into the proof of of the thealgorithm main theorem, we pause to de Why works? This random statistical a crucial role in characterizing thedevelop dynamics without 2 theorem, Why initialization works? Before diving intoobservation theinitialization proofplays of the main werandom pause to intuitions regarding why random

initialization is expected to work. We will build our understanding the need of is sample splitting. initialization expected to work. We will build our understanding step by step: (i) we first investigate the of the population gradient sequence (the case where we h Before diving into the proof of the main Before theorem,dynamics wecase pause to we develop intuitions regarding why random into the proof ofand the main theorem, we pause to dev dynamics the population gradient (the where have infinite samples); (ii) region weargument then turn It is worthofemphasizing that the entire sequence trajectorydiving of the GD finite-sample is automatically confined within certain to case present aa heuristic assuming initialization is expected to work. We will build our understanding step by step: (i) we first investigate the to the finite-sample case and present a heuristic argument assuming independence between the iterates and initialization is expected work. build our understanding enjoying favorable geometry. For example, the GD iterates are always to incoherent with the design the design vectors; (iii) finally,We we will argue that vectors, the true trajectory is sr the vectors; finally, wesequence argue that the true trajectory is remarkably close to the one heuristically dynamics ofdesign the population gradient (the case where we have infinite samples); (ii) we then turnwe h stay sufficiently away(iii) from any saddle point, and exhibit desired smoothness conditions, which we will fordynamics of the population gradient sequence (the case where analyzed in Step (ii), which arises from a key property concerning t malize in in Section Such delicate underlying the GD trajectory arebetween not explained by analyzed Step which arises from a properties key property concerning the “near-independence” between {x }and to the finite-sample case4.(ii), and present ageometric heuristic argument assuming independence the iterates and the designcase vectors }. to the finite-sample and{a present a heuristic provides argument assuming priorthe papers. Invectors light of{a this, convergence analysis based on global geometry i [SQW16] — which and design }. \ i the design vectors; (iii)into finally, we argue that truevectors; trajectory isoffinally, remarkably to (or the heuristically Without loss generality, weclose assume x =one e thisissec 1 throughout thexthe (iii) argue that the true trajectory re \design valuable insights algorithmwe designs with arbitrary initialization resultswe in suboptimal even pes-standard Without loss of generality, assume = ebasis this— section, where e1 denotes the first 1 throughout t vector. For notational simplicity, we denote by analyzedsimistic) in Stepcomputational (ii), which guarantees arises from a key property concerning the “near-independence” between {x } whenanalyzed analyzing aby specific algorithm GD. In contrast, current concerning Step (ii), whichlikearises from a keytheproperty basis vector. For notational simplicity, we denotein paper establishes near-optimal performance guarantees by paying particular attention to finer dynamics of and the design vectors {a i }. t t and the design vectors {ai }. xk :=properties x1 and xt? := [xti ]2 t \ is t throughout tthis section, t the algorithm. As will be we seenassume later, this accomplished by heavily exploiting the statistical in \ Without loss of generality, x = e where e denotes the first standard xk := xWithout and loss x := [xi ]2inwe assume (5) 1 x = e1 throughout 11 of?generality, this sect each iterative update.

basis vector. For notational simplicity, we basis denote by For notational simplicity, we denote by vector. xtk := xt1

and 5

xt? := [xti ]2in

xtk := xt1

and

xt? :=(5) [xti ]2 5

2

Why random initialization works?

Before diving into the proof of the main theorem, we pause to develop intuitions regarding why gradient descent with random initialization is expected to work. We will build our understanding step by step: (i) we first investigate the dynamics of the population gradient sequence (the case where we have infinite samples); (ii) we then turn to the finite-sample case and present a heuristic argument assuming independence between the iterates and the design vectors; (iii) finally, we argue that the true trajectory is remarkably close to the one heuristically analyzed in the previous step, which arises from a key property concerning the “nearindependence” between {xt } and the design vectors {ai }. Without loss of generality, we assume x\ = e1 throughout this section, where e1 denotes the first standard basis vector. For notational simplicity, we denote by xtk := xt1

xt⊥ := [xti ]2≤i≤n

and

(5)

the first entry and the 2nd through the nth entries of xt , respectively. Since x\ = e1 , it is easily seen that 0 t \ \ t and (6) = xt − hxt , x\ ix\ xk e1 = hx , x ix xt⊥ {z } | | {z } signal component perpendicular component

t

represent respectively the components of x along and perpendicular to the signal direction. In what follows, we focus our attention on the following two quantities that reflect the sizes of the preceding two components2

αt := xtk and βt := xt⊥ 2 . (7) Without loss of generality, assume that α0 > 0.

2.1

Population dynamics

To start with, we consider the unrealistic case where the iterates {xt } are constructed using the population gradient (or equivalently, the gradient when the sample size m approaches infinity), i.e. xt+1 = xt − η∇F (xt ). Here, ∇F (x) represents the population gradient given by ∇F (x) := (3kxk22 − 1)x − 2(x\> x)x\ , 2 > \ 2 > which can be computed by ∇F (x) = E[∇f (x)] = E {(a> i x) − (ai x ) }ai ai x assuming that x and the ai ’s are independent. Simple algebraic manipulation reveals the dynamics for both the signal and the perpendicular components: xt+1 = 1 + 3η 1 − kxt k22 xtk ; (8a) k t t+1 t 2 x⊥ = 1 + η 1 − 3kx k2 x⊥ . (8b) Assuming that η is sufficiently small and recognizing that kxt k22 = αt2 + βt2 , we arrive at the following population-level state evolution for both αt and βt (cf. (7)): αt+1 = 1 + 3η 1 − αt2 + βt2 αt ; (9a) 2 2 βt+1 = 1 + η 1 − 3 αt + βt βt . (9b) This recursive system has three fixed points: (α, β) = (1, 0),

(α, β) = (0, 0),

and

√ (α, β) = (0, 1/ 3),

which correspond to the global minimizer, the local maximizer, and the saddle points, respectively, of the population objective function. We make note of the following key observations in the presence of a randomly initialized x0 , which will be formalized later in Lemma 1: 2 Here,

we do not take the absolute value of xtk . As we shall see later, the xtk ’s are of the same sign throughout the execution of the algorithm.

6

10 0

10 2

10

10

-1

10

-2

0

10 -2 0

10

20

30

40

10 -3

50

0

(a) αt /βt

10

20

30

40

50

(b) αt and βt

Figure 4: Population-level state evolution, plotted semilogarithmically: (a) the ratio αt /βt vs. iteration count, and (b) αt and βt vs. iteration count. The results are shown for n = 1000, ηt ≡ 0.1, and x0 ∼ N (0, n−1 In ) (assuming α0 > 0 though). • the ratio αt /βt of the size of the signal component to that of the perpendicular component increases exponentially fast; • the size αt of the signal component keeps growing until it plateaus around 1; • the size βt of the perpendicular component eventually drops towards zero. In other words, when randomly initialized, (αt , β t ) converges to (1, 0) rapidly, thus indicating rapid convergence of xt to the truth x\ , without getting stuck at any undesirable saddle points. We also illustrate these phenomena numerically. Set n = 1000, ηt ≡ 0.1 and x0 ∼ N (0, n−1 In ). Figure 4 displays the dynamics of αt /βt , αt , and βt , which are precisely as discussed above.

2.2

Finite-sample analysis: a heuristic treatment

We now move on to the finite-sample regime, and examine how many samples are needed in order for the population dynamics to be reasonably accurate. Notably, the arguments in this subsection are heuristic in nature, but they are useful in developing insights into the true dynamics of the GD iterates. Rewrite the gradient update rule (3) as (10) xt+1 = xt − η∇f (xt ) = xt − η∇F (xt ) − η ∇f (xt ) − ∇F (xt ) , | {z } :=r(xt )

Pm 2 > \ 2 > t where ∇f (x) = m−1 i=1 [(a> i x) − (ai x ) ]ai ai x. Assuming (unreasonably) that the iterate x is independent of {ai }, the central limit theorem (CLT) allows us to control the size of the fluctuation term r(xt ). Take the signal component as an example: simple calculations give xt+1 = xtk − η ∇F (xt ) 1 − ηr1 (xt ), k where r1 (x) :=

m hn i i 3 o 1 X h > 3 > 2 > ai x − a2i,1 a> x a − E a x − a a x a i,1 i,1 i i i,1 i m i=1

(11)

with ai,1 the first entry of ai . Owing to the preceding independence assumption, r1 is the sum of m i.i.d. zeromean random variables. Assuming that xt never blows up so that kxt k2 = O(1), one can apply the CLT to demonstrate that r p poly log(m) t (12) |r1 (x )| . Var(r1 (xt )) poly log(m) . m

7

S2,k = (Wk + c2 Fk )ei S1,k S2,k

= (Wk + c1 Fk )ei = (Wk + c2 Fk )ei

1,k 2,k

2,k

,

8pixel k

{xt,(l) } Wk = f1 (S1,k , S2,k ) 0 1 x2, S2,k x3) Fx k = fx 2 (S1,k

al

w.r.t. al

x\

Figure 5: Illustration of the region satisfying the “near-independence” property. Here, the green arrows represent the directions of {ai }1≤i≤20 , and the blue region consists of all points such that the first entry r1 (x) of the fluctuation r(x) = ∇f (x) − ∇F (x) is bounded above in magnitude by |xtk |/5 (or |hx, x\ i|/5). with high probability, which is often negligible compared to the other terms. For instance, for the random √ initial guess x0 ∼ N (0, n−1 In ) one has x0|| & 1/ n log n with probability approaching one, telling us that r 0

|r1 (x )| .

poly log(m) |x0|| | m

|r1 (x0 )| x0|| − η ∇F (x0 ) 1 x0||

and

1 hold true for the perpendicular component xt⊥ . as long as m & n poly log(m). Similar observations In summary, by assuming independence between xt and {ai }, we arrive at an approximate state evolution for the finite-sample regime: αt+1 ≈ 1 + 3η 1 − αt2 + βt2 αt ; (13a) 2 2 βt+1 ≈ 1 + η 1 − 3 αt + βt βt , (13b) with the proviso that m & n poly log(m).

2.3

Key analysis ingredients: near-independence and leave-one-out tricks

The preceding heuristic argument justifies the approximate validity of the population dynamics, under an independence assumption that never holds unless we use fresh samples in each iteration. On closer inspection, what we essentially need is the fluctuation term r(xt ) (cf. being well-controlled. For instance, when (10)) focusing on the signal component, one needs |r1 (xt )| xtk for all t ≥ 0. In particular, in the beginning √ iterations, |xtk | is as small as O(1/ n). Without the independence assumption, the CLT types of results fail to hold due to the complicated dependency between xt and {ai }. In fact, one can easily find many points that result in much larger remainder terms (as large as O(1)) and that violate the approximate state evolution (13). See Figure 5 for a caricature of the region where the fluctuation term r(xt ) is well-controlled. As can be seen, it only occupies a tiny fraction of the neighborhood of x\ . Fortunately, despite the complicated dependency across iterations, one can provably guarantee that xt always stays within the preceding desirable region in which r(xt ) is well-controlled. The key idea is to exploit a certain “near-independence” property between {xt } and {ai }. Towards this, we make use of a leave-one-out trick proposed in [MWCC17] for analyzing nonconvex iterative methods. In particular, we construct auxiliary sequences that are 1. independent of certain components of the design vectors {ai }; and 2. extremely close to the original gradient sequence {xt }t≥0 .

8

1

1

A x

x

Ax

Ax

y = |Ax|2

A

A

A

x

x

Ax

Ax

(a) xt

y = |Ax|2

Asgn

(b) xt,(l)

y = |Asgn x|2

(c) xt,sgn

Asgn

y = |Asgn x|2

(d) xt,sgn,(l)

Figure 6: Illustration of the leave-one-out and random-sign sequences. (a) {xt } is constructed using all data {ai , yi }; (b) {xt,(l) } is constructed by discarding the lth sample {al , yl }; (c) {xt,sgn } is constructed by using sgn is obtained by randomly flipping the sign of the first entry of ai ; auxiliary design vectors {asgn i }, where ai t,sgn,(l) (d) {x } is constructed by discarding the lth sample {asgn l , yl }. As it turns out, we need to construct several auxiliary sequences {xt,(l) }t≥0 , {xt,sgn }t≥0 and {xt,sgn,(l) }t≥0 , where {xt,(l) }t≥0 is independent of the lth sampling vector al , {xt,sgn }t≥0 is independent of the sign information of the first entries of all ai ’s, and {xt,sgn,(l) } is independent of both. In addition, these auxiliary sequences are constructed by slightly perturbing the original data (see Figure 6 for an illustration), and hence one can expect all of them to stay close to the original sequence throughout the execution of the algorithm. Taking these two properties together, one can propagate the above statistical independence underlying each auxiliary sequence to the true iterates {xt }, which in turn allows us to obtain near-optimal control of the fluctuation term r(xt ). The details are postponed to Section 4.

3

Related work

Solving systems of quadratic equations, or phase retrieval, has been studied extensively in the recent literature; see [SEC+ 15] for an overview. One popular method is convex relaxation (e.g. PhaseLift [CSV13]), which 1 1 is guaranteed to work as long as m/n exceeds some large enough constant [CL14,DH14,CCG15,CZ15,KRT17]. 1 However, the resulting semidefinite program is computationally prohibitive for solving large-scale problems. 1 To address this issue, [CLS15] proposed the Wirtinger flow algorithm with spectral initialization, which provides the first convergence guarantee for nonconvex methods without sample splitting. Both the sample 1 1 and computation complexities were further improved by [CC17] with an adaptive truncation strategy. Other nonconvex phase retrieval methods include [NJS13,CLM+ 16,Sol17,WGE17,ZZLC17,WGSC17,CL16,DR17, GX16,CFL15,Wei15,BEB17,TV17,CLW17,ZWGC17,QZEW17,ZCL16,YYF+ 17,CWZG17,Zha17,MXM18]. Almost 1all of these nonconvex methods require carefully-designed initialization to guarantee a sufficiently 1 accurate initial point. One exception is the approximate message passing algorithm proposed in [MXM18], which works as long as the correlation between the truth and the initial signal is bounded away from zero. This, however, does not accommodate the case when the initial signal strength is vanishingly small (like random initialization). Other works [Zha17, LGL15] explored the global convergence of alternating minimization / projection with random initialization which, however, require fresh samples at least in each of the first O(log n) iterations in order to enter the local basin. In addition, [LMZ17] explored low-rank recovery from quadratic measurements with near-zero initialization. Using a truncated least squares objective, [LMZ17] established approximate (but non-exact) recovery of over-parametrized GD. Notably, if we do not over-parametrize the phase retrieval problem, then GD with near-zero initialization is (nearly) equivalent to running the power method for spectral initialization3 , which can be understood using prior theory. Another related line of research is the design of generic saddle-point escaping algorithms, where the goal is to locate a second-order stationary point (i.e. the point with a vanishing gradient and a positive-semidefinite Hessian). As mentioned earlier, it has been shown by [SQW16] that as soon as m n log3 n, all local > t 2 P Pm > −1 η y a a> )xt when specifically, the GD update xt+1 = xt − m−1 ηt m t i=1 (ai x ) − yi ai ai xt ≈ (I + m i=1 P i i i > −1 xt ≈ 0, which is equivalent to a power iteration (without normalization) w.r.t. the data matrix I + m ηt m i=1 yi ai ai . 3 More

9

minima are global and all the saddle points are strict. With these two geometric properties in mind, saddlepoint escaping algorithms are guaranteed to converge globally for phase retrieval. Existing saddle-point escaping algorithms include but are not limited to Hessian-based methods [SQW16] (see also [AAZB+ 16, AZ17, JGN+ 17] for some reviews), noisy stochastic gradient descent [GHJY15], perturbed gradient descent [JGN+ 17], and normalized gradient descent [MSK17]. On the one hand, the results developed in these works are fairly general: they establish polynomial-time convergence guarantees under a few generic geometric conditions. On the other hand, the iteration complexity derived therein may be pessimistic when specialized to a particular problem. Take phase retrieval and the perturbed gradient descent algorithm [JGN+ 17] as an example. It has been shown in [JGN+ 17, Theorem 5] that for an objective function that is L-gradient Lipschitz, ρ-Hessian Lipschitz, (θ, γ, ζ)-strict saddle, and also locally α-strongly convex and β-smooth (see definitions in [JGN+ 17]), it takes4 ! β 1 1 L 3 + log = O n + n log O 2 α [min (θ, γ 2 /ρ)] iterations (ignoring logarithmic factors) for perturbed gradient descent to converge to -accuracy. In fact, even with Nesterov’s accelerated scheme [JNJ17], the iteration complexity for entering the local region is at least ! L1/2 ρ1/4 = O n2.5 . O 7/4 2 [min (θ, γ /ρ)] Both of them are much larger than the O log n + log(1/) complexity established herein. This is primarily due to the following facts: (i) the Lipschitz constants of both the gradients and the Hessians are quite large, i.e. L n and ρ n (ignoring log factors), which are, however, treated as dimension-independent constants in the aforementioned papers; (ii) the local condition number is also large, i.e. β/α n. In comparison, as suggested by our theory, the GD iterates with random initialization are always confined within a restricted region enjoying much more benign geometry than the worst-case / global characterization. Furthermore, the above saddle-escaping first-order methods are often more complicated than vanilla GD. Despite its algorithmic simplicity and wide use in practice, the convergence rate of GD with random initialization remains largely unknown. In fact, Du et al. [DJL+ 17] demonstrated that there exist nonpathological functions such that GD can take exponential time to escape the saddle points when initialized randomly. In contrast, as we have demonstrated, saddle points are not an issue for phase retrieval; the GD iterates with random initialization never get trapped in the saddle points. Finally, the leave-one-out arguments have been invoked to analyze other high-dimensional statistical estimation problems including robust M-estimators [EKBB+ 13,EK15], statistical inference for Lasso [JM15], likelihood ratio test for logistic regression [SCC17], etc. In addition, [ZB17, CFMW17, AFWZ17] made use of the leave-one-out trick to derive entrywise perturbation bounds for eigenvectors resulting from certain spectral methods. The techniques have also been applied by [MWCC17, LMCC18] to establish local linear convergence of vanilla GD for nonconvex statistical estimation problems in the presence of proper spectral initialization.

4

Analysis

In this section, we first provide a more general version of Theorem 1 as follows. It spells out exactly the conditions on x0 in order for the gradient method with random initialization to succeed. i.i.d.

Theorem 2. Fix x\ ∈ Rn . Suppose ai ∼ N (0, In ) (1 ≤ i ≤ m) and m ≥ Cn log13 m for some sufficiently large constant C > 0. Assume that the initialization x0 is independent of {ai } and obeys 0 \ hx , x i 1 1 1 \ 0 √ ≥ and 1 − kx k ≤ kx k ≤ 1 + kx\ k2 , (14) 2 2 kx\ k22 log n log n n log n 4 When applied to phase retrieval with m n poly log n, one has L n, ρ n, θ γ 1 (see [SQW16, Theorem 2.2]), α 1, and β & n (ignoring logarithmic factors).

10

and that the stepsize satisfies ηt ≡ η = c/kx\ k22 for some sufficiently small constant c > 0. Then there exist a sufficiently small absolute constant 0 < γ < 1 and Tγ . log n such that with probability at least 1 − O(m2 e−1.5n ) − O(m−9 ), 1. the GD iterates (3) converge linearly to x\ after t ≥ Tγ , namely,

η 2 t−Tγ · γ x\ 2 , dist xt , x\ ≤ 1 − x\ 2 2 2. the magnitude ratio of the signal component

hxt ,x\ i \ x kx\ k22

t

2

hxt ,x\ i kx\ k22

\

,x i \ to the perpendicular component xt − hx x obeys kx\ k2 2

t \ hx ,x i \ kx\ k2 x

t

x −

∀ t ≥ Tγ ;

1

&√ (1 + c1 η 2 )t ,

n log n \ x

t = 0, 1, · · ·

(15)

2

for some constant c1 > 0. Several remarks regarding Theorem 2 are in order. 0 −1 \ 2 • The random √ initialization x ∼ N (0, n kx k2 In ) obeys the condition (14) with probability exceeding 1 − O(1/ log n), which in turn establishes Theorem 1.

• Our current sample complexity reads m & n log13 m, which is optimal up to logarithmic factors. It is possible to further reduce the logarithmic factors using more refined probabilistic tools, which we leave for future work. The remainder of this section is then devoted to proving Theorem 2. Without loss of generality5 , we will assume throughout that x\ = e1 and x01 > 0. (16) Given this, one can decompose t

x =

xtk e1

+

0 xt⊥

(17)

where xtk = xt1 and xt⊥ = [xti ]2≤i≤n as introduced in Section 2. For notational simplicity, we define αt := xtk

and

βt := kxt⊥ k2 .

(18)

Intuitively, αt represents the size of the signal component, whereas βt measures the size of the component perpendicular to the signal direction. In view of (16), we have α0 > 0.

4.1

Outline of the proof

To begin with, it is easily seen that if αt and βt (cf. (18)) obey |αt − 1| ≤ γ/2 and βt ≤ γ/2, then dist xt , x\ ≤ kxt − x\ k2 ≤ αt − 1 + βt ≤ γ. Therefore, our first step — which is concerned with proving dist(xt , x\ ) ≤ γ — comes down to the following two steps. 1. Show that if αt and βt satisfy the approximate state evolution (see (13)), then there exists some Tγ = O (log n) such that αT − 1 ≤ γ/2 and βTγ ≤ γ/2, (19) γ which would immediately imply that dist xTγ , x\ ≤ γ. Along the way, we will also show that the ratio αt /βt grows exponentially fast. 5 This

is because of the rotational invariance of Gaussian distributions.

11

2. Justify that αt and βt satisfy the approximate state evolution with high probability, using (some variants of) leave-one-out arguments. After t ≥ Tγ , we can invoke prior theory [MWCC17] concerning local convergence to show that with high probability, dist xt , x\ ≤ (1 − ρ)t−Tγ kxTγ − x\ k2 , ∀ t > Tγ for some constant 0 < ρ < 1 independent of n and m.

4.2

Dynamics of approximate state evolution

This subsection formalizes our intuition in Section 2: as long as the approximate state evolution holds, then one can find Tγ . log n obeying condition (19). In particular, the approximate state evolution is given by (20a) αt+1 = 1 + 3η 1 − αt2 + βt2 + ηζt αt , 2 2 βt+1 = 1 + η 1 − 3 αt + βt + ηρt βt , (20b) where {ζt } and {ρt } represent the perturbation terms. Our result is this: Lemma 1. Let γ > 0 be some sufficiently small constant, and consider the approximate state evolution (20). Suppose the initial point obeys q 1 1 1 ≤ α02 + β02 ≤ 1 + . (21) and 1− α0 ≥ √ log n log n n log n and the perturbation terms satisfy c3 , log n

max {|ζt | , |ρt |} ≤

t = 0, 1, · · ·

for some sufficiently small constant c3 > 0. (a) Let Tγ := min t : |αt − 1| ≤ γ/2 and βt ≤ γ/2 .

(22)

Then for any sufficiently large n and m and any sufficiently small constant η > 0, one has Tγ . log n,

(23)

and there exist some constants c5 , c10 > 0 independent of n and m such that 1 √ ≤ αt ≤ 2, 2 n log n

c5 ≤ βt ≤ 1.5

and

αt+1 /αt ≥ 1 + c10 η 2 , βt+1 /βt

0 ≤ t ≤ Tγ .

(24)

(b) If we define T0 := min t : αt+1 ≥ c6 / log5 m ,

(25)

T1 := min {t : αt+1 > c4 } ,

(26)

for some arbitrarily small constants c4 , c6 > 0, then 1) T0 ≤ T1 ≤ Tγ . log n; T1 − T0 . log log m; Tγ − T1 . 1; 2) For T0 < t ≤ Tγ , one has αt ≥ c6 / log5 m. Proof. See Appendix B. Remark 2. Recall that γ is sufficiently small and (α, β) = (1, 0) represents the global minimizer. Since |α0 −1| ≈ 1, one has Tγ > 0, which denotes the first time√ when the iterates enter the local region surrounding the global minimizer. In addition, the fact that α0 . 1/ n gives T0 > 0 and T1 > 0, both of which indicate the first time when the signal strength is sufficiently large. 12

Lemma 1 makes precise that under the approximate state evolution, the first stage enjoys a fairly short duration Tγ . log n. Moreover, the size of the signal component grows faster than that of the perpendicular component for any iteration t < Tγ , thus confirming the exponential growth of αt /βt . In addition, Lemma 1 identifies two midpoints T0 and T1 when the sizes of the signal component αt become sufficiently large. These are helpful in our subsequent analysis. In what follows, we will divide Stage 1 (which consists of all iterations up to Tγ ) into two phases: • Phase I : consider the duration 0 ≤ t ≤ T0 ; • Phase II : consider all iterations with T0 < t ≤ Tγ . We will justify the approximate state evolution (20) for these two phases separately.

4.3

Motivation of the leave-one-out approach

As we have alluded in Section 2.3, the main difficulty in establishing the approximate state evolution (20) lies in controlling the perturbation terms to the desired orders (i.e. |ζt | , |ρt | 1/ log n in Lemma 1). To achieve this, we advocate the use of (some variants of) leave-one-out sequences to help establish certain “near-independence” between xt and certain components of {ai }. We begin by taking a closer look at the perturbation terms. Regarding the signal component, it is easily seen from (11) that xt+1 = 1 + 3η 1 − kxt k22 xtk − ηr1 (xt ), k where the perturbation term r1 (xt ) obeys t

h

r1 (x ) = 1 − |

2 xtk

i

xtk

! m m h 2 i 1 X 1 X 4 ai,1 − 3 + 1 − 3 xtk a3 a> xt m i=1 m i=1 i,1 i,⊥ ⊥ {z } | {z }

:=I1

− 3xtk |

:=I2

! m m

t 2 1 X > t 2 2 1 X > t 3 ai,⊥ x⊥ ai,1 − x⊥ 2 − a x ai,1 . m i=1 m i=1 i,⊥ ⊥ {z } | {z } :=I3

(27)

:=I4

Here and throughout the paper, for any vector v ∈ Rn , v⊥ ∈ Rn−1 denotes the 2nd through the nth entries of v. Due to the dependency between xt and {ai }, it is challenging to obtain sharp control of some of these terms. In what follows, we use the term I4 to explain and motivate our leave-one-out approach. As discussed √ in Section 2.3, I4 needs to be controlled to theP level O(1/( n poly log(n))). This precludes us from seeking m 3 a uniform bound on the function h(x) := m−1 i=1 (a> i,⊥ x⊥ ) ai,1 over all√x (or even all x within the set C incoherent with {ai }), since the uniform bound supx∈C |h(x)| can be O( n/poly log(n)) times larger than the desired order. In order to control I4 to the desirable order, one strategy is to approximate it by a sum of independent variables and then invoke the CLT. Specifically, we first rewrite I4 as m

I4 =

1 X > t 3 a x |ai,1 | ξi m i=1 i,⊥ ⊥

with ξi := sgn(ai,1 ). Here sgn(·) denotes the usual sign function. To exploit the statistical independence between ξi and {|ai,1 |, ai,⊥ }, we would like to identify some vector independent of ξi that well approximates xt . If this can be done, then one may treat I4 as a weighted independent sum of {ξi }. Viewed in this light, our plan is the following: 1. Construct a sequence {xt,sgn } independent of {ξi } obeying xt,sgn ≈ xt , so that m

I4 ≈

1 X > t,sgn 3 a x |ai,1 | ξi . m i=1 | i,⊥ ⊥{z } :=wi

13

Algorithm 1 The lth leave-one-out sequence Input: {ai }1≤i≤m,i6=l , {yi }1≤i≤m,i6=l , and x0 . Gradient updates: for t = 0, 1, 2, . . . , T − 1 do

0,(l)

where x

xt+1,(l) = xt,(l) − ηt ∇f (l) (xt,(l) ), P 2 > \ 2 2 = x0 and f (l) (x) = (1/4m)· i:i6=l [(a> i x) − (ai x ) ] .

(29)

Algorithm 2 The random-sign sequence Input: {|ai,1 |}1≤i≤m , {ai,⊥ }1≤i≤m , {ξisgn }1≤i≤m , {yi }1≤i≤m , x0 . Gradient updates: for t = 0, 1, 2, . . . , T − 1 do xt+1,sgn = xt,sgn − ηt ∇f sgn (xt,sgn ), where x0,sgn = x0 , f sgn (x) =

1 4m

Pm

sgn> x)2 i=1 [(ai

− (asgn> x\ )2 ]2 with asgn := i i

(30)

ξisgn |ai,1 | ai,⊥

.

One can then apply standard concentration results (e.g. the Bernstein inequality) to control I4 , as long as none of the weight wi is exceedingly large. t,sgn (1 ≤ i ≤ m) is not much 2. Demonstrate that the weight wi is well-controlled, or equivalently, a> i,⊥ x⊥ larger than its typical size. This can be accomplished by identifying another sequence {xt,(i) } independent of ai such that xt,(i) ≈ xt ≈ xt,sgn , followed by the argument: p > t,sgn > t > t,(i) p

ai,⊥ x ≈ ai,⊥ x⊥ ≈ ai,⊥ x . log m xt,(i) ≈ log m xt⊥ . (28) ⊥ ⊥ ⊥ 2 2 Here, the inequality follows from standard Gaussian tail bounds and the independence between ai and xt,(i) . This explains why we would like to construct {xt,(i) } for each 1 ≤ i ≤ m. As we will detail in the next subsection, such auxiliary sequences are constructed by leaving out a small amount of relevant information from the collected data before running the GD algorithm, which is a variant of the “leave-one-out” approach rooted in probability theory and random matrix theory.

4.4

Leave-one-out and random-sign sequences

We now describe how to design auxiliary sequences to help establish certain independence properties between the gradient iterates {xt } and the design vectors {ai }. In the sequel, we formally define the three sets of auxiliary sequences {xt,(l) }, {xt,sgn }, {xt,sgn,(l) } as introduced in Section 2.3 and Section 4.3. • Leave-one-out sequences {xt,(l) }t≥0 . For each 1 ≤ l ≤ m, we introduce a sequence {xt,(l) }, which drops the lth sample and runs GD w.r.t. the auxiliary objective function f (l) (x) =

i2 1 X h > 2 \ 2 ai x − a> . i x 4m

(32)

i:i6=l

See Algorithm 1 for details and also Figure 6(a) for an illustration. One of the most important features of {xt,(l) } is that all of its iterates are statistically independent of (al , yl ), and hence are incoherent with √ t,(l) t,(l) al with high probability, in the sense that a> . log mkx k2 . Such incoherence properties l x > t > t,sgn further allow us to control both al x and al x (see (28)), which is crucial for controlling the size of the residual terms (e.g. r1 (xt ) as defined in (11)). Notably, the sequence {xt,(l) } has also been applied by [MWCC17] to justify the success of GD with spectral initialization for several nonconvex statistical estimation problems.

14

Algorithm 3 The lth leave-one-out and random-sign sequence Input:{|ai,1 |}1≤i≤m,i6=l , {ai,⊥ }1≤i≤m,i6=l , {ξisgn }1≤i≤m,,i6=l , {yi }1≤i≤m,i6=l , x0 . Gradient updates: for t = 0, 1, 2, . . . , T − 1 do

where x0,sgn,(l) = x0 , f sgn,(l)

xt+1,sgn,(l) = xt,sgn,(l) − ηt ∇f sgn,(l) (xt,sgn,(l) ), (31) sgn h i 2 P ξi |ai,1 | 2 2 1 asgn> x − asgn> x\ x = 4m with asgn := . i i i i:i6=l ai,⊥

• Random-sign sequence {xt,sgn }t≥0 . Introduce a collection of auxiliary design vectors {asgn i }1≤i≤m defined as sgn ξi |ai,1 | asgn := , (33) i ai,⊥ where {ξisgn }1≤i≤m is a set of Radamacher random variables independent of {ai }, i.e. ( 1, with probability 1/2, sgn i.i.d. ξi = 1 ≤ i ≤ m. −1, else,

(34)

In words, asgn is generated by randomly flipping the sign of the first entry of ai . To simplify the notations i hereafter, we also denote ξi = sgn(ai,1 ). (35) As a result, ai and asgn differ only by a single bit of information. With these auxiliary design vectors in i place, we generate a sequence {xt,sgn } by running GD w.r.t. the auxiliary loss function m

f

sgn

2 i2 1 X h sgn> 2 ai x = x − asgn> x\ . i 4m i=1

(36)

One simple yet important feature associated with these new design vectors is that it produces the same measurements as {ai }: 2 2 \ 2 a> = asgn> x\ = |ai,1 | , 1 ≤ i ≤ m. (37) i x i See Figure 6(b) for an illustration and Algorithm 2 for the detailed procedure. This sequence is introduced in order to “randomize” certain Gaussian polynomials (e.g. I4 in (27)), which in turn enables optimal control of these quantities. This is particularly crucial at the initial stage of the algorithm. • Leave-one-out and random-sign sequences xt,sgn,(l) t≥0 . Furthermore, we also need to introduce another collection of sequences {xt,sgn,(l) } by simultaneously employing the new design vectors {asgn i } and discardsgn ing a single sample (asgn , y ). This enables us to propagate the kinds of independence properties across l l the above two sets of sequences, which is useful in demonstrating that xt is jointly “nearly-independent” of both al and {sgn(ai,1 )}. See Algorithm 3 and Figure 6(c). As a remark, all of these leave-one-out and random-sign procedures are assumed to start from the same initial point as the original sequence, namely, x0 = x0,(l) = x0,sgn = x0,sgn,(l) ,

4.5

1 ≤ l ≤ m.

(38)

Justification of approximate state evolution for Phase I of Stage 1

Recall that Phase I consists of the iterations 0 ≤ t ≤ T0 , where c6 . T0 = min t : αt+1 ≥ log5 m

(39)

Our goal here is to show that the approximate state evolution (20) for both the size αt of the signal component and the size βt of the perpendicular component holds true throughout Phase I. Our proof will be inductive in nature. Specifically, we will first identify a set of induction hypotheses that are helpful in proving the validity of the approximate state evolution (20), and then proceed by establishing these hypotheses via induction. 15

4.5.1

Induction hypotheses

For the sake of clarity, we first list all the induction hypotheses. t

p n log5 m C1 , 1≤l≤m m p t 1 n log12 m t t,(l) max xk − xk ≤ αt 1 + C2 , 1≤l≤m log m m s t

t n log5 m

x − xt,sgn ≤ αt 1 + 1 C , 3 2 log m m p t

1 n log9 m

t t,sgn t,(l) t,sgn,(l) −x +x max x − x C4 ,

≤ αt 1 + 1≤l≤m log m m 2

t

t c5 ≤ x⊥ 2 ≤ x 2 ≤ C5 , p

t

x ≤ 4αt n log m, 2

max xt − xt,(l) 2 ≤ βt 1 +

1 log m

(40a) (40b)

(40c) (40d) (40e) (40f)

where C1 , · · · , C5 and c5 are some absolute positive constants. Now we are ready to prove an immediate consequence of the induction hypotheses (40): if (40) hold for the tth iteration, then αt+1 and βt+1 follow the approximate state evolution (see (20)). This is justified in the following lemma. Lemma 2. Suppose m ≥ Cn log11 m for some sufficiently large constant C > 0. For any 0 ≤ t ≤ T0 (cf. (39)), if the tth iterates satisfy the induction hypotheses (40), then with probability at least 1 − O(me−1.5n ) − O(m−10 ), αt+1 = 1 + 3η 1 − αt2 + βt2 + ηζt αt ; (41a) 2 2 βt+1 = 1 + η 1 − 3 αt + βt + ηρt βt (41b) hold for some |ζt | 1/ log m and |ρt | 1/ log m. Proof. See Appendix C. It remains to inductively show that the hypotheses hold for all 0 ≤ t ≤ T0 . Before proceeding to this induction step, it is helpful to first develop more understanding about the preceding hypotheses. 1. In words, (40a), (40b), (40c) specify that the leave-one-out sequences xt,(l) and {xt,sgn } are exceedingly close to the original sequence {xt }. Similarly, the difference between xt − xt,sgn and xt,(l) − xt,sgn,(l) is extremely small, as asserted in (40d). The hypothesis (40e) says that the norm of the iterates {xt } is always bounded from above and from below in Phase I. The last one (40f) indicates that the size αt of the signal component is never too small compared with kxt k2 . 2. Another property that is worth mentioning is the growth rate (with respect to t) of the quantities appeared t,(l) in the induction hypotheses (40). For instance, xtk −xk , kxt −xt,sgn k2 and kxt −xt,sgn −xt,(l) +xt,sgn,(l) k2 grow more or less at the same rate as αt (modulo some (1 + 1/ log m)T0 factor). In contrast, kxt − xt,(l) k2 shares the same growth rate with βt (modulo the (1 + 1/ log m)T0 factor). See Figure 7 for an illustration. The difference in the growth rates turns out to be crucial in establishing the advertised result. 3. Last but not least, we emphasize the sizes of the quantities of interest in (40)√for t = 1 under the Gaussian initialization. Ignoring all of the log m terms and recognizing that α1 1/ n and β1 1, one sees that √ √ 1,(l) kx1 −x1,(l) k2 . 1/ m, |x1k −xk | . 1/m, kx1 −x1,sgn k2 . 1/ m and kx1 −x1,sgn −x1,(l) +x1,sgn,(l) k2 . 1/m. See Figure 7 for an illustration of the trends of the above four quantities. Several consequences of (40) regarding the incoherence between {xt }, {xt,sgn } and {ai }, {asgn i } are immediate, as summarized in the following lemma.

16

10 0

10 0

10 -1 10

-2

10 -2 10 -4

10

-3

10 -6

10 -4 10 -5

0

10

20

30

40

10 -8

50

(a) Stage 1

0

50

100

150

200

(b) Stage 1 and Stage 2

Figure 7: Illustration of the differences among leave-one-out and original sequences vs. iteration count, plotted semilogarithmically. The results are shown for n = 1000 with m = 10n, ηt ≡ 0.1, and kx\ k2 = 1. (a) The four differences increases in Stage 1. From the induction hypotheses (40), our upper bounds on t,(l) |xtk − xk |, kxt − xt,sgn k2 and kxt − xt,sgn − xt,(l) + xt,sgn,(l) k2 scale linearly with αt , whereas the upper √ 1,(l) bound on kxt − xt,(l) k2 is proportional to βt . In addition, kx1 − x1,(l) k2 . 1/ m, |x1k − xk | . 1/m, √ kx1 − x1,sgn k2 . 1/ m and kx1 − x1,sgn − x1,(l) + x1,sgn,(l) k2 . 1/m. (b) The four differences converge to zero geometrically fast in Stage 2, as all the (variants of) leave-one-out sequences and the original sequence converge to the truth x\ . Lemma 3. Suppose that m ≥ Cn log6 m for some sufficiently large constant C > 0 and the tth iterates satisfy the induction hypotheses (40) for t ≤ T0 , then with probability at least 1 − O(me−1.5n ) − O(m−10 ), p

t max a> log m xt 2 ; l x . 1≤l≤m

p t log m xt⊥ 2 ; max a> l,⊥ x⊥ . 1≤l≤m p

t,sgn max a> . log m xt,sgn 2 ; l x 1≤l≤m

p

t,sgn

; max a> . log m xt,sgn l,⊥ x⊥ ⊥ 2 1≤l≤m

t,sgn sgn> t,sgn p

. . log m x x max al 2 1≤l≤m

Proof. These incoherence conditions typically arise from the independence between {xt,(l) } and al . For instance, the first line follows since p > t > t,(l) p al x ≈ al x . log mkxt,(l) k2 log mkxt k2 . See Appendix M for detailed proofs. 4.5.2

Induction step

We then turn to showing that the induction hypotheses (40) hold throughout Phase I, i.e. for 0 ≤ t ≤ T0 . The base case can be easily verified because of the identical initial points (38). Now we move on to the inductive step, i.e. we aim to show that if the hypotheses (40) are valid up to the tth iteration for some th t ≤ T0 , then they continue to hold for the (t + 1) iteration. The first lemma concerns the difference between the leave-one-out sequence xt+1,(l) and the true sequence t+1 x (see (40a)).

17

Lemma 4. Suppose m ≥ Cn log5 m for some sufficiently large constant C > 0. If the induction hypotheses (40) hold true up to the tth iteration for some t ≤ T0 , then with probability at least 1−O(me−1.5n )−O(m−10 ), t+1 p

t+1

1 n log5 m t+1,(l)

C max x ≤ β 1 + (43) −x 1 t+1 2 1≤l≤m log m m holds as long as η > 0 is a sufficiently small constant and C1 > 0 is sufficiently large. Proof. See Appendix D. The next lemma characterizes a finer relation between xt+1 and xt+1,(l) when projected onto the signal direction (cf. (40b)). Lemma 5. Suppose m ≥ Cn log6 m for some sufficiently large constant C > 0. If the induction hypotheses (40) hold true up to the tth iteration for some t ≤ T0 , then with probability at least 1−O(me−1.5n )−O(m−10 ), t+1 p t+1 n log12 m 1 t+1,(l) ≤ αt+1 1 + C2 (44) max xk − xk 1≤l≤m log m m holds as long as η > 0 is a sufficiently small constant and C2 C4 . Proof. See Appendix E. Regarding the difference between xt and xt,sgn (see (40c)), we have the following result. Lemma 6. Suppose m ≥ Cn log5 m for some sufficiently large constant C > 0. If the induction hypotheses (40) hold true up to the tth iteration for some t ≤ T0 , then with probability at least 1−O(me−1.5n )−O m−10 , s t+1

t+1

n log5 m 1

x − xt+1,sgn 2 ≤ αt+1 1 + C3 (45) log m m holds as long as η > 0 is a sufficiently small constant and C3 is a sufficiently large positive constant. Proof. See Appendix F. We are left with the double difference xt+1 − xt+1,sgn − xt+1,(l) + xt+1,sgn,(l) (cf. (40d)), for which one has the following lemma. Lemma 7. Suppose m ≥ Cn log8 m for some sufficiently large constant C > 0. If the induction hypotheses (40) hold true up to the tth iteration for some t ≤ T0 , then with probability at least 1−O(me−1.5n )−O(m−10 ), p t+1

1 n log9 m

t+1 t+1,sgn t+1,(l) t+1,sgn,(l) max x −x −x +x C4 (46)

≤ αt+1 1 + 1≤l≤m log m m 2 holds as long as η > 0 is a sufficiently small constant and C4 > 0 is sufficiently large. Proof. See Appendix G. Assuming the induction hypotheses (40) hold up to the tth iteration for some t ≤ T0 , we know from Lemma 2 that the approximate state evolution for both αt and βt (see (20)) holds up to t + 1. As a result, th the last two hypotheses (40e) and (40f) for the (t + 1) iteration can be easily verified.

4.6

Justification of approximate state evolution for Phase II of Stage 1

Recall from Lemma 1 that Phase II refers to the iterations T0 < t ≤ Tγ (see the definition of T0 in Lemma 1), for which one has c6 αt ≥ (47) log5 m as long as the approximate state evolution (20) holds. Here c6 > 0 is the same constant as in Lemma 1. Similar to Phase I, we invoke an inductive argument to prove that the approximate state evolution (20) continues to hold for T0 < t ≤ Tγ . 18

4.6.1

Induction hypotheses

In Phase I, we rely on the leave-one-out sequences and the random-sign sequences {xt,(l) }, {xt,sgn } and {xt,sgn,(l) } to establish certain “near-independence” between {xt } and {al }, which in turn allows us to obtain sharp control of the residual terms r (xt ) (cf. (10)) and r1 (xt ) (cf. (11)). As it turns out, once the size αt of the signal component obeys αt & 1/poly log(m), then {xt,(l) } alone is sufficient for our purpose to establish the “near-independence” property. More precisely, in Phase II we only need to impose the following induction hypotheses. t p

t n log15 m 1 t,(l)

C ; (48a) ≤ α 1 + max x − x 6 t 2 1≤l≤m log m m

t

t (48b) c5 ≤ x⊥ ≤ x ≤ C5 . 2

2

A direct consequence of (48) is the incoherence between xt and {al }, namely, p

t max a> log m xt⊥ 2 ; l,⊥ x⊥ . 1≤l≤m p

t max a> log m xt 2 . l x . 1≤l≤m

(49a) (49b)

To see this, one can use the triangle inequality to show that > t > t,(l) > al,⊥ x⊥ ≤ al,⊥ x + al,⊥ xt⊥ − xt,(l) ⊥ ⊥

t,(l)

√ log m x⊥ 2 + n xt − xt,(l) 2 p

√

. log m xt⊥ 2 + xt − xt,(l) 2 + n xt − xt,(l) 2 p (ii) p p n log15 m √ . log m + n . log m, m (i)

.

p

where (i) follows from the independence between al and xt,(l) and the Cauchy-Schwarz inequality, and the t last line (ii) arises from (1 + 1/ log m) . 1 for t ≤ Tγ . log n and m n log15/2 m. This combined with the fact that kxt⊥ k2 ≥ c5 /2 results in

p t log m xt⊥ 2 . max a> l,⊥ x⊥ .

1≤l≤m

(50)

The condition (49b) follows using nearly identical arguments, which are omitted here. As in Phase I, we need to justify the approximate state evolution (20) for both αt and βt , given that the tth iterates satisfy the induction hypotheses (48). This is stated in the following lemma. Lemma 8. Suppose m ≥ Cn log13 m for some sufficiently large constant C > 0. If the tth iterates satisfy the induction hypotheses (48) for T0 < t < Tγ , then with probability at least 1 − O(me−1.5n ) − O(m−10 ), αt+1 = 1 + 3η 1 − αt2 + βt2 + ηζt αt ; (51a) 2 2 βt+1 = 1 + η 1 − 3 αt + βt + ηρt βt , (51b) for some |ζt | 1/ log m and ρt 1/ log m. Proof. See Appendix H for the proof of (51a). The proof of (51b) follows exactly the same argument as in proving (41b), and is hence omitted. 4.6.2

Induction step

We proceed to complete the induction argument. Towards this end, one has the following lemma in regard to the induction on max1≤l≤m kxt+1 − xt+1,(l) k2 (see (48a)).

19

Lemma 9. Suppose m ≥ Cn log5 m for some sufficiently large constant C > 0, and consider any T0 < t < Tγ . If the induction hypotheses (40) are valid throughout Phase I and (48) are valid from the T0 th to the tth iterations, then with probability at least 1 − O(me−1.5n ) − O(m−10 ), p t+1

t+1 1 n log13 m t+1,(l)

≤ αt+1 1 + max x −x C6 2 1≤l≤m log m m holds as long as η > 0 is sufficiently small and C6 > 0 is sufficiently large. Proof. See Appendix I. As in Phase I, since we assume the induction hypotheses (40) (resp. (48)) hold for all iterations up to the T0 th iteration (resp. between the T0 th and the tth iteration), we know from Lemma 8 that the approximate state evolution for both αt and βt (see (20)) holds up to t + 1. The last induction hypothesis (48b) for the th (t + 1) iteration can be easily verified from Lemma 1. It remains to check the case when t = T0 + 1. It can be seen from the analysis in Phase I that p T0 +1

T +1

n log5 m 1 T +1,(l)

≤ βT +1 1 + C1 max x 0 − x 0 0 2 1≤l≤m log m m p T0 +1 1 n log15 m ≤ αT0 +1 1 + C6 , log m m for some constant condition C6 1, where the second line holds since βT0 +1 ≤ C5 , αT0 +1 ≥ c6 / log5 m.

4.7

Analysis for Stage 2

Combining the analyses in Phase I and Phase II, we finish the proof of Theorem 2 for Stage 1, i.e. t ≤ Tγ . In addition to dist xTγ , x\ ≤ γ, we can also see from (49b) that p Tγ max a> . log m, i x 1≤i≤m

which in turn implies that p Tγ − x\ . log m. max a> i x

1≤i≤m

Armed with these properties, one can apply the arguments in [MWCC17, Section 6] to prove that for t ≥ Tγ + 1, η t−Tγ η t−Tγ dist xt , x\ ≤ 1 − dist xTγ , x\ ≤ 1 − · γ. (52) 2 2 Notably, the theorem therein [MWCC17, Theorem 1] works under the stepsize ηt ≡ η c/ log n when m n log n. Nevertheless, as remarked by the authors, when the sample complexity exceeds m n log3 m, a constant stepsize is allowed. We are left with proving (15) for Stage 2. Note that we have already shown that the ratio αt /βt increases exponentially fast in Stage 1. Therefore, αT1 1 ≥√ (1 + c10 η 2 )T1 β T1 2n log n and, by the definition of T1 (see (26)) and Lemma 1, one has αT1 βT1 1 and hence αT1 1. β T1 When it comes to t > Tγ , in view of (52), one has 1 − dist xt , x\ αt 1−γ ≥ ≥ t−Tγ βt dist (xt , x\ ) 1 − η2 ·γ 20

(53)

1−γ η t−Tγ (i) αT1 η t−Tγ 1+ 1+ γ 2 β T1 2 t−Tγ 1 η T 1 &√ 1+ 1 + c10 η 2 2 n log n (ii) η t−Tγ 1 Tγ 1+ 1 + c10 η 2 √ 2 n log n 1 t 1 + c10 η 2 , &√ n log n ≥

where (i) arises from (53) and the fact that γ is a constant, (ii) follows since Tγ − T1 1 according to Lemma 1, and the last line holds as long as c10 > 0 and η are sufficiently small. This concludes the proof regarding the lower bound on αt /βt .

5

Discussions

The current paper justifies the fast global convergence of gradient descent with random initialization for phase retrieval. Specifically, we demonstrate that GD with random initialization takes only O log n + log(1/) iterations to achieve a relative -accuracy in terms of the estimation error. It is likely that such fast global convergence properties also arise in other nonconvex statistical estimation problems. The technical tools developed herein may also prove useful for other settings. We conclude our paper with a few directions worthy of future investigation. • Sample complexity and phase transition. We have proved in Theorem 2 that GD with random initialization enjoys fast convergence, with the proviso that m n log13 m. It is possible to improve the sample complexity via more sophisticated arguments. In addition, it would be interesting to examine the phase transition phenomenon of GD with random initialization. • Other nonconvex statistical estimation problems. We use the phase retrieval problem to showcase the efficiency of GD with random initialization. It is certainly interesting to investigate whether this fast global convergence carries over to other nonconvex statistical estimation problems including low-rank matrix and tensor recovery [KMO10, SL16, CW15, TBS+ 16, ZWL15, CL17, HZC18], blind deconvolution [LLSW16, HH17,LLB17] and neural networks [SJL17,LMZ17,ZSJ+ 17,FCL18]. The leave-one-out sequences and the “near-independence” property introduced / identified in this paper might be useful in proving efficiency of randomly initialized GD for the aforementioned problems. • Noisy setting and other activation functions. Throughout this paper, our focus is on inverting noiseless quadratic systems. Extensions to the noisy case is definitely worth investigating. Moving beyond quadratic samples, one may also study other activation functions, including but not limited to Rectified Linear Units (ReLU), polynomial functions and sigmoid functions. Such investigations might shed light on the effectiveness of GD with random initialization for training neural networks. • Other iterative optimization methods. Apart from gradient descent, other iterative procedures have been applied to solve the phase retrieval problem. Partial examples include alternating minimization, Kaczmarz algorithm, and truncated gradient descent (Truncated Wirtinger flow). In conjunction with random initialization, whether the iterative algorithms mentioned above enjoy fast global convergence is an interesting open problem. For example, it has been shown that truncated WF together with truncated spectral initialization achieves optimal sample complexity (i.e. m n) and computational complexity simultaneously [CC17]. Does truncated Wirtinger flow still enjoy optimal sample complexity when initialized randomly? • Applications of leave-one-out tricks. In this paper, we heavily deploy the leave-one-out trick to demonstrate the “near-independence” between the iterates xt and the sampling vectors {ai }. The basic idea is to construct an auxiliary sequence that is (i) independent w.r.t. certain components of the design vectors, and (ii) extremely close to the original sequence. These two properties allow us to propagate the certain independence properties to xt . As mentioned in Section 3, the leave-one-out trick has served as a very 21

powerful hammer for decoupling the dependency between random vectors in several high-dimensional estimation problems. We expect this powerful trick to be useful in broader settings.

Acknowledgements Y. Chen is supported in part by a Princeton SEAS innovation award. The work of Y. Chi is supported in part by AFOSR under the grant FA9550-15-1-0205, by ONR under the grant N00014-18-1-2142, and by NSF under the grants CAREER ECCS-1818571 and CCF-1806154. J. Fan is supported in part by NSF grants DMS-1662139 and DMS-1712591 and NIH grant 2R01-GM072611-13.

References [AAZB+ 16] N. Agarwal, Z. Allen-Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016. [AFWZ17] E. Abbe, J. Fan, K. Wang, and Y. Zhong. Entrywise eigenvector analysis of random matrices with low expected rank. arXiv preprint arXiv:1709.09565, 2017. [AZ17]

Z. Allen-Zhu. Natasha 2: arXiv:1708.08694, 2017.

Faster non-convex optimization than sgd.

arXiv preprint

[BCMN14] A. S. Bandeira, J. Cahill, D. G. Mixon, and A. A. Nelson. Saving phase: Injectivity and stability for phase retrieval. Applied and Computational Harmonic Analysis, 37(1):106–125, 2014. [BEB17]

T. Bendory, Y. C. Eldar, and N. Boumal. Non-convex phase retrieval from STFT measurements. IEEE Transactions on Information Theory, 2017.

[BWY14]

S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.

[CC17]

Y. Chen and E. J. Candès. Solving random quadratic systems of equations is nearly as easy as solving linear systems. Comm. Pure Appl. Math., 70(5):822–883, 2017.

[CCG15]

Y. Chen, Y. Chi, and A. J. Goldsmith. Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61(7):4034– 4059, 2015.

[CESV13]

E. J. Candès, Y. C. Eldar, T. Strohmer, and V. Voroninski. Phase retrieval via matrix completion. SIAM Journal on Imaging Sciences, 6(1):199–225, 2013.

[CFL15]

P. Chen, A. Fannjiang, and G.-R. Liu. Phase retrieval with one or two diffraction patterns by alternating projections with the null initialization. Journal of Fourier Analysis and Applications, pages 1–40, 2015.

[CFMW17] Y. Chen, J. Fan, C. Ma, and K. Wang. Spectral method and regularized MLE are both optimal for top-k ranking. arXiv preprint arXiv:1707.09971, 2017. [CL14]

E. J. Candès and X. Li. Solving quadratic equations via PhaseLift when there are about as many equations as unknowns. Foundations of Computational Mathematics, 14(5):1017–1026, 2014.

[CL16]

Y. Chi and Y. M. Lu. Kaczmarz method for solving quadratic equations. IEEE Signal Processing Letters, 23(9):1183–1187, 2016.

[CL17]

J. Chen and X. Li. Memory-efficient kernel pca via partial matrix sampling and nonconvex optimization: a model-free analysis of local minima. arXiv preprint arXiv:1711.01742, 2017.

22

[CLM+ 16]

T. T. Cai, X. Li, Z. Ma, et al. Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. The Annals of Statistics, 44(5):2221–2251, 2016.

[CLS15]

E. J. Candès, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007, April 2015.

[CLW17]

J.-F. Cai, H. Liu, and Y. Wang. Fast rank one alternating minimization algorithm for phase retrieval. arXiv preprint arXiv:1708.08751, 2017.

[CSV13]

E. J. Candès, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1017–1026, 2013.

[CW15]

Y. Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.

[CWZG17] J. Chen, L. Wang, X. Zhang, and Q. Gu. Robust wirtinger flow for phase retrieval with arbitrary corruption. arXiv preprint arXiv:1704.06256, 2017. [CYC14]

Y. Chen, X. Yi, and C. Caramanis. A convex formulation for mixed regression with two components: Minimax optimal rates. In Conference on Learning Theory, pages 560–604, 2014.

[CZ15]

T. Cai and A. Zhang. ROP: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):102–138, 2015.

[DH14]

L. Demanet and P. Hand. Stable optimizationless recovery from phaseless linear measurements. Journal of Fourier Analysis and Applications, 20(1):199–221, 2014.

[DJL+ 17]

S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.

[DR17]

J. C. Duchi and F. Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. arXiv preprint arXiv:1705.02356, 2017.

[EK15]

N. El Karoui. On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators. Probability Theory and Related Fields, pages 1–81, 2015.

[EKBB+ 13] N. El Karoui, D. Bean, P. J. Bickel, C. Lim, and B. Yu. On robust regression with highdimensional predictors. Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013. [FCL18]

H. Fu, Y. Chi, and Y. Liang. Local geometry of one-hidden-layer neural networks for logistic regression. arXiv preprint arXiv:1802.06463, 2018.

[GHJY15]

R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.

[GX16]

B. Gao and Z. Xu. arXiv:1606.08135, 2016.

[HH17]

W. Huang and P. Hand. Blind deconvolution by a steepest descent algorithm on a quotient manifold. arXiv preprint arXiv:1710.03309, 2017.

[HZC18]

B. Hao, A. Zhang, and G. Cheng. Sparse and low-rank tensor estimation via cubic sketchings. arXiv preprint arXiv:1801.09326, 2018.

[JGN+ 17]

C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017.

Phase retrieval using Gauss-Newton method.

23

arXiv preprint

[JM15]

A. Javanmard and A. Montanari. De-biasing the lasso: Optimal sample size for Gaussian designs. arXiv preprint arXiv:1508.02757, 2015.

[JNJ17]

C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017.

[KMO10]

R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980 –2998, June 2010.

[KRT17]

R. Kueng, H. Rauhut, and U. Terstiege. Low rank matrix recovery from rank one measurements. Applied and Computational Harmonic Analysis, 42(1):88–116, 2017.

[Lan93]

S. Lang. Real and functional analysis. Springer-Verlag, New York,, 10:11–13, 1993.

[LGL15]

G. Li, Y. Gu, and Y. M. Lu. Phase retrieval using iterative projections: Dynamics in the large systems limit. In Allerton Conference on Communication, Control, and Computing, pages 1114–1118. IEEE, 2015.

[LL17]

Y. M. Lu and G. Li. Phase transitions of spectral initialization for high-dimensional nonconvex estimation. arXiv preprint arXiv:1702.06435, 2017.

[LLB17]

Y. Li, K. Lee, and Y. Bresler. Blind gain and phase calibration for low-dimensional or sparse signal sensing via power iteration. In Sampling Theory and Applications (SampTA), 2017 International Conference on, pages 119–123. IEEE, 2017.

[LLSW16]

X. Li, S. Ling, T. Strohmer, and K. Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. CoRR, abs/1606.04933, 2016.

[LMCC18]

Y. Li, C. Ma, Y. Chen, and Y. Chi. Nonconvex matrix factorization from rank-one measurements. arXiv preprint arXiv:1802.06286, 2018.

[LMZ17]

Y. Li, T. Ma, and H. Zhang. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203, 2017.

[LPP+ 17]

J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017.

[LSJR16]

J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915, 2016.

[MM17]

M. Mondelli and A. Montanari. Fundamental limits of weak recovery with applications to phase retrieval. arXiv preprint arXiv:1708.05932, 2017.

[MSK17]

R. Murray, B. Swenson, and S. Kar. Revisiting normalized gradient descent: Evasion of saddle points. arXiv preprint arXiv:1711.05224, 2017.

[MWCC17] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. arXiv preprint arXiv:1711.10467, 2017. [MXM18]

J. Ma, J. Xu, and A. Maleki. Optimization-based amp for phase retrieval: The impact of initialization and `_2-regularization. arXiv preprint arXiv:1801.01170, 2018.

[NJS13]

P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems (NIPS), 2013.

[QZEW17] Q. Qing, Y. Zhang, Y. Eldar, and J. Wright. Convolutional phase retrieval via gradient descent. Neural Information Processing Systems, 2017. [SCC17]

P. Sur, Y. Chen, and E. J. Candès. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. arXiv preprint arXiv:1706.01191, 2017. 24

[SEC+ 15]

Y. Shechtman, Y. C. Eldar, O. Cohen, H. N. Chapman, J. Miao, and M. Segev. Phase retrieval with application to optical imaging: a contemporary overview. IEEE signal processing magazine, 32(3):87–109, 2015.

[SJL17]

M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

[SL16]

R. Sun and Z.-Q. Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535–6579, 2016.

[Sol14]

M. Soltanolkotabi. Algorithms and Theory for Clustering and Nonconvex Quadratic Programming. PhD thesis, Stanford University, 2014.

[Sol17]

M. Soltanolkotabi. Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization. arXiv preprint arXiv:1702.06175, 2017.

[SQW16]

J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 2379–2383. IEEE, 2016.

[SS12]

W. Schudy and M. Sviridenko. Concentration and moment inequalities for polynomials of independent random variables. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, pages 437–446. ACM, New York, 2012.

[TBS+ 16]

S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht. Low-rank solutions of linear matrix equations via procrustes flow. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 964–973. JMLR. org, 2016.

[TV17]

Y. S. Tan and R. Vershynin. Phase retrieval via randomized kaczmarz: Theoretical guarantees. arXiv preprint arXiv:1706.09993, 2017.

[Ver12]

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed Sensing, Theory and Applications, pages 210 – 268, 2012.

[Wei15]

K. Wei. Solving systems of phaseless equations via Kaczmarz methods: A proof of concept study. Inverse Problems, 31(12):125008, 2015.

[WGE17]

G. Wang, G. B. Giannakis, and Y. C. Eldar. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 2017.

[WGSC17] G. Wang, G. B. Giannakis, Y. Saad, and J. Chen. Solving almost all systems of random quadratic equations. arXiv preprint arXiv:1705.10407, 2017. [YYF+ 17]

Z. Yang, L. F. Yang, E. X. Fang, T. Zhao, Z. Wang, and M. Neykov. Misspecified nonconvex statistical optimization for phase retrieval. arXiv preprint arXiv:1712.06245, 2017.

[ZB17]

Y. Zhong and N. Boumal. Near-optimal bounds for phase synchronization. arXiv preprint arXiv:1703.06605, 2017.

[ZCL16]

H. Zhang, Y. Chi, and Y. Liang. Provable non-convex phase retrieval with outliers: Median truncated Wirtinger flow. In International conference on machine learning, pages 1022–1031, 2016.

[Zha17]

T. Zhang. Phase retrieval using alternating minimization in a batch setting. arXiv preprint arXiv:1706.08167, 2017.

[ZSJ+ 17]

K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for onehidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.

[ZWGC17] L. Zhang, G. Wang, G. B. Giannakis, and J. Chen. Compressive phase retrieval via reweighted amplitude flow. arXiv preprint arXiv:1712.02426, 2017. 25

[ZWL15]

T. Zhao, Z. Wang, and H. Liu. A nonconvex optimization framework for low rank matrix estimation. In Advances in Neural Information Processing Systems, pages 559–567, 2015.

[ZZLC17]

H. Zhang, Y. Zhou, Y. Liang, and Y. Chi. A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms. Journal of Machine Learning Research, 2017.

26

A

Preliminaries

We first gather two standard concentration inequalities used throughout the appendix. The first lemma is the multiplicative form of the Chernoff bound, while the second lemma is a user-friendly version of the Bernstein inequality. Lemma 10. Suppose X1 , · · · , Xm are independent random variables taking values in {0, 1}. Denote X = Pm X and µ = E [X]. Then for any δ ≥ 1, one has i=1 i P (X ≥ (1 + δ) µ) ≤ e−δµ/3 . Lemma 11. Consider m independent random variables zl (1 ≤ l ≤ m), each satisfying |zl | ≤ B. For any a ≥ 2, one has v m m m X u X X 2a u zl − E [zl ] ≤ t2a log m E [zl2 ] + B log m 3 l=1

l=1

−a

with probability at least 1 − 2m

l=1

.

Next, we list a few simple facts. The gradient and the Hessian of the nonconvex loss function (2) are given respectively by m i 1 X h > 2 \ 2 ai x − a> ∇f (x) = ai a> i x i x; m i=1

(54)

m

∇2 f (x) =

2 i 1 Xh \ 2 3 a> − a> ai a> i x i x i . m i=1

(55)

\ In addition, recall that x to be x\ = e1 throughout the proof. For each 1 ≤ i ≤ m, we have is assumed ai,1 the decomposition ai = , where ai,⊥ contains the 2nd through the nth entries of ai . The standard ai,⊥ concentration inequality reveals that p \ (56) max |ai,1 | = max a> i x ≤ 5 log m 1≤i≤m

1≤i≤m

with probability 1 − O m−10 . Additionally, apply the standard concentration inequality to see that max kai k2 ≤

1≤i≤m

√

6n

(57)

with probability 1 − O me−1.5n . The next lemma provides concentration bounds regarding polynomial functions of {ai }. i.i.d.

Lemma 12. Consider any > 3/n. Suppose that ai ∼ N (0, In ) for 1 ≤ i ≤ m. Let > n−1 S := z ∈ R max ai,⊥ z ≤ β kzk2 , 1≤i≤m

√ where β is any value obeying β ≥ c1 log m for some sufficiently large constant c1 > 0. Then with probability exceeding 1 − O m−10 , one has P n o 5 1 m 3 > 1 1 2 m ; 1. m a a z n log n, βn log ≤ kzk for all z ∈ S, provided that m ≥ c max 2 0 i=1 i,1 i,⊥ 2 P n 3 1 m 3 1 > 2. m i=1 ai,1 ai,⊥ z ≤ kzk2 for all z ∈ S, provided that m ≥ c0 max 2 n log n,

3 1 3 2 β n log

P 2 1 m 2 2 2 > 3. m − kzk2 ≤ kzk2 for all z ∈ S, provided that m ≥ c0 max 12 n log n, i=1 ai,1 ai,⊥ z 27

o m ;

2 1 2 β n log

m ;

P 2 1 m 6 2 2 > 4. m − 15 kzk2 ≤ kzk2 for all z ∈ S, provided that m ≥ c0 max 12 n log n, i=1 ai,1 ai,⊥ z

4 1 2 β n log

m ;

P 1 6 1 m 2 6 6 > a z − 15 kzk a 5. m i,⊥ 2 ≤ kzk2 for all z ∈ S, provided that m ≥ c0 max 2 n log n, i=1 i,1

2 1 6 β n log

m ;

P 1 4 1 m 2 4 4 > 6. m − 3 kzk a a z i,1 i,⊥ 2 ≤ kzk2 for all z ∈ S, provided that m ≥ c0 max 2 n log n, i=1

2 1 4 β n log

m .

Here, c0 > 0 is some sufficiently large constant. Proof. See Appendix J. The next lemmas provide the (uniform) matrix concentration inequalities about ai a> i . i.i.d.

Lemma 13 ( [Ver12, Corollary 5.35]). Suppose that ai ∼ N (0, In ) for 1 ≤ i ≤ m. With probability at least 1 − ce−˜cm , one has

m

1 X

> ai ai ≤ 2,

m i=1 as long as m ≥ c0 n for some sufficiently large constant c0 > 0. Here, c, c˜ > 0 are some absolute constants. i.i.d.

Lemma 14. Fix some x\ ∈ Rn . Suppose that ai ∼ N (0, In ), 1 ≤ i ≤ m. With probability at least 1 − O m−10 , one has s

m

1 X

n log3 m

\ 2 \ \>

\ 2

x\ 2 , a> ai a>

i x i − x 2 In − 2x x ≤ c0 2

m

m

(58)

i=1

provided that m > c1 n log3 m. Here, c0 , c1 are some universal positive constants. Furthermore, fix any c2 > 1 and suppose that m > c1 n log3 m for some sufficiently large constant c1 > 0. Then with probability exceeding 1 − O m−10 , s

m

1 X

n log3 m 2

2 2 > a> kzk2 (59) ai a>

i z i − kzk2 In − 2zz ≤ c0

m

m i=1 √ holds simultaneously for all z ∈ Rn obeying max1≤i≤m a> i z ≤ c2 log m kzk2 . On this event, we have

m m

1 X

1 X

2 > 2 > |ai,1 | ai,⊥ ai,⊥ ≤ |ai,1 | ai ai ≤ 4. (60)

m

m

i=1

i=1

Proof. See Appendix K. The following lemma provides the concentration results regarding the Hessian matrix ∇2 f (x). 3 Lemma 15. Fix any constant c0 > 1. Suppose that m > c1 n log m for some sufficiently large constant −10 c1 > 0. Then with probability exceeding 1 − O m ,

s

o n

2

In − η∇2 f (z) − 1 − 3η kzk2 + η In + 2ηx\ x\> − 6ηzz > .

n o n log3 m 2 max kzk2 , 1 m

2

∇ f (z) ≤ 10kzk22 + 4 √ hold simultaneously for all z obeying max1≤i≤m a> i z ≤ c0 log m kzk2 , provided that 0 < η < for some sufficiently small constant c2 > 0. and

Proof. See Appendix L.

28

c2 max{kzk22 ,1}

Finally, we note that there are a few immediate consequences of the induction hypotheses (40), which we summarize below. These conditions are useful in the subsequent analysis. Note that Lemma 3 is incorporated here. Lemma 16. Suppose that m ≥ Cn log6 m for some sufficiently large constant C > 0. Then under the hypotheses (40) for t . log n, with probability at least 1 − O(me−1.5n ) − O(m−10 ) one has

t,(l)

(61a) c5 /2 ≤ x⊥ 2 ≤ xt,(l) 2 ≤ 2C5 ;

t,sgn

t,sgn

≤ 2C5 ; (61b) c5 /2 ≤ x⊥ 2 ≤ x 2

t,sgn,(l)

t,sgn,(l)

≤ x

≤ 2C5 ; c5 /2 ≤ x⊥ (61c) 2 2 p > t

t max al x . log m x 2 ; (62a) 1≤l≤m

t > t p (62b) max al,⊥ x⊥ . log m x⊥ 2 ; 1≤l≤m p

t,sgn max a> . log m xt,sgn 2 ; (62c) l x 1≤l≤m p

t,sgn

; . log m xt,sgn max a> (62d) l,⊥ x⊥ ⊥ 2 1≤l≤m sgn> t,sgn p

t,sgn

; . log m x max al (62e) x 2 1≤l≤m

1 ; (63a) max xt − xt,(l) 2 1≤l≤m log m

t

x − xt,sgn 1 ; (63b) 2 log m t,(l) max xk ≤ 2αt . (63c) 1≤l≤m

Proof. See Appendix M.

B

Proof of Lemma 1

We focus on the case when √

1 log n ≤ α0 ≤ √ n n log n

and

1−

1 1 ≤ β0 ≤ 1 + log n log n

The other cases can be proved using very similar arguments as below, and hence omitted. Let η > 0 and c4 > 0 be some sufficiently small constants independent of n. In the sequel, we divide Stage 1 (iterations up to Tγ ) into several substages. See Figure 8 for an illustration. • Stage 1.1: consider the period when αt is sufficiently small, which consists of all iterations 0 ≤ t ≤ T1 with T1 given in (26). We claim that, throughout this substage, 1 , αt > √ 2 n log n √ √ 0.5 < βt < 1.5.

(64a) (64b)

If this claim holds, then we would have αt2 + βt2 < c24 + 1.5 < 2 as long as c4 is small enough. This immediately reveals that 1 + η 1 − 3αt2 − 3βt2 ≥ 1 − 6η, which further gives βt+1 ≥ 1 + η 1 − 3αt2 − 3βt2 + ηρt βt c3 η ≥ 1 − 6η − βt log n ≥ (1 − 7η)βt . In what follows, we further divide this stage into multiple sub-phases. 29

(65)

,t -t

Figure 8: Illustration of the substages for the proof of Lemma 1. – Stage 1.1.1: consider the iterations 0 ≤ t ≤ T1,1 with o n p T1,1 = min t | βt+1 ≤ 1/3 + η .

(66)

Fact 1. For any sufficiently small η > 0, one has βt+1 ≤ (1 − 2η 2 )βt , αt+1 ≤ (1 + 4η)αt , 3

αt+1 ≥ (1 + 2η )αt ,

0 ≤ t ≤ T1,1 ;

(67)

0 ≤ t ≤ T1,1 ; 1 ≤ t ≤ T1,1 ;

α1 ≥ α0 /2; 1 − 7η βT1,1 +1 ≥ √ ; 3 1 T1,1 . 2 . η

(68)

(69)

Moreover, αT1,1 c4 and hence T1,1 < T1 . From Fact 1, we see that in this substage, αt keeps increasing (at least for t ≥ 1) with c4 > αt ≥

1 α0 ≥ √ , 2 2 n log n

0 ≤ t ≤ T1,1 ,

and βt is strictly decreasing with 1.5 > β0 ≥ βt ≥ βT1,1 +1 ≥

1 − 7η √ , 3

0 ≤ t ≤ T1,1 ,

which justifies (64). In addition, combining (67) with (68), we arrive at the growth rate of αt /βt as αt+1 /αt 1 + 2η 3 ≥ = 1 + O(η 2 ). βt+1 /βt 1 − 2η 2 These demonstrate (24) for this substage. – Stage 1.1.2: this substage contains all iterations obeying T1,1 < t ≤ T1 . We claim the following result. Fact 2. Suppose that η > 0 is sufficiently small. Then for any T1,1 < t ≤ T1 , (1 − 7η)2 1 + 30η √ βt ∈ , √ ; 3 3 βt+1 ≤ (1 + 30η 2 )βt . 30

(70) (71)

Furthermore, since αt2 + βt2 ≤ c24 +

(1 + 30η)2 1 < , 3 2

we have, for sufficiently small c3 , that αt+1 ≥ 1 + 3η 1 − αt2 − βt2 − η|ζt | αt c3 η ≥ 1 + 1.5η − αt log n ≥ (1 + 1.4η)αt ,

(72) √ 1 , 2 n log n

and hence αt keeps increasing. This means αt ≥ α1 ≥ with (70) for this substage. As a consequence, T1 − T1,1 . T1 − T0 .

log

c4 α0

log(1 + 1.4η) log cc64 log5 m

log (1 + 1.4η)

which justifies the claim (64) together

.

log n ; η

.

log log m . η

Moreover, combining (72) with (71) yields the growth rate of αt /βt as αt+1 /αt 1 + 1.4η ≥ ≥1+η βt+1 /βt 1 + 30η 2 for η > 0 sufficiently small. – Taken collectively, the preceding bounds imply that T1 = T1,1 + (T1 − T1,1 ) .

1 log n log n . 2 . + 2 η η η

• Stage 1.2: in this stage, we consider all iterations T1 < t ≤ T2 , where αt+1 2 T2 := min t | > . βt+1 γ From the preceding analysis, it is seen that, for η sufficiently small, √ αT1,1 c4 3c4 . ≤ (1−7η)2 ≤ βT1,1 1 − 15η √ 3

In addition, we have: Fact 3. Suppose η > 0 is sufficiently small. Then for any T1 < t ≤ T2 , one has αt2 + βt2 ≤ 2;

(73)

αt+1 /βt+1 ≥ 1 + η; αt /βt αt+1 ≥ {1 − 3.1η} αt ; βt+1 ≥ {1 − 5.1η} βt . In addition, T2 − T1 .

31

1 . η

(74) (75) (76)

With this fact in place, one has αt ≥ (1 − 3.1η)t−T1 αT1 & 1,

T1 < t ≤ T2 .

βt ≥ (1 − 5.1η)t−T1 βT1 & 1,

T1 < t ≤ T2 .

and hence These taken collectively demonstrate (24) for any T1 < t ≤ T2 . Finally, if T2 ≥ Tγ , then we complete the proof as log n Tγ ≤ T2 = T1 + (T2 − T1 ) . 2 . η Otherwise we move to the next stage. • Stage 1.3: this stage is composed of all iterations T2 < t ≤ Tγ . We break the discussion into two cases. – If αT2 +1 > 1 + γ, then αT2 2 +1 + βT22 +1 ≥ αT2 2 +1 > 1 + 2γ. This means that αT2 +2 ≤ 1 + 3η 1 − αT2 2 +1 − βT22 +1 + η|ζT2 +1 | αT2 +1 ηc3 ≤ 1 − 6ηγ − αT2 +1 log n ≤ {1 − 5ηγ} αT2 +1 when c3 > 0 is sufficiently small. Similarly, one also gets βT2 +2 ≤ (1 − 5ηγ)βT2 +1 . As a result, both αt and βt will decrease. Repeating this argument reveals that αt+1 ≤ (1 − 5ηγ)αt , βt+1 ≤ (1 − 5ηγ)βt until αt ≤ 1 + γ. In addition, applying the same argument as for Stage 1.2 yields αt+1 /αt ≥ 1 + c10 η βt+1 /βt for some constant c10 > 0. Therefore, when αt drops below 1 + γ, one has αt ≥ (1 − 3η)(1 + γ) ≥ 1 − γ and βt ≤

γ αt ≤ γ. 2

This justifies that Tγ − T2 .

log

2 1−γ

− log(1 − 5ηγ)

.

1 . η

– If c4 ≤ αT2 +1 < 1 − γ, take very similar arguments as in Stage 1.2 to reach that αt+1 /αt ≥ 1 + c10 η, βt+1 /βt and

αt & 1,

Tγ − T2 .

βt & 1

1 η

T2 ≤ t ≤ Tγ

for some constant c10 > 0. We omit the details for brevity. In either case, we see that αt is always bounded away from 0. We can also repeat the argument for Stage 1.2 to show that βt & 1.

32

In conclusion, we have established that Tγ = T1 + (T2 − T1 ) + (Tγ − T2 ) . and

αt+1 /αt ≥ 1 + c10 η 2 , βt+1 /βt

log n , η2

c5 ≤ βt ≤ 1.5,

0 ≤ t < Tγ

1 ≤ αt ≤ 2, 2 n log n √

0 ≤ t < Tγ

for some constants c5 , c10 > 0. Proof of Fact 1. The proof proceeds as follows. p • First of all, for any 0 ≤ t ≤ T1,1 , one has βt ≥ 1/3 + η and αt2 + βt2 ≥ 1/3 + η and, as a result, βt+1 ≤ 1 + η 1 − 3αt2 − 3βt2 + η|ρt | βt c3 η 2 βt ≤ 1 − 3η + log n ≤ (1 − 2η 2 )βt

(77)

as long as c3 and η are both constants. In other words, βt is strictly decreasing before T1,1 , which also justifies the claim (64b) for this substage. • Moreover, given that the contraction factor of βt is at least 1 − 2η 2 , we have log √ β0 T1,1 .

1/3+η

− log (1 − 2η 2 )

1 . η2

p This upper bound also allows us to conclude that βt will cross the threshold 1/3 + η before αt exceeds c4 , namely, T1,1 < T1 . To see this, we note that the growth rate of {αt } within this substage is upper bounded by αt+1 ≤ 1 + 3η 1 − αt2 − βt2 + η|ζt | αt c3 η ≤ 1 + 3η + αt log n ≤ (1 + 4η)αt .

(78)

This leads to an upper bound |αT1,1 | ≤ (1 + 4η)T1,1 |α0 | ≤ (1 + 4η)O(η

−2

) log n

√

n

c4 .

• Furthermore, we can also lower bound αt . First of all, α1 ≥ 1 + 3η 1 − α02 − β02 − η|ζt | α0 c3 η ≥ 1 − 3η − α0 log n 1 ≥ (1 − 4η)α0 ≥ α0 2 for η sufficiently small. For all 1 ≤ t ≤ T1,1 , using (78) we have αt2 + βt2 ≤ (1 + 4η)T1,1 α02 + β12 ≤ o(1) + (1 − 2η 2 )β0 ≤ 1 − η 2 , allowing one to deduce that αt+1 ≥ 1 + 3η 1 − αt2 − βt2 − η|ζt | αt 33

(79)

≥

c3 η 1 + 3η 3 − αt log n

≥ (1 + 2η 3 )αt . In other words, αt keeps increasing throughout all 1 ≤ t ≤ T1,1 . This verifies the condition (64a) for this substage. • Finally, we make note of one useful lower bound βT1,1 +1 ≥ (1 − 7η)βT1,1 ≥ which follows by combining (65) and the condition βT1,1 ≥

p

1 − 7η √ , 3

(80)

1/3 + η .

Proof of Fact 2. Clearly, βT1,1 +1 falls within this range according to (66) and (80). We now divide into several cases. • If

1+η √ 3

≤ βt
(1 − 7η)4 /3 > (1 − 29η)/3, one has βt+1 ≤ 1 + η 1 − 3αt2 − 3βt2 + η|ρt | βt c3 η βt ≤ (1 + 30η 2 )βt (82) ≤ 1 + 29η 2 + log n 1 + 30η 2 √ < . 3 h i 2 1+30η √ √ Therefore, we have shown that βt+1 ∈ (1−7η) , , which continues to lie within the range (70). 3 3 • Finally, if

1−7η √ 3

< βt
0. In addition, it comes from (80) that βt+1 ≥ (1 − 7η)βt ≥ falls within the range (70). 34

(1−7η)2 √ . 3

This justifies that βt+1

Combining all of the preceding cases establishes the claim (70) for all T1,1 < t ≤ T1 . Proof of Fact 3. We first demonstrate that αt2 + βt2 ≤ 2

(84)

throughout this substage. In fact, if αt2 + βt2 ≤ 1.5, then αt+1 ≤ 1 + 3η 1 − αt2 − βt2 + η|ζt | αt ≤ (1 + 4η) αt and, similarly, βt+1 ≤ (1 + 4η)βt . These taken together imply that 2 2 2 αt+1 + βt+1 ≤ (1 + 4η) αt2 + βt2 ≤ 1.5(1 + 9η) < 2. Additionally, if 1.5 < αt2 + βt2 ≤ 2, then αt+1 ≤ 1 + 3η 1 − αt2 − βt2 + η|ζt | αt c3 η αt ≤ 1 − 1.5η + log n ≤ (1 − η)αt and, similarly, βt+1 ≤ (1 − η)βt . These reveal that 2 2 αt+1 + βt+1 ≤ αt2 + βt2 .

Put together the above argument to establish the claim (84). With the claim (84) in place, we can deduce that αt+1 ≥ 1 + 3η 1 − αt2 − βt2 − η|ζt | αt ≥ 1 + 3η 1 − αt2 − βt2 − 0.1η αt

(85)

and βt+1 ≤ 1 + η 1 − 3αt2 − 3βt2 + η|ρt | βt ≤ 1 + η 1 − 3αt2 − 3βt2 + 0.1η βt . Consequently, 1 + 3η 1 − αt2 − βt2 − 0.1η αt+1 /αt αt+1 /βt+1 = ≥ αt /βt βt+1 /βt 1 + η (1 − 3αt2 − 3βt2 ) + 0.1η 1.8η =1+ 1 + η (1 − 3αt2 − 3βt2 ) + 0.1η 1.8η ≥1+ ≥1+η 1 + 2η for η > 0 sufficiently small. This immediately implies that log αT 2/γ /β1,1 1 1,1 T2 − T1 . . log (1 + η) η Moreover, combine (84) and (85) to arrive at αt+1 ≥ {1 − 3.1η} αt , Similarly, one can show that βt+1 ≥ {1 − 5.1η} βt .

35

(86)

C C.1

Proof of Lemma 2 Proof of (41a)

In view of the gradient update rule (3), we can express the signal component xt+1 as follows || m

xt+1 k

=

xtk

i η X h > t 3 t ai,1 . ai x − a2i,1 a> − i x m i=1

t t > t Expanding this expression using a> i x = xk ai,1 + ai,⊥ x⊥ and rearranging terms, we are left with

xt+1 k

=

xtk

m m h h i t 1 X i 1 X t 2 t 2 4 + η 1 − xk xk · · a + η 1 − 3 xk a3 a> xt m i=1 i,1 m i=1 i,1 i,⊥ ⊥ {z } | {z } | :=J1

:=J2

m m 1 X > t 2 2 1 X > t 3 − 3ηxtk · ai,⊥ x⊥ ai,1 − η · a x ai,1 . m i=1 m i=1 i,⊥ ⊥ | {z } | {z } :=J3

:=J4

In the sequel, we control the above four terms J1 , J2 , J3 and J4 separately. • With regard to the first term J1 , it follows from the standard concentration inequality for Gaussian polynomials [SS12, Theorem 1.9] that ! m 1 X 1/4 1/2 4 P ai,1 − 3 ≥ τ ≤ e2 e−c1 m τ m i=1 for some absolute constant c1 > 0. Taking τ

3 log √ m m

reveals that with probability exceeding 1−O m−10 ,

! m h 2 i t 1 X 4 ai,1 − 3 η 1 − xtk xk m i=1

h 2 i t J1 = 3η 1 − xtk xk +

h 2 i t = 3η 1 − xtk xk + r1 , where the remainder term r1 obeys |r1 | = O

(87)

η log3 m t √ xk . m

Here, the last line also uses the fact that

2 2 1 − xtk ≤ 1 + xt 2 . 1,

(88)

with the last relation coming from the induction hypothesis (40e). • For the third term J3 , it is easy to see that " # m m

t 2 1 X > \ 2 1 X > t 2 2 > t>

a x ai,1 − x⊥ 2 = x⊥ a x ai,⊥ ai,⊥ −In−1 xt⊥ , m i=1 i,⊥ ⊥ m i=1 i | {z }

(89)

:=U

where U − In−1 is a submatrix of the following matrix (obtained by removing its first row and column) m

1 X > \ 2 > a x ai a> i − In + 2e1 e1 . m i=1 i 36

(90)

This fact combined with Lemma 14 reveals that

s m

1 X

n log3 m 2

\ > kU − In−1 k ≤ a> ai a> i x i − In + 2e1 e1 .

m

m i=1 with probability at least 1 − O m−10 , provided that m n log3 m. This further implies

2 J3 = 3η xt⊥ 2 xtk + r2 ,

(91)

where the size of the remaining term r2 satisfies s s

n log3 m t n log3 m t 2 |r2 | . η xk xt⊥ 2 . η xk . m m 2

2

Here, the last inequality holds under the hypothesis (40e) that kxt⊥ k2 ≤ kxt k2 . 1. • When it comes to J2 , our analysis relies on the random-sign sequence {xt,sgn }. Specifically, one can decompose m m m 1 X 3 > t 1 X 3 > t,sgn 1 X 3 > ai,1 ai,⊥ x⊥ = ai,1 ai,⊥ x⊥ + ai,1 ai,⊥ xt⊥ − xt,sgn . (92) ⊥ m i=1 m i=1 m i=1 t,sgn For the first term on the right-hand side of (92), note that |ai,1 |3 a> is statistically independent of i,⊥ x⊥ P m t,sgn 1 3 > a a x as a weighted sum of the ξi ’s and apply the ξi = sgn (ai,1 ). Therefore we can treat m i=1 i,1 i,⊥ ⊥ Bernstein inequality (see Lemma 11) to arrive at m m 1 X 1 X 1 p 3 t,sgn t,sgn a3i,1 a> ξi |ai,1 | a> V1 log m + B1 log m (93) = . i,⊥ x⊥ i,⊥ x⊥ m m m i=1 i=1

with probability exceeding 1 − O m−10 , where V1 :=

m X

|ai,1 |

6

t,sgn a> i,⊥ x⊥

2

and

3 t,sgn . B1 := max |ai,1 | a> i,⊥ x⊥ 1≤i≤m

i=1

Make use ofLemma 12 and the incoherence condition (62d) to deduce that with probability at least 1 − O m−10 , m

2 1 1 X 6 t,sgn 2

V1 = |ai,1 | a> . xt,sgn i,⊥ x⊥ ⊥ 2 m m i=1

with the proviso that m n log5 m. Furthermore, the incoherence condition (62d) together with the fact (56) implies that

. B1 . log2 m xt,sgn ⊥ 2 Substitute the bounds on V1 and B1 back to (93) to obtain r r m 1 X 3

t,sgn

log m log m t,sgn t,sgn 3 >

x

+

x

log m xt,sgn ai,1 ai,⊥ x⊥ . ⊥ ⊥ ⊥ 2 2 2 m m m m i=1

(94)

as long as m & log5 m. Additionally, regarding the second term on the right-hand side of (92), one sees that m m 1 X > \ 2 1 X 3 > t,sgn t ai,1 ai,⊥ xt⊥ − xt,sgn = a x ai,1 a> , (95) i,⊥ x⊥ − x⊥ ⊥ m i=1 m i=1 i | {z } :=u>

37

where u is the first column of (90) without the first entry. Hence we have s m 1 X

n log3 m t,sgn t,sgn t

xt⊥ − xt,sgn , a3i,1 a> ≤ kuk2 xt⊥ − x⊥ 2 . i,⊥ x⊥ − x⊥ ⊥ 2 m m i=1

(96)

with probability exceeding 1 − O m−10 , with the proviso that m n log3 m. Substituting the above two bounds (94) and (96) back into (92) gives m m m 1 X 1 X 1 X t,sgn t,sgn 3 > t 3 > 3 > t ai,1 ai,⊥ x⊥ ≤ ai,1 ai,⊥ x⊥ + ai,1 ai,⊥ x⊥ − x⊥ m m m i=1 i=1 i=1 s r 3

log m

xt,sgn + n log m xt⊥ − xt,sgn . . ⊥ ⊥ 2 2 m m As a result, we arrive at the following bound on J2 :   s r log m 3

n log m 2

xt,sgn +

xt⊥ − xt,sgn  |J2 | . η 1 − 3 xtk  ⊥ ⊥ 2 2 m m s r 3 (i)

log m

xt,sgn + η n log m xt⊥ − xt,sgn .η ⊥ ⊥ 2 2 m m s r 3 (ii)

log m

xt⊥ + η n log m xt⊥ − xt,sgn , . η ⊥ 2 2 m m

≤ kxt k + xt − xt,sgn where (i) uses (88) again and (ii) comes from the triangle inequality xt,sgn ⊥ ⊥ ⊥ ⊥ 2 2 2 q q log m n log3 m and the fact that . m ≤ m • It remains to control J4 , towards which we resort to the random-sign sequence {xt,sgn } once again. Write m

m

m

i 1 X > t 3 1 X h > t 3 1 X > t,sgn 3 t,sgn 3 ai,⊥ x⊥ ai,1 = ai,⊥ x⊥ ai,1 + ai,⊥ x⊥ − a> ai,1 . i,⊥ x⊥ m i=1 m i=1 m i=1 t,sgn For the first term in (97), since ξi = sgn (ai,1 ) is statistically independent of a> i,⊥ x⊥ upper bound the first term using the Bernstein inequality (see Lemma 11) as m 1 X 1 p t,sgn 3 > ai,⊥ x⊥ |ai,1 | ξi . V2 log m + B2 log m , m m i=1

3

(97)

|ai,1 |, we can

where the quantities V2 and B2 obey V2 :=

m X

t,sgn a> i,⊥ x⊥

6

2

|ai,1 |

and

i=1

t,sgn 3 B2 := max a> |ai,1 | . i,⊥ x⊥ 1≤i≤m

Using similar arguments as in bounding (93) yields

6

V2 . m xt,sgn ⊥ 2

and

3

B2 . log2 m xt,sgn ⊥ 2

with the proviso that m n log5 m and r r m 1 X

3 log3 m t,sgn 3

log m t,sgn 3 t,sgn >

x

x

+

log m xt,sgn 3 , ai,⊥ x⊥ |ai,1 | ξi . ⊥ ⊥ ⊥ 2 2 2 m m m m i=1 38

(98)

with probability exceeding 1 − O(m−10 ) as soon as m & log5 m. Regarding the second term in (97), m h 1 X i 3 3 t,sgn t ai,1 a> − a> i,⊥ x⊥ i,⊥ x⊥ m i=1 m n h > t,sgn io (i) 1 X t,sgn t,sgn 2 t > t 2 > > t a = x − x a x + a x + a x a x a> i,1 i,⊥ ⊥ i,⊥ ⊥ i,⊥ ⊥ i,⊥ ⊥ i,⊥ ⊥ ⊥ m i=1 v v u m h m h i2 u (ii) u 1 X u1 X i t − xt,sgn t 4 + 5 a> xt,sgn 4 a2 . t (99) ≤ t a> x 5 a> i,1 i,⊥ ⊥ i,⊥ ⊥ ⊥ i,⊥ x⊥ m i=1 m i=1 Here, the first equality (i) utilizes the elementary identity a3 − b3 = (a − b) a2 + b2 + ab , and (ii) follows from the Cauchy-Schwarz inequality as well as the inequality (a2 + b2 + ab)2 ≤ (1.5a2 + 1.5b2 )2 ≤ 5a4 + 5b4 . Use Lemma 13 to reach v v u u m h i u 2 u1 X > t,sgn > t t ai,⊥ x⊥ − x⊥ = t xt⊥ − xt,sgn ⊥ m i=1

m

1 X ai,⊥ a> i,⊥ m i=1

!

t

. xt⊥ − xt,sgn . x⊥ − xt,sgn ⊥ ⊥ 2

Additionally, combining Lemma 12 and the incoherence conditions (62b) and (62d), we can obtain v u m h u1 X

i t 4 + 5 a> xt,sgn 4 a2 . xt 2 + xt,sgn 2 . 1, t 5 a> ⊥ 2 i,1 i,⊥ x⊥ i,⊥ ⊥ ⊥ 2 m i=1 as long as m n log6 m. Here, the last relation comes from the norm conditions (40e) and (61b). These in turn imply m h 1 X i

t,sgn 3 > t 3 >

. ai,⊥ x⊥ − ai,⊥ x⊥ ai,1 . xt⊥ − xt,sgn (100) ⊥ 2 m i=1 Combining the above bounds (98) and (100), we get m m h 1 X 1 X i 3 3 3 t,sgn t,sgn t a> ai,1 + η a> − a> |J4 | ≤ η ai,1 i,⊥ x⊥ i,⊥ x⊥ i,⊥ x⊥ m m i=1 i=1 r

log m

xt,sgn 3 + η xt⊥ − xt,sgn .η ⊥ ⊥ 2 2 m r

log m

xt,sgn + η xt⊥ − xt,sgn .η ⊥ ⊥ 2 2 m r

log m

xt⊥ + η xt⊥ − xt,sgn , .η ⊥ 2 2 m where the penultimate inequality arises from the norm condition (61b) and q the last one comes from the

t,sgn

t

t,sgn t

triangle inequality x⊥ ≤ kx⊥ k2 + x⊥ − x⊥ and the fact that logmm ≤ 1. 2 2 • Putting together the above estimates for J1 , J2 , J3 and J4 , we reach xt+1 = xtk + J1 − J3 + J2 − J4 k h

2 2 i t = xtk + 3η 1 − xtk xk − 3η xt⊥ 2 xt|| + R1 n

2 o = 1 + 3η 1 − xt 2 xtk + R1 , 39

(101)

where R1 is the residual term obeying s r

n log3 m t log m

xt⊥ + η xt − xt,sgn . xk + η |R1 | . η 2 2 m m Substituting the hypotheses (40) into (101) and recalling that αt = hxt , x\ i lead us to conclude that  s  ! r 3 n

t 2 o n log m log m αt+1 = 1 + 3η 1 − x 2 αt + O η αt  + O η βt m m   s t 5 n log m  1 C3 + O ηαt 1 + log m m o n

2 (102) = 1 + 3η 1 − xt 2 + ηζt αt , for some |ζt |

1 log m ,

provided that s

n log3 m 1 m log m r 1 log m βt αt m log m s t 1 n log5 m 1 1+ C3 . log m m log m

(103a) (103b) (103c)

5 Here, the first condition (103a) naturally holds under the √ sample complexity m n log m, whereas the t second condition (103b) is true since βt ≤ kx k2 . αt n log m (cf. the induction hypothesis (40f)) and m n log4 m. For the last condition (103c), observe that for t ≤ T0 = O (log n),

1+

1 log m

t = O (1) ,

which further implies 1+

1 log m

s

t C3

5

n log m . C3 m

s

n log5 m 1 m log m

as long as the number of samples obeys m n log7 m. This concludes the proof.

C.2

Proof of (41b)

Given the gradient update rule (3), the perpendicular component xt+1 can be decomposed as ⊥ m

t xt+1 ⊥ = x⊥ −

= xt⊥ +

i η X h > t 2 \ 2 t ai x − a> x ai,⊥ a> i i x m i=1 m m η X > \ 2 η X > t 3 t ai x ai,⊥ a> x − a x ai,⊥ . i m i=1 m i=1 i | {z } | {z } :=v1

:=v2

In what follows, we bound v1 and v2 in turn.

40

(104)

t t > t • We begin with v1 . Using the identity a> i x = ai,1 xk + ai,⊥ x⊥ , one can further decompose v1 into the following two terms: m

m

1 1 X > \ 2 1 X > \ 2 t v1 = xtk · ai x ai,1 ai,⊥ + a x ai,⊥ a> i,⊥ x⊥ η m i=1 m i=1 i = xtk u + U xt⊥ , where U , u are as defined, respectively, in (89) and (95). Recall that we have shown that s s n log3 m n log3 m and kU − In−1 k . kuk2 . m m hold with probability exceeding 1 − O m−10 . Consequently, one has v1 = ηxt⊥ + r1 ,

(105)

where the residual term r1 obeys s kr1 k2 . η

n log3 m

xt⊥ + η 2 m

s

n log3 m t xk . m

(106)

• It remains to bound v2 in (104). To this end, we make note of the following fact m

m

m

3 1 X 3 1 X > t 3 1 X > t 3 ai x ai,⊥ = ai,⊥ x⊥ ai,⊥ + xtk a ai,⊥ m i=1 m i=1 m i=1 i,1 +

m 3xtk X

m

t ai,1 a> i,⊥ x⊥

2

ai,⊥ + 3 xtk

i=1

m 2 1 X t a2 ai,⊥ a> i,⊥ x⊥ m i=1 i,1

m m 3xtk X 3 2 1 X > t 3 t 2 ai,⊥ x⊥ ai,⊥ + ai,1 a> ai,⊥ + xtk u + 3 xtk U xt⊥ . (107) = i,⊥ x⊥ m i=1 m i=1

Applying Lemma 14 and using the incoherence condition (62b), we get

s

m

1 X

n log3 m 2 2

t t t> t >

xt⊥ 2 , x I − 2x x . a> x a a −

n−1 i,⊥ ⊥ 2 ⊥ ⊥ i,⊥ ⊥ i,⊥ 2

m m i=1 s

> 2 m

1 X

n log3 m 0 0 0 2

xt⊥ 2 , a> ai a> − xt⊥ 2 In − 2

. t t t i i 2 x⊥ x⊥ x⊥

m

m i=1

as long as m n log3 m. These two together allow us to derive

( ) m m

1 X

1 X

t 2 t

t 2 2

> t 3 > t > t t> t ai,⊥ x⊥ ai,⊥ − 3 x⊥ 2 x⊥ = ai,⊥ x⊥ ai,⊥ ai,⊥ − x⊥ 2 In−1 − 2x⊥ x⊥ x⊥

m

m

i=1 i=1 2 2

m

1 X

t 2

t t> t t 2

≤ a> ai,⊥ a> i,⊥ x⊥ i,⊥ − x⊥ 2 In−1 − 2x⊥ x⊥ x⊥ 2

m

i=1 s

n log3 m

xt⊥ 3 ; . 2 m and

2 > m m

1 X

1 X

t 2 0 0 0

> t 2 > >

ai,1 ai,⊥ x⊥ ai,⊥ ≤ ai ai ai − x⊥ 2 In − 2

t t t x⊥ x⊥ x⊥

m

m

i=1 i=1 2 {z } | :=A

41

s

n log3 m

xt⊥ 2 , 2 m Pm 1 > t 2 where the second one follows since m i=1 ai,1 ai,⊥ x⊥ ai,⊥ is the first column of A except for the first entry. Substitute the preceding bounds into (107) to arrive at

m

1 X

t 2 t 2 t

> t 3 t ai x ai,⊥ − 3 x⊥ 2 x⊥ − 3 xk x⊥

m i=1 2

m m

1 X 1 X

3 2

t

> t t t > t 2

≤ ai,⊥ x⊥ ai,⊥ − 3 x⊥ 2 x⊥ + 3 xk ai,1 ai,⊥ x⊥ ai,⊥

m

m

i=1 i=1 2 2

t 3 t 2 t (U − In−1 ) x⊥ 2 + xk u + 3 xk 2 s s 3

n log3 m

xt⊥ 3 + xt xt⊥ 2 + xt 3 + xt 2 xt⊥ . n log m xt . k k k 2 2 2 2 m m .

with probability at least 1 − O(m−10 ). Here, the last relation holds owing to the norm condition (40e) and the fact that

t 3 t t 2 t 3 t 2 t

x⊥ + x x⊥ + x + x x⊥ xt 3 . xt . 2

k

2

k

k

2

2

2

This in turn tells us that m

2

2 2 η X > t 3 ai x ai,⊥ = 3η xt⊥ 2 xt⊥ + 3η xtk xt⊥ + r2 = 3η xt 2 xt⊥ + r2 , v2 = m i=1 where the residual term r2 is bounded by s kr2 k2 . η

n log3 m

xt . 2 m

• Putting the above estimates on v1 and v2 together, we conclude that n

t 2 o t t

x xt+1 = x + v − v = 1 + η 1 − 3 x⊥ + r3 , 1 2 ⊥ ⊥ 2 where r3 = r1 − r2 satisfies s kr3 k2 . η

n log3 m

xt . 2 m

Plug in the definitions of αt and βt to realize that βt+1

for some |ρt |

1 log m ,

 s  3 n

t 2 o n log m = 1 + η 1 − 3 x 2 βt + O η (αt + βt ) m n o

2 = 1 + η 1 − 3 xt 2 + ηρt βt ,

with the proviso that m n log5 m and s n log3 m 1 αt βt . m log m

(108)

The last condition holds true since s s 3 n log m n log3 m 1 1 1 αt . βt , m m log m log m log5 m where we have used the assumption αt . n log

11

1 log5 m

(see definition of T0 ), the sample size condition m

m and the induction hypothesis βt ≥ c5 (see (40e)). This finishes the proof. 42

D

Proof of Lemma 4

It follows from the gradient update rules (3) and (29) that xt+1 − xt+1,(l) = xt − η∇f xt − xt,(l) − η∇f (l) xt,(l) = xt − η∇f xt − xt,(l) − η∇f xt,(l) + η∇f (l) xt,(l) − η∇f xt,(l) Z 1 i η h > t,(l) 2 t,(l) 2 \ 2 al a> , (109) ∇ f (x (τ )) dτ xt − xt,(l) − − a> = In − η al x l x l x m 0 where we denote x (τ ) := xt + τ xt,(l) − xt . Here, the last identity is due to the fundamental theorem of calculus [Lan93, Chapter XIII, Theorem 4.2]. • Controlling the first term in (109) requires exploring the properties of the Hessian ∇2 f (x). Since x (τ ) lies between xt and xt,(l) for any 0 ≤ τ ≤ 1, it is easy to see from (61) and (62) that p p kx⊥ (τ )k2 ≤ kx (τ )k2 ≤ 2C5 and max a> log m . log m kx (τ )k2 . (110) i x (τ ) . 1≤i≤m

In addition, combining (61) and (63) leads to

kx⊥ (τ )k2 ≥ xt⊥ 2 − xt − xt,(l) 2 ≥ c5 /2 − log−1 m ≥ c5 /4.

(111)

Armed with these bounds, we can readily apply Lemma 15 to obtain

n o

2 >

In − η∇2 f (x (τ )) − 1 − 3η kx (τ )k2 + η In + 2ηx\ x\> − 6ηx (τ ) x (τ ) s s 3 n o n log m n log3 m 2 max kx(τ )k2 , 1 . η . .η m m This further allows one to derive

In − η∇2 f (x (τ )) xt − xt,(l)

2

 s 

n 3 o

n log m

2 >

xt − xt,(l)  . xt − xt,(l) + O η ≤ 1 − 3η kx (τ )k2 + η In + 2ηx\ x\> − 6ηx (τ ) x (τ ) 2 m 2 Moreover, we can apply the triangle inequality to get

n o

> 2 xt − xt,(l)

1 − 3η kx (τ )k2 + η In + 2ηx\ x\> − 6ηx (τ ) x (τ )

n

2 o

2 > t t,(l) ≤ 1 − 3η kx (τ )k2 + η In − 6ηx (τ ) x (τ ) x −x

+ 2ηx\ x\> xt − xt,(l) 2 2

n

o (i)

t,(l) 2 > = 1 − 3η kx (τ )k2 + η In − 6ηx (τ ) x (τ ) xt − xt,(l) + 2η xtk − xk 2

(ii)

≤

t,(l) 2 1 − 3η kx (τ )k2 + η xt − xt,(l) 2 + 2η xtk − xk ,

t,(l) where (i) holds since x\> xt − xt,(l) = xtk − xk (recall that x\ = e1 ) and (ii) follows from the fact that

2 > 1 − 3η kx (τ )k2 + η In − 6ηx (τ ) x (τ ) 0,

as long as η ≤ 1/ (18C5 ). This further reveals

In − η∇2 f (x (τ )) xt − xt,(l) 2

43

 

  3

n log m  xt − xt,(l) + 2η xtk − xt,(l) ≤ 1+η 1− + O η k 2   m   s   3 (i) 

t 2 n log m  xt − xt,(l) + 2η xtk − xt,(l) ≤ 1 + η 1 − 3 x 2 + O η xt − xt,(l) 2 + O η k 2   m (ii)

≤

 s

2 3 kx (τ )k2

o n

2 t,(l) 1 + η 1 − 3 xt 2 + ηφ1 xt − xt,(l) 2 + 2η xtk − xk ,

for some |φ1 |

1 log m ,

(112)

where (i) holds since for every 0 ≤ τ ≤ 1

2

2 2 2 kx (τ )k2 ≥ xt 2 − kx (τ )k2 − xt 2

2

≥ xt 2 − x (τ ) − xt 2 kx (τ )k2 + xt 2

2 ≥ xt 2 − O xt − xt,(l) 2 ,

(113)

and (ii) comes from the fact (63a) and the sample complexity assumption m n log5 m. • We then move on to the second term of (109). Observing that xt,(l) is statistically independent of al , we have

h

1

i i 1 h > t,(l) 2 > t,(l) 2 > \ 2 > t,(l) \ 2 > t,(l)

a x − a x a a x + a> al x kal k2 l l l l l x

m

≤ m al x 2

√ p 1

. · log m · log m xt,(l) · n m 2 p

3 n log m t,(l) (114)

x , m 2 where the second inequality makes use of the facts (56), (57) and the standard concentration results p

> t,(l) p al x . log m xt,(l) 2 . log m. • Combine the previous two bounds (112) and (114) to reach

t+1

x − xt+1,(l) 2

Z 1 2

1

> t,(l) > \ 2 2 t t,(l) > t,(l)

≤ I − η ∇ f (x(τ )) dτ x − x + η a x − a x a a x l l l l

m

0 2 !2 p

3 n o

t 2

η n log m

t,(l) t,(l) ≤ 1 + η 1 − 3 x 2 + ηφ1 xt − xt,(l) 2 + 2η xtk − xk + O

x m 2 ! p n o

2

η n log3 m

xt + 2η xt − xt,(l) . ≤ 1 + η 1 − 3 xt 2 + ηφ1 xt − xt,(l) 2 + O k k 2 m Here the last relation holds because of the triangle inequality

t,(l)

x ≤ xt + xt − xt,(l) 2 2 2 √ and the fact that

n log3 m m

1 log m .

In view of the inductive hypotheses (40), one has o

t+1

(i) n

2

x − xt+1,(l) 2 ≤ 1 + η 1 − 3 xt 2 + ηφ1 βt 1 + 44

1 log m

t

p n log5 m C1 m

!

p t 1 n log12 m +O (αt + βt ) + 2ηαt 1 + C2 log m m p t 5 o n (ii)

2 1 n log m ≤ 1 + η 1 − 3 xt 2 + ηφ2 βt 1 + C1 log m m p t+1 5 (iii) n log m 1 C1 ≤ βt+1 1 + , log m m η

p

n log3 m m

for some |φ2 | log1 m , where the inequality (i) uses kxt k2 ≤ |xtk | + kxt⊥ k2 = αt + βt , the inequality (ii) holds true as long as p p t n log3 m n log5 m 1 1 (αt + βt ) βt 1 + C1 , (115a) m log m log m m p p n log12 m n log5 m 1 αt C2 βt C 1 . (115b) m log m m Here, the first condition (115a) comes from the fact that for t < T0 , p p p n log3 m n log3 m n log3 m (αt + βt ) βt C 1 βt , m m m as long as C1 > 0 is sufficiently large. The other one (115b) is valid owing to the assumption of Phase I αt 1/ log5 m. Regarding the inequality (iii) above, it is easy to check that for some |φ3 | log1 m , n o

2 βt+1 1 + η 1 − 3 xt 2 + ηφ2 βt = + ηφ3 βt βt βt+1 βt+1 + ηO φ3 βt = βt βt 1 , ≤ βt+1 1 + log m where the second equality holds since

βt+1 βt

(116)

1 in Phase I.

The proof is completed by applying the union bound over all 1 ≤ l ≤ m.

E

Proof of Lemma 5

Use (109) once again to deduce t+1,(l) t+1 xt+1 − xk = e> − xt+1,(l) 1 x k Z 1 i > η h > t,(l) 2 > \ 2 t,(l) 2 = e1 In − η ∇ f (x (τ )) dτ xt − xt,(l) − al x − a> e1 al a> l x l x m 0 Z 1 i η h > t,(l) 2 t,(l) 2 t t,(l) > \ 2 t,(l) = xtk − xk − η e> ∇ f (x (τ )) dτ x − x − a x − a x al,1 a> , (117) 1 l l l x m 0 where we recall that x (τ ) := xt + τ xt,(l) − xt . We begin by controlling the second term of (117). Applying similar arguments as in (114) yields h

1 log2 m i

t,(l) > t,(l) 2 > \ 2 > t,(l) − al x al,1 al x .

x m al x m 2 with probability at least 1 − O m−10 . 45

Regarding the first term in (117), one can use the decomposition t,(l) t,(l) t t,(l) t a> = ai,1 xtk − xk + a> i x −x i,⊥ x⊥ − x⊥ to obtain that m

2 e> 1∇ f

t

t,(l)

(x (τ )) x − x

i 2 1 Xh \ 2 t t,(l) 3 a> ai,1 a> = − a> i x (τ ) i x i x −x m i=1 =

m i 2 2 1 Xh t,(l) \ 2 ai,1 xtk − xk 3 a> − a> i x (τ ) i x m i=1 | {z } :=ω1 (τ )

m 2 i 1 Xh t,(l) \ 2 t . + 3 a> − a> ai,1 a> i x (τ ) i x i,⊥ x⊥ − x⊥ m i=1 {z } | :=ω2 (τ )

In the sequel, we shall bound ω1 (τ ) and ω2 (τ ) separately. • For ω1 (τ ), Lemma 14 together with the facts (110) tells us that m h 1 X i h 2 2 i 2 2 > > \ 2 3 ai x (τ ) − ai x ai,1 − 3 kx (τ )k2 + 6 xk (τ ) − 3 m i=1 s s n o n log3 m n log3 m 2 . max kx (τ )k2 , 1 . , m m which further implies that 2 t,(l) 2 + r1 ω1 (τ ) = 3 kx (τ )k2 + 6 xk (τ ) − 3 xtk − xk with the residual term r1 obeying s |r1 | = O 

 n log3 m t t,(l) xk − xk  . m

• We proceed to bound ω2 (τ ). Decompose w2 (τ ) into the following: ω2 (τ ) =

m m 2 3 X > 1 X > \ 2 t,(l) t,(l) t t ai x (τ ) ai,1 a> a x ai,1 a> − . i,⊥ x⊥ − x⊥ i,⊥ x⊥ − x⊥ m i=1 m i=1 i | {z } | {z } :=ω4

:=ω3 (τ )

\ – The term ω4 is relatively simple to control. Recognizing a> i x m

ω4 =

2

= a2i,1 and ai,1 = ξi |ai,1 |, one has

m

1 X 1 X t,sgn,(l) t,(l) t,sgn,(l) 3 3 t,sgn t ξi |ai,1 | a> − x⊥ + ξi |ai,1 | a> − xt,sgn + x⊥ . i,⊥ x⊥ i,⊥ x⊥ − x⊥ ⊥ m i=1 m i=1

t,sgn,(l) 3 t,sgn In view of the independence between ξi and |ai,1 | a> − x⊥ , one can thus invoke the i,⊥ x⊥ Bernstein inequality (see Lemma 11) to obtain m 1 X 1 p t,sgn,(l) 3 > t,sgn ξi |ai,1 | ai,⊥ x⊥ − x⊥ V1 log m + B1 log m (118) . m m i=1

46

with probability at least 1 − O m−10 , where V1 :=

m X

2 t,sgn,(l) 6 t,sgn x − x |ai,1 | a> i,⊥ ⊥ ⊥

and

t,sgn,(l) 3 t,sgn x − x B1 := max |ai,1 | a> . i,⊥ ⊥ ⊥

i=1

1≤i≤m

Regarding V1 , one can combine the fact (56) and Lemma 14 to reach m > 1 X 1 t,sgn,(l) 2 V1 . log2 m xt,sgn − x |ai,1 | ai,⊥ a> i,⊥ ⊥ ⊥ m m i=1

2

t,sgn,(l) − x . log2 m xt,sgn

. ⊥ ⊥

! t,sgn,(l)

xt,sgn − x⊥ ⊥

2

For B1 , it is easy to check from (56) and (57) that q

t,sgn,(l) B1 . n log3 m xt,sgn − x⊥

. ⊥ 2

The previous two bounds taken collectively yield s m 1 X

2 pn log5 m

log3 m

t,sgn

t,sgn t,sgn,(l) t,sgn,(l) t,sgn,(l) 3 > t,sgn ξi |ai,1 | ai,⊥ x⊥ − x⊥

x⊥ − x⊥

+

x⊥ − x⊥

. m m m 2 2 i=1 s

log3 m

t,sgn t,sgn,(l) . (119)

x⊥ − x⊥

, m 2 as long as m & n log2 m. The second term in ω4 can be simply controlled by the Cauchy-Schwarz inequality and Lemma 14. Specifically, we have m 1 X t,(l) t,sgn,(l) 3 > t,sgn t ξi |ai,1 | ai,⊥ x⊥ − x⊥ − x⊥ + x⊥ m i=1

m

1 X

t t,(l) t,sgn,(l) 3 ≤ ξi |ai,1 | a> − xt,sgn + x⊥

i,⊥ x⊥ − x⊥ ⊥

m

2 i=1 2 s

n log3 m

t t,(l) t,sgn,(l) t,sgn . (120)

x⊥ − x⊥ − x⊥ + x⊥

, m 2 where the second relation holds due to Lemma 14. Take the preceding two bounds (119) and (120) collectively to conclude that s s

log3 m n log3 m

t,sgn

t t,sgn,(l) t,(l) t,sgn,(l) t,sgn |ω4 | .

x⊥ − x⊥

+

x⊥ − x⊥ − x⊥ + x⊥

m m 2 2 s s

log3 m n log3 m

t

t t,(l) t,(l) t,sgn,(l) t,sgn .

x⊥ − x⊥ +

x⊥ − x⊥ − x⊥ + x⊥

, m m 2 2 where the second line follows from the triangle inequality

t,sgn

t,sgn,(l) t,(l) t,(l) t,sgn,(l) t,sgn

x⊥ − x⊥

≤ xt⊥ − x⊥ + xt⊥ − x⊥ − x⊥ + x⊥

2

and the fact that

q

log3 m m

≤

q

2

n log3 m . m

– It remains to bound ω3 (τ ). To this end, one can decompose ω3 (τ ) =

m 2 2 i 3 Xh > t,(l) t ai,1 a> ai x (τ ) − asgn> x (τ ) i,⊥ x⊥ − x⊥ i m i=1 | {z } :=θ1 (τ )

47

2

m 2 2 3 X sgn> t,(l) sgn> sgn t x (τ ) ai,1 a> x (τ ) − ai + ai i,⊥ x⊥ − x⊥ m i=1 | {z } :=θ2 (τ )

+

3 m |

m X

xsgn (τ ) asgn> i

2

t,sgn,(l)

t,sgn ai,1 a> − x⊥ i,⊥ x⊥

i=1

{z

}

:=θ3 (τ )

m 2 3 X sgn> sgn t,(l) t,sgn,(l) t + ai x (τ ) ai,1 a> − xt,sgn + x⊥ , i,⊥ x⊥ − x⊥ ⊥ m i=1 {z } | :=θ4 (τ )

where we denote xsgn (τ ) = xt,sgn + τ xt,sgn,(l) − xt,sgn . A direct consequence of (61) and (62) is that p sgn> sgn x (τ ) . log m. (121) ai Recalling that ξi = sgn (ai,1 ) and ξisgn = sgn asgn i,1 , one has sgn> a> x (τ ) = (ξi − ξisgn ) |ai,1 | xk (τ ) , i x (τ ) − ai sgn> a> x (τ ) = (ξi + ξisgn ) |ai,1 | xk (τ ) + 2a> i x (τ ) + ai i,⊥ x⊥ (τ ) ,

which implies that 2 2 sgn> sgn> sgn> a> − ai x (τ ) = a> x (τ ) · a> x (τ ) i x (τ ) i x (τ ) − ai i x (τ ) + ai = (ξi − ξisgn ) |ai,1 | xk (τ ) (ξi + ξisgn ) |ai,1 | xk (τ ) + 2a> i,⊥ x⊥ (τ ) = 2 (ξi − ξisgn ) |ai,1 | xk (τ ) a> i,⊥ x⊥ (τ )

(122)

2

owing to the identity (ξi − ξisgn ) (ξi + ξisgn ) = ξi2 − (ξisgn ) = 0. In light of (122), we have m 6 X t,(l) > t (ξi − ξisgn ) |ai,1 | xk (τ ) a> i,⊥ x⊥ (τ ) ai,1 ai,⊥ x⊥ − x⊥ m i=1 " # m 1 X t,(l) 2 sgn > > = 6xk (τ ) · x⊥ (τ ) (1 − ξi ξi ) |ai,1 | ai,⊥ ai,⊥ xt⊥ − x⊥ . m i=1

θ1 (τ ) =

First note that

m m

1 X

1 X

2 2 (1 − ξi ξisgn ) |ai,1 | ai,⊥ a> |ai,1 | ai,⊥ a>

i,⊥ ≤ 2 i,⊥ . 1,

m

m i=1

(123)

i=1

where the last relation holds due to Lemma 14. This results in the following upper bound on θ1 (τ )

t,(l) t,(l) |θ1 (τ )| . xk (τ ) kx⊥ (τ )k2 xt⊥ − x⊥ . xk (τ ) xt⊥ − x⊥ , 2

2

where we have used the fact that kx⊥ (τ )k2 . 1 (see (110)). Regarding θ2 (τ ), one obtains m

θ2 (τ ) =

ih i 3 X h sgn> t,(l) t ai (x (τ ) − xsgn (τ )) asgn> (x (τ ) + xsgn (τ )) ai,1 a> . i,⊥ x⊥ − x⊥ i m i=1

Apply the Cauchy-Schwarz inequality to reach v v u m h m i2 h i2 u h i2 u1 X u1 X 2 sgn> > t − xt,(l) sgn (τ )) sgn (τ )) t |θ2 (τ )| . t (x (τ ) + x asgn> (x (τ ) − x a |a | a x i,1 i ⊥ ⊥ i,⊥ m i=1 i m i=1 48

v u m h

i2 u1 X

t,(l) .t (x (τ ) − xsgn (τ )) log m · xt⊥ − x⊥ asgn> i m i=1 2

p

t,(l) . log m kx (τ ) − xsgn (τ )k2 xt⊥ − x⊥ . 2

Here the second relation comes from Lemma 14 and the fact that p sgn> (x (τ ) + xsgn (τ )) . log m. ai When it comes to θ3 (τ ), we need to exploit the independence between 2 t,sgn,(l) t,sgn xsgn (τ ) ai,1 a> − x⊥ . {ξi } and asgn> i,⊥ x⊥ i Similar to (118), one can obtain |θ3 (τ )| .

1 p V2 log m + B2 log m m

with probability at least 1 − O m−10 , where V2 :=

m X

asgn> xsgn (τ ) i

4

2 t,sgn,(l) 2 t,sgn |ai,1 | a> − x⊥ i,⊥ x⊥

i=1

B2 := max

1≤i≤m

2 > t,sgn,(l) t,sgn sgn asgn> x (τ ) |a | x − x a . i,1 i,⊥ i ⊥ ⊥

It is easy to see from Lemma 14, (121), (56) and (57) that q

2

t,sgn t,sgn,(l) t,sgn,(l) 3 V2 . m log2 m xt,sgn − x and B . n log m − x

x

, 2 ⊥ ⊥ ⊥ ⊥ 2

2

which implies s |θ3 (τ )| . 

log3 m + m

 s p

n log5 m  log3 m

t,sgn

t,sgn t,sgn,(l) t,sgn,(l)

x⊥ − x⊥

x⊥ − x⊥

m m 2 2

with the proviso that m & n log2 m. We are left with θ4 (τ ). Invoking Cauchy-Schwarz inequality, v v u m m h i2 4 u u1 X u1 X 2 t − xt,(l) − xt,sgn + xt,sgn,(l) |θ4 (τ )| . t aisgn> xsgn (τ ) t |ai,1 | a> x i,⊥ ⊥ ⊥ ⊥ ⊥ m i=1 m i=1 v u m

2 u1 X

t,(l) t,sgn,(l) .t aisgn> xsgn (τ ) log m · xt⊥ − x⊥ − xt,sgn + x

⊥ ⊥ m i=1 2

p

t,(l) t,sgn,(l) . log m xt⊥ − x⊥ − xt,sgn + x⊥

, ⊥ 2

√ where we have used the fact that aisgn> xsgn (τ ) . log m. In summary, we have obtained

o n p

t,(l) |ω3 (τ )| . xk (τ ) + log m kx (τ ) − xsgn (τ )k2 xt⊥ − x⊥ 2 s

p log3 m t,sgn

t,sgn,(l) t,(l) t,sgn,(l) t,sgn +

x⊥ − x⊥

+ log m xt⊥ − x⊥ − x⊥ + x⊥

m 2 2   s 

p log3 m 

t t,(l) . xk (τ ) + log m kx (τ ) − xsgn (τ )k2 +

x⊥ − x⊥  m  2 49

+

p

t,(l) t,sgn,(l) + x log m xt⊥ − x⊥ − xt,sgn

, ⊥ ⊥ 2

where the last inequality utilizes the triangle inequality

t,sgn t,(l) t,sgn,(l) t,sgn,(l) t,(l) t,sgn

x⊥ − x⊥

≤ xt⊥ − x⊥ + xt⊥ − x⊥ − x⊥ + x⊥ and the fact that

log3 m m

≤

√

2

2

2

q

log m. This together with the bound for ω4 (τ ) gives

|ω2 (τ )| ≤ |ω3 (τ )| + |ω4 (τ )|   s

  3 p log m

t t,(l) . xk (τ ) + log m kx (τ ) − xsgn (τ )k2 +

x⊥ − x⊥  m  2

p

t,(l) t,sgn,(l) + log m xt⊥ − x⊥ − xt,sgn + x⊥

, ⊥ 2

as long as m n log2 m. • Combine the bounds to arrive at    s Z 1   3 n log m 2 t,(l) t+1,(l) 2 t   xk (τ ) + x − x xt+1 − x = 1 + 3η 1 − kx (τ )k dτ + η · O 2 k k k  k  m 0

p log2 m

t,(l)

t,(l) t,sgn,(l) t,sgn +O η

x + O η log m xt⊥ − x⊥ − x⊥ + x⊥

m 2 2     s  

3 p log m

t t,(l) xk (τ ) + log m kx (τ ) − xsgn (τ )k + + O η sup

x − x⊥  . 2 m  ⊥ 2 0≤τ ≤1  To simplify the above bound, notice that for the last term, for any t < T0 . log n and 0 ≤ τ ≤ 1, one has t p n log12 m xk (τ ) ≤ xt + xt,(l) − xt ≤ αt + αt 1 + 1 C . αt , 2 k k k log m m p as long as m n log12 m. Similarly, one can show that

p p

log m kx (τ ) − xsgn (τ )k2 ≤ log m xt − xt,sgn 2 + xt − xt,(l) − xt,sgn + xt,sgn,(l) 2 s  p 5 9 p n log m n log m  . αt log m  + . αt , m m with the proviso that m n log6 m. Therefore, we can further obtain    s  

2 3

n log m 2 t+1

t+1,(l)  xtk − xt,(l) xk − xk ≤ 1 + 3η 1 − xt 2 + η · O  xt − xt,(l) + xtk + k   m 2

p

log2 m

t,(l) t,sgn,(l) t

+O η x 2 + O η log m xt⊥ − x⊥ − xt,sgn + x

⊥ ⊥ m 2

t t,(l) + O ηαt x − x 2

n o

t 2

t,(l)

≤ 1 + 3η 1 − x 2 + ηφ1 xtk − xk + O ηαt xt − xt,(l) 2

2 p

log m t

t,(l) t,sgn,(l) +O η x 2 + O η log m xt⊥ − x⊥ − xt,sgn + x⊥

⊥ m 2 50

for some |φ1 | log1 m . Here the last inequality comes from the sample complexity m n log5 m, the assumption αt log15 m and the fact (63a). Given the inductive hypotheses (40), we can conclude t p n o

t 2 1 n log12 m t+1 t+1,(l)

C2 ≤ 1 + 3η 1 − x 2 + ηφ1 αt 1 + xk − xk log m m ! p t 2 p n log9 m η log m 1 C4 +O (αt + βt ) + O η log m · αt 1 + m log m m ! t p 5 1 n log m + O ηαt βt 1 + C1 log m m p t o (i) n

t 2 n log12 m 1

C2 ≤ 1 + 3η 1 − x 2 + ηφ2 αt 1 + log m m p t+1 12 (ii) 1 n log m C2 ≤ αt+1 1 + log m m for some |φ2 |

1 log m .

Here, the inequality (i) holds true as long as p log2 m 1 n log12 m (αt + βt ) αt C2 m log m m p p 9 p n log m 1 n log12 m log mC4 C2 m log m m p p 5 n log m 1 n log12 m βt C1 C2 , m log m m

(124a) (124b) (124c)

where the first condition (124a) is satisfied since (according to Lemma 1) p αt + βt . βt . αt n log m. The second condition (124b) holds as long as C2 C4 . The third one (124c) holds trivially. Moreover, the second inequality (ii) follows from the same reasoning as in (116). Specifically, we have for some |φ3 | log1 m , n o

t 2 αt+1

1 + 3η 1 − x 2 + ηφ2 αt = + ηφ3 αt αt αt+1 αt+1 + ηO φ3 αt ≤ αt αt 1 , ≤ αt+1 1 + log m as long as

αt+1 αt

1.

The proof is completed by applying the union bound over all 1 ≤ l ≤ m.

F

Proof of Lemma 6

By similar calculations as in (109), we get the identity Z 1 t+1 t+1,sgn 2 ˜ x −x = I −η ∇ f (x (τ )) dτ xt − xt,sgn + η ∇f sgn xt,sgn − ∇f xt,sgn , 0

˜ (τ ) := xt + τ (xt,sgn − xt ). The first term satisfies where x

Z 1

2 t t,sgn

I −η ˜ ∇ f ( x (τ )) dτ x − x

0

51

(125)

Z 1

t

2

x − xt,sgn ˜ ∇ f ( x (τ )) dτ I − η ≤

2 0   s  Z 1   3

n log m  xt − xt,sgn , ˜ (τ )k22 dτ + O η ≤ 1 + 3η 1 − kx 2   m 0

(126)

where we have invoked Lemma 15. Furthermore, one has for all 0 ≤ τ ≤ 1

2

2 ˜ (τ )k22 ≥ xt 2 − kx ˜ (τ )k22 − xt 2 kx

2 ˜ (τ ) − xt 2 kx ˜ (τ )k2 + xt 2 ≥ xt 2 − x

2

˜ (τ )k2 + xt 2 . ≥ xt 2 − xt − xt,sgn 2 kx ˜ (τ )k2 . 1 reveals that This combined with the norm conditions kxt k2 . 1, kx

2 ˜ (τ )k2 ≥ xt + O xt − xt,sgn , min kx 0≤τ ≤1

2

2

2

and hence we can further upper bound (126) as

Z 1

2 t t,sgn

I −η ˜ (τ )) dτ x − x ∇ f (x

 0   s   3

t

t 2

n log m  xt − xt,sgn ≤ 1 + 3η 1 − x 2 + η · O  x − xt,sgn 2 + 2   m n o

2

≤ 1 + 3η 1 − xt 2 + ηφ1 xt − xt,sgn 2 , for some |φ1 | log1 m , where the last line follows from m n log5 m and the fact (63b). The remainder of this subsection is largely devoted to controlling the gradient difference ∇f sgn xt,sgn − ∇f xt,sgn in (125). By the definition of f sgn (·), one has ∇f sgn xt,sgn − ∇f xt,sgn m o 1 X n sgn> t,sgn 3 sgn sgn> t,sgn sgn \ 2 > t,sgn 3 > \ 2 > t,sgn = ai x ai − asgn> x a x a − a x a + a x a x ai i i i i i i i m i=1 m m o 1 X 2 sgn sgn> 1 X n sgn> t,sgn 3 sgn > t,sgn 3 ai x ai − ai x ai − = ai,1 ai ai − ai a> xt,sgn . i m i=1 m i=1 | {z } | {z } :=r1

:=r2

2 \ 2 Here, the last identity holds because of a> = asgn> x\ = a2i,1 (see (37)). i x i sgn sgn • We begin with the second term r2 . By construction, one has asgn |ai,1 | and ai,1 = i,⊥ = ai,⊥ , ai,1 = ξi ξ1 |ai,1 |. These taken together yield 0 a> sgn sgn> sgn > i,⊥ ai ai − ai ai = (ξi − ξi ) |ai,1 | , (127) ai,⊥ 0

and hence r2 can be rewritten as " r2 =

Pm 3 sgn t,sgn 1 − ξi ) |ai,1 | a> i i,⊥ x⊥ i=1 (ξ m P 3 m sgn 1 − ξi ) |ai,1 | ai,⊥ xt,sgn ·m i=1 (ξi k

# .

(128)

For the first entry of r2 , the triangle inequality gives m m m 1 X 1 X 1 X 3 > 3 3 sgn t,sgn t,sgn sgn t (ξi − ξi ) |ai,1 | ai,⊥ x⊥ ≤ |ai,1 | ξi a> |ai,1 | ξi a> + i,⊥ x⊥ i,⊥ x⊥ m m m i=1 i=1 i=1 | {z } | {z } :=φ1

52

:=φ2

m 1 X 3 sgn > t,sgn t + |ai,1 | ξi ai,⊥ x⊥ − x⊥ . m i=1 | {z } :=φ3

3

t,sgn Regarding φ1 , we make use of the independence between ξi and |ai,1 | a> i,⊥ x⊥ and invoke the Bernstein inequality (see Lemma 11) to reach that with probability at least 1 − O m−10 ,

φ1 .

1 p V1 log m + B1 log m , m

where V1 and B1 are defined to be V1 :=

m X

6 t,sgn 2 |ai,1 | a> i,⊥ x⊥

and

i=1

B1 := max

1≤i≤m

n o 3 t,sgn |ai,1 | a> . i,⊥ x⊥

It is easy to see from Lemma 12 and the incoherence condition (62d) that with probability exceeding

2

and B1 . log2 m xt,sgn , which implies 1 − O m−10 , V1 . m xt,sgn ⊥ ⊥ 2 2 ! r r

log m log3 m log m t,sgn

xt,sgn , + x⊥ φ1 . ⊥ 2 2 m m m as long as m log5 m. Similarly, one can obtain r φ2 .

log m

xt⊥ . 2 m

The last term φ3 can be bounded through the Cauchy-Schwarz inequality. Specifically, one has s

m

1 X

n log3 m

3 t

xt,sgn − xt⊥ , |ai,1 | ξisgn ai,⊥ xt,sgn − x . φ3 ≤ ⊥ 2 ⊥ ⊥ 2

m m i=1 2

where the second relation arises from Lemma 14. The previous three bounds taken collectively yield s r m 1 X

log m n log3 m 3 t,sgn t,sgn t

xt,sgn − xt⊥ x + + x (ξisgn − ξi ) |ai,1 | a> x . ⊥ 2 i,⊥ ⊥ ⊥ ⊥ 2 2 m m m i=1 s r 3

log m t

x⊥ + n log m xt,sgn − xt⊥ . . (129) ⊥ 2 2 m m

≤ kxt k + xt,sgn − xt and Here the second inequality results from the triangle inequality xt,sgn ⊥ 2 ⊥ 2 ⊥ ⊥ 2 q q 3m the fact that logmm ≤ n log . In addition, for the second through the nth entries of r2 , one can again m invoke Lemma 14 to obtain

m m m

1 X

1 X

1 X

3 3 sgn 3 sgn |ai,1 | (ξi − ξi ) ai,⊥ ≤ |ai,1 | ξi ai,⊥ + |ai,1 | ξi ai,⊥

m

m

m i=1 i=1 i=1 2 2 2 s 3 n log m . . (130) m This combined with (128) and (129) yields s s r 3 3

log m t n log m t,sgn n log m kr2 k2 . x⊥ 2 + x⊥ − xt⊥ 2 + xt,sgn . k m m m 53

• Moving on to the term r1 , we can also decompose o   Pm n sgn> t,sgn 3 sgn 1 > t,sgn 3 − a x a x a a i,1 i i,1 i i=1 m o . r1 =  Pm n sgn> t,sgn 3 sgn 1 > t,sgn 3 a x a − a ai,⊥ i x i i=1 i,⊥ m For the second through the nth entries, we see that m m n o (i) 1 X 3 o 1 X n sgn> t,sgn 3 sgn > t,sgn 3 t,sgn 3 ai,⊥ − ai x ai,⊥ = ai,⊥ x xt,sgn − a> ai asgn> i x i m i=1 m i=1 m 2 (ii) 1 X sgn> t,sgn sgn> t,sgn t,sgn sgn > t,sgn 2 > t,sgn x + ai x x ai x ai + ai ai,⊥ (ξi − ξi ) |ai,1 | xk = m i=1 m 2 X xt,sgn k sgn sgn> t,sgn sgn> t,sgn > t,sgn 2 > t,sgn = (ξi − ξi ) |ai,1 | ai x + ai x + ai x ai x ai,⊥ , m i=1

3 3 2 2 where (i) follows from asgn i,⊥ = ai,⊥ and (ii) relies on the elementary identity a −b = (a − b) a + b + ab . Pm sgn> t,sgn 2 sgn Pm sgn> t,sgn 2 sgn sgn> 1 1 a x x Treating m a ai ai , a as the first column (except its first entry) of i i,1 i,⊥ i=1 i=1 ai m by Lemma 14 and the incoherence condition (62e), we have m

m

2 1 X sgn 1 X sgn> t,sgn 2 sgn ξi |ai,1 | asgn> xt,sgn ai,⊥ = a x ai,1 ai,⊥ = 2xt,sgn xt,sgn + v1 , i ⊥ k m i=1 m i=1 i where kv1 k2 .

q

n log3 m . m

Similarly, m

−

where kv2 k2 .

q

n log3 m . m

1 X t,sgn 2 ξi |ai,1 | a> ai,⊥ = −2xt,sgn xt,sgn + v2 , i x ⊥ k m i=1

Moreover, we have

m

1 X sgn t,sgn 2 ξi |ai,1 | a> ai,⊥ i x m i=1 m

m

2 1 X sgn 1 X sgn = ξi |ai,1 | asgn> ξ |ai,1 | xt,sgn ai,⊥ + i m i=1 m i=1 i

t,sgn 2 a> i x

−

asgn> xt,sgn i

2

ai,⊥

= 2xt,sgn xt,sgn + v1 + v3 , ⊥ k where v3 is defined as m

1 X sgn ξ |ai,1 | v3 = m i=1 i

t,sgn 2 a> i x

−

asgn> xt,sgn i

2

ai,⊥

m

= 2xt,sgn k =

1 X 2 t,sgn sgn (ξi − ξisgn ) a> ξi |ai,1 | ai,⊥ i,⊥ x⊥ m i=1

1 2xt,sgn k m

m X

2

t,sgn (ξi ξisgn − 1) |ai,1 | ai,⊥ a> . i,⊥ x⊥

i=1

Here the second equality comes from the identity (122). Similarly one can get m

2 1 X − ξi |ai,1 | asgn> xt,sgn ai,⊥ = −2xt,sgn xt,sgn − v2 − v4 , i ⊥ k m i=1 54

(131)

where v4 obeys v4 =

m 2 1 X t,sgn > t,sgn 2 ξi |ai,1 | asgn> x − a x ai,⊥ i i m i=1 m

= 2xt,sgn k It remains to bound

1 m

Pm

i=1

1 X 2 t,sgn . (ξi ξisgn − 1) |ai,1 | ai,⊥ a> i,⊥ x⊥ m i=1

xt,sgn (ξisgn − ξi ) |ai,1 | asgn> i

t,sgn ai,⊥ . To this end, we have a> i x

m

1 X sgn t,sgn xt,sgn a> ai,⊥ ξ |ai,1 | asgn> i x i m i=1 i m

=

m

2 h sgn> t,sgn i 1 X sgn 1 X sgn sgn> t,sgn t,sgn > t,sgn ai,⊥ x a + |a | a x a x − ai x ξi |ai,1 | asgn> ξ i,⊥ i,1 i i i m i=1 m i=1 i

= 2xt,sgn xt,sgn + v1 + v5 , ⊥ k where

m

v5 = xt,sgn k

1 X 2 (ξi ξisgn − 1) |ai,1 | ai,⊥ asgn> xt,sgn . i m i=1

The same argument yields m

−

1 X t,sgn ξi |ai,1 | asgn> xt,sgn a> ai,⊥ = −2xt,sgn xt,sgn − v2 − v6 , i x i ⊥ k m i=1

where

m

v6 = xt,sgn k

1 X 2 (ξi ξisgn − 1) |ai,1 | ai,⊥ asgn> xt,sgn . i m i=1

Combining all of the previous bounds and recognizing that v3 = v4 and v5 = v6 , we arrive at s

m n

1 X o n log3 m t,sgn

t,sgn 3 sgn > t,sgn 3 asgn> x a − a x a . kv k + kv k .

xk .

i,⊥ 1 2 2 2 i i i,⊥

m

m i=1 2

Regarding the first entry of r1 , one has m n 1 X o 3 3 > t,sgn asgn> xt,sgn asgn ai,1 i i,1 − ai x m i=1 m 1 X 3 3 sgn t,sgn t,sgn sgn t,sgn t,sgn > > = ξi |ai,1 | xk + ai,⊥ x⊥ ξi |ai,1 | − ξi |ai,1 | xk + ai,⊥ x⊥ ξi |ai,1 | m i=1 m 1 X 2 > t,sgn 2 t,sgn 3 > = (ξisgn − ξi ) |ai,1 | 3 |ai,1 | xt,sgn a x + a x . i,⊥ i,⊥ ⊥ ⊥ k m i=1 t,sgn 3 In view of the independence between ξi and |ai,1 | a> , from the Bernstein’s inequality (see Lemma i,⊥ x⊥ 11), we have that m 1 X 1 p t,sgn 3 > ξi |ai,1 | ai,⊥ x⊥ V2 log m + B2 log m . m m i=1 holds with probability exceeding 1 − O m−10 , where

V2 :=

m X

|ai,1 |

2

t,sgn a> i,⊥ x⊥

6

and

i=1

55

t,sgn 3 B2 := max |ai,1 | a> . i,⊥ x⊥ 1≤i≤m

6

and B2 . log2 m xt,sgn 3 , which further implies It is straightforward to check that V2 . m xt,sgn ⊥ ⊥ 2 2 r r m 1 X

3 log3 m t,sgn 3

log m t,sgn t,sgn 3 >

+

log m xt,sgn 3 ,

x

x ξi |ai,1 | ai,⊥ x⊥ . ⊥ ⊥ ⊥ 2 2 2 m m m m i=1 as long as m log5 m. For the term involving ξisgn , we have m m m h i 1 X sgn 1 X sgn 1 X sgn t,sgn 3 t,sgn 3 > t 3 t 3 + . ξi |ai,1 | a> x = ξ |a | a x ξi |ai,1 | a> − a> i,1 i,⊥ ⊥ i,⊥ ⊥ i,⊥ x⊥ i,⊥ x⊥ i m i=1 m i=1 m i=1 | {z } | {z } :=θ1

:=θ2

Similarly one can obtain r |θ1 | .

log m

xt⊥ 3 . 2 m

Expand θ2 using the elementary identity a3 − b3 = (a − b) a2 + ab + b2 to get m

θ2 =

h > t,sgn i 1 X sgn t,sgn t,sgn 2 t t 2 t ξi |ai,1 | a> a> + a> + a> ai,⊥ x⊥ i,⊥ x⊥ − x⊥ i,⊥ x⊥ i,⊥ x⊥ i,⊥ x⊥ m i=1 m

=

1 X > t 2 sgn t,sgn t ai,⊥ x⊥ ξi |ai,1 | a> i,⊥ x⊥ − x⊥ m i=1 m

+

1 X > t,sgn 2 sgn t,sgn t ai,⊥ x⊥ ξi |ai,1 | a> i,⊥ x⊥ − x⊥ m i=1

+

1 X > t > t,sgn t a x a xt,sgn − xt⊥ ξisgn |ai,1 | a> . i,⊥ x⊥ − x⊥ m i=1 i,⊥ ⊥ i,⊥ ⊥

m

Once more, we can apply Lemma 14 with the incoherence conditions (62b) and (62d) to obtain s

m

1 X n log3 m 2

sgn t a> ξi |ai,1 | a> ;

i,⊥ x⊥ i,⊥ .

m m i=1 2 s

m

1 X

n log3 m 2

t,sgn sgn > a> x ξ . |a | a .

i,1 i,⊥ ⊥ i,⊥ i

m

m i=1 2

In addition, one can use the Cauchy-Schwarz inequality to deduce that m 1 X > sgn t,sgn t,sgn > t t > t ai,⊥ x⊥ ai,⊥ x⊥ − x⊥ ξi |ai,1 | ai,⊥ x⊥ − x⊥ m i=1 v v u u m m 2 h h i2 u1 X i2 u 1 X 2 t,sgn t,sgn t > t t t a> x a x − x |ai,1 | a> ≤t i,⊥ ⊥ i,⊥ ⊥ ⊥ i,⊥ x⊥ − x⊥ m i=1 m i=1 v v

u u m m

2 X u 1 X

2 u

2 1

t,sgn 2 t,sgn ≤ t a> xt⊥ ai,⊥ a> |ai,1 | ai,⊥ a>

x⊥ − xt⊥ 2 t

x⊥ − xt⊥ 2 i,⊥ i,⊥ i,⊥

m

m

i=1 i=1

t

sgn 2

. x⊥ − x⊥ 2 , where the last inequality comes from Lemma 14. Combine the preceding bounds to reach s

n log3 m

xt⊥ − xsgn + xt⊥ − xsgn 2 . |θ2 | . ⊥ ⊥ 2 2 m 56

Applying the similar arguments as above we get m t,sgn 2 3 X sgn 3 > t,sgn ξi − ξi |ai,1 | ai,⊥ x⊥ xk m i=1 r  s r 3

log m n log m t,sgn 2  log m

xt,sgn +

xt⊥ +

xt⊥ − xt,sgn  . xk ⊥ ⊥ 2 2 2 m m m   s r 3

n log m t,sgn 2  log m

xt⊥ +

xt⊥ − xt,sgn  , . xk ⊥ 2 2 m m

≤ kxt k + xt − xt,sgn and the fact where the last line follows from the triangle inequality xt,sgn ⊥ 2 ⊥ ⊥ ⊥ 2 2 q q 3m that logmm ≤ n log . Putting the above results together yields m s r

n log m t,sgn log m t,sgn n log3 m

xt⊥ − xsgn + xt⊥ 2 + kr1 k2 . x⊥ xk + ⊥ 2 2 m m m r  s 3

2 t,sgn 2

+ x  log m xt⊥ + n log m xt⊥ − xsgn  , + xt⊥ − xsgn ⊥ ⊥ k 2 2 2 m m s

3

which can be further simplified to s s r 3

n log m t log m t n log3 m

xt − xsgn + xt − xsgn 2 . kr1 k2 . x⊥ 2 + xk + 2 2 m m m • Combine all of the above estimates to reach

Z 1

t+1

sgn t,sgn t+1,sgn 2 t t,sgn

x

˜ )) dτ x − x −x ≤ I −η ∇ f (x(τ x − ∇f xt,sgn 2

+ η ∇f 2 0 2 s ! r n o

t 2

log m t n log3 m t x⊥ 2 + η ≤ 1 + 3η 1 − x 2 + ηφ2 xt − xt,sgn 2 + O η xk m m for some |φ2 | log1 m . Here the second inequality follows from the fact (63b). Substitute the induction hypotheses into this bound to reach s t n o

t+1

t 2 1 n log5 m t+1,sgn

x

≤ 1 + 3η 1 − x + ηφ2 αt 1 + −x C3 2 2 log m m s r log m n log3 m +η βt + η αt m m s t o (i) n

t 2 1 n log5 m ≤ 1 + 3η 1 − x 2 + ηφ3 αt 1 + C3 log m m s t+1 (ii) 1 n log5 m C3 , ≤ αt+1 1 + log m m for some |φ3 |

1 log m ,

where (ii) follows the same reasoning as in (116) and (i) holds as long as r

log m 1 1 βt αt 1 + m log m log m 57

s

t C3

n log5 m , m

(132a)

s

t n log3 m 1 1 C3 αt αt 1 + m log m log m

s

n log5 m . m

(132b)

Here the first condition (132a) results from (see Lemma 1) p βt . n log m · αt , and the second one is trivially true with the proviso that C3 > 0 is sufficiently large.

G

Proof of Lemma 7

Consider any l (1 ≤ l ≤ m). According to the gradient update rules (3), (29), (30) and (31), we have xt+1 − xt+1,(l) − xt+1,sgn + xt+1,sgn,(l) i h = xt − xt,(l) − xt,sgn + xt,sgn,(l) − η ∇f xt − ∇f (l) xt,(l) − ∇f sgn xt,sgn + ∇f sgn,(l) xt,sgn,(l) . It then boils down to controlling the gradient difference, i.e. ∇f (xt ) − ∇f (l) xt,(l) − ∇f sgn (xt,sgn ) + ∇f sgn,(l) xt,sgn,(l) . To this end, we first see that ∇f xt − ∇f (l) xt,(l) = ∇f xt − ∇f xt,(l) + ∇f xt,(l) − ∇f (l) xt,(l) Z 1 1 > t,(l) 2 > \ 2 t,(l) 2 t t,(l) al x − al x = ∇ f (x (τ )) dτ x −x + al a> , l x m 0 (133) where we denote x(τ ) := xt + τ xt,(l) − xt and the last identity results from the fundamental theorem of calculus [Lan93, Chapter XIII, Theorem 4.2]. Similar calculations yield ∇f sgn xt,sgn − ∇f sgn,(l) xt,sgn,(l) Z 1 1 sgn> t,sgn,(l) 2 sgn> \ 2 sgn sgn> t,sgn,(l) ˜ (τ )) dτ al x = ∇2 f sgn (x xt,sgn − xt,sgn(l) + − al x al al x m 0 (134) ˜ ) := xt,sgn + τ xt,sgn,(l) − xt,sgn . Combine (133) and (134) to arrive at with x(τ ∇f xt − ∇f (l) xt,(l) − ∇f sgn xt,sgn + ∇f sgn,(l) xt,sgn,(l) Z 1 Z 1 2 t t,(l) 2 sgn ˜ ) dτ xt,sgn − xt,sgn,(l) = ∇ f x(τ ) dτ x − x − ∇ f x(τ 0 {z } | 0 :=v1

1 > t,(l) 2 1 sgn> t,sgn,(l) 2 sgn> \ 2 sgn sgn> t,sgn,(l) \ 2 > t,(l) + al x − a> x a a x − a x − a x al al x . l l l l l m m {z } | :=v2

(135) In what follows, we shall control v1 and v2 separately. \ • We start with the simpler term v2 . In light of the fact that a> l x one can decompose v2 as h 2 i t,(l) 2 t,(l) mv2 = a> − asgn> xt,sgn,(l) al a> l x l x l | {z } :=θ1

58

2

= asgn> x\ l

2

2 = al,1 (see (37)),

+

h

asgn> xt,sgn,(l) l

2

− |al,1 |

2

i sgn> t,sgn,(l) t,(l) . al a> − asgn x l x l al {z } | :=θ2

First, it is easy to see from (56) and the independence between asgn and xt,sgn,(l) that l 2 sgn> t,sgn,(l) 2 2 2 xt,sgn,(l) + |al,1 | x − |al,1 | ≤ asgn> al l

2

. log m · xt,sgn,(l) 2 + log m . log m

(136)

with probability at least 1−O m−10 , where the last inequality results from the norm condition xt,sgn,(l) 2 . 1 (see (61c)). Regarding the term θ2 , one has sgn sgn> sgn> θ2 = al a> − a a xt,(l) + asgn xt,(l) − xt,sgn,(l) , l l l l al which together with the identity (127) gives " # t,(l) a> x sgn sgn> l,⊥ ⊥ θ2 = (ξl − ξl ) |al,1 | + asgn xt,(l) − xt,sgn,(l) . t,(l) l al xk al,⊥ In view of the independence between al and xt,(l) , and between asgn and xt,(l) − xt,sgn,(l) , one can again l apply standard Gaussian concentration results to obtain that

p > t,(l) p sgn> t,(l)

t,(l)

and x − xt,sgn,(l) . log m xt,(l) − xt,sgn,(l) al al,⊥ x⊥ . log m x⊥ 2

2

with probability exceeding 1 − O m−10 . Combining these two with the facts (56) and (57) leads to sgn> t,(l) t,(l) t,(l) sgn t,sgn,(l) ka k + ka k x − x x + kθ2 k2 ≤ |ξl − ξlsgn | |al,1 | a> a x l,⊥ 2 l,⊥ ⊥ l l k 2

p p p √ t,(l)

t,(l)

. log m log m x⊥ + n xk + n log m xt,(l) − xt,sgn,(l) 2 2

p t,(l) t,(l)

t,(l) t,sgn,(l) −x (137) . log m x⊥ + n log m xk + x

. 2

2

We now move on to controlling θ1 . Use the elementary identity a2 − b2 = (a − b) (a + b) to get sgn> t,sgn,(l) t,(l) t,sgn,(l) > t,(l) t,(l) θ1 = a> − asgn> x a x + a x al a> . l x l l x l l

(138)

The constructions of asgn requires that l t,(l)

t,(l) a> − asgn> xt,sgn,(l) = ξl |al,1 | xk l x l

t,sgn,(l)

− ξlsgn |al,1 | xk t,(l)

t,sgn,(l)

t,(l) + a> − x⊥ l,⊥ x⊥

.

t,sgn,(l)

Similarly, in view of the independence between al,⊥ and x⊥ − x⊥ , and the fact (56), one can see that with probability at least 1 − O m−10 > t,(l) t,(l) t,sgn,(l) > t,sgn,(l) − asgn> xt,sgn,(l) ≤ |ξl | |al,1 | xk + |ξlsgn | |al,1 | xk − x al x + al,⊥ xt,(l) ⊥ l ⊥

p t,(l) t,sgn,(l) t,(l) t,sgn,(l) . log m xk + xk + x⊥ − x⊥

2

p t,(l) t,(l) t,sgn,(l) . log m xk + x −x (139)

, 2

t,sgn,(l) ≤ xt,(l) + xt,(l) − xt,sgn,(l) . where the last inequality results from the triangle inequality xk k 2 Substituting (139) into (138) results in > t,(l) sgn> t,sgn,(l) sgn> t,sgn,(l) > t,(l) t,(l) x − a x + a x ka k x kθ1 k2 = a> x a a l 2 l l l l l 59

p

p √ p t,(l) log m xk + xt,(l) − xt,sgn,(l) 2 · log m · n · log m q

t,(l) n log3 m xk + xt,(l) − xt,sgn,(l) 2 , .

where the second line comes from the simple facts (57), p > t,(l) + asgn> xt,sgn,(l) ≤ log m al x l

and

(140)

> t,(l) p al x . log m.

Taking the bounds (136), (137) and (140) collectively, we can conclude that 2 1 2 t,sgn,(l) x kθ k kv2 k2 ≤ − |a | kθ1 k2 + asgn> 2 2 l,1 l m p

log2 m n log3 m t,(l)

t,(l) .

x⊥ + xk + xt,(l) − xt,sgn,(l) 2 . m m 2 • To bound v1 , one first observes that ˜ (τ )) xt,sgn − xt,sgn,(l) ∇2 f (x (τ )) xt − xt,(l) − ∇2 f sgn (x ˜ (τ )) xt,sgn − xt,sgn,(l) = ∇2 f (x (τ )) xt − xt,(l) − xt,sgn + xt,sgn,(l) + ∇2 f (x (τ )) − ∇2 f (x {z } | {z } | :=w2 (τ )

:=w1 (τ )

˜ (τ )) − ∇2 f sgn (x ˜ (τ )) xt,sgn − xt,sgn,(l) . + ∇2 f (x {z } | :=w3 (τ )

– The first term w1 (τ ) satisfies

Z

t

x − xt,(l) − xt,sgn + xt,sgn,(l) − η

0

1

w1 (τ )dτ

2

Z 1

2 t t,(l) t,sgn t,sgn,(l) I − η = ∇ f (x (τ )) dτ x − x − x + x

0 2

Z 1

t

t,(l) t,sgn t,sgn,(l) 2 −x +x ∇ f (x (τ )) dτ ≤

· x − x

I − η 2 0

t 2 1

≤ 1 + 3η 1 − x 2 + O η + ηφ1 xt − xt,(l) − xt,sgn + xt,sgn,(l) , log m 2 for some |φ1 |

1 log m ,

where the last line follows from the same argument as in (112).

– Regarding the second term w2 (τ ), it is seen that

m h

X

2

2 2 i 3

> > > 2

∇ f (x (τ )) − ∇ f (x ˜ ) ˜ (τ )) = ai x(τ ) − ai x(τ ai ai

m i=1

m 3 X

2 2 >

˜ ) ≤ max ai x(τ ) − a> ai a> i x(τ i 1≤i≤m

m

i=1

m

X > > 3

> ˜ (τ )) max ai (x (τ ) + x ˜ (τ )) ai ai ≤ max ai (x (τ ) − x 1≤i≤m 1≤i≤m

m

i=1 > p ˜ (τ )) log m, . max ai (x (τ ) − x (141) 1≤i≤m

where the last line makes use of Lemma 13 as well as the incoherence conditions > p ˜ (τ ) . log m. ˜ (τ )) ≤ max a> max a> i (x (τ ) + x i x (τ ) + max ai x 1≤i≤m

1≤i≤m

60

1≤i≤m

(142)

Note that h i ˜ ) = xt + τ xt,(l) − xt − xt,sgn + τ xt,sgn,(l) − xt,sgn x(τ ) − x(τ = (1 − τ ) xt − xt,sgn + τ xt,(l) − xt,sgn,(l) . This implies for all 0 ≤ τ ≤ 1, > > t,(l) t t,sgn ai x(τ ) − x(τ ˜ ) ≤ a> − xt,sgn,(l) . + ai x i x −x Moreover, the triangle inequality together with the Cauchy-Schwarz inequality tells us that > t > t,(l) t,sgn t,(l) t,sgn,(l) t t,sgn x − x − x + x + x − x − xt,sgn,(l) ≤ a> a ai x i i

> t ≤ ai x − xt,sgn + kai k2 xt − xt,sgn − xt,(l) + xt,sgn,(l) 2 and > t > t t,(i) t,sgn,(i) ai x − xt,sgn ≤ a> x − x + ai x − xt,sgn − xt,(i) + xt,sgn,(i) i

t,(i) ≤ a> − xt,sgn,(i) + kai k2 xt − xt,sgn − xt,(i) + xt,sgn,(i) 2 . i x Combine the previous three inequalities to obtain > t > t,(l) t,sgn t,sgn,(l) ˜ max a> (x (τ ) − x (τ )) ≤ max a x − x + max x − x a i i i 1≤i≤m 1≤i≤m 1≤i≤m

t,(i) − xt,sgn,(i) + 3 max kai k2 max xt − xt,sgn − xt,(l) + xt,sgn,(l) 2 ≤ 2 max a> i x 1≤i≤m 1≤i≤m 1≤l≤m p

t,(i)

t

√ t,sgn,(i)

−x + n max x − xt,sgn − xt,(l) + xt,sgn,(l) 2 , . log m max x 2 1≤i≤m

1≤l≤m

where the last inequality follows from the independence between ai and xt,(i) − xt,sgn,(i) and the fact (57). Substituting the above bound into (141) results in

2

∇ f (x (τ )) − ∇2 f (x ˜ (τ ))

p

. log m max xt,(i) − xt,sgn,(i) + n log m max xt − xt,sgn − xt,(l) + xt,sgn,(l) 2 1≤i≤m 1≤l≤m 2 p

t

t

t,sgn

+ n log m max x − xt,sgn − xt,(l) + xt,sgn,(l) . . log m x − x 2 2 1≤l≤m

Here, we use the triangle inequality

t,(i)

− xt,sgn,(i) ≤ xt − xt,sgn 2 + xt − xt,sgn − xt,(i) + xt,sgn,(i) 2

x 2

√

and the fact log m ≤ n log m. Consequently, we have the following bound for w2 (τ ):

˜ (τ )) · xt,sgn − xt,sgn,(l) 2 kw2 (τ )k2 ≤ ∇2 f (x (τ )) − ∇2 f (x p

. log m xt − xt,sgn 2 + n log m max xt − xt,sgn − xt,(l) + xt,sgn,(l) 2 xt,sgn − xt,sgn,(l) 2 . 1≤l≤m

– It remains to control w3 (τ ). To this end, one has w3 (τ ) =

m 2 i 1 Xh > \ 2 t,sgn ˜ 3 a> x (τ ) − a x ai a> − xt,sgn,(l) i i i x m i=1 | {z } :=ρi

m 2 2 i sgn sgn> t,sgn 1 Xh ˜ (τ ) − asgn> − x\ ai ai x − xt,sgn,(l) . 3 asgn> x i i m i=1 | {z } :=ρsgn i

61

We consider the first entry of w3 (τ ), i.e. w3,k (τ ), and the 2nd through the nth entries, w3,⊥ (τ ), separately. For the first entry w3,k (τ ), we obtain m

m

1 X 1 X sgn sgn t,sgn w3,k (τ ) = ρi ξi |ai,1 | a> − xt,sgn,(l) − ρ ξ |ai,1 | asgn> xt,sgn − xt,sgn,(l) . (143) i x i m i=1 m i=1 i i Use the expansions t,sgn,(l) t,sgn,(l) t,sgn t,sgn t,sgn t,sgn,(l) − x + a> − x⊥ a> x − x = ξ |a | x i i,1 i,⊥ x⊥ i k k t,sgn,(l) t,sgn,(l) t,sgn asgn> − xk + a> − x⊥ xt,sgn − xt,sgn,(l) = ξisgn |ai,1 | xt,sgn i,⊥ x⊥ i k to further obtain m

w3,k (τ ) =

m

1 X 1 X t,sgn,(l) t,sgn,(l) 2 t,sgn sgn > − x⊥ xkt,sgn − xk (ρi − ρsgn + (ρi ξi − ρsgn i ) |ai,1 | i ξi ) |ai,1 | ai,⊥ x⊥ m i=1 m i=1 m

1 X t,sgn,(l) 2 = xkt,sgn − xk (ρi − ρsgn i ) |ai,1 | m i=1 {z } | :=θ1 (τ )

+

1 m |

m X

t,sgn,(l)

t,sgn sgn > (ρi − ρsgn − x⊥ i ) (ξi + ξi ) |ai,1 | ai,⊥ x⊥

i=1

{z

}

:=θ2 (τ ) m

m

1 X sgn 1 X t,sgn,(l) t,sgn,(l) t,sgn t,sgn ρi ξi |ai,1 | a> ρi ξisgn |ai,1 | a> − − x⊥ − x⊥ + i,⊥ x⊥ i,⊥ x⊥ m i=1 m i=1 {z } | {z } | :=θ3 (τ )

:=θ4 (τ )

The identity (122) reveals that ˜ ⊥ (τ ) , ρi − ρsgn = 6 (ξi − ξisgn ) |ai,1 | x ˜k (τ ) a> i,⊥ x i and hence

m

6 X t,sgn,(l) 3 ˜ ⊥ (τ ) xt,sgn (ξi − ξisgn ) |ai,1 | a> − xk , θ1 (τ ) = x ˜k (τ ) · i,⊥ x k m i=1 which together with (130) implies

m

1 X t,sgn

t,sgn,(l) 3 > sgn ˜ ⊥ (τ )k2 |θ1 (τ )| ≤ 6 x ˜k (τ ) xk − xk (ξi − ξi ) |ai,1 | ai,⊥ kx

m i=1 s n log3 m t,sgn,(l) ˜ ⊥ (τ )k2 . x ˜k (τ ) xt,sgn − x kx k k m s n log3 m t,sgn,(l) x ˜k (τ ) xt,sgn − x . , k k m where the penultimate inequality arises from (130) and the last inequality utilizes the fact that

t,sgn,(l)

+ ˜ ⊥ (τ )k2 ≤ xt,sgn kx

x

. 1. ⊥ ⊥ 2 2

Again, we can use (144) and the identity (ξi −

ξisgn ) (ξi

+

θ2 (τ ) = 0. 62

ξisgn )

= 0 to deduce that

(144)

t,sgn,(l) t,sgn |ai,1 | a> − x⊥ When it comes to θ3 (τ ), we exploit the independence between ξi and ρsgn i i,⊥ x⊥ and apply the Bernstein inequality (see Lemma 11) to obtain that with probability exceeding 1 − O m−10 1 p |θ3 (τ )| . V1 log m + B1 log m , m where V1 :=

m X

2 2 t,sgn,(l) 2 > t,sgn − x⊥ (ρsgn i ) |ai,1 | ai,⊥ x⊥

> t,sgn,(l) t,sgn − x⊥ and B1 := max |ρsgn . i | |ai,1 | ai,⊥ x⊥

i=1

1≤i≤m

Combine the fact |ρsgn i | . log m and Lemma 14 to see that t,sgn,(l)

2 . − x⊥ V1 . m log2 m xt,sgn ⊥ 2 In addition, the facts |ρsgn i | . log m, (56) and (57) tell us that q

t,sgn,(l)

. B1 . n log3 m xt,sgn − x⊥ ⊥ 2 Continue the derivation to reach s  s p 3 5

log m n log m  t,sgn log3 m t,sgn,(l)

xt,sgn − xt,sgn,(l) , |θ3 (τ )| .  . + x⊥ − x⊥ ⊥ ⊥ 2 2 m m m

(145)

provided that m & n log2 m. This further allows us to obtain m h 1 X i 2 2 t,sgn,(l) t,sgn \ ˜ (τ ) − a> |θ4 (τ )| = 3 a> ξisgn |ai,1 | a> − x⊥ i x i x i,⊥ x⊥ m i=1 m n 1 X o 2 t,(l) 2 sgn > t 3 a> x (τ ) − |a | ξ |a | a x − x ≤ i,1 i,1 i i,⊥ ⊥ i ⊥ m i=1 m n 1 X o 2 2 t,sgn,(l) t,sgn ˜ (τ ) − 3 a> + 3 a> ξisgn |ai,1 | a> − x⊥ i x i x (τ ) i,⊥ x⊥ m i=1 m n 1 X o 2 t,(l) t,sgn,(l) 2 t,sgn sgn > > t 3 ai x (τ ) − |ai,1 | ξi |ai,1 | ai,⊥ x⊥ − x⊥ − x⊥ + x⊥ + m i=1 s

p log3 m

t,sgn

t t,(l) t,sgn,(l) ˜ (τ )k2 .

x⊥ − x⊥ + log m x⊥ − x⊥

kx (τ ) − x m 2 2

1

t t,(l) t,sgn(l) t,sgn + (146)

x⊥ − x⊥ − x⊥ + x⊥

. 3/2 2 log m To justify the last inequality, we first use similar bounds as in (145) to show that with probability exceeding 1 − O m−10 , s m n 1 X

o log3 m 2

t t,(l) t,(l) 2 t 3 a> − |ai,1 | ξisgn |ai,1 | a> .

x⊥ − x⊥ . i,⊥ x⊥ − x⊥ i x(τ ) m m 2 i=1

In addition, we can invoke the Cauchy-Schwarz inequality to get m n 1 X o 2 2 t,sgn,(l) t,sgn sgn ˜ (τ ) − 3 a> 3 a> ξi |ai,1 | a> − x⊥ i x i x (τ ) i,⊥ x⊥ m i=1

63

v u u ≤t

m o2 2 1 Xn 2 > x (τ ) 2 ˜ 3 a> x (τ ) − 3 a |ai,1 | i i m i=1

!

m 2 1 X > t,sgn t,sgn,(l) ai,⊥ x⊥ − x⊥ m i=1

!

v u m n

u1 X o2 2 t,sgn,(l) 2 t,sgn > x (τ ) 2 ˜ .t − x |a | a> x (τ ) − a

,

x i,1 i i ⊥ ⊥ m i=1 2 where the last line arises from Lemma 13. For the remaining term in the expression above, we have v v u u m n m o2 u1 X u1 X 2 2 2 2 2 2 > > t t ˜ (τ )) ˜ (τ )) a> ˜ (τ ) − ai x (τ ) |ai,1 | a> |ai,1 | = ai x i (x (τ ) + x i (x (τ ) − x m i=1 m i=1 v u m (i) u log m X 2 2 ˜ (τ )) .t |ai,1 | a> i (x (τ ) − x m i=1 (ii)

.

p

˜ (τ )k2 . log m kx (τ ) − x

Here, (i) makes use of the incoherence condition (142), whereas (ii) comes from Lemma 14. Regarding the last line in (146), we have m n 1 X o 2 t,(l) t,sgn,(l) 2 t 3 a> − |ai,1 | ξisgn |ai,1 | a> − xt,sgn + x⊥ i x (τ ) i,⊥ x⊥ − x⊥ ⊥ m i=1

m n

1 X o 2

t,(l) t,sgn,(l) 2 sgn t,sgn > t 3 a> x (τ ) − |a | ξ |a | a − x − x + x ≤

x

. i,1 i,1 i i,⊥ ⊥ i ⊥ ⊥ ⊥

m 2 i=1 2

n o 2 2 Since ξisgn is independent of 3 a> x (τ ) − |a | |ai,1 | a> i,1 i i,⊥ , one can apply the Bernstein inequality (see Lemma 11) to deduce that

m n

1 X o 1 p 2

2 sgn 3 a> − |ai,1 | ξi |ai,1 | a> V2 log m + B2 log m ,

i x (τ ) i,⊥ .

m m i=1 2

where V2 :=

m n o2 X 2 2 2 3 3 a> x (τ ) − |a | |ai,1 | a> i,1 i i,⊥ ai,⊥ . mn log m; i=1

2 √ 2 n log3/2 m. B2 := max 3 a> x (τ ) − |a | i,1 |ai,1 | kai,⊥ k2 . i 1≤i≤m

This further implies s

√ m n

1 X

o n log4 m n log5/2 m 1 2

2 sgn > 3 a> x (τ ) − |a | ξ |a | a . + . ,

i,1 i,1 i i,⊥ i 3/2

m

m m log m i=1 2 as long as m n log7 m. Take the previous bounds on θ1 (τ ), θ2 (τ ), θ3 (τ ) and θ4 (τ ) collectively to arrive at s s

3 t,sgn n log m log3 m

t,sgn t,sgn,(l) t,sgn,(l) w3,k (τ ) . x − xk ˜k (τ ) xk +

x⊥ − x⊥

m m 2 s

p log3 m

t

t,sgn t,(l) t,sgn,(l) ˜ (τ )k2 +

x⊥ − x⊥ + log m x⊥ − x⊥

kx (τ ) − x m 2 2 64

+

1 log3/2 m

t t,(l) t,sgn(l) t,sgn

x⊥ − x⊥ − x⊥ + x⊥

2

s

.

n log3 m t,sgn,(l) − xk x ˜k (τ ) xt,sgn k m s

p log3 m

t


x⊥ − x⊥ + log m x⊥ − x⊥

kx (τ ) − x m 2 2

1

t t,(l) t,sgn(l) t,sgn + − x − x + x

x

, ⊥ ⊥ ⊥ ⊥ 2 log3/2 m

where the last inequality follows from the triangle inequality

t,sgn t,sgn,(l) t,(l) t,(l) t,sgn(l) t,sgn

x⊥ − x⊥

≤ xt⊥ − x⊥ + xt⊥ − x⊥ − x⊥ + x⊥

2

q

2

2

3

log m 1 and the fact that for m sufficiently large. Similar to (143), we have the following ≤ log3/2 m m identity for the 2nd through the nth entries of w3 (τ ): m m 1 X 1 X sgn t,sgn t,sgn,(l) ρi ai,⊥ a> ρi ai,⊥ asgn> x − x − xt,sgn − xt,sgn,(l) i i m i=1 m i=1 m 2 2 3 X t,sgn,(l) sgn sgn> > ˜ (τ ) ξi − ai ˜ (τ ) ξi ai x x |ai,1 | ai,⊥ xt,sgn − x = k k m i=1

w3,⊥ (τ ) =

m 3 X t,sgn,(l) 2 |ai,1 | (ξi − ξisgn ) |ai,1 | ai,⊥ xt,sgn − x k k m i=1 m 2 2 sgn> 3 X t,sgn,(l) t,sgn ˜ ˜ + a> x (τ ) − a x (τ ) ai,⊥ a> . − x⊥ i i,⊥ x⊥ i m i=1

+

√ ˜ (τ )k2 and ˜ (τ ) . log m kx It by Lemma 14 and the incoherence conditions a> i x is easy to check sgn> √ ˜ (τ ) . log m kx ˜ (τ )k2 that x ai  s m 3 2 1 X > n log m , ˜ (τ ) ξi |ai,1 | ai,⊥ = 2˜ ˜ ⊥ (τ ) + O  a x x1 (τ ) x m i=1 i m and s  m 3 2 n log m 1 X sgn> . ˜ (τ ) ξisgn |ai,1 | ai,⊥ = 2˜ ˜ ⊥ (τ ) + O  a x x1 (τ ) x m i=1 i m Besides, in view of (130), we have s

m

3 X

n log3 m

2 sgn |ai,1 | (ξi − ξi ) |ai,1 | ai,⊥ . .

m

m i=1

2

2

3 Pm 2 sgn> t,sgn,(l) t,sgn >˜ >

. To ˜ We are left with controlling a x (τ ) − a x (τ ) a a x − x i,⊥ i,⊥ i i ⊥ ⊥

m i=1

2 this end, one can see from (144) that

m

3 X 2 2 sgn>

t,sgn,(l) t,sgn > > ˜ (τ ) ˜ (τ ) − ai ai x x ai,⊥ ai,⊥ x⊥ − x⊥

m

i=1 2

65

m

6 X

t,sgn,(l) t,sgn sgn > > ˜ ⊥ (τ ) ai,⊥ x⊥ − x⊥ ˜ (τ ) · = x (ξi − ξi ) |ai,1 | ai,⊥ ai,⊥ x

k m i=1

m

1 X

t,sgn t,sgn,(l) ˜ ≤ 12 max |ai,1 | x ˜k (τ ) max a> x (τ ) − x⊥ ai,⊥ a>

i,⊥ ⊥ i,⊥ x⊥ 1≤i≤m 1≤i≤m

m

2 i=1

t,sgn,(l) − x⊥ . log m x ˜k (τ ) xt,sgn

, ⊥ 2

√ ˜ ⊥ (τ ) . log m where the last relation arises from (56), the incoherence condition max1≤i≤m a> i,⊥ x and Lemma 13. Hence the 2nd through the nth entries of w3 (τ ) obey s

n log3 m t,sgn

t,sgn,(l) t,sgn,(l) − x⊥ ˜k (τ ) xt,sgn − xk kw3,⊥ (τ )k2 .

. + log m x xk ⊥ m 2 Combine the above estimates to arrive at kw3 (τ )k2 ≤ w3,k (τ ) + kw3,⊥ (τ )k2 s

n log3 m t,sgn

t,sgn,(l) t,sgn,(l) − x + ˜k (τ ) xt,sgn ≤ log m x − x

x ⊥ ⊥ k k m 2 s

p log3 m

t


x⊥ − x⊥ + log m x⊥ − x⊥

kx (τ ) − x m 2 2

1

t t,(l) t,sgn(l) t,sgn +

x⊥ − x⊥ − x⊥ + x⊥

. 3/2 2 log m • Putting together the preceding bounds on v1 and v2 (w1 (τ ), w2 (τ ) and w3 (τ )), we can deduce that

t+1

− xt+1,(l) − xt+1,sgn + xt+1,sgn,(l)

x 2

Z 1 Z 1 Z 1

t

t,(l) t,sgn t,sgn,(l)

x − x − x + x − η = w (τ ) dτ + w (τ ) dτ + w (τ ) dτ − ηv 1 2 3 2

0

0

0

2

Z 1

t

t,(l) t,sgn t,sgn,(l)

−x +x −η sup kw (τ )k2 + η sup kw3 (τ )k2 + η kv2 k2 ≤ x − x w1 (τ ) dτ

+ η 0≤τ ≤1 0≤τ ≤1 0 2

n o

t 2

t t,(l) t,sgn t,sgn,(l) ≤ 1 + 3η 1 − x 2 + ηφ1 x − x −x +x

2

p

t

t

t,sgn t,sgn t,(l) t,sgn,(l) t,sgn t,sgn,(l)

n log m max x − x −x +x + log m x − x −x +O η

x

2 2 1≤l≤m 2  s 

n log3 m t,sgn

t,sgn,(l)  + O η log m sup x ˜k (τ ) xt,sgn − xt,sgn,(l) + O η − xk xk m 2 0≤τ ≤1  s 

3 p log m

t

t,sgn t,(l) t,sgn,(l) ˜ (τ )k2 . + O η

x⊥ − x⊥  + O η log m x⊥ − x⊥

sup kx (τ ) − x m 2 2 0≤τ ≤1 ! p

n log3 m t,(l) log2 m

t,(l)

t,(l) t,sgn,(l) +O η −x . (147)

x⊥ + O η xk + x

m m 2 2 To simplify the preceding bound, we first make the following claim, whose proof is deferred to the end of this subsection. Claim 1. For t ≤ T0 , the following inequalities hold:

p

n log m xt,sgn − xt,sgn,(l) 2

66

1 ; log m

p n log3 m . αt log m; m 1 . αt log m log m

p

˜ (τ )k2 + ˜k (τ ) + log m xt − xt,sgn 2 + log m sup kx (τ ) − x log m sup x 0≤τ ≤1

0≤τ ≤1

Armed with Claim 1, one can rearrange terms in (147) to obtain for some |φ2 |, |φ3 | log1 m

t+1 − xt+1,(l) − xt+1,sgn + xt+1,sgn,(l)

x

2 o n

t 2

≤ 1 + 3η 1 − x 2 + ηφ2 max xt − xt,(l) − xt,sgn + xt,sgn,(l) 1≤l≤m 2   s

log3 m log2 m 

t + ηO log m · αt + +

x − xt,(l) m m 2 s  p 3

log2 m n log m n log3 m  t t,(l)

xt⊥ + ηO  + xk − xk + η 2 m m m ! p

n log3 m t + ηO xk + xt − xt,sgn 2 m

n o

2

≤ 1 + 3η 1 − xt 2 + ηφ3 xt − xt,(l) − xt,sgn + xt,sgn,(l) 2

t t,(l) + O (η log m) · αt x − x 2  s  3

log2 m n log m  t t,(l)

xt⊥ + O η xk − xk + O η 2 m m ! p

n log3 m t +O η xk + xt − xt,sgn 2 . m Substituting in the hypotheses (40), we can arrive at

t+1

− xt+1,(l) − xt+1,sgn + xt+1,sgn,(l)

x 2 p t n o

t 2 n log9 m 1 ≤ 1 + 3η 1 − x 2 + ηφ3 αt 1 + C4 log m m t p 5 n log m 1 C1 + O (η log m) αt βt 1 + log m m  s  p t log3 m  1 n log5 m + O η βt 1 + C1 m log m m  s  t p 3 n log m 1 n log12 m  αt 1 + + O η C2 m log m m ! p log2 m n log3 m +O η βt + O η αt m m s ! p t n log3 m 1 n log5 m +O η αt 1 + C3 m log m m p t o (i) n

t 2 1 n log9 m ≤ 1 + 3η 1 − x 2 + ηφ4 αt 1 + C4 log m m 67

(ii)

≤ αt+1 for some |φ4 | as long as

1 log m .

1 1+ log m

t+1

p C4

n log9 m m

Here, the last relation (ii) follows the same argument as in (116) and (i) holds true

p t n log5 m 1 1 C1 (log m) αt βt 1 + αt 1 + log m m log m s p t n log3 m 1 n log12 m 1 C2 αt 1 + αt 1 + m log m m log m log2 m 1 βt αt 1 + m log m s p t n log3 m 1 n log5 m 1 C3 αt 1 + αt 1 + m log m m log m p 3 n log m 1 αt αt 1 + m log m

1 log m

t

1 log m

t

1 log m

t

1 log m

t

1 log m

t

p C4 p C4 p C4 p C4 p C4

n log9 m ; m

(148a)

n log9 m ; m

(148b)

n log9 m ; m

(148c)

n log9 m ; m

(148d)

n log9 m , m

(148e)

where we recall that t ≤ T0 . log n. The first condition (148a) can be checked using βt . 1 and the assumption that C4 > 0 is sufficiently large. The second one is valid if m n log8 m. In addition, the third condition follows from the relationship (see Lemma 1) p βt . αt n log m. It is also easy to see that the last two are both valid. Proof of Claim 1. For the first claim, it is east to see from the triangle inequality that

p

n log m xt,sgn − xt,sgn,(l)

2

p

t

t,(l) ≤ n log m x − x + xt − xt,(l) − xt,sgn + xt,sgn,(l) 2 2 p t t p 5 p p n log m n log9 m 1 1 ≤ n log mβt 1 + + n log mαt 1 + C1 C4 log m m log m m .

n log3 m n log5 m 1 + , m m log m

as long as m n log6 m. Here, we have invoked the upper bounds on αt and βt provided in Lemma 1. Regarding the second claim, we have t,sgn,(l) t,sgn t,sgn t,sgn,(l) x ˜k (τ ) ≤ xt,sgn + ≤ 2 + − x x x x k k k k k

t,(l) ≤ 2 xtk + 2 xt − xt,sgn 2 + xtk − xk + xt − xt,(l) − xt,sgn + xt,sgn,(l) 2   s p p 5 12 9 n log m n log m n log m  + + . αt , . αt 1 + m m m as long as m n log5 m. Similar arguments can lead us to conclude that the remaining terms on the left-hand side of the second inequality in the claim are bounded by O(αt ). The third claim is an immediate consequence of the fact αt log15 m (see Lemma 1).

68

H

Proof of Lemma 8

Recall from Appendix C that    s   3

n log m 2

xt + O η  xtk + J2 − J4 , xt+1 = 1 + 3η 1 − k 2   m where J2 and J4 are defined respectively as m h 2 i 1 X J2 := η 1 − 3 xtk · a3 a> xt ; m i=1 i,1 i,⊥ ⊥ m

J4 := η ·

1 X > t 3 a x ai,1 . m i=1 i,⊥ ⊥

Instead of resorting to the leave-one-out sequence {xt,sgn } as in Appendix C, we can directly apply Lemma 12 and the incoherence condition (49a) to obtain m X 1 1 3 > t t 2 1 ai,1 ai,⊥ x⊥ η 6 xt⊥ 2 η αt ; |J2 | ≤ η 1 − 3 xk m log m log m i=1 m 1 X 1 1 3 > t ai,⊥ x⊥ ai,1 η 6 xt⊥ 2 η |J4 | ≤ η αt m log m log m i=1 with probability at least 1 − O m−10 , as long as m n log13 m. Here, the last relations come from the fact that αt ≥ logc5 m (see Lemma 1). Combining the previous estimates gives n o

2 αt+1 = 1 + 3η 1 − xt 2 + ηζt αt , with |ζt |

I

1 log m .

This finishes the proof.

Proof of Lemma 9

In view of Appendix D, one has p

n o

t 2

n log3 m

t+1

t t+1,(l) t,(l)

xt −x

x

≤ 1 + 3η 1 − x 2 + ηφ1 x − x + O η 2 m 2 2 for some |φ1 |

1 log m ,

! ,

where we use the trivial upper bound

t,(l) 2η xtk − xk ≤ 2η xt − xt,(l) . 2

Under the hypotheses (48a), we can obtain

n o

t 2

t+1 t+1,(l)

−x

x

≤ 1 + 3η 1 − x 2 + ηφ1 αt 1 + 2

1 log m

t

n log15 m +O η m

p

n log15 m m

C6

t n o

t 2 1

≤ 1 + 3η 1 − x 2 + ηφ2 αt 1 + C6 log m p t+1 1 n log15 m ≤ αt+1 1 + C6 , log m m

69

p

p

n log3 m (αt + βt ) m

!

for some |φ2 |

1 log m ,

as long as η is sufficiently small and p

p t 1 1 n log3 m n log15 m (αt + βt ) αt 1 + C6 . m log m log m m

This is satisfied since, according to Lemma 1, p p p p t n log3 m n log3 m n log13 m n log15 m 1 1 (αt + βt ) . . αt αt 1 + C6 , m m m log m log m m as long as C6 > 0 is sufficiently large.

J

Proof of Lemma 12

Without loss of generality, it suffices to consider all the unitPvectors z obeying kzk2 = 1. To begin with, for m 1 any given z, we can express the quantities of interest as m i=1 (gi (z) − G (z)) , where gi (z) depends only on z and ai . Note that θ 2 1 gi (z) = aθi,1 a> i,⊥ z for different θ1 , θ2 ∈ {1, 2, 3, 4, 6} in each of the cases considered herein. It can be easily verified from Gaussianality that in all of these cases, for any fixed unit vector z one has 2 E gi2 (z) . (E [|gi (z)|]) ; (149) E [|gi (z)|] 1; (150) 1 √ (151) − E [gi (z)] ≤ E [|gi (z)|] . i,⊥ z |≤βkzk2 ,|ai,1 |≤5 log m} n √ In addition, on the event max1≤i≤m kai k2 ≤ 6n which has probability at least 1 − O me−1.5n , one has, for any fixed unit vectors z, z0 , that h E gi (z) 1{|a>

i

|gi (z) − gi (z0 )| ≤ nα kz − z0 k2

(152)

forP some parameter α = O (1) in all cases. In light of these properties, we will proceed by controlling m 1 i=1 gi (z) − E [gi (z)] in a unified manner. m We start by looking at any fixed vector z independent of {ai }. Recognizing that m

h i 1 X gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} 2 2 i,⊥ i,⊥ m i=1 is a sum of m i.i.d. random variables, one can thus apply the Bernstein inequality to obtain ( ) m 1 X h i gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} ≥ τ P 2 2 i,⊥ i,⊥ m i=1 τ 2 /2 ≤ 2 exp − , V + τ B/3 where the two quantities V and B obey m i 1 1 X h 2 1 2 2 √ E g (z) 1 > i {|ai,⊥ z|≤βkzk2 ,|ai,1 |≤5 log m} ≤ m E gi (z) . m (E [|gi (z)|]) ; m2 i=1 n o 1 B := max |gi (z)| 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} . 2 i,⊥ m 1≤i≤m

V :=

70

(153) (154)

Here the penultimate relation of (153) follows from (149). Taking τ = E [|gi (z)|], we can deduce that m 1 X i h gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} ≤ E [|gi (z)|] (155) 2 2 i,⊥ i,⊥ m i=1 n o i (z)|] with probability exceeding 1 − 2 min exp −c1 m2 , exp − c2 E[|g for some constants c1 , c2 > 0. In B particular, when m2 /(n log n) and E [|gi (z)|] /(Bn log n) are both sufficiently large, the inequality (155) holds with probability exceeding 1 − 2 exp (−c3 n log n) for some constant c3 > 0 sufficiently large. We then move on to extending this result to a uniform bound. Let Nθ be a θ-net of the unit sphere with n cardinality |Nθ | ≤ 1 + θ2 such that for any z on the unit sphere, one can find a point z0 ∈ Nθ such that kz − z0 k2 ≤ θ. Apply the triangle inequality to obtain m 1 X i h gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} 2 2 i,⊥ i,⊥ m i=1 m 1 X h i ≤ gi (z0 ) 1{|a> z0 |≤βkz0 k ,|ai,1 |≤5√log m} −E gi (z0 ) 1{|a> z0 |≤βkz0 k ,|ai,1 |≤5√log m} 2 2 i,⊥ i,⊥ m i=1 | {z } :=I1

m h 1 X i + gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −gi (z0 ) 1{|a> z0 |≤βkz0 k ,|ai,1 |≤5√log m} , 2 2 i,⊥ i,⊥ m i=1 | {z } :=I2

where the second line arises from the fact that h i h E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} = E gi (z0 ) 1{|a> i,⊥

√ i,⊥ z0 |≤βkz0 k2 ,|ai,1 |≤5 log m}

2

i

.

n With regard to the first term I1 , by the union bound, with probability at least 1−2 1 + θ2 exp (−c3 n log n), one has I1 ≤ E [|gi (z0 )|] . n o √ ≤ β kzk , |ai,1 | ≤ 5 log m , we have It remains to bound I2 . Denoting Si = z | a> z i,⊥ 2 m 1 X I2 = gi (z) 1{z∈Si } −gi (z0 ) 1{z0 ∈Si } m i=1 m m m 1 X 1 X 1 X (gi (z) − gi (z0 )) 1{z∈Si ,z0 ∈Si } + gi (z) 1{z∈Si ,z0 ∈S + g (z ) 1 ≤ i 0 / i} {z ∈S / i ,z0 ∈Si } m m m i=1

i=1

i=1

m m X 1 X 1 ≤ |gi (z) − gi (z0 )| + max gi (z) 1{z∈Si } · 1{z∈Si ,z0 ∈S / i} m i=1 m 1≤i≤m i=1

+

m X 1 max gi (z0 ) 1{z0 ∈Si } · 1{z∈S / i ,z0 ∈Si } . m 1≤i≤m i=1

For the first term in (156), it follows from (152) that m

1 X |gi (z) − gi (z0 )| ≤ nα kz − z0 k2 ≤ nα θ. m i=1 For the second term of (156), we have 1{z∈Si ,z0 ∈S / i } ≤ 1{|a> z |≤β,|a> z0 |≥β } i,⊥ i,⊥ 71

(156)

= 1{|a>

i,⊥ z |≤β }

= 1{|a>

i,⊥ z

1{|a> z0 |≥β+√6nθ} + 1{β≤|a> z0 | i,⊥ z0 |≤β+

≤ 1{β≤|a>

i,⊥ z0

√

|≤β+

6nθ }

(157)

6nθ }

.

Here, the identity (157) holds due to the fact that 1{|a> z|≤β } 1{|a> z0 |≥β+√6nθ} = 0; i,⊥ i,⊥ > √ in fact, under the condition ai,⊥ z0 ≥ β + 6nθ one has √ √ √ > > ai,⊥ z ≥ ai,⊥ z0 − a> 6nθ − kai,⊥ k2 kz − z0 k2 > β + 6nθ − 6nθ ≥ β, i,⊥ (z − z0 ) ≥ β + which is contradictory to a> i,⊥ z ≤ β. As a result, one can obtain m X

1{z∈Si ,z0 ∈S / i} ≤

i=1

m X

1{β≤|a>

i,⊥ z0

√ 6nθ }

|≤β+

≤ 2Cn log n,

i=1 2

with probability at least 1 − e− 3 Cn log n for a sufficiently large constant C > 0, where the last inequality follows from the Chernoff bound (see Lemma 10). This together with the union bound reveals that with 2 n probability exceeding 1 − 1 + θ2 e− 3 Cn log n , m X 1 max gi (z) 1{z∈Si } · 1{z∈Si ,z0 ∈S / i } ≤ B · 2Cn log n m 1≤i≤m i=1

with B defined in (154). Similarly, one can show that m X 1 max gi (z0 ) 1{z0 ∈Si } · 1{z∈S / i ,z0 ∈Si } ≤ B · 2Cn log n. m 1≤i≤m i=1

Combine the above bounds to reach that I1 + I2 ≤ E [|gi (z0 )|] + nα θ + 4B · Cn log n ≤ 2 E [|gi (z)|] , as long as

E [|gi (z)|] and 4B · Cn log n ≤ E [|gi (z)|] . 2 2 In view of the fact (150), one can take θ n−α to conclude that m 1 X h i gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} ≤ 2 E [|gi (z)|] (158) 2 2 i,⊥ i,⊥ m nα θ ≤

i=1

holds for all z ∈ Rn with probability at least 1 − 2 exp (−c4 n log n) for some constant c4 > 0, with the proviso that ≥ n1 and that E [|gi (z)|] / (Bn log√ n) sufficiently large. Further, we note that {maxi |ai,1 | ≤ 5 log m} occurs with probability at least 1 − O(m−10 ). Therefore, on an event of probability at least 1 − O(m−10 ), one has m

m

1 X 1 X gi (z) = gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} 2 i,⊥ m i=1 m i=1

(159)

for all z ∈ Rn−1 obeying maxi a> i,⊥ z ≤ β kzk2 . On this event, one can use the triangle inequality to obtain m m 1 X 1 X gi (z) − E [gi (z)] = gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E [gi (z)] 2 i,⊥ m m i=1 i=1 72

m 1 X i h ≤ gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} −E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} 2 2 i,⊥ i,⊥ m i=1 h i + E gi (z) 1{|a> z|≤βkzk ,|ai,1 |≤5√log m} − E [gi (z)] 2

i,⊥

1 ≤ 2 E [|gi (z)|] + E [|gi (z)|] n ≤ 3 E [|gi (z)|] , as long as >P1/n, where the penultimate line follows from (151). This leads to the desired uniform upper m 1 −10 bound for m g (z) − E [g (z)], namely, with probability at least 1 − O m , i i=1 i m 1 X gi (z) − E [gi (z)] ≤ 3 E [|gi (z)|] m i=1

holds uniformly for all z ∈ Rn−1 obeying maxi a> i,⊥ z ≤ β kzk2 , provided that m2 /(n log n) and E [|gi (z)|] / (Bn log n) are both sufficiently large (with B defined in (154)). To finish up, we provide the bounds on B and the resulting sample complexity conditions for each case as follows. n o 3 5 1 1 1 2 2 • For gi (z) = a3i,1 a> i,⊥ z, one has B . m β log m, and hence we need m max 2 n log n, βn log m ; 3 • For gi (z) = ai,1 a> , one has B . i,⊥ z

1 3 mβ

1

log 2 m, and hence we need m max

2 • For gi (z) = a2i,1 a> z , we have B . i,⊥

1 2 mβ

log m, and hence m max

2 • For gi (z) = a6i,1 a> z , we have B . i,⊥

1 2 mβ

log3 m, and hence m max

6 , one has B . • For gi (z) = a2i,1 a> i,⊥ z

1 6 mβ


4 • For gi (z) = a2i,1 a> , one has B . i,⊥ z

1 4 mβ


n

3 1 3 1 2 2 n log n, β n log

2 1 2 2 n log n, β n log

1

m ;


1


1


1

o m

m ;

m ;

m .

Given that can be arbitrary quantity above 1/n, we establish the advertised results.

K

Proof of Lemma 14

Note that if the second claim (59) holds, we can readily use it to justify the first one (58) by observing that p

\ \

max a> i x ≤ 5 log m x 2 1≤i≤m

holds with probability at least 1 − O m−10 . As a consequence, the proof is devoted to justifying the second claim in the lemma. First, notice that it suffices to consider all z’s with unit norm, i.e. kzk2 = 1. We can then apply the triangle inequality to obtain

m m

1 X

1 X 2 2

> > > > > > ai z ai ai − In − 2zz ≤ ai z ai ai 1{|a> z|≤c2 √log m} − β1 In + β2 zz

i

m

m

i=1 i=1 | {z } :=θ1

73

+ β1 In + β2 zz > − In + 2zz > , | {z } :=θ2

where

i h β1 := E ξ 2 1{|ξ|≤c2 √log m}

h i β2 := E ξ 4 1{|ξ|≤c2 √log m} − β1

and

with ξ ∼ N (0, 1). • For the second term θ2 , we can further bound it as follows

θ2 ≤ kβ1 In − In k + β2 zz > − 2zz > ≤ |β1 − 1| + |β2 − 2| , which motivates us to bound |β1 − 1| and |β2 − 2|. Towards this end, simple calculation yields r √ p c2 2 c2 log m 2 log m 1 − β1 = · c2 log me− 2 + erfc π 2 r 2 (i) p c2 log m c2 2 1 2 2 log m √ · c2 log me− 2 + √ e− 4 ≤ π π c2 log m (ii)

≤

1 , m 2

where (i) arises from the fact that for all x > 0, erfc (x) ≤ √1π x1 e−x and (ii) holds as long as c2 > 0 is sufficiently large. Similarly, for the difference |β2 − 2|, one can easily show that h i 2 (160) |β2 − 2| ≤ E ξ 4 1{|ξ|≤c2 √log m} − 3 + |β1 − 1| ≤ . m Take the previous two bounds collectively to reach θ2 ≤

3 . m

• With regards to θ1 , we resort to the standard covering argument. First, fix some x, z ∈ Rn with kxk2 = kzk2 = 1 and notice that m 2 1 X > 2 > 2 ai z ai x 1{|a> z|≤c2 √log m} −β1 − β2 z > x i m i=1

is a sum of m i.i.d. random variables with bounded sub-exponential norms. To see this, one has

2

> 2 > 2

ai x 1{|a> z|≤c2 √log m} ≤ c22 log m a> ≤ c22 log m,

ai z i x i

ψ1

ψ1

where k · kψ1 denotes the sub-exponential norm [Ver12]. This further implies that

2

> 2 > 2

ai x 1{|a> z|≤c2 √log m} −β1 − β2 z > x ≤ 2c22 log m.

ai z i

ψ1

Apply the Bernstein’s inequality to show that for any 0 ≤ ≤ 1, ! m 1 X 2 2 2 a> a> 1{|a> z|≤c2 √log m} −β1 − β2 z > x ≥ 2c22 log m ≤ 2 exp −c2 m , P i z i x i m i=1 q n log m reveals that with probability exceeding where c > 0 is some absolute constant. Taking m 1 − 2 exp (−c10 n log m) for some c10 > 0, one has s m 1 X n log3 m 2 2 2 a> a> 1{|a> z|≤c2 √log m} −β1 − β2 z > x . c22 . (161) i z i x i m m i=1

74

n One can then apply the covering argument to extend the above result to all unit vectors x, z ∈ R . Let 2 n Nθ be a θ-net of the unit sphere, which has cardinality at most 1 + θ . Then for every x, z ∈ R with unit norm, we can find x0 , z0 ∈ Nθ such that kx − x0 k2 ≤ θ and kz − z0 k2 ≤ θ. The triangle inequality reveals that m 1 X 2 2 > 2 > > ai z ai x 1{|a> z|≤c2 √log m} −β1 − β2 z x i m i=1 m 1 X 2 2 2 2 2 ≤ a> a> 1{|a> z0 |≤c2 √log m} −β1 − β2 z0> x0 + β2 z > x − z0> x0 i z0 i x0 i m | {z } i=1 | {z } :=I2 :=I

1 m h 1 X i 2 > 2 2 2 > > > + ai x0 1{|a> z0 |≤c2 √log m} . ai z ai x 1{|a> z|≤c2 √log m} − ai z0 i i m i=1 | {z }

:=I3

Regarding I1 , one sees from (161) and the union bound that with probability at least 1−2(1+ θ2 )2n exp (−c10 n log m), one has s n log3 m . m

I1 . c22

For the second term I2 , we can deduce from (160) that β2 ≤ 3 and 2 > 2 z x − z0> x0 = z > x − z0> x0 z > x + z0> x0 > = (z − z0 ) x + z0 (x − x0 ) z > x + z0> x0 ≤ 2 (kz − z0 k2 + kx − x0 k2 ) ≤ 2θ, where the last line arises from the Cauchy-Schwarz inequality and the fact that x, z, x0 , z0 are all unit norm vectors. This further implies I2 ≤ 6θ. Now we move on to control the last term I3 . Denoting n o p Si := u | a> i u ≤ c2 log m allows us to rewrite I3 as m h 1 X i 2 2 2 2 I3 = a> a> 1{z∈Si } − a> a> 1{z0 ∈Si } i z i x i z0 i x0 m i=1 m h 1 X 2 > 2 2 > 2 i > ≤ a> z a x − a z a x 1 0 {z∈Si ,z0 ∈Si } i i i 0 i m i=1 m m 1 X 1 X 2 2 2 2 a> a> 1{z∈Si ,z0 ∈S a> a> 1{z0 ∈Si ,z∈S + / i} + / i} . i z i x i z0 i x0 m m i=1 i=1

(162)

Here the decomposition is similar to what we have done in (156). For the first term in (162), one has m h m 1 X i 2 > 2 1 X > 2 > 2 2 2 2 2 a> a> − a> a> 1{z∈Si ,z0 ∈Si } ≤ ai x − a> ai x0 ai z i z i x i z0 i x0 i z0 m m i=1 i=1 ≤ nα θ, for some α = O(1). Here the last line follows from the smoothness of the function g (x, z) = a> i z Proceeding to the second term in (162), we see from (157) that √ √ √ 1{z∈Si ,z0 ∈S / i } ≤ 1{c2 log m≤|a> z0 |≤c2 log m+ 6nθ } , i

75

2

2 a> i x .

which implies that m m 1 X 1 X 2 2 2 2 > a> a> 1{z∈Si ,z0 ∈S 1{z∈Si } a> 1{z∈Si ,z0 ∈S / i } ≤ max ai z / i} i z i x i x m 1≤i≤m m i=1 i=1 m 1 X 2 2 > √ ≤ c2 log m ai x 1{c2 √log m≤|a> z0 |≤c2 √log m+ 6nθ} . i m i=1 With regard to the above quantity, we have the following claim. Claim 2. With probability at least 1 − c2 e−c3 n log m for some constants c2 , c3 > 0, one has r m 1 X 2 n log m > ai x 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} . i m m i=1 for all x ∈ Rn with unit norm and for all z0 ∈ Nθ . With this claim in place, we arrive at s m 1 X n log3 m 2 2 > 2 a> z a x 1 . c {z∈Si ,z0 ∈S / i} i i 2 m m i=1

with high probability. Similar arguments lead us to conclude that with high probability s m 1 X n log3 m 2 2 > 2 . a> z a x 1 . c 0 {z0 ∈Si ,z ∈S / i} i 0 i 2 m m i=1 Taking the above bounds collectively and setting θ m−α−1 yield with high probability for all unit vectors z’s and x’s s m 1 X n log3 m 2 2 2 > > 2 √ a> z a x 1 −β − β z x . c , > 1 2 i i 2 {|ai z|≤c2 log m} m m i=1 which is equivalent to saying that s θ1 . c22

n log3 m . m

The proof is complete by combining the upper bounds on θ1 and θ2 , and the fact

1 m

=o

q

n log3 m m

.

Proof of Claim 2. We first apply the triangle inequality to get m m 1 X 1 X 2 2 > > √ √ ai x 1{c2 √log m≤|a> z0 |≤c2 √log m+ 6nθ} ≤ ai x0 1{c2 √log m≤|a> z0 |≤c2 √log m+ 6nθ} i i m m i=1 i=1 {z } | :=J1

m h 1 X 2 2 i > > + ai x − ai x0 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} , i m i=1 | {z } :=J2

where x0 ∈ Nθ and kx − x0 k2 ≤ θ. The second term can be controlled as follows m 2 1 X > 2 O(1) J2 ≤ θ, ai x − a> i x0 ≤ n m i=1

76

2 where we utilize the smoothness property of the function h (x) = a> i x . It remains to bound J1 , for which we first fix x0 and z0 . Take the Bernstein inequality to get ! m 1 X h i 2 2 P a> 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} −E a> 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} ≥ τ i x0 i x0 i i m i=1

≤ 2e−cmτ

2

for some constant c > 0 and any sufficiently small τ > 0. Taking τ exceeding 1 − 2e

−Cn log m

q

n log m m

reveals that with probability

for some large enough constant C > 0,

J1 . E

h

2 a> i x0

i

r

1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} + i

n log m . m

Regarding the expectation term, it follows from Cauchy-Schwarz that r h i r h i h 4 i 2 > > √ √ √ E 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} E ai x0 1{c2 log m≤|a> z0 |≤c2 log m+ 6nθ} ≤ E ai x0 i i h i E 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} i

≤ 1/m, as long as θ is sufficiently small. Combining the preceding bounds with the union bound, we can see that 2n −Cn log m with probability at least 1 − 2 1 + θ2 e r J1 .

1 n log m + . m m

Picking θ m−c1 for some large enough constant c1 > 0, we arrive at with probability at least 1−c2 e−c3 n log m r m 1 X n log m 2 a> 1{c2 √log m≤|a> z0 |≤c2 √log m+√6nθ} . i x i m m i=1

for all unit vectors x’s and for all z0 ∈ Nθ , where c2 , c3 > 0 are some absolute constants.

L

Proof of Lemma 15

Recall that the Hessian matrix is given by m

∇2 f (z) =

2 i 1 Xh \ 2 3 a> − a> ai a> i z i x i . m i=1

Lemma 14 implies that with probability at least 1 − O m−10 , s

2

2

2

∇ f (z) − 6zz > − 3 kzk2 In + 2x\ x\> + x\ 2 In .

n o n log3 m 2 2 max kzk2 , x\ 2 m

(163)

> √ a z ≤ c0 log m kzk , with the proviso that m holds simultaneously for all z obeying max 1≤i≤m i 2

n log3 m. This together with the fact x\ 2 = 1 leads to  s   3 n o  n log m 2 2 −∇2 f (z) −6zz > − 3 kzk2 − 1 + O  max kzk2 , 1  In   m

77

 s   3 n o  n log m 2 2 − 9 kzk2 − 1 + O  max kzk2 , 1  In .   m As a consequence, if we pick 0 < η
0 sufficiently small, then In − η∇2 f (z) 0. This

combined with (163) gives s

o n

2

In − η∇2 f (z) − 1 − 3η kzk2 + η In + 2ηx\ x\> − 6ηzz > .

n o n log3 m 2 max kzk2 , 1 . m

Additionally, it follows from (163) that s  3 n

\ 2

2

o n log m

2 > \ \>

∇ f (z) ≤ 6zz + 3 kzk In + 2x x + x In + O   max kzk2 , x\ 2 2 2 2 2 m s  3 n o n log m  max kzk2 , 1 ≤ 9kzk22 + 3 + O  2 m ≤ 10kzk22 + 4 as long as m n log3 m.

M

Proof of Lemma 16

Note that when t . log n, one naturally has

1 1+ log m

t . 1.

(164)

Regarding the first set of consequences (61), one sees via the triangle inequality that

max xt,(l) 2 ≤ xt 2 + max xt − xt,(l) 2 1≤l≤m 1≤l≤m p t (i) n log5 m 1 ≤ C5 + βt 1 + C1 η log m m ! p 5 (ii) n log m ≤ C5 + O m (iii)

≤ 2C5 , where (i) follows from the induction hypotheses (40a) and (40e). p The second inequality (ii) holds true since βt . 1 and (164). The last one (iii) is valid as long as m n log5 m. Similarly, for the lower bound, one can show that for each 1 ≤ l ≤ m,

t,(l)

x ≥ xt⊥ − xt⊥ − xt,(l) ⊥ ⊥ 2 2 2

≥ xt⊥ − max xt − xt,(l) 2

2

1≤l≤m

≥ c5 − βt 1 +

1 log m

t

p C1 η

n log3 m c5 ≥ , m 2

p as long as m n log5 m. Using similar arguments (αt . 1), we can prove the lower and upper bounds for xt,sgn and xt,sgn,(l) . 78

For the second set of consequences (62), namely the incoherence consequences, first notice that it is t sufficient to show that the inner product (for instance |a> l x |) is upper bounded by C7 log m in magnitude for some absolute constants C7 > 0. To see this, suppose for now p t (165) max a> l x ≤ C7 log m. 1≤l≤m

One can further utilize the lower bound on kxt k2 to deduce that C7 p

t log m xt 2 . max a> l x ≤ 1≤l≤m c5 This justifies the claim that we only need to obtain bounds as in (165). Once again we can invoke the triangle inequality to deduce that with probability at least 1 − O m−10 , > t t t,(l) t,(l) max a> + max a> l x ≤ max al x − x l x 1≤l≤m

1≤l≤m

1≤l≤m

(i)

t,(l) ≤ max kal k2 max xt − xt,(l) 2 + max a> l x 1≤l≤m 1≤l≤m 1≤l≤m p t (ii) √

n log5 m p 1 . nβt 1 + C1 η + log m xt,(l) 2 log m m 5/2 p p n log m . + C5 log m . C5 log m. m Here, the first relation (i) results from the Cauchy-Schwarz inequality and (ii) utilizes induction

the √ hypothesis t,(l) (40a), the fact (57) and the standard Gaussian concentration, namely, max1≤l≤m a> x . log m xt,(l) 2 l with probability at least 1 − O m−10 . The last line is a direct consequence of the fact (61a) established above and (164). In regard to the incoherence w.r.t. xt,sgn , we resort to the leave-one-out sequence xt,sgn,(l) . Specifically, we have > t,sgn > t > t,sgn al x ≤ al x + al x − xt > t,sgn t t,sgn,(l) ≤ a> − xt − xt,sgn,(l) + xt,(l) + a> − xt,(l) l x + al x l x t p p √ n log9 m p 1 . log m + nαt 1 + C4 + log m log m m p . log m. The remaining incoherence conditions can be obtained through similar arguments. For the sake of conciseness, we omit the proof here. With regard to the third set of consequences (63), we can directly use the induction hypothesis and obtain p t

t

1 n log3 m t,(l)

max x − x ≤ β 1 + C t 1 2 1≤l≤m log m m p 3 n log m 1 . , . m log m p as long as m n log5 m. Apply similar arguments to get the claimed bound on kxt − xt,sgn k2 . For the remaining one, we have t,(l) t,(l) max xk ≤ max xtk + max xk − xtk 1≤l≤m 1≤l≤m 1≤l≤m p t n log12 m 1 C2 η ≤ αt + αt 1 + log m m ≤ 2αt , with the proviso that m

p

n log12 m. 79

Gradient Descent with Random Initialization - Princeton University

Gradient Descent with Random Initialization - Princeton University

Suggest Documents

Adaptive Online Gradient Descent - ScholarlyCommons - University of ...

Learning to learn by gradient descent by gradient descent

Stochastic gradient descent with differentially private updates

SGDR: STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS

Analysis of Standard Gradient Descent with GD

Nonlinear Conjugate Gradient Methods with Sufficient Descent

Communication-Efficient Distributed Stochastic Gradient Descent with

Convergence Analysis of Gradient Descent Algorithms with

Averaged Stochastic Gradient Descent with ... - Semantic Scholar

InfiniteBoost: building infinite ensembles with gradient descent

Gradient and Distributed Algorithms - Princeton University

Semi-Stochastic Gradient Descent Methods

Correlations of Random Binary Sequences - Princeton University

Notes on random reals - Princeton University

Impact of Initialization on Gradient Descent Method in ...www.researchgate.net › publication › fulltext › Impact-of

Lighting with Paint - Princeton Graphics - Princeton University

Stochastic Gradient Descent on GPUs - ISS - The University of Texas ...

Learning Halfspaces and Neural Networks with Random Initialization

Accelerated gradient descent methods with line search - Springer Link

Robust Fully Distributed Minibatch Gradient Descent with Privacy ...

Iso-Contour Queries and Gradient Descent with ... - Stony Brook AMS

mm-Wave Channel Estimation with Accelerated Gradient Descent

LMS Algorithm With Gradient Descent Filter Length - IEEE Xplore

Volatility Estimation with Functional Gradient Descent for ... - ETH ZÃ¼rich