Understanding Generalization and Optimization Performance of Deep ...

Understanding Generalization and Optimization Performance of Deep CNNs Pan Zhou 1 Jiashi Feng 1

Abstract

arXiv:1805.10767v1 [cs.LG] 28 May 2018

This work aims to provide understandings on the remarkable success of deep convolutional neural networks (CNNs) by theoretically analyzing their generalization performance and establishing optimization guarantees for gradient descent based training algorithms. Specifically, for a CNN model consisting of l convolutional layers and one fully connected layer, we provepthat its generalization error is bounded by O( θe %/n) where θ denotes freedom degree of the network parameQl ters and %e = O(log( i=1 bi (ki − si + 1)/p) + log(bl+1 )) encapsulates architecture parameters including the kernel size ki , stride si , pooling size p and parameter magnitude bi . To our best knowledge, this is the first generalization bound that Ql+1 only depends on O(log( i=1 bi )), tighter than existing ones that all involve an exponential term Ql+1 like O( i=1 bi ). Besides, we prove that for an arbitrary gradient descent algorithm, the computed approximate stationary point by minimizing empirical risk is also an approximate stationary point to the population risk. This well explains why gradient descent training algorithms usually perform sufficiently well in practice. Furthermore, we prove the one-to-one correspondence and convergence guarantees for the non-degenerate stationary points between the empirical and population risks. It implies that the computed local minimum for the empirical risk is also close to a local minimum for the population risk, thus ensuring the good generalization performance of CNNs.

1. Introduction Deep convolutional neural networks (CNNs) have been successfully applied to various fields, such as image classifica1

Department of Electrical & Computer Engineering (ECE), National University of Singapore, Singapore. Correspondence to: Pan Zhou , Jiashi Feng . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

tion (Szegedy et al., 2015; He et al., 2016; Wang et al., 2017), speech recognition (Sainath et al., 2013; Abdel-Hamid et al., 2014), and classic games (Silver et al., 2016; Brown & Sandholm, 2017). However, theoretical analyses and understandings on deep CNNs still largely lag their practical applications. Recently, although many works establish theoretical understandings on deep feedforward neural networks (DNNs) from various aspects, e.g. (Neyshabur et al., 2015; Kawaguchi, 2016; Zhou & Feng, 2018; Tian, 2017; Lee et al., 2017), only a few (Sun et al., 2016; Kawaguchi et al., 2017; Du et al., 2017a;b) provide explanations on deep CNNs due to their more complex architectures and operations. Besides, these existing works all suffer certain discrepancy between their theories and practice. For example, the developed generalization error bound either exponentially grows along with the depth of a CNN model (Sun et al., 2016) or is data-dependent (Kawaguchi et al., 2017), and the convergence guarantees for optimization algorithms over CNNs are achieved by assuming an over-simplified CNN model consisting of only one non-overlapping convolutional layer (Du et al., 2017a;b). As an attempt to explain the practical success of deep CNNs and mitigate the gap between theory and practice, this work aims to provide tighter data-independent generalization error bound and algorithmic optimization guarantees for the commonly used deep CNN models in practice. Specifically, we theoretically analyze the deep CNNs from following two aspects: (1) how their generalization performance varies with different network architecture choices and (2) why gradient descent based algorithms such as stochastic gradient descend (SGD) (Robbins & Monro, 1951), adam (Kingma & Ba, 2015) and RMSProp (Tieleman & Hinton, 2012), on minimizing empirical risk usually offer models with satisfactory performance. Moreover, we theoretically demonstrate the benefits of (stride) convolution and pooling operations, which are unique for CNNs, to the generalization performance, compared with feedforward networks. Formally, we consider a CNN model g(w; D) parameterized by w ∈ Rd , consisting of l convolutional layers and one subsequent fully connected layer. It maps the input D ∈ Rr0 ×c0 to an output vector v ∈ Rdl+1 . Its i-th convolutional layer takes Z(i−1) ∈ Rr˜i−1 ×˜ci−1 ×di−1 as input and outputs Z(i) ∈ Rr˜i ×˜ci ×di through spatial convolution, non-linear activation and pooling operations sequentially.

Understanding Generalization and Optimization Performance of Deep CNNs

Here r˜i × c˜i and di respectively denote resolution and the number of feature maps. Specifically, the computation with the i-th convolutional layer is described as j X(i) (:, :, j) = Z(i−1) ~ W(i) ∈ Rri ×ci , ∀j = 1, · · · , di ,

Y(i) = σ1 (X(i) ) ∈ Rri ×ci ×di , Z(i) = pool Y(i) ∈ Rr˜i ×˜ci ×di ,

where X(i) (:, :, j) denotes the j-th feature map output by j the i-th layer; W(i) ∈ Rki ×ki ×di−1 denotes the j-th convolutional kernel of size ki ×ki and there are in total di kernels in the i-th layer; ~, pool (·) and σ1 (·) respectively denote the convolutional operation with stride si , pooling operation with window size p × p without overlap and the sigmoid function. In particular, Z(0) = D is the input sample. The last layer is a fully connected one and formulated as u = W(l+1) z(l) ∈ R

dl+1

and v = σ2 (u) ∈ R

dl+1

,

where z(l) ∈ Rr˜l c˜l dl is vectorization of the output Z(l) of the last convolutional layer; W(l+1) ∈ Rdl+1 ×˜rl c˜l dl denotes the connection weight matrix; σ2 (·) is a softmax activation function (for classification) and dl+1 is the class number. In practice, a deep CNN model is trained by minimizing the following empirical risk in terms of squared loss on the training data pairs (D (i) , y (i) ) drawn from an unknown distribution D, n 1X e Qn (w) , f (g(w; D (i) ), y (i) ), (1) n i=1

where f (g(w; D), y) = 12 kv − yk22 is the squared loss e via SGD function. One can obtain the model parameter w or its variants like adam and RMSProp. However, this empirical solution is different from the desired optimum w∗ that minimizes the population risk: Q(w) , E(D,y)∼D f (g(w; D), y). This raises an important question: why CNNs trained by minimizing the empirical risk usually perform well in practice, considering the high model complexity and nonconvexity? This work answers this question by (1) establishing the generalization performance guarantee for CNNs and e by gradient (2) expounding why the computed solution w descent based algorithms for minimizing the empirical risk usually performs sufficiently well in practice. To be specific, we present three new theoretical guarantees for CNNs. First, we prove that the generalization error of p deep CNNs decreases at the rate of O( θe %/(2n)) where θ denotes parameter freedom degree1 , and %e depends on the 1

We use the terminology of “parameter freedom degree” here for characterizing redundancy of parameters. For example, for a rank-r matrix A ∈ Rm1 ×m2 , the parameter freedom degree in this work is r(m1 + m2 + 1) instead of the commonly used one r(m1 + m2 − r).

network architecture parameters including the convolutional kernel size ki , stride si , pooling size p, channel number di and parameter magnitudes. It is worth mentioning that our generalization error bound is the first one that does not exponentially grow with depth. Secondly, we prove that for any gradient descent based optimization algorithm, e.g. SGD, RMSProp or adam, if its e is an approximate stationary point of the empirical output w e n (w), w e is also an approximate stationary point of risk Q the population risk Q(w). This result is important as it explains why CNNs trained by minimizing the empirical risk have good generalization performance on test samples. We achieve such results by analyzing the convergence behavior of the empirical gradient to its population counterpart. Finally, we go further and quantitatively bound the distance e and w∗ . We prove that when the samples are between w e n (w) sufficient, a non-degenerate stationary point wn of Q uniquely corresponds to a non-degenerate stationary point w∗ of the population risk p Q(w), with a distance shrinking at the rate of O((β/ζ) de %/n) where β also depends on the CNN architecture parameters (see Thereom 2). Here ζ accounts for the geometric topology of non-degenerate stationary points. Besides, the corresponding pair (wn , w∗ ) shares the same geometrical property—if one in (wn , w∗ ) is a local minimum or saddle point, so is the other one. All these results guarantee that for an arbitrary algorithm proe is close to vided with sufficient samples, if the computed w e is also close to the optimum the stationary point wn , then w w∗ and they share the same geometrical property.

To sum up, we make multiple contributions to understand deep CNNs theoretically. To our best knowledge, this work presents the first theoretical guarantees on both generalization error bound without exponential growth over network depth and optimization performance for deep CNNs. We substantially extend prior works on CNNs (Du et al., 2017a;b) from the over-simplified single-layer models to the multi-layer ones, which is of more practical significance. Our generalization error bound is much tighter than the one derived from Rademacher complexity (Sun et al., 2016) and is also independent of data and specific training procedure, which distinguishes it from (Kawaguchi et al., 2017).

2. Related Works Recently, many works have been devoted to explaining the remarkable success of deep neural networks. However, most works only focus on analyzing fully feedforward networks from aspects like generalization performance (Bartlett & Maass, 2003; Neyshabur et al., 2015), loss surface (Saxe et al., 2014; Dauphin et al., 2014; Choromanska et al., 2015; Kawaguchi, 2016; Nguyen & Hein, 2017; Zhou & Feng, 2018), optimization algorithm convergence (Tian, 2017; Li


& Yuan, 2017) and expression ability (Eldan & Shamir, 2016; Soudry & Hoffer, 2017; Lee et al., 2017).

the convergence rate of the empirical risk and generalization error of CNN which is absent in (Mei et al., 2017).

The literature targeting at analyzing CNNs is very limited, mainly because CNNs have much more complex architectures and computation. Among the few existing works, Du et al. (2017b) presented results for a simple and shallow CNN consisting of only one non-overlapping convolutional layer and ReLU activations, showing that gradient descent (GD) algorithms with weight normalization can converge to the global minimum. Similarly, Du et al. (2017a) also analyzed optimization performance of GD and SGD with nonGaussian inputs for CNNs with only one non-overlapping convolutional layer. By utilizing the kernel method, Zhang et al. (2017) transformed a CNN model into a single-layer convex model which has almost the same loss as the original CNN with high probability and proved that the transformed model has higher learning efficiency.

Our work is also critically different from the recent work (Zhou & Feng, 2018) although we adopt a similar analysis road map with it. Zhou & Feng (2018) analyzed DNNs while this work focuses on CNNs with more complex architectures and operations which are more challenging and requires different analysis techniques. Besides, this work provides stronger results in the sense of several tighter bounds with much milder assumptions. (1) For nonlinear DNNs, Zhou & Feng (2018) assumed the input data to be Gaussian, while this work gets rid of such a restrictive passumption. (2) The generalization error bound O(b rl+1 d/n) in (Zhou & Feng, 2018) exponentially depends on the upper magnitude bound rb of the weight matrix per layer and linearly p depends on the total parameter number d, while ours is O( θe %/n) which only depends on the logarithm term Ql+1 %e = log( i=1 bi ) and the freedom degree θ of the network parameters, where bi and bl+1 respectively denote the upper magnitude bounds of each kernel per layer and the weight matrix of the fully connected layer. Note, the exponential term O(b rl+1 ) in (Zhou & Feng, 2018) cannot be further improved due to their proof techniques. The results on empirical gradient and stationary point pairs in (Zhou & Feng, Ql+1 2018) rely on O(b r2(l+1) ), while ours is O( i=1 bi ) which only depends on bi instead of b2i . (3) This work explores the parameter structures, i.e. the low-rankness property, and derives tighter bounds as the parameter freedom degree θ is usually smaller than the total parameter number d.

Regarding generalization performance of CNNs, Sun et al. (2016) provided the Rademacher complexity of a deep CNN model which is then used to establish the generalization error bound. But the Rademacher complexity exponentially depends on the magnitude of total parameters per layer, leading to loose results. In contrast, the generalization error bound established in this work is much tighter, as discussed in details in Sec. 3. Kawaguchi et al. (2017) introduced two generalization error bounds of CNN, but both depend on a specific dataset as they involve the validation error or the intermediate outputs for the network model on a provided dataset. They also presented dataset-independent generalization error bound, but with a specific two-phase training procedure required, where the second phase need fix the states of ReLU activation functions. However, such two-phase training procedure is not used in practice. There are also some works focusing on convergence behavior of nonconvex empirical risk of a single-layer model to the population risk. Our proof techniques essentially differ from theirs. For example, (Gonen & Shalev-Shwartz, 2017) proved that the empirical risk converges to the population risk for those nonconvex problems with no degenerated saddle points. Unfortunately, due to existence of degenerated saddle points in deep networks (Dauphin et al., 2014; Kawaguchi, 2016), their results are not applicable here. A very recent work (Mei et al., 2017) focuses on single-layer nonconvex problems but requires the gradient and Hessian of the empirical risk to be strong sub-Gaussian and subexponential respectively. Besides, it assumes a linearity property for the gradient which hardly holds in practice. Comparatively, our assumptions are much milder. We only assume magnitude of the parameters to be bounded. Furthermore, we also explore the parameter structures of optimized CNNs, i.e. the low-rankness property, and derive bounds matching empirical observations better. Finally, we analyze

3. Generalization Performance of Deep CNNs In this section, we present the generalization error bound for deep CNNs and reveal effects of different architecture parameters on their generalization performance, providing some principles on model architecture design. We derive these results by establishing uniform convergence of the e n (w) to its population one Q(w). empirical risk Q

We start with explaining our assumptions. Similar to (Xu & Mannor, 2012; Tian, 2017; Zhou & Feng, 2018), we assume that the parameters of the CNN have bounded magnitude. But we get rid of the Gaussian assumptions on the input data, meaning our assumption is milder than the ones in (Tian, 2017; Soudry & Hoffer, 2017; Zhou & Feng, 2018). j Assumption 1. The magnitudes of the j-th kernel W(i) ∈ Rki ×ki ×di−1 in the i-th convolutional layer and the weight matrix W(l+1) ∈ Rdl+1 ×˜rl c˜l dl in the the fully connected layer are respectively bounded as follows j kW(i) kF ≤ bi (1 ≤ j ≤ di ; 1 ≤ i ≤ l), kW(l+1) kF ≤ bl+1 ,

where bi (1 ≤ i ≤ l) and bl+1 are positive constants. We also assume that the entry value of the target output y


always falls in [0, 1], which can be achieved by scaling the entry value in y conveniently. In this work, we also consider possible emerging structure of the learned parameters after training—the parameters usually present redundancy and low-rank structures (Lebedev et al., 2014; Jaderberg et al., 2014) due to high model complexity. So we incorporate low-rankness of the parameters or more concretely the parameter matrix consisting of kernels per layer, into our analysis. Denoting by vec(A) the vectorization of a matrix A, we have Assumption 2. f(i) and W(l+1) obey Assumption 2. Assume the matrices W f(i) ) ≤ ai (1 ≤ i ≤ l) and rank(W(l+1) ) ≤ al+1 , rank(W

f(i) = [vec(W 1 ), vec(W 2 ), · · · , vec(W di )] ∈ where W (i) (i) (i) 2

Rki di−1 ×di denotes the matrix consisting of all kernels in the i-th layer (1 ≤ i ≤ l).

The parameter low-rankness can also be defined on kernels individually by using the tensor rank (Tucker, 1966; Zhou et al., 2017; Zhou & Feng, 2017). Our proof techniques are extensible to this case and similar results can be expected. 3.1. Generalization Error Bound for Deep CNNs We now proceed to establish generalization error bound for deep CNNs. Let S = {(D (1) , y (1) ), · · · , (D (n) , y (n) )} denote the set of training samples i.i.d. drawn from D. When e to problem (1) is computed by a the optimal solution w deterministic algorithm, the generalization error is defined e n (w) e − Q(w) e . But in practice, a CNN model is as g = Q usually optimized by randomized algorithms, e.g. SGD. So we adopt the following generalization error in expectation. Definition 1. (Generalization error) (Shalev-Shwartz et al., 2010) Assume a randomized algorithm A is employed for optimization over training samples S = {(D (1) , y (1) ), e n (w) is e = argminw Q · · ·, (D (n), y (n) )} ∼ D and w the empirical risk minimizer (ERM). Then if we have e n (w)) e −Q e ≤ k , the ERM is said to ES∼D EA (Q(w) have generalization error with rate k under distribution D.

We bound the generalization error in expectation for deep CNNs by first establishing uniform convergence of the empirical risk to its corresponding population risk, as stated in Lemma 1 with proof in Sec. D.1 in supplement. Lemma 1. Assume that in CNNs, σ1 and σ2 are respectively the sigmoid and softmax activation functions and the loss function f (g(w; D), y) is squared loss. Under Assumptions 1 and 2, if n ≥ cf 0 l2 (bl+1 + Pl √ 2 ri ci /(θ%ε2 ) where cf 0 is a universal i=1 di bi ) maxi constant, then with probability at least 1 − ε, we have s θ% + log 4ε e sup Qn (w) − Q(w) ≤ , (2) 2n w∈Ω

where the total freedom degree Pl θ of the network is θ = al+1 (dl+1 + r˜l√ c˜l dl + 1) + i=1 ai (ki 2 di−1 + di + 1) and Pl n % = i=1 log di bi (k4pi −si +1) + log(bl+1 ) + log 128p 2 .

To our best knowledge, this generalization error rate is the first one that grows linearly (in contrast to exponentially) with depth l without needing any special training procedure. This can be observed from the fact that our result Pl only depends on O( i=1 log(bi )), rather than an expoQl+1 nential factor O( i=1 bi ) which appears in some existing works, e.g. the uniform convergence of the empirical risk in deep CNNs (Sun et al., 2016) and fully feedforward networks (Bartlett & Maass, 2003; Neyshabur et al., 2015; Zhou & Feng, 2018). This faster convergence rate is achieved by adopting similar analysis technique in (Mei et al., 2017; Zhou & Feng, 2018) but we derive tighter bounds on the related parameters featuring distributions of the empirical risk and its gradient, with milder assumptions. For instance, both (Zhou & Feng, 2018) and this work show that the empirical risk follows a sub-Gaussian distribution. But Zhou & Feng (2018) used Gaussian concentration inequality and thus need Lipschitz constant of loss which exponentially depends on the depth. In contrast, we use -net to decouple the dependence between input D and parameter w and then adopt Hoeffding’s inequality, only requiring the constant magnitude bound of loss and geting rid of exponential term. Based on Lemma 1, we derive generalization error of deep CNNs in Theorem 1 with proof in Sec. D.2 in supplement. Theorem 1. Assume that in CNNs, σ1 and σ2 are respectively the sigmoid and softmax functions and the loss function f (g(w; D), y) is squared loss. Suppose Assumptions 1 and 2 hold. Then with probability at least 1 − ε, the generalization error of a deep CNN model is bounded as s θ% + log 4ε e e − Qn (w) e ≤ ES∼D EA Q(w) , 2n where θ and % are given in Lemma 1.

By inspecting Theorem 1, one can find that √ the generalization error diminishes at the rate of O (1/ n) (up to a log factor). Besides, Theorem 1 explicitly reveals the roles of network parameters in determining model generalization performance. Such transparent results form stark contrast to the works (Sun et al., 2016) and (Kawaguchi et al., 2017) (see more comparison in Sec. 3.2). Notice, our technique also applies to other third-order differentiable activation functions, e.g. tanh, and other losses, e.g. cross entropy, with only slight difference in the results. First, the freedom degree θ of network parameters, which depends on the network size and the redundancy in parameters, plays an important role in the generalization error bound. More specifically, to obtain smaller generalization


error, more samples are needed to train a deep CNN model having larger freedom degree θ. As aforementioned, although the results in Theorem 1 are obtained under the low-rankness condition defined on the parameter matrix consisting of kernels per layer, they are easily extended to the (tensor) low-rankness defined on each kernel individually. The low-rankness captures common parameter redundancy in practice. For instance, (Lebedev et al., 2014; Jaderberg et al., 2014) showed that parameter redundancy exists in a trained network model and can be squeezed by low-rank tensor decomposition. The classic residual function (He et al., 2016; Zagoruyko & Komodakis, 2016) with three-layer bottleneck architecture (1 × 1, 3 × 3 and 1 × 1 convs) has rank 1 in generalized block term decomposition (Chen et al., 2017; Cohen & Shashua, 2016). Similarly, inception networks (Szegedy et al., 2017) explicitly decomposes a convolutional kernel of large size into two separate convolutional kernels of smaller size (e.g. a 7 × 7 kernel is replaced by two multiplying kernels of size 7 × 1 and 1 × 7). Employing these low-rank approximation techniques helps reduce the freedom degree and provides smaller generalization error. Notice, the low-rankness assumption only affects the freedom degree θ. Without this assumption, θ will be replaced by the total parameter number of the network. From the factor %, one can observe that the kernel size ki and its stride si determine the generalization error but in opposite ways. Larger kernel size ki leads to larger generalization error, while larger stride si provides smaller one. This is because both larger kernel and smaller stride increase the model complexity, since larger kernel means more trainable parameters and smaller stride implies larger size of feature maps in the subsequent layer. Also, the pooling operation in the first l convolutional layers helps reduce the generalization error, as reflected by the factor 1/p in %. Furthermore, the number of feature maps (i.e. channels) di appearing in the θ and % also affects the generalization error. A wider network with larger di requires more samples for training such that it can generalize well. This is because (1) a larger di indicates more trainable parameters, which usually increases the freedom degree θ, and (2) a larger di j also requires larger kernels W(i) with more channel-wise dimensions since there are more channels to convolve, leadj ing to a larger magnitude bound bi for the kernel W(i) . Therefore, as suggested by Theorem 1, a thin network is more preferable than a fat network. Such an observation is consistent with other analysis works on the network expression ability (Eldan & Shamir, 2016; Lu et al., 2017) and the architecture-engineering practice, such as (He et al., 2016; Szegedy et al., 2015). By comparing contributions of the architecture and parameter magnitude to the generalization performance, we find that the generalization error usually depends on the network architecture parameters linearly or more heavily, and also on parameter magnitudes but

with a logarithm term log bi . This implies the architecture plays a more important role than the parameter magnitudes. Therefore, for achieving better generalization performance in practice, architecture engineering is indeed essential. Finally, by observing the factor %, we find that imposing certain regularization, such as kwk22 , on the trainable parameters is useful. The effectiveness of such a regularization will be more significant when imposing on the weight matrix of the fully connected layer due to its large size. Such a regularization technique, in deep learning literature, is well known as “weight decay”. This conclusion is consistent with other analysis works on the deep forward networks, such as (Bartlett & Maass, 2003; Neyshabur et al., 2015; Zhou & Feng, 2018). 3.2. Discussions Sun et al. (2016) also analyzed generalization error bound in deep CNNs but employing different techniques. They proved that the Rademacher complexity Rm (F) of a deep CNN model activation functions is p with sigmoid √ O(ebx (2peb)l+1 log(r0 c0 )/ n) where F denotes the function hypothesis that maps the input data D to v ∈ Rdl+1 by the analyzed CNN model. Here ebx denotes the upper bound of the absolute entry values in the input datum D, i.e. ebx ≥ |Di,j | (∀i, j), and eb obeys eb ≥ Pdi j max{maxi j=1 kW(i) k1 , kW(l+1) k1 }. Sun et al. (2016) showed that with probability at least 1 − ε, the difference between the empirical margin error errγe (g) (g ∈ F) and the population margin error errγp (g) can be bounded as errγp (g)

8dl+1 (2dl+1 − 1) ≤ inf errγe (g) + Rm (F) γ>0 γ # r r log log2 (2/γ) log(2/ε) + , (3) + n n

where γ controls the error margin since it obeys γ ≥ vy − maxk6=y vk and y denotes the label of v. However, the bound in Eqn. (3) is practically loose, since Rm (F) involves the exponential factor (2eb)l+1 which is usually very large. In this case, Rm (F) is extremely large. By comparison, the bound provided in our Theorem 1 only depends on Pl+1 i=1 log(bi ) which avoids the exponential growth along with the depth l, giving a much tighter and more practically meaningful bound. The generalization error bounds in (Kawaguchi et al., 2017) either depend on a specific dataset or rely on restrictive and rarely used training procedure, while our Theorem 1 is independent of any specific dataset or training procedure, rendering itself more general. More importantly, the results in Theorem 1 make the roles of network parameters transparent, which could benefit understanding and architecture design of CNNs.


4. Optimization Guarantees for Deep CNNs Although deep CNNs are highly non-convex, gradient descent based algorithms usually perform quite well on optimizing the models in practice. After characterizing the roles of different network parameters for the generalization performance, here we present optimization guarantees for gradient descent based algorithms in training CNNs. Specifically, in practice one usually adopts SGD or its variants, such as adam and RMSProp, to optimize the CNN models. Such algorithms usually terminate when the gradient magnitude decreases to a low level and the training hardly proceeds. This implies that the algorithms in fact e for the loss compute an -approximate stationary point w e n (w), i.e. k∇w Q e n (w)k e 22 ≤ . Here we explore function Q e of such a problem: by computing an -stationary point w e n (w), can we also expect w e to be sufthe empirical risk Q ficiently good for generalization, or in other words expect that it is also an approximate stationary point for the population risk Q(w)? To answer this question, first we analyze e n (w) the relationship between the empirical gradient ∇w Q and its population counterpart ∇w Q(w). Founded on this, we further establish convergence of the empirical gradient of the computed solution to its corresponding population gradient. Finally, we present the bounded distance between e and the optimum w∗ . the computed solution w To our best knowledge, this work is the first one that analyzes the optimization behavior of gradient descent based algorithms for training multi-layer CNN models with the commonly used convolutional and pooling operations. 4.1. Convergence Guarantees on Gradients Here we present guarantees on convergence of the empirical gradient to the population one in Theorem 2. As aforementioned, such results imply good generalization performance e n (w). e to the empirical risk Q of the computed solution w

Theorem 2. Assume that in CNNs, σ1 and σ2 respectively are the sigmoid and softmax functions and the loss function f (g(w; D), y) is squared loss. Suppose Assumptions 1 and 2 hold. Then the empirical gradient uniformly converges to the population gradient in Euclidean norm. More specifically, there exist universal constants cg0 and cg such that if n ≥ cg0

P l2 bl+1 2 (bl+1 + li=1 di bi )2 (r0 c0 d0 )4 , d40 b1 8 (d log(6)+θ%)ε2 maxi (ri ci )

e n (w)−∇w Q(w) sup ∇w Q

≤ cg β

w∈Ω

2

s

then

2d+θ%+log 2n

4 ε

holds with probability at least 1 − ε, where % is provided in Lemma 1. Here β and d are defined as β = h Ql dj bj 2 (kj −sj +1)2 i1/2 Pl bl+1 2 di−1 rl c l d l r c + 2 2 i−1 i−1 2 i=1 8p bi di j=i 8p 16p2 Pl 2 and d = r˜l c˜l dl dl+1 + i=1 ki di−1 di , respectively.

Its proof is given in Sec. D.3 in supplement. From Theorem 2, the empirical gradient converges to the population √ one at the rate of O(1/ n) (up to a log factor). In Sec. 3.1, we have discussed the roles of the network architecture parameters in %. Here we further analyze the effects of the network parameters on the optimization behavior through the factor β. The roles of the kernel size ki , the stride si , the pooling size p and the channel number di in β are consistent with those in Theorem 1. The extra factor ri ci advocates not building such CNN networks with extremely large feature map sizes. The total number of parameters d is involved here instead of the degree of freedom because the gradient e n (w) may not have low-rank structures. ∇w Q

Based on Theorem 2, we can further conclude that if the e is an -approximate stationary point computed solution w of the empirical risk, then it is also a 4-approximate stationary point of the population risk. We state this result in Corollary 1 with proof in Sec. D.4 in supplement. Corollary 1. Suppose assumptions in Theorem 2 hold and we have n ≥ (d% + log(4/ε))β 2 /. Then if the solue computed by minimizing the empirical risk obeys tion w e n (w)k e 22 ≤ , we have k∇Q(w)k e 22 ≤ 4 with probak∇Q bility at least 1 − ε.

Corollary 1 shows that by using full gradient descent algorithms to minimize the empirical risk, the computed ape is also close to the desired proximate stationary point w stationary point w∗ of the population risk. This guarantee is also applicable to other stochastic gradient descent based algorithms, like SGD, adam and RMSProp, by applying recent results on obtaining -approximate stationary point for nonconvex problems (Ghadimi & Lan, 2013; Tieleman & Hinton, 2012; Kingma & Ba, 2015). Accordingly, the e has guaranteed generalization perforcomputed solution w mance on new data. It partially explains the success of gradient descent based optimization algorithms for CNNs. 4.2. Convergence of Stationary Points Here we go further and directly characterize the distance e n (w) and between stationary points in the empirical risk Q its population counterpart Q(w). Compared with the results for the risk and gradient, the results on stationary points give more direct performance guarantees for CNNs. Here we only analyze the non-degenerate stationary points including local minimum/maximum and non-degenerate saddle points, as they are geometrically isolated and thus are unique in local regions. We first introduce some necessary definitions. Definition 2. (Non-degenerate stationary points and saddle points) (Gromoll & Meyer, 1969) A stationary point w is said to be a non-degenerate stationary point of Q(w) if inf λi ∇2 Q(w) ≥ ζ, i where λi ∇2 Q(w) is the i-th eigenvalue of the Hessian


∇2 Q(w) and ζ is a positive constant. A stationary point is said to be a saddle point if the smallest eigenvalue of its Hessian ∇2 Q(w) has a negative value. Suppose Q(w) has m non-degenerate stationary points which are denoted as {w(1) , w(2) , · · · , w(m) }. We have following results on the geometry of these stationary points in Theorem 3. The proof is given in Sec. D.5 in supplement. Theorem 3. Assume in CNNs, σ1 and σ2 are respectively the sigmoid and softmax activation functions and the loss f (g(w; D), y) is squared loss. Suppose Assumptions 1 and 2 hold. Then if n ≥ ch max d+θ% ζ2 , P l2 bl+1 2 (bl+1 + li=1 di bi )2 (r0 c0 d0 )4 where ch is a constant, d40 b1 8 d%ε2 maxi (ri ci ) for k ∈ {1, · · · , m}, there exists a non-degenerate sta(k) e n (w) which uniquely corresponds tionary point wn of Q to the non-degenerate stationary point w(k) of Q(w) with probability at least 1 − ε. Moreover, with same probability (k) the distance between wn and w(k) is bounded as 2cg β kwn(k) −w(k) k2 ≤ ζ

s

2d+θ%+log 2n

4 ε

, (1 ≤ k ≤ m),

where % and β are given in Lemma 1 and Theorem 2, respectively. Theorem 3 shows that there exists exact one-to-one correspondence between the non-degenerate stationary points of e n (w) and the popular risk Q(w) for the empirical risk Q CNNs, if the sample size n is sufficiently large. Moreover, (k) e n (w) is very the non-degenerate stationary point wn of Q close to its corresponding non-degenerate stationary point w(k) of Q(w). More √ importantly, their distance shrinks at the rate of O (1/ n) (up to a log factor). The network parameters have similar influence on the distance bounds as explained in the above subsection. Compared with gradient convergence rate in Theorem 2, the convergence rate of corresponding stationary point pairs in Theorem 3 has an extra factor 1/ζ that accounts for the geometric topology of non-degenerate stationary points, similar to other works like stochastic optimization analysis (Duchi & Ruan, 2016). For degenerate stationary points to which the corresponding Hessian matrix has zero eigenvalues, one cannot expect to establish unique correspondence for stationary points in empirical and population risks, since they are not isolated anymore and may reside in flat regions. But Theorem 2 e n (w) and Q(w) at these guarantees that the gradients of Q points are close. This implies a degenerate stationary point e n (w), of Q(w) will also give a near-zero gradient for Q e indicating it is also a stationary point for Qn (w). Du et al. (2017a;b) showed that for a simple and shallow CNN consisting of only one non-overlapping convolutional layer, (stochastic) gradient descent algorithms with weight

normalization can converge to the global minimum. In contrast to their simplified models, we analyze complex multi-layer CNNs with the commonly used convolutional and pooling operations. Besides, we provide results on both gradient and the distance between the computed solution and desired stationary points, which are applicable to arbitrary gradient descent based algorithms. Next, based on Theorem 3, we derive that the correspond(k) ing pair (wn , w(k) ) in the empirical and population risks shares the same geometrical property stated in Corollary 2. Corollary 2. Suppose the assumptions in Theorem 3 hold. (k) If any one in the pair (wn , w(k) ) in Theorem 3 is a local minimum or saddle point, so is the other one. See the proof of Corollary 2 in Sec. D.6 in supplement. (k) Corollary 2 tells us that the corresponding pair, wn and w(k) , has the same geometric property. Namely, if either one in the pair is a local minimum or saddle point, so is the other one. This result is important for optimization. If e by minimizing the empirical risk the computed solution w e n (w) is a local minimum, then it is also a local minimum Q of the population risk Q(w). Thus it partially explains e can generalize well on new why the computed solution w data. This property also benefits designing new optimization algorithms. For example, Saxe et al. (2014) and Kawaguchi (2016) pointed out that degenerate stationary points indeed exist for deep linear neural networks and Dauphin et al. (2014) empirically validated that saddle points are usually surrounded by high error plateaus in deep forward neural networks. So it is necessary to avoid the saddle points and find the local minimum of population risk. From Theorem 3, it is clear that one only needs to find the local minimum of empirical risk by using escaping saddle points algorithms, e.g. (Ge et al., 2015; Jin et al., 2017; Agarwal et al., 2017).

5. Comparison on DNNs And CNNs Here we compare deep feedforward neural networks (DNNs) with deep CNNs from their generalization error and optimization guarantees to theoretically explain why CNNs are more preferable than DNNs, to some extent. By assuming the input to be standard Gaussian N (0, τ 2 ), Zhou & Feng (2018) proved that if n ≥ 18r2 /(dτ 2 ε2 log(l+ 1)), with probability 1 − ε, the generalization error of an (l + 1)-layer DNN model with sigmoid activation functions is bounded by n : r q d log(n(l+1)) + log(4/ε) n , cn τ (1 + cr l) max di , i n where cn is a universal constant; di denotes the width of the i-th layer; d is the total parameter number of l the network; cr = max(b r2 /16, rb2 /16 ) where rb upper bounds Frobenius norm of the weight matrix in each


layer. Recall that the generalization bound of CNN p provided in this work is O( θe %/(2n)), where %e = √ Pl di bi (ki − si + 1)/(4p) + log(bl+1 ). i=1 log

By observing the above two generalization bounds, one can see when the layer number is fixed, CNN usually has smaller generalization error than DNN because: (1) CNN usually has much fewer parameters, i.e. smaller d, than DNN due to parameter sharing mechanism of convolutions. (2) The generalization error of CNN has a smaller factor than DNN in the network parameter magnitudes. Generalization error Ql+1 bound of CNN depends on a logarithm term O(log i=1 bi ) of the magnitude bi of each kernel/weight matrix, while the bound for DNN depends on O(b rl+1 ). Since the kernel size is much smaller than that of the weight matrix in the fully connected layer by unfolding the convolutional layer, the upper magnitude bound bi (i = 1, · · · , l) of each kernel is usually much smaller than rb. (3) The pooling operation and the stride in convolutional operation in CNN also benefit its generalization performance. This is because the factor %e involves O(2(l + 1) log(1/p)) and (ki − si + 1) which also reduce the generalization error. Notice, by applying our analysis technique, it might be possible to remove the exponential term in DNN. But as mentioned above, the unique operations, e.g. convolution, pooling and striding, still benefit CNN, making it generalize better than DNN. Because of the above factors, the empirical gradient in CNN converges to its population counterpart faster, as well as the paired non-degenerate stationary points for empirical risk and population risk. All these results guarantee that for an arbitrary gradient descent based algorithm, it is faster to compute an approximate stationary point or a local minimum in population risk of CNN compared with DNN.

6. Proof of Roadmap Here we briefly introduce the proof roadmap. Due to space limitation, all the proofs of our theoretical results are deferred to the supplement. Firstly, our analysis relies on bounding the gradient magnitude and the spectral norm of Hessian of the loss f (g(w; D), y). By considering multilayer architecture of CNN, we establish recursive relation of their magnitudes in the k and k + 1 layers (Lemmas 9 ∼ 14 in supplement) and get their overall magnitude upper bound. e n (w) − Q(w)| For the uniform convergence supw∈Ω |Q in Lemma 1, we resort to bound three distances: e n (w) − Q e n (wk )|, A2 = supw ∈Θ A1 = supw∈Ω |Q w kw e n (wk ) − EQ e n (wk )| and A3 = supw∈Ω |EQ e n (wk ) |Q w w w −EQ(w)|, where wkw belongs to the -net Θ of parameter domain Ω. Using Markov inequality and Lipschitz property of loss, we can bound A1 and A3 . To bound A2 , we prove the empirical risk is sub-Gaussian. Considering the element wkw in -net Θ is independent of input D, we use Hoeffd-

ing’s inequality to prove empirical risk at point wkw to be sub-Gaussian for any wkw in Θ. By this decoupling of -net, our bound on A2 depends on the constant magnitude of loss and gets rid of exponential term. Combining these bounds together, we obtain the uniform convergence of empirical risk and can derive the generalization bound. We use a similar decomposition and decoupling strategy mentioned above to bound gradient uniform convergence e n (w)−∇w Q(w)k2 in Theorem 2. But here supw∈Ω k∇w Q we need to bound gradient and spectral norm of Hessian.

To prove correspondence and bounded distance of stationary e n (w)|| ≤ and points, we define a set G = {w ∈ Ω : ||∇Q 2 e inf i |λi (∇ Qn (w))| ≥ ζ} where λi is the i-th eigenvalue e n (w). Then G is decomposed into countable comof ∇2 Q ponents each of which has one or zero non-degenerate stationary point. Next we prove the uniform convergence between empirical and population Hessian by using a similar strategy as above. Combining uniform convergence of gradient and Hessian and the results in differential topology (Lemmas 4 & 5 in supplement), we obtain that for each component of G, if there is a unique non-degenerate stationary e n (w) also has a unique one, and vice point in Q(w), then Q versa. This gives the one-to-one correspondence relation. Finally, the uniform convergence of gradient and Hessian can bound the distance between the corresponding points.

7. Conclusion In this work, we theoretically analyzed why deep CNNs can achieve remarkable success, from its generalization performance and the optimization guarantees of (stochastic) gradient descent based algorithms. We proved that the generalization error of deep CNNs can be bounded by a factor which depends on the network parameters. Moreover, we analyzed the relationship between the computed solution by minimizing the empirical risk and the optimum solutions in population risk from their gradient and their Euclidean distance. All these results show that with sufficient training samples, the generalization performance of deep CNN models can be guaranteed. Besides, these results also reveal that the kernel size ki , the stride si , the pooling size p, the channel number di and the freedom degree θ of the network parameters are critical to the generalization performance of deep CNNs. We also showed that the weight parameter magnitude is also important. These suggestions on network designs accord with the widely used network architectures.

Acknowledgements Jiashi Feng was partially supported by NUS startup R-263000-C08-133, MOE Tier-I R-263-000-C21-112, NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.


References Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10):1533–1545, 2014.

Gonen, A. and Shalev-Shwartz, S. Fast rates for empirical risk minimization of strict saddle problems. In COLT, pp. 1043–1063, 2017. Gromoll, D. and Meyer, W. On differentiable functions with isolated critical points. Topology, 8(4):361–369, 1969.

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima faster than gradient descent. In STOC, pp. 1195–1199, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.

Bartlett, P. and Maass, W. Vapnik-chervonenkis dimension of neural nets. The handbook of brain theory and neural networks, pp. 1188–1192, 2003.

Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

Brown, N. and Sandholm, T. Safe and nested subgame solving for imperfect-information games. In NIPS, 2017. Chen, Y., Jin, X., Kang, B., Feng, J., and Yan, S. Sharing residual units through collective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180, 2017. Choromanska, A., Henaff, M., Mathieu, M., Arous, G., and LeCun, Y. The loss surfaces of multilayer networks. In AISTATS, pp. 192–204, 2015. Cohen, N. and Shashua, A. Convolutional rectifier networks as generalized tensor decompositions. In ICML, pp. 955– 963, 2016. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, pp. 2933–2941, 2014. Du, S., Lee, J., and Tian, Y. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017a. Du, S., Lee, J., Tian, Y., Poczos, B., and Singh, A. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017b. Duchi, J. and Ruan, F. Local asymptotics for some stochastic optimization problems: Optimality, constraint identification, and dual averaging. arXiv preprint arXiv:1612.05612, 2016. Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In COLT, pp. 907–940, 2016. Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points—online stochastic gradient for tensor decomposition. In COLT, pp. 797–842, 2015. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

Jin, C., Ge, R., Netrapalli, P., Kakade, S., and Jordan, M. How to escape saddle points efficiently. In ICML, 2017. Kawaguchi, K. Deep learning without poor local minima. In NIPS, pp. 1097–1105, 2016. Kawaguchi, K., Kaelbling, L., and Bengio, Y. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, pp. 1–13, 2015. Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., and Lempitsky, V. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. arXiv preprint arXiv:1412.6553, 2014. Lee, H., Ge, R., Risteski, A., Ma, T., and Arora, S. On the ability of neural nets to express distributions. In COLT, pp. 1–26, 2017. Li, Y. and Yuan, Y. Convergence analysis of two-layer neural networks with ReLU activation. In NIPS, 2017. Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width. In NIPS, pp. 6232–6240, 2017. Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for non-convex losses. Annals of Statistics, 2017. Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. In COLT, pp. 1376– 1401, 2015. Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In ICML, 2017. Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3): 400–407, 1951.


Sainath, T., Mohamed, A., Kingsbury, B., and Ramabhadran, B. Deep convolutional neural networks for LVCSR. In ICASSP, pp. 8614–8618. IEEE, 2013. Saxe, A., McClelland, J., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, 2014. Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. JMLR, 11:2635–2670, 2010. Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Driessche, G. Van Den, Schrittwieser, J., Antonoglou, I., Panneershelvam, V., and Lanctot, M. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. Soudry, D. and Hoffer, E. Exponentially vanishing suboptimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017. Sun, S., Chen, W., Wang, L., Liu, X., and Liu, T. On the depth of deep neural networks: A theoretical view. In AAAI, pp. 2066–2072, 2016. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR, pp. 1–9, 2015. Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, pp. 4278–4284, 2017. Tian, Y. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. ICML, 2017. Tieleman, T. and Hinton, G. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. Tucker, L. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966. Wang, Y., Xu, C., Xu, C., and Tao, D. Beyond filters: Compact feature map for portable deep model. In ICML, pp. 3703–3711, 2017. Xu, H. and Mannor, S. Robustness and generalization. Machine Learning, 86(3):391–423, 2012. Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. Zhang, Y., Liang, P., and Wainwright, M. Convexified convolutional neural networks. ICML, 2017.

Zhou, P. and Feng, J. Outlier-robust tensor PCA. In CVPR, pp. 1–9, 2017. Zhou, P. and Feng, J. Empirical risk landscape analysis for understanding deep neural networks. In ICLR, 2018. Zhou, P., Lu, C., Lin, Z., and Zhang, C. Tensor factorization for low-rank tensor completion. IEEE TIP, 27(3):1152 – 1163, 2017.

Understanding Generalization and Optimization Performance for Deep CNNs (Supplementary File) Pan Zhou∗ Jiashi Feng∗ National University of Singapore, Singapore [email protected] [email protected] ∗

A

Structure of This Document

This document gives some other necessary notations and preliminaries for our analysis in Sec. B.1 and provides auxiliary lemmas in Sec. B.2. Then in Sec. C, we present the technical lemmas for proving our final results and their proofs. Next, in Sec. D, we utilize these technical lemmas to prove our desired results. Finally, we give the proofs of other auxiliary lemmas in Sec. E. As for the results in manuscript, the proofs of Lemma 1 and Theorem 1 in Sec. 3.1 in the manuscript are respectively provided in Sec. D.1 and Sec. D.2. As for the results in Sec. 4 in the manuscript, Sec. D.3 and D.4 present the proofs of Theorem 2 and Corollary 1, respectively. Finally, we respectively introduce the proofs of Theorem 3 and Corollary 2 in Sec. D.5 and D.6.

B

Notations and Preliminary Tools

Beyond the notations introduced in the manuscript, we need some other notations used in this document. Then we introduce several lemmas that will be used later. B.1

Notations

e to denote the convolution operation with Throughout this document, we use h·, ·i to denote the inner product and use ~ stride 1. A ⊗ C denotes the Kronecker product between A and C. Note that A and C in A ⊗ C can be matrices or vectors. qP 2 n1 ×n2 For a matrix A ∈ R , we use kAkF = i,j Aij to denote its Frobenius norm, where Aij is the (i, j)-th entry of A. We use kAkop = maxi |λi (A)| to denote the operation norm of a matrix A ∈ Rn1 ×n1 , where λi (A) denotes the i-th eigenvalue of the matrix A. For a 3-way tensor A ∈ Rs×t×q , its operation norm is computed as D 3 E X kAkop = sup λ⊗ , A = Aijk λi λj λk , kλk2 ≤1

i,j,k

where Aijk denotes the (i, j, k)-th entry of A. di 1 For brevity, in this document we use f (w, D) to denote f (g(w; D), y) in the manuscript. Let w(i) = (w(i) ; · · · ; w(i) )∈ 2 2 k k Rki di−1 di (i = 1, · · · , l) be the parameter of the i-th layer where w(i) = vec W(i) ∈ Rki di−1 is the vectorization k of W(i) . Similarly, let w(l+1) = vec W(l+1) ∈ Rrl cl dl dl+1 . Then, we further define w = (w(1) , · · · , w(l) , w(l+1) ) ∈ Pl

Rrl cl dl dl+1 +

i=1

ki 2 di−1 di

k which contains all the parameter in the network. Here we use W(i) to denote the k-th kernel in

k,j k,j k k the i-th convolutional layer. For brevity, let W(i) denotes the j-th slice of W(i) , i.e. W(i) = W(i) (:, :, j).

c denotes the matrix which is obtained by rotating the matrix M by 180 degrees. Then we use For a matrix M ∈ Rs×t , M δi to denote the gradient of f (w, D) w.r.t. X(i) : δi = ∇X(i) f (w, D) ∈ Rri ×ci ×di ,

(i = 1, · · · , l),

Understanding Generalization and Optimization Performance for Deep CNNs k Based on δi , we further define δei ∈ R(˜ri−1 −ki +1)×(˜ci−1 −ki +1)×di . Each slice δei+1 can be computed as follows. Firstly, let k k k and similarly we pad zeros of si − 1 δei+1 = δi+1 . Then, we pad zeros of si − 1 rows between the neighboring rows in δei+1 k k e e columns between the neighboring columns in δi+1 . Accordingly, the size of δi+1 is (si (ri − 1) + 1) × (si (ci − 1) + 1). k k Finally, we pad zeros of width ki − 1 around δei+1 to obtain new δei+1 ∈ R(si (ri −1)+2ki −1)×(si (ci −1)+2ki −1) . Note that since ri+1 = (˜ ri − ki+1 )/si+1 + 1 and ri+1 = (˜ ri − ki+1 )/si+1 + 1, we have si (ri − 1) + 2ki − 1 = r˜i−1 − ki + 1 and si (ci − 1) + 2ki − 1 = c˜i−1 − ki + 1.

Then we define four operators G (·), Q (·), up (·) and vec(·), which are useful for explaining the following analysis.

Operation G (·): It maps an arbitrary vector z ∈ Rd into a diagonal matrix G (z) ∈ Rd×d with its i-th diagonal entry equal to σ1 (zi )(1 − σ1 (zi )) in which zi denotes the i-th entry of z. Operation Q (·): it maps a vector z ∈ Rd into a matrix of size d2 × d whose ((i − 1)d + i, i) (i = 1, · · · , d) entry equal to σ1 (zi )(1 − σ1 (zi ))(1 − 2σ1 (zi )) and rest entries are all 0.

Operation up (·): up (M ) represents conducting upsampling on M ∈ Rs×t×q . Let N = up (M ) ∈ Rps×pt×q . Specifically, for each slice N (:, :, i) (i = 1, · · · , q), we have N (:, :, i) = up (M (:, :, i)). It actually upsamples each entry M (g, h, i) into a matrix of p2 same entries p12 M (g, h, i). Operation vec(·): For a matrix A ∈ Rs×t , vec(A) is defined as vec(A) = (A(:, 1); · · · ; A(:, t)) ∈ Rst that vectorizes A ∈ Rs×t along its columns. If A ∈ Rt×s×q is a 3-way tensor, vec(A) = [vec(A(:, :, 1)); vec(A(:, : , 2)), · · · , vec(A(:, :, q))] ∈ Rstq . B.2

Auxiliary Lemmas

We first introduce Lemmas 2 and 3 which are respectively used for bounding the `2 -norm of a vector and the operation norm of a matrix. Then we introduce Lemmas 4 and 5 which discuss the topology of functions. In Lemma 6, we introduce the Hoeffding’s inequality which provides an upper bound on the probability that the sum of independent random variables deviates from its expected value. In Lemma 7, we provide the covering number of an -net for a low-rank matrix. Finally, several commonly used inequalities are presented in Lemma 8 for our analysis. Lemma 2. (Vershynin, 2012) For any vector x ∈ Rd , its `2 -norm can be bounded as kxk2 ≤

1 sup hλ, xi . 1 − λ∈λ

where λ = {λ1 , . . . , λkw } be an -covering net of Bd (1). Lemma 3. (Vershynin, 2012) For any symmetric matrix X ∈ Rd×d , its operator norm can be bounded as kXkop ≤

1 sup |hλ, Xλi| . 1 − 2 λ∈λ

where λ = {λ1 , . . . , λkw } be an -covering net of Bd (1). Lemma 4. (Mei et al., 2017) Let D ⊆ Rd be a compact set with a C 2 boundary ∂D, and f, g : A → R be C 2 functions defined on an open set A, with D ⊆ A. Assume that for all w ∈ ∂D and all t ∈ [0, 1], t∇f (w) + (1 − t)∇g(w) 6= 0. Finally, assume that the Hessian ∇2 f (w) is non-degenerate and has index equal to r for all w ∈ D. Then the following properties hold: (1) If g has no critical point in D, then f has no critical point in D. (2) If g has a unique critical point w in D that is non-degenerate with an index of r, then f also has a unique critical point w0 in D with the index equal to r. Lemma 5. (Mei et al., 2017) Suppose that F (w) : Θ → R is a C 2 function where w ∈ Θ. Assume that {w(1) , . . . , w(m) } is its non-degenerate critical points and let D = {w ∈ Θ : k∇F (w)k2 < and inf i λi ∇2 F (w) ≥ ζ}. Then D can be decomposed into (at most) countably components, with each component containing either exactly one critical point, or no critical point. Concretely, there exist disjoint open sets {Dk }k∈N , with Dk possibly empty for k ≥ m + 1, such that D = ∪∞ k=1 Dk .

Furthermore, w(k) ∈ Dk for 1 ≤ k ≤ m and each Di , k ≥ m + 1 contains no stationary points.

Understanding Generalization and Optimization Performance for Deep CNNs

Lemma 6. (Hoeffding, Pn 1963) Let x1 , · · · , xn be independent random variables where xi is bounded by the interval [ai , bi ]. Suppose sn = n1 i=1 xi , then the following properties hold: 2n2 t2 P (sn − Esn ≥ t) ≤ exp − Pn . 2 i=1 (bi − ai )

where t is an arbitrary positive value.

Lemma 7. (Candes & Plan, 2009) Let Sr = {X ∈ Rn1 ×n2 : rank (X) ≤ r, kXkF = b}. Then there exists an -net e Sr for the Frobenius norm obeying r(n1 +n2 +1) 9b e |Sr | ≤ .

Then we give a lemma to summarize the properties of G (·), Q (·) and up (·) defined in Section B.1, the convolutional operation ~ defined in manuscript. Lemma 8. For G (·), Q (·) and up (·) defined in Section B.1, the convolutional operation ~ defined in manuscript, we have the following properties: (1) For arbitrary vector z, and arbitrary matrices M and N of proper sizes, we have kG (z)M k2F ≤

1 kM k2F 16

and kN G (z)k2F ≤

1 kN k2F . 16

(2) For arbitrary vector z, and arbitrary matrices M and N of proper sizes, we have kQ (z)M k2F ≤

26 kM k2F 38

and kN Q (z)k2F ≤

26 kN k2F . 38

(3) For any tensor M ∈ Rs×t×q , we have kup (M )k2F ≤

1 kM k2F . p2

(4) For any kernel W ∈ Rki ×ki ×di and δei+1 ∈ R(˜ri−1 −ki +1)×(˜ci−1 −ki +1)×di defined in Sec. B.1, then we have e k2F ≤ (ki − si + 1)2 kW k2F kδei+1 k2F . kδei+1 ~W

(5) For softmax activation function σ2 , we can bound the norm of difference between output v and its corresponding ont-hot label as 0 ≤ kv − yk22 ≤ 2. It should be pointed out that we defer the proof of Lemma 8 to Sec. E.

C

Technical Lemmas and Their Proofs

Here we present the key lemmas and theorems for proving our desired results. For brevity, in this document we use f (w, D) to denote f (g(w; D), y) in the manuscript. Lemma 9. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Then the gradient of f (w, D) with respect to w(j) can be written as T ∇W(l+1) f (w, D) =S(v − y)z(l) ∈ Rdl+1 ×˜rl c˜l dl ,

j e δeik ∈ Rki ×ki , (j = 1, · · · , di−1 ; k = 1, · · · , di ; i = 1, · · · , l), ∇W k,j f (w, D) =Z(i−1) ~ (i)


where δik is the k-slice (i.e. δi (:, :, k)) of δi which is defined as δi = ∇X(i) f (w, D) ∈ Rri ×ci ×di ,

(i = 1, · · · , l),

and further satisfies 



di+1

δij =up 

X

k=1

j k e c k,j  δei+1 ~W(i+1) σ10 (X(i) ) ∈ Rri ×ci ,

(j = 1, · · · , di ; i = 1, · · · , l − 1),

c k,j ∈ Rki+1 ×ki+1 is obtained by rotating W k,j with 180 degrees. Also, δ is computed as follows: where the matrix W l (i+1) (i+1) ∇u f (w, D) = S(v − y) ∈ Rdl+1 ,

∇z(l) f (w, D) = (W(l+1) )T S(v − y) ∈ Rr˜l c˜l dl , ∂f (w, D) r˜l ×˜ cl ×dl 0 ∈ Rri ×ci ×di . ∇Z(l) f (w, D) = reshape ∇z(l) f (w, D) ∈ R , δl = σ1 (X(l) ) up ∂Z(l)

where S = G (u).

Proof. We use chain rule to compute the gradient of f (w, D) with respect to Z(i) . We first compute several basis gradient. According to the relationship between X(i) , Y(i) , Z(i) and f (w, D), we have ∇u f (w, D) = S(v − y) ∈ Rdl+1 ,

∇z(l) f (w, D) = (W(l+1) )T S(v − y) ∈ Rr˜l c˜l dl , ∇Z(l) f (w, D) = reshape ∇z(l) f (w, D) ∈ Rr˜l ×˜cl ×dl ,

∂Y(l) ∂Z(l) ∂f (w, D) ∇X(l) f (w, D) = = σ10 (X(l) ) up ∂X(l) ∂Y(l) ∂Z(l) Then we can further obtain   di+1 X j k e c k,j  δij =up  δei+1 ~W(i+1) σ10 (X(i) ) ∈ Rri ×ci , k=1

∂f (w, D) ∂Z(l)

, δl ∈ Rri ×ci ×di .

(j = 1, · · · , di ; i = 1, · · · , l − 1).

c k,j denotes the j-th slice W c k (:, : j) of W c k . Note, we clockwise rotate the matrix W k,j by 180 degrees to where W (i) (i) (i) (i+1) c k,j . Finally, we can compute the gradient w.r.t. W(l+1) and W (i) (i = 1, · · · , l): obtain W (i+1) T ∇W(l+1) f (w, D) =S(v − y)z(l) ∈ Rdl+1 ×˜rl c˜l dl ,

j e δeik ∈ Rki ×ki , (j = 1, · · · , di−1 ; k = 1, · · · , di ; i = 1, · · · , l). ∇W k,j f (w, D) =X(i−1) ~ (i)

The proof is completed.

Lemma 10. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Then the gradient of f (w, D) with respect to w(j) can be written as 2

kδl kF ≤

ϑ bl+1 2 , 16p2

where ϑ = 1/8.

2

kδi kF ≤

di+1 bi+1 2 (ki+1 − si+1 + 1)2

δi+1 2 , 2 F 16p

2

kδi kF ≤

l ϑbl+1 2 Y ds bs 2 (ks − ss + 1)2 , 16p2 s=i+1 16p2

Proof. We first bound δl . By Lemma 9, we have

2 2

0 ¬ 1 ∂f (w, D) ∂f (w, D) 2

= 1

≤ kδl kF = σ1 (X(l) ) up

2 ∂Z(l) 16p ∂Z(l) F 16p2 F

1

(W(l+1) )T S(v − y) 2 ≤ ϑ bl+1 2 , ≤ 2 2 16p 16p2

∂f (w, D) 2

∂z

(l) 2


where ¬ holds since the values of the entries in the tensor σ10 (X(l) ) ∈ Rrl ×cl ×dl belong to [0, 1/4], and the up (·) operation does repeat one entry into p2 entries but the entry value becomes 1/p2 of the original entry value. holds since we have kS(v − y)k22 ≤ 1/8 in Lemma 8. Also by Lemma 9, we can further bound kδij k2F as follows:

2



2 

di+1 di dX di di i+1 X X X X

1

j 2 2 k,j k,j j k ec k ec 0 e e

  δi+1 ~W(i+1) kδi kF = δi+1 ~W(i+1) σ1 (X(i) ) ≤

δi =

up 2 16p F

j=1 k=1 j=1 j=1 k=1

F

F

di dX di dX i+1 i+1

di+1 X di+1 (ki+1 − si+1 + 1)2 X

ek e c k,j 2

ek 2 ≤ ≤ δ ~ W

i+1

δi+1 (i+1) 2 2 16p j=1 16p F F j=1 k=1

k=1

c k,j 2

W(i+1)

F

di+1

k

di+1 bi+1 2 (ki+1 − si+1 + 1)2 di+1 (ki+1 − si+1 + 1)2 X

2 k 2

δi+1

δi+1 2 , ≤ =

W (i+1) 2 F F 16p2 16p F ¬

k=1

c k (:, :, j) of the tensor W c k . ¬ holds since we rotate the matrix W k,j denotes the j-th slice W (i+1) (i+1) (i+1)

Pdi

k,j 2

k 2 k,j k,j k 2 2 c c by 180 degrees to obtain W(i+1) , indicating kW(i+1) kF = kW(i+1) kF and j=1 W(i+1) = W(i+1) ≤ bi+1 2 . F F Accordingly, the above inequality gives where

c k,j W (i+1)

2 kδi kF


l Y

2 di+1 bi+1 2 (ki+1 − si+1 + 1)2 ds bs 2 (ks − ss + 1)2 2

≤ δ ≤ · · · ≤ kδ k i+1 l F F 16p2 16p2 s=i+1 l ϑbl+1 2 Y ds bs 2 (ks − ss + 1)2 ≤ . 16p2 s=i+1 16p2

Lemma 11. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Then the gradient of f (w, D) with respect to W (l+1) and w can be respectively bounded as follows:

2 2

∇W f (w, D) F ≤ ϑ˜ rl c˜l dl and k∇w f (w, D)k2 ≤ β 2 , (l+1) h Pl where ϑ = 1/8 and β , ϑ˜ rl c˜l dl + i=1

2 2 Ql ϑbl+1 2 di−1 s −ss +1) ri−1 ci−1 s=i ds bs (k16p 2 p2 bi 2 di

Proof. By utilizing Lemma 10, we can bound

i1/2

.

di dX di dX l l X l X i−1 i−1

2

2 X X X

j k

∇w f (w, D) 2 = e = ~δ k,j f (w, D)

∇

Z

i (i) (i−1) W F i=1

(i)

i=1 k=1 j=1

¬

≤

≤ ≤ ≤ =

di dX l X i−1 X i=1 k=1 j=1

di dX l X i−1 X i=1 k=1 j=1

l X i=1

l X i=1

i=1 k=1 j=1

F

2 2

j

k (ki − si + 1)2 Z(i−1)

δi F F

2 r˜i−1 c˜i−1 (ki − si + 1)2 δik F 2

r˜i−1 c˜i−1 di−1 (ki − si + 1)2 kδi kF r˜i−1 c˜i−1 di−1 (ki − si + 1)2

l X ϑbl+1 2 di−1 i=1

F

p 2 bi 2 d i

l ϑbl+1 2 Y ds bs 2 (ks − ss + 1)2 16p2 s=i+1 16p2

l Y ds bs 2 (ks − ss + 1)2 ri−1 ci−1 , 16p2 s=i


2

2 2

j

j

j e ik ~δ where ¬ holds since Z(i−1)

≤ (ki − si + 1)2 Z(i−1) δik F ; and holds since the values of entries in Z(i−1) F

belong to [0, 1].

F

On the other hand, we can bound

2

2

T

∇W f (w, D) F = S(v − y)z(l) rl c˜l dl .

≤ ϑ˜ (l+1) F

So we can bound the `2 norm of the gradient as follows:

l

2 X

2

∇w f (w, D) 2 k∇w f (w, D)kF = ∇W(l+1) f (w, x) F + (i) F i=1

≤ϑ˜ rl c˜l dl +

l X ϑbl+1 2 di−1 i=1

p2 bi 2 di

l Y ds bs 2 (ks − ss + 1)2

ri−1 ci−1

16p2

s=i

.

The proof is completed. Lemma 12. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Then for both cases, the gradient of x(i) with respect to w(j) can be bounded as follows: 2

∂vec X i Y

∂x 2

(i) ds bs 2 (ks − ss + 1)2

= (i) ≤ di ri ci r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2

∂w

∂w(j) 16p2 (j) F

s=j+1 F

and

∂vec X s 2

s 2 i Y

∂x(i) (i) ds bs 2 (ks − ss + 1)2 2

≤ ri ci r˜j−1 c˜j−1 dj−1 (kj − sj + 1) max = max .

s s ∂w(j) ∂w(j) F 16p2

s=j+1 F

k k Proof. For brevity, let X(i) (s, t) denotes the (s, t)-th entry in the matrix X(i) ∈ Rri ×ci . We also let φ(i,m) =

Rrm ×cm ×dm . So similar to Lemma 9, we have   dm+1 X q ek e c k,q  σ10 (X(m) φq(i,m) =up  φ ) ∈ Rrm ×cm , (i,m+1) ~W(m+1) k=1

k ∂X(i) (s,t) ∂X(m)

∈

(q = 1, · · · , dm ),

k,q km+1 ×km+1 c k,q where the matrix W is obtained by rotating W(m+1) with 180 degrees. Then according to the (m+1) ∈ R relationship between X(j) and W(j) , we can compute k ∂X(i) (s, t) g,h ∂W(j)

h e g(i,j) ∈ Rkj ×kj , (h = 1, · · · , dj−1 ; g = 1, · · · , dj ). = Z(j−1) ~φ

Therefore, we can further obtain

2

2 dj dj−1 dj dj−1 dj dj−1 k k

∂X k (s, t) 2 X

2 X ∂X(i) X X ∂X(i) X X (s, t) (s, t) (i)

h e g(i,j) ~φ

=

=

=

Z(j−1)

g,h g,h

∂w(j)

∂W

∂W F g=1 h=1 g=1 h=1 g=1 h=1 (j) (j) F F

F

2

2

2 XX

h

g 2

≤ (kj − sj + 1)2 Z(j−1)

φ(i,j) ≤ (kj − sj + 1)2 Z(j−1) φ(i,j) dj dj−1

g=1 h=1

F

F

2

≤˜ rj−1 c˜j−1 dj−1 (kj − sj + 1)2 φ(i,j) . F

F

F


On the other hand, by Lemma 9, we can further bound kφ(i,j) k2F as follows:

2

 

dm+1 dm dm

2

2 X

X X

q

k,q q k 0 e c

  e up φ(i,m+1) ~W(m+1) σ1 (X(m) )

φ(i,m) =

φ(i,m) =

F F

q=1 q=1 k=1 F

2

dm+1 dm+1 d d m m

2 X

1 X dm+1 X X ek ek c k,q

e c k,q eW ≤ φ

φ(i,m+1) ~

(i,m+1) ~W(m+1) ≤ (m+1)

2 2 16p q=1 16p q=1 F

k=1

≤ ¬

= ≤

k=1

F

dm dX m+1

2 dm+1 (km+1 − sm+1 + 1)2 X

ek

c k,q 2 W φ

(i,m+1) (m+1) 16p2 F F q=1 k=1

dm+1 (km+1 − sm+1 + 1) 16p2

m+1 2 dX

k=1

2

ek

φ(i,m+1)

F

2

ck

W(m+1)

F

2 dm+1 bm+1 2 (km+1 − sm+1 + 1)2

,

φ (i,m+1) 16p2 F

k,q c k,q k2 . It further yields where ¬ holds since kW(m+1) k2F = kW (m+1) F

2

2

φ(i,m) ≤ φ(i,i) F

F

i Y

i ds bs 2 (ks − ss + 1)2 ¬ Y ds bs 2 (ks − ss + 1)2 = . 16p2 16p2 s=m+1 s=m+1

2

2 k

∂X(i) (s,t)

where ¬ holds since we have φ(i,i) =

∂X(i) = 1. F F

Therefore, we have

2 ri X ci ri X ci k

∂xk 2 X

2

X

(i)

∂X(i) (s, t)

r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2 φ(i,j)

=

≤

∂w(j)

∂w(j) F s=1 t=1 s=1 t=1 F F

2

=ri ci r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2 φ(i,j) F

i Y ds bs 2 (ks − ss + 1)2 ≤ri ci r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2 . 16p2 s=j+1

It further gives

di i Y

∂x(i) 2 X

∂xs(i) 2 ds bs 2 (ks − ss + 1)2

=

≤ di ri ci r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2 .

∂w

∂w 16p2 (j) F (j) F s=1 s=j+1


Lemma 13. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Then the gradient of δls with respect to w(j) can be bounded as follows:

and

where ϑ˜ =

∂δls

∂w

2 l

2 2 Y ˜

ds bs 2 (ks − ss + 1)2

s 2

≤ ϑbl+1 d r c r ˜ c ˜ d (k − s + 1)

W

l l l j−1 j−1 j−1 j j (l+1)

16p2 16p2 F (j) F s=j+1

3 64 .

2 l 4 Y ˜

ds bs 2 (ks − ss + 1)2

≤ ϑbl+1 dl rl cl r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2 ,

16p2 16p2 (j) F s=j+1

∂δl

∂w

Understanding Generalization and Optimization Performance for Deep CNNs dl 1 2 i Proof. Assume that W(l+1) = [W(l+1) , W(l+1) , · · · , W(l+1) ] where W(l+1) ∈ Rdl+1 ×˜rl c˜l is a submatrix in W(l+1) . Then Pdl k k we have v = σ2 ( k=1 W(l+1) z(l) ). For brevity, we further define a matrix G(k) as follows:

h i G(k)= σ10 x(k) , σ10 x(k) , · · · , σ10 x(k) ∈ Rrk ck dk ×rk ck , {z } | rk ck columns

Then we have ∂ ∂z(l)

∂f (w, D) s ∂z(l)

!

i h s s )T G (u)G (u)W(l+1) , = (v − y)T ⊗ (W(l+1) )T Q (u)W(l+1) + (W(l+1)

where Q (u) is a matrix of size d2l+1 × dl+1 whose (s, (s − 1)dl+1 + s) entry equal to σ1 (us )(1 − σ1 (us ))(1 − 2σ1 (us )) and rest entries are all 0. Accordingly, we have

∂

∂z(l)

∂f (w, D) s ∂z(l)

! 2 h

2

2

i

s s )T G (u)G (u)W(l+1) )T Q (u)W(l+1) + (W(l+1)

≤2 (v − y)T ⊗ (W(l+1)

F F F 6

2

2 2 1 s 2

2 2 s ≤2 kv − ykF W(l+1)

W(l+1) F + 2 W(l+1)

W(l+1) F 8 3 16 F F 7

2 ¬ 2 1

s ≤2 + 2 bl+1 2 W(l+1)

38 16 F

2 3

s ≤ bl+1 2 W(l+1)

, 64 F 2

where we have kv − ykF ≤ 2 by Lemma 8. Then by similar way, we can have

2

∂

=

∂z(l) (j) F

∂δls

∂w

∂f (w, D) s ∂z(l)

!

2

2 ∂z(l) ∂yl ∂x(l) 3

2 s b

≤

W

l+1 (l+1) 2

∂yl ∂x(l) ∂w(j) 64 ∗ 16p F F

∂x(l) 2

∂w (j) F

l

2 Y 3 ds bs 2 (ks − ss + 1)2

2 s 2 ≤ b d r c r ˜ c ˜ d (k − s + 1) .

W

l+1 l l l j−1 j−1 j−1 j j (l+1) 64 ∗ 16p2 16p2 F s=j+1

Therefore, we can further obtain:

∂δl

∂w

(j)

2 dl+1 X ∂δ s

l

=

∂w F

s=1


2 l Y

3 ds bs 2 (ks − ss + 1)2 4 2

≤ b d r c r ˜ c ˜ d (k − s + 1) . l+1 l l l j−1 j−1 j−1 j j

64 ∗ 16p2 16p2 (j) F s=j+1

Lemma 14. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Then the Hessian of f (w, x) with respect to w can be bounded as follows:

where γ =

ϑbl+1 2 d20 2 2 2 l r0 c0 b1 4 d21

∇3w f (w, D).

hQ l

2

∇w f (w, D) 2 ≤ O γ 2 , F

ds bs 2 (ks −ss +1)2 √ s=1 8 2p2

i2 1/2

. With the same condition, we can bound the operation norm of

That is, there exists a universal constant ν such that ∇3w f (w, D) op ≤ ν. 2

k Proof. From Lemma 9, we can further compute the Hessian matrix ∇2w f (w, D). Recall that w(i) ∈ Rki di−1 (k = h i di k k 1 1, · · · , dl ) is the vectorization of W(i) ∈ Rki ×ki ×di−1 , i.e. w(i) = vec W(i) (:, : 1) ; · · · ; vec W(i) (:, :, di−1 ) .


h

i 2 di 1 Let w(i) = w(i) ; · · · ; w(i) ∈ Rki di−1 ×di (i = 1, · · · , l). Also, w(l+1) ∈ Rr˜l c˜l dl dl+1 is the vectorization of the weight matrix W(l+1) . Then if 1 ≤ i, j ≤ l, we can have 

   ∂ f (w, D)  =  k  ∂w(j) ∂w(i)   2



∂ 2 f (w,D) k,1 ∂w(j) ∂w(i) ∂ 2 f (w,D) k,2 ∂w(j) ∂w(i)

.. .

∂ 2 f (w,D) k,d

∂w(j) ∂w(i) i−1





1 ek )) eδ ∂ (vec(Z(i−1) ~ i ∂w(j) 2 ek )) eδ ∂ (vec(Z ~



      i (i−1)    ∂w(j)     = ..      .     ∂ vec Z di−1 ~ k e eδ (i−1)

i

∂w(j)

∂ vec Z 1 ∂ vec δek  ( ( (i−1) )) ( ( i )) 1 P1 δeik + P Z 2 (i−1) ∂w(j) ∂w   ∂ vec (j) 2  ( (δeik ))  2  P1 δeik ∂ (vec(Z(i−1) )) + P2 Z(i−1)  2 2 ∂w(j) ∂w(j)   =  ∈ Rki di−1 ×kj dj dj−1 , .   ..    ∂ vecZ di−1 ∂ vec δek  ( ( )) di−1 (i−1) i P1 δeik + P2 Z(i−1) ∂w(j) ∂w(j)

(4)

2 2 di−1 where P1 δeik ∈ Rki ×˜ri−1 c˜i−1 di−1 and P2 Z(i−1) ∈ Rki ×(˜ri −ki +1)(˜ci −ki +1) satisfy: each row in P1 δeik contains di−1 the vectorization of (δeik )T at the right position and the remaining entries are 0s, and each row in P2 Z(i−1) is the d

i−1 submatrix in Z(i−1) that need to conduct inner product with δeik in turn. Note that there are si − 1 rows and columns between each neighboring nonzero entries in N which is decided by the definition of δe in Sec. B.1. Accordingly, we have

i+1

2

∂ vec Z di−1

(i−1)

P1 δeik

≤ (ki − si + 1)2 kδeik k2F

∂w(j)

F

2

∂ vec Z di−1

(i−1)

= (ki − si + 1)2 kδik k2F

∂w(j)

F

2

∂ vec Z di−1

(i−1)

∂w(j)

F

and 2

∂ vec δeik

P2 Z di−1

≤ (ki − si + 1)2 kZ di−1 k2F (i−1) (i−1)

∂w (j)

F

∂ vec δek 2

i

= (ki − si + 1)2 kZ di−1 k2F

(i−1)

∂w (j)

F

∂ vec δ k 2

i

.

∂w(j)

Then in order to bound

k∇2w f (w, D)k2F

l+1 X l+1 2 X

∂ f (w, D) 2

=

∂w ∂w , i=1 j=1

(j)

(i)

F

we try to bound each term separately. So we consider the following five cases: l ≥ i ≥ j, i ≤ j ≤ l, l + 1 = i > j, l + 1 = j > i and l + 1 = i = j. Case 1: l ≥ i ≥ j

F


In the following, we first consider the first case, i.e. i ≥ j, and bound

 2   

∂ vec δek 2

di+1

∂ vec δ k 2 X

i

i es ~ c s,k  σ 0 (X k )

∂ vec up 

= e δ = W

1 i+1 (i) (i+1)

∂w(j) ∂w(j)

∂w(j)

s=1 F F F



 2    2

2

di+1 di+1 k X X )

∂σ10 (X(i)

¬ 2 ∂ es ~ es ~ c s,k  c s,k  + 2 up  up  e eW δ W ≤ vec δ

i+1 i+1 (i+1) (i+1)

16 ∂w(j) ∂w(j)

s=1 s=1 F F F

2

2   2

∂vec σ 0 (xk )

dX di+1 k i+1 X ∂x

(i) 2 2 ∂ (i) s e c s,k s e c s,k 

= δei+1 ~W(i+1) vec  δei+1 ~W(i+1) + 2

k 2 16p ∂w(j) p ∂w ∂x (j)

(i) s=1 s=1 F F F

2

 2 

2

d di+1 k i+1 X

∂X(i)

2 2 · 26 X es

s e c s,k  c s,k

∂ vec  eW ≤ δi+1 ~ δei+1 ~W(i+1) + 8 2

(i+1)

2

16p ∂w(j) 3 p ∂w (j)

s=1

F

s=1

F

F

di+1 di+1

∂ δes 2 2 · 26

X X s,k ® 2di+1

2 es 2 s,k i+1 2 2 2 c c ≤ + (k −s +1) k W k d (k −s +1) W δ

i+1 i+1 i+1 i+1 i+1 i+1 (i+1) F ∂w (i+1) 16p2 38 p2 F F (j) s=1 s=1

∂X k 2 (i)

∂w(j) F F

s 2 di+1 di+1 k 2

6 X X 2 ∂X

∂δi+1 2·2 ¯ 2di+1 (i)

s,k s 2 s,k 2 = (ki+1 −si+1 +1)2 kW(i+1) k2F

.

W(i+1) δi+1 F

∂w + 38 p2 di+1 (ki+1 −si+1 +1)

16p2 ∂w F (j) F (j) s=1

s=1

(5)

F

k k is independent on w(j) and the values of entries in σ10 (X(i) ) is not larger than 1/4 since for any constant ¬ holds since X(i) 0 a, σ (a) = σ(a)(1 − σ(a)) ≤ 1/4. holds since for arbitrary tensor M , we have kup (M )k2F ≤ kM k2F /p2 in Lemma 8, and we also have

2

2

2

∂vec σ 0 (xk ) k

∂xk 2 6 ∂xk 6 ∂X k ∂x

(i) 2 2 (i) (i) (i) (i)

= Q xk

≤ 8

= 8

. (i)

k

∂w ∂w 3 ∂w 3 ∂w ∂x (j) (j) (j) (j)

(i) F

F

F

F

c s,k and the conclusion in Lemma 8; ¯ holds ® holds since we can just adopt similar strategy in Eqn. (4) to separate W (i+1) since the difference between δes and δ s is that we pad 0 around δ s to obtain δes , indicating kδ s k2 = kδes k2 . i+1

i+1

i+1

i+1

i+1 F

i+1 F

Accordingly, we can further bound

2 di−1

∂ δe 2 X ∂ vec δik

i

=

∂w(j)

∂w(j) F

k=1

F

s 2 di−1 di+1 k 2 6 X X s,k 2 ∂X(i)

∂δi+1

2 2 · 2

2 s

W(i+1) δi+1 F

∂w + 38 p2 di+1 (ki+1 −si+1 +1)

∂w F (j) (j) F k=1 k=1 s=1 F

2

di+1 k 2

s 6 X

∂δi+1

2di+1

2

∂X(i) s s

+ 2 · 2 di+1 (ki+1 −si+1 +1)2 δi+1 2 max (ki+1 −si+1 +1)2 kW(i+1) k2F ≤

W(i+1)

max

2 8 2 F s 16p ∂w(j) F 3 p F k ∂w(j) s=1 F

2

2 k

6 ∂X(i)

∂δi+1 ¬ 2di+1

+ 2 · 2 di+1 (ki+1 −si+1 +1)2 bi+1 2 δi+1 2 max ≤ (ki+1 −si+1 +1)2 bi+1 2

2 8 2 F k ∂w(j) 16p ∂w(j) F 3 p F

2 l 2 2 2 Y

2di+1 ∂δ d b (k − s + 1) ϑb d s s s s l+1 j−1 i+1 (ki+1 −si+1 +1)2 bi+1 2 , ≤

∂w + 3p2 b 2 d ri ci rj−1 cj−1 2 16p2 16p (j) F j j s=j di−1 di+1

XX 2di+1 s,k ≤ (ki+1 −si+1 +1)2 kW(i+1) k2F 2 16p s=1

where ¬ holds since we have and 12.

s kW(i+1) kF

2 k

2

∂X(i)

≤ rw ; holds due to the bounds of δi+1 F and ∂w(j)

in Lemma 10 F


Then, we can use the above recursion inequality to further obtain

∂δi 2

∂w (j) F  " i+1 #

2 l 2 2 2 Y 2ds Y

∂δ 2d ϑb d d b (k −s +1) i+2 l+1 j−1 s s s s i+2  ≤ (ks −ss +1)2 bs 2  (ki+2 −si+2 +1)2 bi+2 2

∂w + 3p2 b 2 d rj−1 cj−1 ri+1 ci+1 2 2 2 16p 16p 16p (j) j j F s=i+1 s=j

l Y ϑbl+1 2 dj−1 ds bs 2 (ks − ss + 1)2 r c r c i i j−1 j−1 2 16p2 3p2 bj dj s=j " l # # " " i+1

2 l 2 2 2 Y 2ds Y 2ds Y

∂δ ϑb d d b (k − s + 1) l+1 j−1 s s s s 2 2 l ≤ (ks −ss +1)2 bs ri ci + ri+1 ci+1 (ks −ss +1)2 bs

∂w + 3p2 b 2 d rj−1 cj−1 16p2 16p2 16p2 (j) F j j s=i+1 s=i+1 s=j " i+2 # ## " l Y 2ds Y 2ds +ri+2 ci+2 (ks −ss +1)2 bs 2 + · · · + rl cl (ks −ss +1)2 bs 2 . 2 2 16p 16p s=i+1 s=i+1

+

By Lemma 13, we have

where ϑ˜ =

3 64 .

2 l 4 Y ˜

ds bs 2 (ks − ss + 1)2

≤ ϑbl+1 dj−1 dl rl cl rj−1 cj−1 ,

16p2 p2 bj 2 dj (j) F s=j

∂δl

∂w

Thus, we can establish

# l "

2 l 2 Y ˜

ds bs 2 (ks − ss + 1)2 Y ds bs 2 (ks − ss + 1)2

≤ ϑbl+1 dj−1 rj−1 cj−1 τ + bl+1 2 dl rl cl .

3 16p2 16p2 p2 bj 2 dj (j) F s=j s=i+1

∂δi

∂w

hQ i hQ i+1 l 2ds 2 2 where τ = ri ci + ri+1 ci+1 s=i+116p + · · · + rl cl 2 (ks −ss +1) bs s=i+1

2

∂ f (w,x) 2 bound of ∂w(j) ∂w(i) as follows:

2ds 2 2 16p2 (ks −ss +1) bs

i . It further gives the

F

2

2

2 s di−1 di i−1

∂ vec δ k ∂ vec X 2 X X

∂ f (w, D) 2 dX (i−1) ∂ f (w, D)

i s

P1 δik

+ P2 X(i−1)

=

∂w ∂w = k

∂w(j) ∂w(i) ∂w ∂w (j) (i) F (j) (j)

k=1 k=1 s=1 F F   2

2

s di−1 di

∂ vec δ k X X  ∂ vec X(i−1)

 i k s

+ P δ ≤2 P X

  1 2 i (i−1)

∂w(j) ∂w(j)

k=1 s=1 F F   2

s di−1 di

2 ∂ vec δ k 2 ∂ vec X(i−1) X X  2



i s k

+ ≤2(ki −si +1)2

X(i−1)

  δi F

∂w ∂w F (j) (j)

k=1 s=1 F F   2

∂ vec X

2

2

(i−1) ∂ (vec (δi ))

2 2



≤2(ki −si +1) kδi kF

+ X(i−1) F ∂w

 ∂w(j) (j)

F F

∂ (vec (δi )) 2 ds bs (ks − ss + 1)2 32ϑbl+1 di−1

r c r c + r c d i−1 i−1 j−1 j−1 i−1 i−1 i−1 16p2 ∂w(j) F bi 2 bj 2 di dj s=i+1   " # l l 2 2 2 2 2 Y Y ϑb d d d b (k − s + 1) 2d b (k − s + 1) l+1 i−1 j−1 s s s s s s s s  . ≤O  ri−1 ci−1 rj−1 cj−1 2 2 16p 16p bi 2 bj 2 di dj s=j s=i ¬

≤

2

l Y

2


where ¬ holds because of Lemma 12, while holds due to 13. Case 2: i ≤ j ≤ l Since

∂ 2 f (w,D) ∂w∂wT

is symmetrical, we have

∂ 2 f (w,D) T ∂w ∂w(i) (j)

=

∂ 2 f (w,D) T ∂w ∂w(j) (i)

(1 ≤ i, j ≤ l). Thus, it yields

∂ 2 f (w, D) 2

∂ 2 f (w, D) 2

=

.

T ∂w T ∂w

∂w(i) ∂w (j) (i) (j) F

Case 3: l + 1 = i > j

T

F

In the following, we first consider the first case, i.e. cross entropy and softmax activation, and bound

2

2 T

∂(v − y)z T 2

∂z(l)

∂v ∂z(l) ∂x(l) ∂x (l) (l)

= + z(l) ⊗ Idl+1

= [Ir˜l c˜l dl ⊗ (v − y)]

∂w(j) ∂x(l) ∂w(j) ∂z(l) ∂x(l) ∂w(j) (l+1) F F F

2

∂x (l) = p [Ir˜l c˜l dl ⊗ (v − y)] + z(l) ⊗ Idl+1 diag (σ20 (u)) W(l+1) G(l) ,

uf ∂w(j) F

2

∂ f (w, D)

∂w ∂w (j)

where G(l) is defined as

h i G(l)= σ10 x(l) , σ10 x(l) , · · · , σ10 x(l) ∈ Rrl cl dl ×rl cl . | {z } rl cl columns

Thus, we can further obtain

2

∂ f (w, D) 2

2 ∂x(l) 2 0

≤ 1 [Ir˜ c˜ d ⊗ (v − y)] + z(l) ⊗ Id

diag (σ2 (u)) W(l+1) F l l l l+1

∂w ∂w

16p2 ∂w(j) F (j) (l+1) F

2 ∂x(l) 2 2 2 0

kI ⊗ (v − y)k + z ⊗ I diag (σ (u)) W ≤ r˜l c˜l dl dl+1 (l) (l+1) F 2 F 16p2 ∂w(j) F

¬ 2 2 ∂x(l) 2 2 0

≤ z ⊗ diag (σ (u)) W kI ⊗ (v − y)k + r˜l c˜l dl (l) (l+1) F 2 F 16p2 ∂w(j) F l Y 1 1 ds bs 2 (ks − ss + 1)2 2 ≤ 2 r˜l c˜l dl 2 + bl+1 dl rl cl r˜j−1 c˜j−1 dj−1 (kj − sj + 1)2 8p 16 16p2 s=j+1 Y l 2dj−1 2 2 2 ds bs 2 (ks − ss + 1)2 1 2 = 4 2 rl cl dl rj−1 cj−1 2 + bl+1 16 16p2 p bj dj s=j

where ¬ holds since for an arbitrary vector u ∈ Rk and an arbitrary matrix M ∈ Rk×k , we have (u ⊗ Ik ) M = u ⊗ M ; holds since we use Lemma 12 and the assumption that kW(l+1) k2F ≤ bl+1 2 .

Now we consider the least square loss and softmax activation function. In such a case, we can further obtain:

2

∂(v − y)G (u)z T 2

∂ f (w, D) 2 (l)

=

∂w ∂w

∂w (j) (l+1) F (j) F

2 T

∂z ∂vec (G (u)) ∂u ∂z(l) ∂x(l) ∂v ∂z(l) ∂x(l) (l) ∂x(l)

= [Ir˜l c˜l dl ⊗ (v − y)] + z(l) ⊗ (v − y) + z(l) ⊗ Idl+1

∂x(l) ∂w(j) ∂u ∂z(l) ∂x(l) ∂w(j) ∂z(l) ∂x(l) ∂w(j) F

2

∂x (l)

. = p [Ir˜l c˜l dl ⊗ (v − y)] + z(l) ⊗ (v − y) Q (u)W(l+1) + z(l) ⊗ Idl+1 Q (u)G (u)W(l+1) G(l)

uf ∂w(j) F


Thus, we can further obtain

2

∂ f (w, D)

∂w ∂w (j)

(l+1)

2

F

2 ∂x(l) 2 1

≤ [Ir˜l c˜l dl ⊗ (v − y)] + z(l) ⊗ (v − y) Q (u)W(l+1) + z(l) ⊗ Idl+1 Q (u)G (u)W(l+1) F 16p2 ∂w(j) F

2

2 ∂x(l) 2 3 2

≤ kIr˜l c˜l dl ⊗ (v − y)kF + z(l) ⊗ (v − y) Q (u)W(l+1) F + z(l) ⊗ Idl+1 Q (u)G (u)W(l+1) F 16p2 ∂w(j) F l Y ¬ 3 3 ds bs 2 (ks − ss + 1)2 2 2 ≤ r ˜ c ˜ d 2 + b d r c r ˜ c ˜ d (k − s + 1) l l l l+1 l l l j−1 j−1 j−1 j j 16p2 100 16p2 s=j+1 =

Y l 3dj−1 26 27 ds bs 2 (ks − ss + 1)2 2 2 2 2 2 r c d b + b , r c 2 + j−1 j−1 l+1 l+1 l l l 2 8 8 3 16 · 3 16p2 p4 bj dj s=j

where ¬ holds since we use Lemma 12 and the fact that kW(l+1) k2F ≤ bl+1 2 . Case 4: i < j = l + 1

Similar to the Case 2, we also can have

∂ 2 f (w, D) 2

∂ 2 f (w, D) 2

=

.

T ∂w T ∂w

∂w(j)

∂w(i) (j) (i) F

F

∂ 2 f (w,D) So in this case, we can just directly use the bound in case 3 to bound

∂wT ∂w

2

. (j)

(i)

Case 5: i = j = l + 1

In the following, we first consider the first case, i.e. i = l + 1, and bound

F

2

∂(v − y)z T 2

∂v 2 (l)

=

z(l) ⊗ Id

=

l+1

∂w(l+1) ∂w(l+1) F (l+1) F F

T

2 = z(l) ⊗ Idl+1 G (u) z(l) ⊗ Idl+1 F

h

T iT 2 ¬

=

z(l) z(l) ⊗ G (u)

∂ 2 f (w, D)

∂w ∂w (l+1)

F

2 1 2 2 2 ≤ z(l) k4F kG (u) F ≤ r˜ c˜ d dl+1 , 16 l l l

where ¬ holds since for an arbitrary vector u ∈ Rk and an arbitrary matrix M ∈ Rk×k , we have (u ⊗ Ik ) M = u ⊗ M .

2

f (w,D) 2 Now we can bound ∂ ∂w∂w

as follows: F

2

l l X l 2 X X

∂ f (w, D) 2 ∂ 2 f (w, D) 2

∂ 2 f (w, D) 2

∂ f (w, D) 2

=

+2

+2

∂w∂w

∂w

∂w ∂w

∂w ∂w (l+1) ∂w(l+1) F (j) (l+1) F (j) (i) F F j=1 j=1 i=j ! 2l−2 Y l bl+1 4 rw 2 l2 2 2 √ ≤O max r˜i c˜i rj cj (ds ks ) 16p2 8 2p2 k1 4 1≤i,j≤l s=1  #2  " l 2 2 Y ds bs 2 (ks − ss + 1)2 ϑb d l+1 0 2 2 2 . √ l r0 c0 ≤O  2 b1 4 d21 8 2p s=1


On the other hand, if the activation functions σ1 and σ2 are respectively sigmoid function and softmax function, f (w, D) is infinitely differentiable. Also σ(a), σ 0 (a), σ 00 (a) and σ 000 (a) are all bounded. This means that ∇3w f (w, D) exists. Also since input D and the parameter w are bounded, we can always find a universal constant ν such that D 3 E X k∇3w f (w, D)kop = sup λ⊗ , ∇3w f (w, D) = [∇3w f (w, D)]ijk λi λj λk ≤ ν < +∞. kλk2 ≤1

i,j,k

The proof is completed. Lemma 15. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Suppose Assumption 1 on the input data D holds. Then for any t > 0, the objective f (w, x) obeys ! n 2nt2 1 X (i) (i) f (w, D )−E(f (w, D )) > t ≤ 2 exp − 2 , P n i=1 α where α = 1. Proof. Since the input D (i) (i = 1, · · · , n) are independent from each other, then the output f (w, D (i) ) (i = 1, · · · , n) are also independent. Meanwhile, when the loss is the square loss, we can easily bound 0 ≤ f (w, D (i) ) = 21 kv − yk22 ≤ 1, since the value of entries in v belongs to [0, 1] and y is a one-hot vector label of v. Besides, for arbitrary random variable x, |x − Ex| ≤ |x|. So by Hoeffding’s inequality in Lemma 6, we have ! n 1 X 2nt2 (i) (i) P f (w, D )−E(f (w, D )) > t ≤ exp − 2 , n i=1 α where α = 1. This means that

1 n

Pn

i=1

f (w, D (i) )−E(f (w, D (i) )) has exponential tails.

Lemma 16. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Suppose Assumption 1 on the input data D holds. Then for any t > 0 and arbitrary unit vector λ ∈ Sd−1 , the gradient ∇f (w, x) obeys ! n E 1 X D nt2 (i) (i) P λ, ∇w f (w, D ) −ED∼D ∇w f (w, D ) > t ≤ exp − 2 . n i=1 2β h Pl where β , ϑ˜ rl c˜l dl + i=1


i1/2

in which ϑ = 1/8.

Proof. Since the input D (i) (i = 1, · · · , n) are independent from each other, then the output ∇w f (w, D (i) ) (i = 1, · · · , n) are also independent. Furthermore, for arbitrary vector x, kx − Exk22 ≤ kxk22 . Hence, for an arbitrary unit vector λ ∈ Sd−1 Pl where d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1 di , we have hλ, ∇w f (w, D (i) ) − ED∼D ∇w f (w, D (i) )i ≤kλk2 k∇w f (w, D (i) ) − ED∼D ∇w f (w, D (i) )k2 ¬

≤kλk2 k∇w f (w, D (i) )k2 ≤ β, where ¬ holds since kλk2 = 1 (λ ∈ Sd−1 ) and by Lemma 11, we have k∇w f (w, D (i) )k ≤ β where β , h 2 Pl Ql ds bs 2 (ks −ss +1)2 i1/2 di−1 ϑ˜ rl c˜l dl + i=1 ϑbpl+1 r c in which ϑ = 1/8. i−1 i−1 2b 2d s=i 16p2 i i Thus, we can use Hoeffding’s inequality in Lemma 6 to bound

n E 1 X D P λ, ∇w f (w, D (i) ) −ED∼D ∇w f (w, D (i) ) > t n i=1


!

nt2 ≤ exp − 2 2β

.


Lemma 17. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Suppose that Assumption 1 on the input data D and the parameter w holds. Then for any t > 0 and arbitrary unit vector λ ∈ Sd−1 , the Hessian ∇2 f (w, D) obeys P

where γ =

! n E 1 X D nt2 2 (i) 2 (i) λ, (∇w f (w, D ) − ED∼D ∇w f (w, D ))λ > t ≤ 2 exp − 2 . n i=1 2γ

ϑbl+1 2 d20 2 2 2 l r0 c0 b1 4 d21

hQ l

ds bs 2 (ks −ss +1)2 √ s=1 8 2p2

i2 1/2

.

Proof. Since the input D (i) (i = 1, · · · , n) are independent from each other, then the output ∇w f (w, D (i) ) (i = 1, · · · , n) are also independent. On the other hand, for arbitrary random matrix X, kX − EXk2F ≤ kXk2F . Thus, for an arbitrary Pl unit vector λ ∈ Sd−1 where d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1 di , we have hλ, (∇2w f (w, D (i) ) − ED∼D ∇2w f (w, D (i) ))λi ≤kλk2 k(∇2w f (w, D (i) ) − ED∼D ∇2w f (w, D (i) ))λk2 ≤k∇2w f (w, D (i) ) − ED∼D ∇2w f (w, D (i) )kop kλk2 ≤k∇2w f (w, D (i) ) − ED∼D ∇2w f (w, D (i) )kF ≤k∇2w f (w, D (i) )kF ¬

≤γ, where ¬ holds since kλk2 = 1 (λ ∈ Sd−1 ) and by Lemma 14, we have k∇2w f (w, D (i) )k ≤ γ where γ = h i2 1/2 ϑbl+1 2 d20 2 2 2 Ql ds bs 2 (ks −ss +1)2 √ l r c . 2 4 0 0 s=1 b1 d 8 2p2 1

Thus, we can use Hoeffding’s inequality in Lemma 6 to bound

! n E 1 X D nt2 2 (i) 2 (i) λ, (∇w f (w, D ) − ED∼D ∇w f (w, D ))λ > t ≤ exp − 2 . n i=1 2γ

P

The proof is completed. Lemma 18. Suppose that the activation function σ1 is sigmoid and σ2 is softmax, and the loss function f (w, D) is squared loss. Suppose that Assumption 1 on the input data D and the parameter w holds. Then the empirical Hessian converges uniformly to the population Hessian in operator norm. Specifically, there exit two universal constants cv0 and cv such that if hQ i−1 l ds bs 2 (ks −ss +1)2 ν2 √ n ≥ cv0 d%ε , then with probability at least 1 − 2 s=1 8 2p2

e n (w)−∇2 Q(w) sup ∇2 Q

≤ cv γ op w∈Ω

s

2d + θ% + log 2n

4 ε

,

Pl holds with probability at least 1 − ε, where d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1di , θ = al+1 (dl+1+ r˜l c ˜l dl − √ Pl Pl di bi (ki −si +1) 2 n 2al+1 + 1) + i=1 ai (ki di + di−1 − 2ai + 1), % = + log(bl+1 ) + log 128p2 , and i=1 log 4p 1/2 h i 2 2 ϑbl+1 2 d20 2 2 2 Ql −ss +1)2 √s γ= . l r0 c0 s=1 ds bs (k b1 4 d2 8 2p2 1

k Proof. Recall that the weight of each kernel and the feature maps has magnitude bound separately, i.e. w(i) ∈ ki 2 di−1 r˜l c˜l dl dl+1 1 2 f(i) = [vec(W ), vec(W ), · · · , B (rw ) (i = 1, · · · , l; k = 1, · · · , di ) and w ∈B (bl+1 ). Since W (l+1)

d

2

f(i) kF ≤ di bi . vec(W(i)i−1 )] ∈ Rki di ×di−1 , we have kW

(i)

(i)


f(i,) is the di bi /(bl+1 + Pl di bi )-covering net of the matrix W f(i) which is the set of all parameters So here we assume W i=1 in the i-th layer. Then by Lemma 7, we have the covering number n i ≤

9(bl+1 +

Pl

i=1 di bi )

!ai (ki 2 di +di−1 −2ai +1)

,

f(i) ) ≤ ai for 1 ≤ i ≤ l. For the last layer, we also can construct an bl+1 /(bl+1 + f(i) obeys rank(W since the rank of W Pl i=1 di bi )-covering net for the weight matrix W(l+1) . Here we have n

l+1

≤

9(bl+1 +

Pl

i=1

di bi )

!al+1 (dl+1 +˜rl c˜l dl −2al+1 +1)

,

since the rank of W(l+1) obeys rank(W(l+1) ) ≤ al+1 . Finally, we arrange them together to construct a set Θ and claim that there is always an -covering net w in Θ for any parameter w. Accordingly, we have |Θ| ≤

l+1 Y

i=1

i

n =

9(bl+1 +

Pl

i=1

di bi )

!al+1 (dl+1 +˜rl c˜l dl −2al+1 +1)+Pli=1 ai (ki 2 di +di−1 −2ai +1)

=

9(bl+1 +

Pl

i=1

di bi )

!θ

Pl where θ = al+1 (dl+1 + r˜l c˜l dl − 2al+1 + 1) + i=1 ai (ki 2 di + di−1 − 2ai + 1) which is the total freedom degree of the network. So we can always find a vector wkw ∈ Θ such that kw − wkw k2 ≤ . Now we use the decomposition strategy to bound our goal:

2e

∇ Qn (w) − ∇2 Q(w) op

n

1 X

= ∇2 f (w, D (i) ) − ED∼D (∇2 f (w, D))

n

i=1 op

n n

1 X 1X

= ∇2 f (w, D (i) ) − ∇2 f (wkw , D (i) ) + ∇2 f (wkw , D (i) ) − E(∇2 f (wkw , D))

n n i=1 i=1

2 2 + ED∼D (∇ f (wkw , D)) − ED∼D (∇ f (w, D))

op

n

n

1 X

1 X

∇2 f (w, D (i) ) − ∇2 f (wkw , D (i) ) + ∇2 f (wkw , D (i) ) − ED∼D (∇2 f (wkw , D)) ≤

n

n i=1 i=1 op op

+ ED∼D (∇2 f (wkw , D)) − ED∼D (∇2 f (w, D)) .

op

Here we also define four events E0 , E1 , E2 and E3 as

e n (w) − ∇2 Q(w) E0 = sup ∇2 Q

≥t , op w∈Ω  

n

 

1 X t

∇2 f (w, D (i) ) − ∇2 f (wkw , D (i) ) ≥ , E1 = sup w∈Ω n

3 i=1 op  

n



1 X

t

E2 = sup ∇2 f (wkw , D (i) ) − ED∼D (∇2 f (wkw , D)) ≥ ,

wkw ∈Θ n 3 i=1 op

t E3 = sup ED∼D (∇2 f (wkw , D)) − ED∼D (∇2 f (w, D)) op ≥ . 3 w∈Ω

,


Accordingly, we have P (E0 ) ≤ P (E1 ) + P (E2 ) + P (E3 ) .

So we can respectively bound P (E1 ), P (E2 ) and P (E3 ) to bound P (E0 ).

Step 1. Bound P (E1 ): We first bound P (E1 ) as follows:

! n

1 X t

2 (i) 2 (i) P (E1 ) =P sup ∇ f (w, D ) − ∇ f (wkw , D ) ≥

3 w∈Ω n i=1 2

!

n

1 X ¬3

2 (i) 2 (i) ≤ ED∼D sup ∇ f (w, D ) − ∇ f (wkw , D )

t w∈Ω n i=1 2

2

3 2 ≤ ED∼D sup ∇ f (w, D) − ∇ f (wkw , D) 2 t w∈Ω !

1 Pn 2 (i) 2 (i)

3 i=1 ∇ f (w, D ) − ∇ f (wkw , D ) 2 n sup kw − wkw k2 ≤ ED∼D sup t kw − wkw k2 w∈Ω w∈Ω

≤

3ν , t

where ¬ holds since by Markov inequality and holds because of Lemma 14. Therefore, we can set t≥

6ν . ε

Then we can bound P(E1 ): P(E1 ) ≤

ε . 2

Step 2. Bound P (E2 ): By Lemma 3, we know that for any matrix X ∈ Rd×d , its operator norm can be computed as kXkop ≤

1 sup |hλ, Xλi| . 1 − 2 λ∈λ

where λ = {λ1 , . . . , λkw } be an -covering net of Bd (1).

Pl Let λ1/4 be the 14 -covering net of Bd (1), where d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1 di . Recall that we use Θ to denote the -net P θ Ql+1 3(bl+1 + li=1 di bi ) of wkw and we have |Θ| ≤ i=1 n i = . Then we can bound P (E2 ) as follows:

n !

1 X t

2 (i) 2 P (E2 ) =P ∇ f (wkw , D ) − ED∼D (∇ f (wkw , D)) ≥ sup

3 wkw ∈Θ n i=1 2 * ! + ! n 1X 2 t ≤P sup 2 λ, ∇ f (wkw , D (i) ) − ED∼D ∇2 f (wkw , D) λ ≥ 3 n i=1 wkw ∈Θ,λ∈λ1/4 ! θ Pl X 1 n t 3(bl+1 + i=1 di bi ) d 2 (i) 2 ≤12 sup P λ, ∇ f (wkw , D ) − ED∼D ∇ f (wkw , D) λ ≥ n i=1 6 wkw ∈Θ,λ∈λ1/4 !θ Pl ¬ 3(bl+1 + i=1 di bi ) nt2 d ≤12 2 exp − , 72γ 2

where ¬ holds since by Lemma 17, we have P

! n E 1 X D nt2 2 (i) 2 (i) λ, (∇w f (w, D ) − ED∼D ∇w f (w, D ))λ > t ≤ exp − 2 . n i=1 2γ


where γ =

ϑbl+1 2 d20 2 2 2 l r0 c0 b1 4 d21

Thus, if we set

hQ

l ds bs 2 (ks −ss +1)2 √ s=1 8 2p2

t≥ then we have

i2 1/2

.

v P u u 72γ 2 d log(12) + θ log 3(bl+1 + li=1 di bi ) + log t n

P (E2 ) ≤

4 ε

,

ε . 2


t P (E3 ) =P sup ED∼D (∇2 f (wkw , D)) − ED∼D (∇2 f (w, D)) 2 ≥ 3 w∈Ω

2

t ≤P ED∼D sup (∇ f (wkw , D) − ∇2 f (w, D) 2 ≥ 3 w∈Ω ! 1 Pn 2 (i) 2 (i) t i=1 ∇ f (w, D ) − ∇ f (wkw , D ) n ≤P sup sup kw − wkw k2 ≥ kw − wkw k2 3 w∈Ω w∈Ω ¬ t , ≤P ν ≥ 3 where ¬ holds because of Lemma 14. We set enough small such that ν < t/3 always holds. Then it yields P (E3 ) = 0. i− 12 h P 36(bl+1 + li=1 di bi ) Ql ds bs 2 (ks −ss +1)2 √ Step 4. Final result: To ensure P(E0 ) ≤ ε, we just set = . Note that s=1 ϑ2 nbl+1 8 2p2 6ν ε

> 3ν. Thus we can obtain 

2

v P u 3(bl+1 + li=1 di bi ) u + log  6ν t 72γ 2 d log(12) + θ log , t ≥ max   ε n

ν Thus, if n ≥ cv0 d%ε 2

hQ

l ds bs 2 (ks −ss +1)2 √ s=1 8 2p2

e n (w)−∇2 Q(w) sup ∇2 Q cv γ

≤ˆ op w∈Ω

=cv γ

i−1

4 ε



 . 

where cv0 is a constant, there exists a universal constant cv such that

v √ u ds bs (ks −ss +1) u d log(12) + θ Pl log n + log(b ) + log + log 2 l+1 t i=1 4p 128p s

n

2d + θ% + log 2n

4 ε

Pl holds with probability at least 1 − ε, where d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1di , θ = al+1 (dl+1 + r˜l c˜l dl − √ Pl Pl di bi (ki −si +1) n 2al+1 + 1) + i=1 ai (ki 2 di + di−1 − 2ai + 1), % = + log(bl+1 ) + log 128p 2 , and i=1 log 4p 1/2 h i2 2 ϑbl+1 2 d20 2 2 2 Ql −ss +1)2 √s γ= l r0 c0 s=1 ds bs (k . The proof is completed. 2 b1 4 d2 8 2p 1

D D.1

Proofs of Main Theorems Proof of Lemma 1

k Proof. Recall that the weight of each kernel and the feature maps has magnitude bound separately, i.e. w(i) ∈ ki 2 di−1 r˜l c˜l dl dl+1 1 2 f(i) = [vec(W ), vec(W ), · · · , B (rw ) (i = 1, · · · , l; k = 1, · · · , di ) and w ∈B (bl+1 ). Since W (l+1)

d

2

f(i) kF ≤ di bi . vec(W(i)i−1 )] ∈ Rki di ×di−1 , we have kW

(i)

(i)

4 ε


f(i,) is the di bi /(bl+1 + Pl di bi )-covering net of the matrix W f(i) which is the set of all parameters So here we assume W i=1 in the i-th layer. Then by Lemma 7, we have the covering number i

n ≤

9(bl+1 +

Pl

i=1

di bi )

!ai (ki 2 di +di−1 −2ai +1)

,

f(i) ) ≤ ai for 1 ≤ i ≤ l. For the last layer, we also can construct an bl+1 /(bl+1 + f(i) obeys rank(W since the rank of W Pl i=1 di bi )-covering net for the weight matrix W(l+1) . Here we have n

l+1

≤

9(bl+1 +

Pl

i=1

di bi )

!al+1 (dl+1 +˜rl c˜l dl −2al+1 +1)

,


l+1 Y

i

n =

9(bl+1 +

i=1

Pl

i=1 di bi )

!al+1 (dl+1 +˜rl c˜l dl −2al+1 +1)+Pli=1 ai (ki 2 di +di−1 −2ai +1)

=

9(bl+1 +

Pl

i=1

di bi )

!θ

Pl where θ = al+1 (dl+1 + r˜l c˜l dl − 2al+1 + 1) + i=1 ai (ki 2 di + di−1 − 2ai + 1) which is the total freedom degree of the network. So we can always find a vector wkw ∈ Θ such that kw − wkw k2 ≤ . Now we use the decomposition strategy to bound our goal: n 1 X e (i) f (w, D ) − ED∼D (f (w, D)) Qn (w) − Q(w) = n i=1 n n 1X 1X (i) (i) (i) = f (wkw , D )−Ef (wkw , D)+ED∼D f (wkw , D)−ED∼D f (w, D) f (w, D )−f (wkw , D ) + n n i=1 i=1 n n 1 X 1X f (wkw , D (i) )−ED∼D f (wkw , D) f (w, D (i) )−f (wkw , D (i) ) + ≤ n n i=1 i=1 + ED∼D f (wkw , D)−ED∼D f (w, D) .

Then, we define four events E0 , E1 , E2 and E3 as e E0 = sup Qn (w) − Q(w) ≥ t , w∈Ω ( ) n 1 X t (i) E1 = sup f (w, D ) − f (wkw , x(i) ) ≥ , 3 w∈Ω n i=1 n ( ) 1X t (i) E2 = sup f (wkw , D )−ED∼D (f (wkw , D)) ≥ , 3 wkw ∈Θ n i=1 ( ) t E3 = sup ED∼D (f (wkw , D))−ED∼D (f (w, D)) ≥ . 3 w∈Ω

Accordingly, we have

P (E0 ) ≤ P (E1 ) + P (E2 ) + P (E3 ) .


,


Step 1. Bound P (E1 ): We first bound P (E1 ) as follows: ! n 1 X t P (E1 ) =P sup f (w, D (i) ) − f (wkw , D (i) ) ≥ 3 w∈Ω n i=1 n ! 1 X ¬3 (i) (i) ≤ ED∼D sup f (w, D ) − f (wkw , D ) t w∈Ω n i=1 ! 1 Pn (i) (i) f (w, D ) − f (w , D ) 3 k w ≤ ED∼D sup n i=1 sup kw − wkw k2 t kw − wkw k2 w∈Ω w∈Ω

3

e (w, D) ≤ ED∼D sup ∇Q

, n t 2 w∈Ω

where ¬ holds since by Markov inequality, we have that for an arbitrary nonnegative random variable x, then P(x ≥ t) ≤

E(x) . t

e

Now we only need to bound ED∼D supw∈Ω ∇Q n (w, D) . Therefore, by Lemma 11, we have 2

e

ED∼D sup ∇Q (w, D)

= ED∼D n 2

w∈Ω

h

where β , ϑ˜ rl c˜l dl +

Pl

i=1

!

n

1 X

(i) ∇f (w, D ) ≤ ED∼D supk∇f (w, D)k2 ≤ β. sup

w∈Ω w∈Ω n i=1 2

2 Ql ϑbl+1 2 di−1 s −ss +1) ri−1 ci−1 s=i ds bs (k16p 2 p2 bi 2 di

P (E1 ) ≤ We further let t≥

in which ϑ = 1/8. Therefore, we have

6β . ε

Then we can bound P(E1 ): P(E1 ) ≤ Step 2.

3β . t

i 2 1/2

ε . 2

Bound P (E2 ): Recall that we use Θ to denote the index of wkw and we have |Θ| ≤ θ P 9(bl+1 + li=1 di bi ) . We can bound P (E2 ) as follows:

Ql+1

i=1

n ! 1 X t (i) P (E2 ) =P sup f (wkw , D ) − ED∼D (f (wkw , D)) ≥ 3 wkw ∈Θ n i=1 ! ! θ Pl n 1 X 9(bl+1 + i=1 di bi ) t ≤ sup P f (wkw , D (i) ) − ED∼D (f (wkw , D)) ≥ n i=1 3 wkw ∈Θ !θ Pl ¬ 9(bl+1 + i=1 di bi ) 2nt2 ≤ 2 exp − 2 , α

where ¬ holds because in Lemma 15, we have ! n 1 X 2nt2 (i) (i) f (w, D )−E(f (w, D )) > t ≤ exp − 2 , P n i=1 α

n i =


where α = 1. Thus, if we set t≥ then we have

v P u u α2 θ log 9(bl+1 + li=1 di bi ) + log t

4 ε

2n

P (E2 ) ≤

,

ε . 2


t P (E3 ) =P sup kED∼D (f (wkw , D)) − ED∼D (f (w, D))k2 ≥ 3 w∈Ω kED∼D (f (wkw , D) − f (w, D)k2 ) t =P sup sup kw − wkw k2 ≥ kw − wkw k2 3 w∈Ω w∈Ω t ≤P ED∼D sup k∇Qw (w, D)k2 ≥ 3 w∈Ω ¬ t ≤P β ≥ , 3 where ¬ holds since we utilize Lemma 11. We set enough small such that β < t/3 always holds. Then it yields P (E3 ) = 0. h i− 21 P 18p2 (bl+1 + li=1 di bi ) Ql ds bs 2 (ks −ss +1)2 Step 4. Final result: To ensure P(E0 ) ≤ ε, we just set = . Note that s=1 ϑ2 nbl+1 16p2 6β ε

> 3β due to ε ≤ 1. Thus we can obtain

v P u 9(bl+1 + li=1 di bi ) u + log  6β t α2 θ log t ≥ max  ε , 2n 

By comparing the values of α, we can observe that if n ≥ cf 0 exists such a universal constant cf such that e sup Q n (w)−Q(w) ≤ α

w∈Ω

4 ε



 . 

P √ l2 (bl+1 + li=1 di bi )2 maxi ri ci θ%ε2

where cf 0 is a constant, there

v √ u P l ds bs (ks −ss +1) uθ n log +log + log(b )+log l+1 t i=1 4p 128p2 2n

4 ε

=

s

θ%+log 2n

4 ε

Pl holds with probability at least 1 − ε, where θ = al+1 (dl+1 + r˜l c˜l dl − 2al+1 + 1) + i=1 ai (ki 2 di + di−1 − 2ai + 1), √ Pl n % = i=1 log di bi (k4pi −si +1) + log(bl+1 ) + log 128p 2 , and α = 1. The proof is completed. D.2

Proof of Theorem 1

Proof. By Lemma 1 in the manuscript, we know that if n ≥ cf 0 l2 (bl+1 + universal constant, then with probability at least 1 − ε, we have e sup Q n (w) − Q(w) ≤

w∈Ω

s

Pl

i=1

θ% + log 2n

4 ε

di bi )2 maxi

√

ri ci /(θ%ε2 ) where cf 0 is a

,

where the total √freedom degree θ of the network is θ = al+1 (dl+1 + r˜l c˜l dl + 1) + Pl di bi (ki −si +1) n % = i=1 log + log(bl+1 ) + log 128p 2 . 4p

Pl

i=1

ai (ki 2 di−1 + di + 1) and


Thus based on such a result, we can derive the following generalization bound: e n (w)) e −Q e ≤ ES∼D sup ES∼D EA (Q(w)

w∈Ω

e e Q (w) − Q(w) n ≤ sup Q n (w) − Q(w) ≤ w∈Ω

s

4 ε

θ% + log 2n

.

Thus, the conclusion holds. The proof is completed. D.3

Proof of Theorem 2

k Proof. Recall that the weight of each kernel and the feature maps has magnitude bound separately, i.e. w(i) ∈ 2 r˜l c˜l dl dl+1 ki di−1 1 2 f ∈B (bl+1 ). Since W(i) = [vec(W ), vec(W ), · · · , B (rw ) (i = 1, · · · , l; k = 1, · · · , di ) and w (l+1)

(i)

(i)

2 d f(i) kF ≤ di bi . vec(W(i)i−1 )] ∈ Rki di−1 ×di , we have kW f(i,) is the di bi /(bl+1 + Pl di bi )-covering net of the matrix W f(i) which is the set of all parameters So here we assume W i=1 in the i-th layer. Then by Lemma 7, we have the covering number

i

n ≤

9(bl+1 +

Pl

i=1

di bi )

!ai (ki 2 di−1 +di −2ai +1)

,

f(i) obeys rank(W f(i) ) ≤ ai for 1 ≤ i ≤ l. For the last layer, we also can construct an bl+1 /(bl+1 + since the rank of W Pl i=1 di bi )-covering net for the weight matrix W(l+1) . Here we have n

l+1

≤

9(bl+1 +

Pl

i=1

di bi )

!al+1 (dl+1 +˜rl c˜l dl −2al+1 +1)

,


l+1 Y

i

n =

9(bl+1 +

Pl

i=1 di bi )

i=1

!al+1 (dl+1 +˜rl c˜l dl −2al+1 +1)+Pli=1 ai (ki 2 di−1 +di −2ai +1)

=

9(bl+1 +

Pl

i=1

di bi )

!θ

Pl where θ = al+1 (dl+1 + r˜l c˜l dl − 2al+1 + 1) + i=1 ai (ki 2 di−1 + di − 2ai + 1) which is the total freedom degree of the

network. So we can

always find a vector wkw ∈ Θ such that kw − wkw k2 ≤ . Accordingly, we can decompose

e

∇Qn (w) − ∇Q(w) as 2

e

∇Qn (w) − ∇Q(w) 2

n

1 X

∇f (w, D (i) ) − ED∼D (∇f (w, D)) =

n

i=1 2

n n

1 X 1X

= ∇f (w, D (i) ) − ∇f (wkw , D (i) ) + ∇f (wkw , D (i) ) − ED∼D (∇f (wkw , D))

n n i=1 i=1

+ ED∼D (∇f (wkw , D)) − ED∼D (∇f (w, D))

n

2 n

1 X

1 X

∇f (w, D (i) ) − ∇f (wkw , D (i) ) + ∇f (wkw , D (i) ) − ED∼D (∇f (wkw , D)) ≤

n

n i=1 i=1 2 2

+ ED∼D (∇f (wkw , D)) − ED∼D (∇f (w, D)) .

2

,


Here we also define four events E0 , E1 , E2 and E3 as

e ≥ t , (w) − ∇Q(w) E0 = sup ∇Q

n 2 w∈Ω

( ) n

1 X t

(i) (i) E1 = sup ∇f (w, D ) − ∇f (wkw , D ) ≥ ,

3 w∈Ω n i=1 2

n

( )

1 X

t

(i) E2 = sup ∇f (wkw , D ) − ED∼D (∇f (wkw , D)) ≥ ,

3 wkw ∈Θ n i=1 2

) (

t

. E3 = sup ED∼D (∇f (wkw , D)) − ED∼D (∇f (w, D)) ≥

3 w∈Ω 2

Accordingly, we have

P (E0 ) ≤ P (E1 ) + P (E2 ) + P (E3 ) .



n

!

1 X t

(i) (i) P (E1 ) =P sup ∇f (w, D ) − ∇f (wkw , D ) ≥

3 w∈Ω n i=1 2

! n

1 X ¬3

(i) (i) ≤ ED∼D sup ∇f (w, D ) − ∇f (wkw , D )

t w∈Ω n i=1 2 !

1 Pn (i) (i)

∇f (w, D ) − ∇f (w , D ) 3 kw i=1 n 2 sup kw − wkw k2 ≤ ED∼D sup t kw − wkw k2 w∈Ω w∈Ω

3

2e

≤ ED∼D sup ∇ Qn (w, D) , t 2 w∈Ω

where ¬ holds since by Markov inequality, we have that for an arbitrary nonnegative random variable x, then P(x ≥ t) ≤ E(x) t .

e n (w, D) Now we only need to bound ED∼D supw∈Ω ∇2 Q

. Here we utilize Lemma 14 to achieve this goal: 2

ED∼D

where γ =

2

2e

2 ∗

sup ∇ Qn (w, D) ≤ ED∼D sup ∇ f (w, D) − ∇ f (w , D) 2 ≤ γ. 2

w∈Ω

ϑbl+1 2 d20 2 2 2 l r0 c0 b1 4 d21

hQ

l ds bs 2 (ks −ss +1)2 √ s=1 8 2p2

w∈Ω

i2 1/2

. Therefore, we have

P (E1 ) ≤ We further let t≥

3γ . t

6γ . ε

Then we can bound P(E1 ): P(E1 ) ≤

ε . 2

Step 2. Bound P (E2 ): By Lemma 2, we know that for any vector x ∈ Rd , its `2 -norm can be computed as kxk2 ≤

1 sup hλ, xi . 1 − λ∈λ


where λ = {λ1 , . . . , λkw } be an -covering net of Bd (1).

Pl Let λ be the 12 -covering net of Bd (1), where d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1 di . Recall that we use Θ to denote the index of θ P Ql+1 3(bl+1 + li=1 di bi ) wkw so that kw − wkw k ≤ . Besides, |Θ| ≤ i=1 n i = . Then we can bound P (E2 ) as follows:

n

!

1 X

t

(i) ∇f (wkw , D ) − ED∼D (∇f (wkw , D)) ≥ P (E2 ) =P sup

3 wkw ∈Θ n i=1 2 * ! + n 1X t (i) =P sup 2 λ, ∇f (wkw , D ) − ED∼D (∇f (wkw , D)) ≥ n i=1 3 wkw ∈Θ,λ∈λ1/2 ! θ Pl X n 9(bl+1 + i=1 di bi ) 1 t d (i) ≤6 sup P λ, ∇f (wkw , D ) − ED∼D (∇f (wkw , D)) ≥ n i=1 6 wkw ∈Θ,λ∈λ1/2 !θ Pl ¬ 9(bl+1 + i=1 di bi ) nt2 d , 2 exp − ≤6 72β 2

where ¬ holds since by Lemma 16, we have n E 1 X D λ, ∇w f (w, D (i) ) −ED∼D ∇w f (w, D (i) ) > t P n i=1

h Pl where β , ϑ˜ rl c˜l dl + i=1 Thus, if we set


t≥ then we have

i1/2

!

nt2 ≤ exp − 2 2β

.

in which ϑ = 1/8.

v P u u 72β 2 d log(6) + θ log 9(bl+1 + li=1 di bi ) + log t n

P (E2 ) ≤

4 ε

,

ε . 2

Step 3. Bound P (E3 ): We first bound P (E3 ) as follows: t P (E3 ) =P sup kE(∇f (wkw , x)) − ED∼D (∇f (w, x))k2 ≥ 3 w∈Ω kED∼D (∇f (wkw , x) − ∇f (w, x)k2 ) t sup kw − wkw k2 ≥ =P sup kw − wkw k2 3 w∈Ω w∈Ω

t

e n (w, x) ≤P ED∼D sup ∇2 Q

≥ 3 2 w∈Ω t ≤P γ ≥ . 3 We set enough small such that γ < t/3 always holds. Then it yields P (E3 ) = 0. Step 4.

6β ≥ ε i− 12 ds bs 2 (ks −ss +1)2 . s=i 16p2

Final result: h P 18p2 (bl+1 + li=1 di bi ) Ql ϑ2 nbl+1

Note that

3β.

Finally, to ensure P(E0 )

v P u 9(bl+1 + li=1 di bi ) u + log  6γ t 72β 2 d log(6) + θ log t ≥ max   ε , n 

≤

4 ε

ε, we just set



 . 

=

Understanding Generalization and Optimization Performance for Deep CNNs l2 b

2

(b

P + l

d b )2 (r c d )4

0 0 0 l+1 i=1 i i By comparing the values of β and γ, we have if n ≥ cg0 dl+1 where cg0 is a universal constant, 4 b 8 (d log(6)+θ%)ε2 max (r c ) i i i 0 1 then there exists a universal constant cg such that v h √ u i ds bs (ks −ss +1) u d + 1 θ Pl log n 4

+log(b )+log +log 2 l+1 t i=1 log(6) 4p 128p ε

e n (w)−∇w Q(w) sup ∇w Q

≤cg β n 2 w∈Ω s d + 12 θ%+ 12 log 4ε ≤cg β , n Pl holds with probability at least 1 − ε, where d = r˜lc˜l dl dl+1 + i=1 ki 2 di−1 di , θ = al+1 (dl+1 + r˜l c˜l dl + √ Pl Pl di bi (ki −si +1) 2 n = + log(bl+1 ) + log 128p and β , 1) + 2 , i=1 log i=1 ai (ki di−1 + di + 1), % 4p h i Pl ϑbl+1 2 di−1 Ql ds bs 2 (ks −ss +1)2 1/2 ϑ˜ rl c˜l dl + i=1 p2 bi 2 di ri−1 ci−1 s=i in which ϑ = 1/8. The proof is completed. 16p2

D.4

Proof of Corollary 1

Proof. By Theorem 2, we know that there exist universal constants cg0 and cg such that if n P 2 l2 b (bl+1 + li=1 di bi )2 (r0 c0 d0 )4 cg0 dl+1 4 b 8 (d log(6)+θ%)ε2 max (r c ) , i i i 0 1

then

e n (w) − ∇w Q(w) sup ∇w Q

≤ cg β 2

w∈Ω

s

2d + θ% + log 2n

4 ε

≥

holds with probability at least 1 − ε, where % is provided in Lemma 1. Here β and d are defined as β = h Pl bl+1 2 di−1 Ql dj bj 2 (kj −sj +1)2 i1/2 Pl rl cl dl + r c and d = r˜l c˜l dl dl+1 + i=1 ki 2 di−1 di , respectively. 2 2 i−1 i−1 2 i=1 8p bi di j=i 8p 16p2 So based on such a result, we can derive that if n ≥ c2g (2d + θ% + log(4/ε))β 2 /(2), then we have s

√ 2d + θ% + log

e n (w) e n (w) e 2 ≤ ∇w Q e + ∇w Q e − ∇w Q(w) e ≤ + cg β k∇Q(w)k 2n 2 2

4 ε

√ ≤ 2 .

e is a 4-approximate stationary point in population risk with probability e 22 ≤ 4, which means that w Thus, we have k∇Q(w)k at least 1 − ε. The proof is completed. D.5

Proof of Theorem 3

Proof. Suppose that {w(1) , w(2) , · · · , w(m) } are the non-degenerate critical points of Q(w). So for any w(k) , it obeys inf λki ∇2 Q(w(k) ) ≥ ζ, i

where λki ∇2 Q(w(k) ) denotes the i-th eigenvalue of the Hessian ∇2 Q(w(k) ) and ζ is a constant. We further define a d 2 set D = {w ∈ R | k∇Q(w)k2 ≤ and inf i |λi ∇ Q(w(k) ) | ≥ ζ}. According to Lemma 5, D = ∪∞ k=1 Dk where each Dk is a disjoint component with w(k) ∈ Dk for k ≤ m and Dk does not contain any critical point of Q(w) for k ≥ m + 1. On the other hand, by the continuity of ∇Q(w), it yields k∇Q(w)k2 = for w ∈ ∂Dk . Notice, we set the value of blow which is actually a function related to n. Then by utilizing Theorem 2, we let sample number n sufficient large such that

e

sup ∇Q (w)−∇Q(w)

≤ n 2 2 w∈Ω

holds with probability at least 1 − ε, where is defined as v √ u ds bs (ks −ss +1) u d log(6) + θ Pl log n + log(bl+1 ) + log 128p + log 2 t i=1 4p , cg β 2 n

4 ε

.


This further gives that for arbitrary w ∈ Dk , we have

e e inf t∇Q n (w) + (1 − t)∇Q(w) = inf t ∇Qn (w) − ∇Q(w) + ∇Q(w) w∈Dk w∈Dk 2 2

e ≥ inf k∇Q(w)k2 − sup t ∇Qn (w) − ∇Q(w) w∈Dk

w∈Dk

≥ . 2 Similarly, by utilizing Lemma 18, let n be sufficient large such that

ζ

e n (w) − ∇2 Q(w) sup ∇2 Q

≤ 2 op w∈Ω

holds with probability at least 1 − ε, where ζ satisfies

ζ ≥ cv γ 2

s

d + θ% + log n

4 ε

2

(6)

.

e n (w) for arbitrary w ∈ Dk as Assume that b ∈ Rd is a vector and satisfies bT b = 1. In this case, we can bound λki ∇2 Q follows: e n (w)b e n (w) = inf min bT ∇2 Q inf λki ∇2 Q w∈Dk w∈Dk bT b=1 e n (w) − ∇2 Q(w) b + bT ∇2 Q(w)b = inf min bT ∇2 Q w∈Dk bT b=1 e n (w) − ∇2 Q(w) b ≥ inf min bT ∇2 Q(w)b − min bT ∇2 Q w∈Dk bT b=1 bT b=1 (7) T 2 e n (w) − ∇2 Q(w) b ≥ inf min b ∇ Q(w)b − max bT ∇2 Q T T w∈Dk b b=1 b b=1

k 2 e n (w) − ∇2 Q(w) = inf inf |λi ∇ f (w(k) , x) | − ∇2 Q

w∈Dk i op ζ ≥ . 2 2 e This means that in each set Dk , ∇ Qn (w) has no zero eigenvalues. Then, combine this and Eqn. (6), by Lemma 4 we know e n (w) has also no critical point in Dk ; that if the population risk Q(w) has no critical point in Dk , then the empirical risk Q otherwise it also holds. e n (w). Assume that in Dk , Q(w) has Now we bound the distance between the corresponding critical points of Q(w) and Q (k) e a unique critical point w(k) and Qn (w) also has a unique critical point wn . Then, there exists t ∈ [0, 1] such that for any z ∈ ∂Bd (1), we have ≥k∇Q(wn(k) )k2

= max h∇Q(wn(k) ), zi z T z=1

= max h∇Q(w(k) ), zi + h∇2 Q(w(k) + t(wn(k) − w(k) ))(wn(k) − w(k) ), zi z T z=1

E1/2 ¬D 2 ≥ ∇2 Q(w(k) ) (wn(k) − w(k) ), (wn(k) − w(k) )

≥ζkwn(k) − w(k) k2 ,

(k)

where ¬ holds since ∇Q(w(k) ) = 0 and holds since w(k) + t(wn − w(k) ) is in Dk and for any w ∈ Dk we have 2 P l bl+1 2 (bl+1 + li=1 di bi )2 (r0 c0 d0 )4 d+θ% , inf i |λi ∇2 Q(w) | ≥ ζ. So if n ≥ ch max where ch is a constant, then 4 8 2 2 ζ d0 b1 d%ε maxi (ri ci ) v √ u ds bs (ks −ss +1) u d log(6) + θ Pl log n + log(bl+1 ) + log 128p + log 4ε 2 t i=1 4p 2c β g kwn(k) − w(k) k2 ≤ ζ n


holds with probability at least 1 − ε. D.6

Proof of Corollary 2

Proof. By Theorem 3, we know that the non-degenerate stationary point w(k) in the m non-degenerate stationary points in (k)

population risk, denoted by {w(1) , w(2) , · · · , w(m) } uniquely corresponding to a non-degenerate stationary point wn in the empirical risk. On the other hand, for any w(k) , it obeys inf λki ∇2 Q(w(k) ) ≥ ζ, i

where ∇ Q(w(k) ) denotes the i-th eigenvalue of the Hessian ∇2 Q(w(k) ) and ζ is a constant. We further define a d 2 set D = {w ∈ R | k∇Q(w)k2 ≤ and inf i |λi ∇ Q(w(k) ) | ≥ ζ}. According to Lemma 5, D = ∪∞ k=1 Dk where each Dk is a disjoint component with w(k) ∈ Dk for k ≤ m and Dk does not contain any critical point of Q(w) for k ≥ m + 1. λki

2

(k)

(k)

Then wn also belong to the component Dk due to the unique corresponding relation between w(k) and wn . Then from Eqn. (6) and (7), we know that if the assumptions in Theorem 3 hold, then for arbitrary w ∈ Dk and t ∈ (0, 1),

ζ

e

k 2 e inf t∇Q (w) + (1 − t)∇Q(w) ≥ and inf ∇ Q (w)

λ ≥ , n n i w∈Dk w∈Dk 2 2 2

e n (w) has no zero eigenvalues. Then, combine this and where and ζ are constants. This means that in each set Dk , ∇2 Q e n (w) also Eqn. (6), we can obtain that in Dk , if Q(w) has a unique critical point w(k) with non-degenerate index sk , then Q n has a unique critical point w(k) in Dk with the same non-degenerate index sk . Namely, the number of negative eigenvalues (k)

(k)

of the Hessian matrices ∇2 Q(w(k) ) and ∇2 Q(wn ) are the same. This further gives that if one of the pair (w(k) , wn ) is a local minimum or saddle point, then another one is also a local minimum or a saddle point. The proof is completed.

E E.1

Proof of Auxiliary Lemmas Proof of Lemma 8

Proof. (1) Since G (z) is a diagonal matrix and its diagonal values are upper bounded by σ1 (zi )(1 − σ1 (z)) ≤ 1/4 where zi denotes the i-th entry of zi , we can conclude kG (z)M k2F ≤

1 kM k2F 16

and

kN G (z)k2F ≤

1 kN k2F . 16

(2) The operator Q (·) maps a vector z ∈ Rd into a matrix of size d2 × d whose ((i − 1)d + i, i) (i = 1, · · · , d) entry equal to σ1 (zi )(1 − σ1 (zi ))(1 − 2σ1 (zi )) and rest entries are all 0. This gives 1 σ1 (zi )(1 − σ1 (zi ))(1 − 2σ1 (zi )) = (3σ1 (zi ))(1 − σ1 (zi ))(1 − 2σ1 (zi )) 3 3 1 3σ1 (zi ) + 1 − σ1 (zi ) + 1 − 2σ1 (zi ) ≤ 3 3 23 ≤ 4. 3 This means the maximal value in Q (z) is at most kQ (z)M k2F ≤

23 34 .

Consider the structure in Q (z), we can obtain

26 kM k2F 38

and

kN Q (z)k2F ≤

26 kN k2F . 38

(3) up (M ) represents conducting upsampling on M ∈ Rs×t×q . Let N = up (M ) ∈ Rps×pt×q . Specifically, for each slice N (:, :, i) (i = 1, · · · , q), we have N (:, :, i) = up (M (:, :, i)). It actually upsamples each entry M (g, h, i) into a matrix of p2 same entries p12 M (g, h, i). So it is easy to obtain kup (M )k2F ≤

1 kM k2F . p2


(4) Let M = W (:, :, i) and N = δei+1 (:, : i). Assume that H = M ~ N ∈ Rm1 ×m2 , where m1 = r˜i−1 − 2ki + 2 and m2 = c˜i−1 − 2ki + 2. Then we have kHk2F =

m1 X m2 X i=1 j=1

|H(i, j)|2 =

m1 X m2 X i=1 j=1

hMΩi,j , N i2 ≤

m1 X m2 X i=1 j=1

kMΩi,j k2F kN k2F ,

where Ωi,j denotes the entry index of M for the (i, j)-th convolution operation (i.e. computing the H(i, j)). Since for P each convolution computing, each element in M is involved at most one time, we can claim that any element m1 Pm2 in M in i=1 kM k2F occurs at most (ki − si + 1)2 since there are si − 1 rows and columns between each Ω i,j j=1 neighboring nonzero entries in N which is decided by the definition of δe in Sec. B.1. Therefore, we have i+1

m1 X m2 X i=1 j=1

kMΩi,j k2F ≤ (ki − si + 1)2 kM k2F ,

which further gives e k2F ≤ (ki − si + 1)2 kM k2F kN k2F . kM ~N

Consider all the slices in δei+1 , we can obtain

e k2F ≤ (ki − si + 1)2 kW k2F kδei+1 k2F . kδei+1 ~W

(5) Since for softmax activation function σ2 , we have y, we can obtain

Pdl+1 i=1

vi = 1 (vi ≥ 0) and there is only one nonzero entry (i.e. 1) in

0 ≤ kv − yk22 = kvk22 + kyk22 − 2hv, yi = 2 − 2hv, yi ≤ 2. The proof is completed.

References Candes, E. and Plan, Y. Tight oracle bounds for low-rank matrix recovery from a minimal number of random measurements. IEEE TIT, 57(4):2342–2359, 2009. Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301): 13–30, 1963. Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for non-convex losses. Annals of Statistics, 2017. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices, compressed sensing. Cambridge Univ. Press, Cambridge, pp. 210–268, 2012.

Understanding Generalization and Optimization Performance of Deep ...

Understanding Generalization and Optimization Performance of Deep ...

Suggest Documents

Towards Understanding Generalization of Deep Learning: Perspective ...

Understanding deep learning requires rethinking generalization - arXiv

understanding the performance of deep mixed ...

Variable generalization performance of a deep learning model ... - PLOS

Optimization and Generalization of Lifting Schemes

Generalization in Deep Learning - arXiv

Unified Deep Supervised Domain Adaptation and Generalization

GENERALIZATION PERFORMANCE ANALYSIS

OPTIMIZATION APPROACHES FOR GENERALIZATION Monika

TRIOID: A GENERALIZATION OF MATROID ... - Optimization Online

Performance Optimization of the Gasdynamic Mirror ... - Deep Blue

Algorithmic Stability and Generalization Performance Olivier Bousquet ...

Understanding and Improving Performance

Understanding Deep Web Technologies

A Generalization of Nemhauser and Trotter's Local Optimization ...

Generalization performance of graph-based semi-supervised ...

A GENERALIZATION OF MATROID AND THE ... - Optimization Online

Generalization performance of spetro-temporal ...

a generalization of the sequential optimization and ...

Generalization Performance of Subspace Bayes ... - Google Sites

Performance Optimization in Mobile-Edge Computing via Deep ... - arXiv

Optimization of Mechanical Performance

Understanding and Improving Performance ...

Combinatorial Optimization and Performance ...