Scalable Adaptive Stochastic Optimization Using Random Projections
arXiv:1611.06652v1 [stat.ML] 21 Nov 2016
Gabriel Krummenacher♦ ∗
[email protected] Yannic Kilcher♦
[email protected]
Brian McWilliams♥ ∗
[email protected] Joachim M. Buhmann♦
[email protected]
Nicolai Meinshausen♣
[email protected] ♦
Institute for Machine Learning, Department of Computer Science, ETH Zürich, Switzerland ♣ Seminar for Statistics, Department of Mathematics, ETH Zürich, Switzerland ♥ Disney Research, Zürich, Switzerland
Abstract Adaptive stochastic gradient methods such as A DAG RAD have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of A DAG RAD is expected to attain better performance, however in high dimensions it is computationally impractical. We present A DA - LR and R ADAG RAD two computationally efficient approximations to full-matrix A DAG RAD based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix A DAG RAD but at a much smaller computational cost. We show that the regret of A DA - LR is close to the regret of full-matrix A DAG RAD which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that A DA - LR and R ADAG RAD perform similarly to full-matrix A DAG RAD. On the task of training convolutional neural networks as well as recurrent neural networks, R ADAG RAD achieves faster convergence than diagonal A DAG RAD.
1
Introduction
Recently, adaptive stochastic optimization algorithms have gained popularity for large-scale convex and non-convex optimization problems. Among these, A DAG RAD [10] and its variants [22] have received particular attention and have proven among the most successful algorithms for training deep networks. Although these problems are inherently highly non-convex, recent work has begun to explain the success of such algorithms [3, 5]. A DAG RAD adaptively sets the learning rate for each dimension by means of a time-varying proximal regularizer. The most commonly studied and utilised version considers only a diagonal matrix proximal term. As such it incurs almost no additional computational cost over standard stochastic ∗
Authors contributed equally.
30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
gradient descent (SGD). However, when the data has low effective rank the regret of A DAG RAD may have a much worse dependence on the dimensionality of the problem than its full-matrix variant (which we refer to as A DA - FULL). Such settings are common in high dimensional data where there are many correlations between features and can also be observed in the convolutional layers of neural networks. The computational cost of A DA - FULL is substantially higher than that of A DAG RAD– it requires computing the inverse square root of the matrix of gradient outer products to evaluate the proximal term which grows with the cube of the dimension. As such it is rarely used in practise. In this work we propose two methods that approximate the proximal term used in A DA - FULL drastically reducing computational and storage complexity with little adverse affect on optimization performance. First, in Section 3.1 we develop A DA - LR, a simple approximation using random projections. This procedure reduces the computational complexity of A DA - FULL by a factor of p but retains similar theoretical guarantees. In Section 3.2 we systematically profile the most computationally expensive parts of A DA - LR and introduce further randomized approximations resulting in a truly scalable algorithm, R ADAG RAD. In Section 3.3 we outline a simple modification to R ADAG RAD– reducing the variance of the stochastic gradients – which greatly improves practical performance. Finally we perform an extensive comparison between the performance of R ADAG RAD with several widely used optimization algorithms on a variety of deep learning tasks. For image recognition with convolutional networks and language modeling with recurrent neural networks we find that R ADAG RAD and in particular its variance-reduced variant achieves faster convergence. 1.1
Related work
Motivated by the problem of training deep neural networks, very recently many new adaptive optimization methods have been proposed. Most computationally efficient among these are first order methods similar in spirit to A DAG RAD, which suggest alternative normalization factors [22, 29, 7]. Several authors propose efficient stochastic variants of classical second order methods such as LBFGS [6, 21]. Efficient algorithms exist to update the inverse of the Hessian approximation by applying the matrix-inversion lemma or directly updating the Hessian-vector product using the “double-loop” algorithm but these are not applicable to A DAG RAD style algorithms. In the convex setting these methods can show great theoretical and practical benefit over first order methods but have yet to be extensively applied to training deep networks. On a different note, the growing zoo of variance reduced SGD algorithms [20, 8, 19] has shown vastly superior performance to A DAG RAD-style methods for standard empirical risk minimization and convex optimization. Recent work has aimed to move these methods into the non-convex setting [1]. Notably, [23] combine variance reduction with second order methods. Most similar to R ADAG RAD are those which propose factorized approximations of second order information. Several methods focus on the natural gradient method [2] which leverages second order information through the Fisher information matrix. [15] approximate the inverse Fisher matrix using a sparse graphical model. [9] use low-rank approximations whereas [27] propose an efficient Kronecker product based factorization. Concurrently with this work, [13] propose a randomized preconditioner for SGD. However, their approach requires access to all of the data at once in order to compute the preconditioning matrix which is impractical for training deep networks. [24] propose a theoretically motivated algorithm similar to A DA - LR and a faster alternative based on Oja’s rule to update the SVD. Fast random projections. Random projections are low-dimensional embeddings Π : Rp → Rτ which preserve – up to a small distortion – the geometry of a subspace of vectors. We concentrate on the class of structured random projections, among which the Subsampled Randomized Fourier Transform (SRFT) has particularly attractive properties [16]. The SRFT consists of a preconditioning step after which τ columns of the new matrix are subsampled uniformly at random as p Π = p/τ SΘD with the definitions: (i) S ∈ Rτ ×p is a subsampling matrix. (ii) D ∈ Rp×p is a diagonal matrix whose entries are drawn independently from {−1, 1}. (iii) Θ ∈ Rp×p is a unitary discrete Fourier tranansform (DFT) matrix. This formulations allows very fast implementations using the fast Fourier transform (FFT), for example using the popular FFTW package2 . Applying the FFT to a p−dimensional vector can be achieved in O (p log τ ) time. Similar structured random projections 2
http://www.fftw.org/
2
have gained popularity as a way to speed up [25] and robustify [28] large-scale linear regression and for distributed estimation [18, 17]. 1.2
Problem setting
The problem considered by [10] is online stochastic optimization where the goal is, at each step, to predict a point β t ∈ Rp which achieves low regret with respect to a fixed optimal predictor, β opt , for a sequence of (convex) functions Ft (β). After T rounds, the regret can be defined as PT PT R(T ) = t=1 Ft (β t ) − t=1 Ft (β opt ). Initially, we will consider functions Ft of the form Ft (β) := ft (β) + ϕ(β) where ft and ϕ are convex loss and regularization functions respectively. Throughout, the vector gt ∈ ∇ft (β t ) refers to a particular subgradient of the loss function. Standard first order methods update β t at each step by moving in the opposite direction of gt according to a step-size parameter, η. The A DAG RAD family of algorithms [10] instead use an adaptive learning rate which can be different for each feature.PThis is controlled using a time-varying proximal term which we briefly review. Defining t Gt = i=1 gi gi> and Ht = δIp + (Gt−1 + gt gt> )1/2 , the A DA - FULL proximal term is given by ψt (β) = 12 hβ, Ht βi.
Clearly when p is large, constructing G and finding its root and inverse at each iteration is impractical. In practice, rather than the full outer product matrix, A DAG RAD uses a proximal function consisting of the diagonal of Gt , ψt (β) = 12 β, δIp + diag(Gt )1/2 β . Although the diagonal proximal term is computationally cheaper, it is unable to capture dependencies between coordinates in the gradient terms. Despite this, A DAG RAD has been found to perform very well empirically. One reason for this is modern high-dimensional datasets are typically also very sparse. Under these conditions, coordinates in the gradient are approximately independent.
2
Stochastic optimization in high dimensions
A DAG RAD has attractive theoretical and empirical properties and adds essentially no overhead above a standard first order method such as SGD. It begs the question, what we might hope to gain by introducing additional computational complexity. In order to motivate our contribution, we first present an analogue of the discussion in [11] focussing on when data is high-dimensional and dense. We argue that if the data has low-rank (rather than sparse) structure A DA - FULL can effectively adapt to the intrinsic dimensionality. We also show in Section 3.1 that A DA - LR has the same property. First, we review the theoretical properties of A DAG RAD algorithms, borrowing the g1:T,j notation[10]. Proposition 1. A DAG RAD and A DA - FULL achieve the following regret (Corollaries 6 & 11 from [10]) respectively: RD (T ) ≤ 2kβ opt k∞
p X j=1
kg1:T,j k + δkβ opt k1 ,
RF (T ) ≤ 2kβ opt k · tr(GT ) + δkβ opt k. (1) 1/2
The major difference between RD (T ) and RF (T ) is the inclusion of the final full-matrix and diagonal proximal term, respectively. Under a sparse data generating distribution A DAG RAD achieves an up-to exponential improvement over SGD which is optimal in a minimax sense [11]. While data sparsity is often observed in practise in high-dimensional datasets (particularly web/text data) many other problems are dense. Furthermore, in practise applying A DAG RAD to dense data results in a learning rate which tends to decay too rapidly. It is therefore natural to ask how dense data affects the performance of A DA - FULL. For illustration, consider when the data points xi are sampled i.i.d. from a Gaussian distribution PX = N (0, Σ). The resulting variable will clearly be dense. A common feature of high dimensional data is low effective rank defined for a matrix Σ as r(Σ) = tr(Σ)/kΣk ≤ rank(Σ) ≤ p. Low effective rank implies that r p and therefore the eigenvalues of the covariance matrix decay quickly. We will consider distributions parameterised by covariance matrices Σ with eigenvalues λj (Σ) = λ0 j −α for j = 1, . . . , p. 3
Functions of the form Ft (β) = Ft (β > xt ) have gradients kgt k ≤ M kxt k. For example, the least squares loss Ft (β > xt ) = 12 (yt − β > xt )2 has gradient gt = xt (yt − x> t β t ) = xt εt , such that kεt k ≤ M . Let us consider the effect of distributions parametrised by Σ on the proximal terms of full, and diagonal A DAG RAD. Plugging X into the proximal terms of (1) and taking expectations with respect to PX we obtain for A DAG RAD and A DA - FULL respectively:
E
p X j=1
v p u T u X X √ t 2 kg1:T,j k ≤ M E x2t,j ≤ pM T , j=1
t=1
E tr((
T X
gt gt> )1/2 )
t=1
≤M
p
T λ0
p X
j −α/2 ,
j=1
(2)
where the first inequality is from Jensen and the second is from noticing the sum of T squared Gaussian random variables a χ2 random variable. We can consider Pp is−α/2 Pp the effect of fast-decaying spectrum: for α ≥ 2, j=1 j = O (log p) and for α ∈ (1, 2), j=1 j −α/2 = O p1−α/2 .
When the data (and thus the gradients) are dense, yet have low effective rank, A DA - FULL is able to adapt to this structure. On the contrary, although A DAG RAD is computationally practical, in the worst case it may have exponentially worse dependence on the data dimension (p compared with log p). In fact, the discrepancy between the regret of A DA - FULL and that of A DAG RAD is analogous to the discrepancy between A DAG RAD and SGD for sparse data. Algorithm 1 A DA - LR
Algorithm 2 R ADAG RAD
Input: η > 0, δ ≥ 0, τ
Input: η > 0, δ ≥ 0, τ
Output: β T
Output: β T
1: for t = 1 . . . T do 2: Receive gt = ∇ft (β t ). ˜t = Πgt 3: Project: g ˜t = G ˜ t−1 + gt g ˜t> 4: G ˜t ) 5: Qt , Rt ← qr_update(Qt−1 , Rt−1 , gt , g > ˜ 6: B = Gt Qt 7: U, Σ, W = B {SVD} 8: V = WQ> 9: γ t = η(gt − VV> gt ) 10: β t+1 = β t − ηV(Σ1/2 + δI)−1 V> gt − γ t 11: end for
1: for t = 1 . . . T do 2: Receive gt = ∇ft (β t ). 3: Gt = Gt−1 + gt gt> ˜ t = Gt Π 4: Project: G ˜ t {QR-decomposition} 5: QR = G 6: B = Q> Gt 7: U, Σ, V = B {SVD} 8: 9: 10: β t+1 = β t − ηV(Σ1/2 + δI)−1 V> gt 11: end for
3
Approximating A DA - FULL using random projections
It is clear that in certain regimes, A DA - FULL provides stark optimization advantages over A DAG RAD in terms of the dependence on p. However, A DA - FULL requires maintaining a p × p matrix, G and computing its square root and inverse. Therefore, computationally the dependence of A DA - FULL on p scales with the cube which is impractical in high dimensions. ˜ t ∈ Rτ = A naïve approach would be to simply reduce the dimensionality of the gradient vector, g Πgt . A DA - FULL is now directly applicable in this low-dimensional space, returning a solution vector ˜ ∈ Rτ at each iteration. However, for many problems, the original coordinates may have some β t intrinsic meaning or in the case of deep networks, may be parameters in a model. In which case it is important to return a solution in the original space. Unfortunately in general it is not possible to ˜ [31]. recover such a solution from β t Instead, we consider a different approach to maintaining and updating an approximation of the A DAG RAD matrix while retaining the original dimensionality of the parameter updates β and gradients g. 4
3.1
Randomized low-rank approximation
As a first approach we approximate the inverse square root of Gt using a fast randomized singular value decomposition (SVD) [16]. We proceed in two stages: First we compute an approximate basis Q for the range of Gt . Then we use Q to compute an approximate SVD of Gt by forming the smaller dimensional matrix B = Q> Gt and then compute the low-rank SVD UΣV> = B. This is faster than computing the SVD of Gt directly if Q has few columns. ˜ t = Gt Π by means An approximate basis Q can be computed efficiently by forming the matrix G ˜t of a structured random projection and then constructing an orthonormal basis for the range of G by QR-decomposition. The randomized SVD allows us to quickly compute the square root and ˜ −1 = V(Σ1/2 + δI)−1 V> . We call this pseudo-inverse of the proximal term Ht by setting H t approximation A DA - LR and describe the steps in full in Algorithm 1. In practice, using a structured random projection such as the SRFT leads to an approximation of the
original matrix, Gt of the following form Gt − QQ> Gt ≤ , with high probability [16] where depends on τ , the number of columns of Q; p and the τ th singular value of Gt . Briefly, if the singular values of Gt decay quickly and τ is chosen appropriately, will be small (this is stated more formally in Proposition 2). We leverage this result to derive the following regret bound for A DA - LR (see C.1 for proof). Proposition 2. Let σk+1 be the kth largest singular value of Gt . Setting the projection dimension as √ 2 p p 4 k + 8 log(kn) ≤ τ ≤ p and defining = 1 + 7p/τ · σk+1 . With failure probability at √ 1/2 most O k −1 A DA - LR achieves regret RLR (T ) ≤ 2kβ opt ktr(GT ) + (2τ + δ)kβ opt k .
√ Due to the randomized approximation we incur an additional 2τ kβ opt k compared with the regret of A DA - FULL (eq. 1). So, under the earlier stated assumption of fast decaying eigenvalues we can use an identical argument as in eq. (2) to similarly obtain a dimension dependence of O (log p + τ ). Approximating the inverse square root decreases the complexity of each iteration from O p3 to O τ p2 . We summarize the cost of each step in Algorithm 1 and contrast it with the cost of A DA - FULL in Table A.1 in Section A. Even though A DA - LR removes one factor of p form the runtime of A DA - FULL it still needs to store the large matrix Gt . This prevents A DA - LR from being a truly practical algorithm. In the following section we propose a second algorithm which directly stores a low dimensional approximation to Gt that can be updated cheaply. This allows for an improvement in runtime to O τ 2 p . 3.2
R ADAG RAD: A faster approximation
From Table A.1, the expensive steps in Algorithm 1 are the update of Gt (line 3), the random projection (line 4) and the projection onto the approximate range of Gt (line 6). In the following we propose R ADAG RAD, an algorithm that reduces the complexity to O τ 2 p by only approximately solving some of the expensive steps in A DA - LR while maintaining similar performance in practice.
To compute the approximate range Q, we do not need to store the full matrix Gt . Instead we only ˜ t = Gt Π. This matrix can be computed iteratively by setting require the low dimensional matrix G ˜ t ∈ Rp×τ = G ˜ t−1 + gt (Πgt )> . This directly reduces the cost of the random projection to G O (p log τ ) since we only project the vector gt instead of the matrix Gt , it also makes the update of ˜ t faster and saves storage. G ˜ t on the approximate range of Gt and use the SVD to compute the inverse square We then project G root. Since Gt is symmetric its row and column space are identical so little information is lost by ˜ t instead of Gt on the approximate range of Gt .3 The advantage is that we can now projecting G compute the SVD in O τ 3 and the matrix-matrix product on line 6 in O τ 2 p . See Algorithm 2 for the full procedure. The most expensive steps are now the QR decomposition and the matrix multiplications in steps 6 ˜ t with and 8 (see Algorithm 2 and Table A.1). Since at each iteration we only update the matrix G 3
This idea is similar to bilinear random projections [14].
5
˜t> we can use faster rank-1 QR-updates [12] instead of recomputing the the rank-one matrix gt g ˜ > Q for very large problems (e.g. full QR decomposition. To speed up the matrix-matrix product G t backpropagation in convolutional neural networks), a multithreaded BLAS implementation can be used. 3.3
Practical algorithms
Here we outline several simple modifications to the R ADAG RAD algorithm to improve practical performance. Corrected update. The random projection step only retains at most τ eigenvalues of Gt . If the assumption of low effective rank does not hold, important information from the p − τ smallest eigenvalues might be discarded. R ADAG RAD therefore makes use of the corrected update β t+1 = β t − ηV(Σ1/2 + δI)−1 V> gt − γ t ,
γ t = η(I − VV> )gt .
where
γ t is the projection of the current gradient onto the space orthogonal to the one captured by the random projection of Gt . This ensures that important variation in the gradient which is poorly approximated by the random projection is not completely lost. Consequently, if the data has rank less than τ , kγk ≈ 0. This correction only requires quantities which have already been computed but greatly improves practical performance. Variance reduction. Variance reduction methods based on SVRG [20] obtain lower-variance gradient estimates by means of computing a “pivot point” over larger batches of data. Recent work has shown improved theoretical and empirical convergence in non-convex problems [1] in particular in combination with A DAG RAD. We modify R ADAG RAD to use the variance reduction scheme of SVRG. The full procedure is given in Algorithm 3 in Section B. The majority of the algorithm is as R ADAG RAD except for the outer loop which computes the pivot point, µ every epoch which is used to reduce the variance of the stochastic gradient (line 4). The important additional parameter is m, the update frequency for µ. As in [1] we set this to m = 5n. Practically, as is standard practise we initialise R ADA - VR by running A DAG RAD for several epochs. We study the empirical behaviour of A DA - LR, R ADAG RAD and its variance reduced variant in the next section.
4 4.1
Experiments Low effective rank data
We compare the performance A DA - FULL A of our proposed algorithms A DA - LR A R ADA G RAD R G against both the diagonal and A DA G RAD A G full-matrix A DAG RAD variants in the idealised setting where the data is dense but has low effective rank. We generate binary classification data Principal component Iteration with n = 1000 and p = (a) Logistic Loss (b) Spectrum 125. The data is sampled i.i.d. from a Gaussian distribution N (µc , Σ) where Σ has with Figure 1: Comparison of: (a) loss and (b) the largest eigenvalues rapidly decaying eigenvalues (normalised by their sum) of the proximal term on simulated data. λj (Σ) = λ0 j −α with α = 1.3, λ0 = 30. Each of the two classes has a different mean, µc . 100
100
Normalised eigenvalues
DA FULL DA LR ADA
Loss
DA
RAD
RAD
10−1
10−1
10−2
10−2
10−3
0
500
1000
1500
2000
2500
3000
3500
4000
0
10
20
30
40
50
60
For each algorithm learning rates are tuned using cross validation. The results for 5 epochs are averaged over 5 runs with different permutations of the data set and instantiations of the random projection for A DA - LR and R ADAG RAD. For the random projection we use an oversampling factor so Π ∈ R(10+τ )×p to ensure accurate recovery of the top τ singular values and then set the values of λ[τ :p] to zero [16]. 6
R ADA G RAD R ADA - VR A DA G RAD A DA G RAD +SVRG
10−2
R ADA G RAD R ADA - VR A DA G RAD A DA G RAD +SVRG
10−3
10
100
0
Training Loss
Training Loss
Training Loss
10−1
R ADA G RAD R ADA - VR A DA G RAD A DA G RAD +SVRG
10−1 0
5000
10000
15000
20000
Iteration
25000
30000
35000
0
40000
5000
10000
15000
20000
Iteration
25000
30000
0
35000
0.75 0.99
10000
20000
30000
Iteration
40000
50000
0.9
0.70
0.8
0.97
R ADA G RAD R ADA - VR A DA G RAD A DA G RAD +SVRG
0.96
0
5000
10000
15000
20000
Iteration
25000
(a) MNIST
30000
35000
0.60
0.55
R ADA G RAD R ADA - VR A DA G RAD A DA G RAD +SVRG
0.50
0.45
0.40
0.95 40000
Test Accuracy
Test Accuracy
Test Accuracy
0.65 0.98
0.7
0.6
R ADA G RAD R ADA - VR A DA G RAD A DA G RAD +SVRG
0.5
0.4
0.3
0.35 0
5000
10000
15000
20000
Iteration
(b) CIFAR
25000
30000
35000
0
10000
20000
30000
Iteration
40000
50000
(c) SVHN
Figure 2: Comparison of training loss (top row) and test accuracy (bottom row) on (a) MNIST, (b) CIFAR and (c) SVHN. Figure 1a shows the mean loss on the training set. The performance of A DA - LR and R ADAG RAD match that of A DA - FULL. On the other hand, A DAG RAD converges to the optimum much more slowly. Figure 1b shows the largest eigenvalues (normalized by their sum) of the proximal matrix for each method at the end of training. The spectrum of Gt decays rapidly which is matched by the randomized approximation. This illustrates the dependencies between the coordinates in the gradients and suggests Gt can be well approximated by a low-dimensional matrix which considers these dependencies. On the other hand the spectrum of A DAG RAD (equivalent to the diagonal of G) decays much more slowly. The learning rate, η chosen by R ADAG RAD and A DA - FULL are roughly one order of magnitude higher than for A DAG RAD. 4.2
Non-convex optimization in neural networks
Here we compare R ADAG RAD and R ADA - VR against A DAG RAD and the combination of A DAG RAD + SVRG on the task of optimizing several different neural network architectures. Convolutional Neural Networks. We used modified variants of standard convolutional network architectures for image classification on the MNIST, CIFAR-10 and SVHN datasets. These consist of three 5 × 5 convolutional layers generating 32 channels with ReLU non-linearities, each followed by 2 × 2 max-pooling. The final layer was a dense softmax layer and the objevtive was to minimize the categorical cross entropy. We used a batch size of 8 and trained the networks without momentum or weight decay, in order to eliminate confounding factors. Instead, we used dropout regularization (p = 0.5) in the dense layers during training. Step sizes were determined by coarsely searching a log scale of possible values and evaluating performance on a validation set. We found R ADAG RAD to have a higher impact with convolutional layers than with dense layers, due to the higher correlations between weights. Therefore, for computational reasons, R ADAG RAD was only applied on the convolutional layers. The last dense classification layer was trained with A DAG RAD. In this setting A DA - FULL is computationally infeasible. The number of parameters in the convolutional layers is between 50-80k. Simply storing the full G matrix using double precision would require more memory than is available on top-of-the-line GPUs. The results of our experiments can be seen in Figure 2, where we show the objective value during training and the test accuracy. We find that both R ADAG RAD variants consistently outperform both A DAG RAD and the combination of A DAG RAD + SVRG on these tasks. In particular combining R ADAG RAD with variance reduction results in the largest improvement for training although both R ADAG RAD variants quickly converge to very similar values for test accuracy. 7
For all models, the learning rate selected by R ADAG RAD is approximately an order of magnitude larger than the one selected by A DAG RAD. This suggests that R ADAG RAD can make more aggressive steps than A DAG RAD, which results in the relative success of R ADAG RAD over A DAG RAD, especially at the beginning of the experiments. We observed that R ADAG RAD performed 5-10× slower than A DAG RAD per iteration. This can be attributed to the lack of GPU-optimized SVD and QR routines. These numbers are comparable with other similar recently proposed techniques [24]. However, due to the faster convergence we found that the overall optimization time of R ADAG RAD was lower than for A DAG RAD. R ADA G RAD R ADA G RAD Recurrent Neural Networks. A DA G RAD A DA G RAD We trained the strongly-typed variant of the long short-term memory network (T-LSTM, [4]) for language modelling, which consists of the following task: Given a sequence of words from an original text, predict the next word. We used pre-trained G LOV E embedding vectors [30] as input to the T-LSTM layer and a softmax over the vocabulary (10k words) as output. The loss is the mean categorical crossIteration Iteration entropy. The memory size of the T-LSTM units was set to Figure 3: Comparison of training loss (left) and and test loss 256. We trained and evaluated (right) on language modelling task with the T-LSTM. our network on the Penn Treebank dataset [26]. We subsampled strings of length 20 from the dataset and asked the network to predict each word in the string, given the words up to that point. Learning rates were selected by searching over a log scale of possible values and measuring performance on a validation set. 10−1
Test Loss
Training Loss
100
10−2
10−1
10−3
0
20000
40000
60000
80000
100000
0
20000
40000
60000
80000
100000
We compared R ADAG RAD with A DAG RAD without variance reduction. The results of this experiment can be seen in Figure 3. During training, we found that R ADAG RAD consistently outperforms A DAG RAD: R ADAG RAD is able to both quicker reduce the training loss and also reaches a smaller value (5.62 × 10−4 vs. 1.52 × 10−3 , a 2.7× reduction in loss). Again, we found that the selected learning rate is an order of magnitude higher for R ADAG RAD than for A DAG RAD. R ADAG RAD is able to exploit the fact that T-LSTMs perform type-preserving update steps which should preserve any low-rank structure present in the weight matrices. The relative improvement of R ADAG RAD over A DAG RAD in training is also reflected in the test loss (1.15 × 10−2 vs. 3.23 × 10−2 , a 2.8× reduction).
5
Discussion
We have presented A DA - LR and R ADAG RAD which approximate the full proximal term of A DAG RAD using fast, structured random projections. A DA - LR enjoys similar regret to A DA - FULL and both methods achieve similar empirical performance at a fraction of the computational cost. Importantly, R ADAG RAD can easily be modified to make use of standard improvements such as variance reduction. Using variance reduction in combination in particular has stark benefits for non-convex optimization in convolutional and recurrent neural networks. We observe a marked improvement over widely-used techniques such as A DAG RAD and SVRG, the combination of which has recently been proven to be an excellent choice for non-convex optimization [1]. Furthermore, we tried to incorporate exponential forgetting schemes similar to RMS P ROP and A DAM into the R ADAG RAD framework but found that these methods degraded performance. A downside of such methods is that they require additional parameters to control the rate of forgetting. Optimization for deep networks has understandably been a very active research area. Recent work has concentrated on either improving estimates of second order information or investigating the effect of variance reduction on the gradient estimates. It is clear from our experimental results that a thorough 8
study of the combination provides an important avenue for further investigation, particularly where parts of the underlying model might have low effective rank. Acknowledgements. We are grateful to David Balduzzi, Christina Heinze-Deml, Martin Jaggi, Aurelien Lucchi, Nishant Mehta and Cheng Soon Ong for valuable discussions and suggestions.
References [1] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, 2016. [2] S.-I. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998. [3] D. Balduzzi. Deep online convex optimization with gated games. arXiv preprint arXiv:1604.01952, 2016. [4] D. Balduzzi and M. Ghifary. Strongly-typed recurrent neural networks. In Proceedings of the 33rd International Conference on Machine Learning, 2016. [5] D. Balduzzi, B. McWilliams, and T. Yeoman-Butler. Neural taylor approximations: Convergence and exploration in rectifier networks. arXiv preprint arXiv:1611.02345, 2016. [6] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. arXiv preprint arXiv:1401.7020, 2014. [7] Y. N. Dauphin, H. de Vries, J. Chung, and Y. Bengio. Rmsprop and equilibrated adaptive learning rates for non-convex optimization. arXiv preprint arXiv:1502.04390, 2015. [8] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 2014. [9] G. Desjardins, K. Simonyan, R. Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2062–2070, 2015. [10] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011. [11] J. C. Duchi, M. I. Jordan, and H. B. McMahan. Estimation, optimization, and parallelism when data is sparse. In Advances in Neural Information Processing Systems, 2013. [12] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012. [13] A. Gonen and S. Shalev-Shwartz. Faster sgd using sketched conditioning. arXiv preprint arXiv:1506.02649, 2015. [14] Y. Gong, S. Kumar, H. Rowley, and S. Lazebnik. Learning binary codes for high-dimensional data using bilinear projections. In Proceedings of CVPR, pages 484–491, 2013. [15] R. Grosse and R. Salakhudinov. Scaling up natural gradient by sparsely factorizing the inverse fisher matrix. In Proceedings of the 32nd International Conference on Machine Learning, pages 2304–2313, 2015. [16] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011. [17] C. Heinze, B. McWilliams, and N. Meinshausen. Dual-loco: Distributing statistical estimation using random projections. In Proceedings of AISTATS, 2016. [18] C. Heinze, B. McWilliams, N. Meinshausen, and G. Krummenacher. Loco: Distributing ridge regression with random projections. arXiv preprint arXiv:1406.3469, 2014. [19] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems, 2015. [20] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013. [21] N. S. Keskar and A. S. Berahas. adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs. Nov. 2015. [22] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [23] A. Lucchi, B. McWilliams, and T. Hofmann. A variance reduced stochastic newton method. arXiv preprint arXiv:1503.08316, 2015. [24] H. Luo, A. Agarwal, N. Cesa-Bianchi, and J. Langford. Efficient second order online learning via sketching. arXiv preprint arXiv:1602.02202, 2016. [25] M. W. Mahoney. Randomized algorithms for matrices and data. Apr. 2011. arXiv:1104.5557v3 [cs.DS]. [26] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. [27] J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
9
[28] B. McWilliams, G. Krummenacher, M. Lucic, and J. M. Buhmann. Fast and robust least squares estimation in corrupted linear models. In Advances in Neural Information Processing Systems, volume 27, 2014. [29] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2413–2421, 2015. [30] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014. [31] L. Zhang, M. Mahdavi, R. Jin, T. Yang, and S. Zhu. Recovering optimal solution by dual random projection. arXiv preprint arXiv:1211.3046, 2012.
10
Supplementary Information for Scalable Adaptive Stochastic Optimization Using Random Projections A
Computational Complexity
Table A.1: Comparison of computational complexity in big O notation between A DA - FULL, A DA - LR and R ADAG RAD. Operation Πgt Gt = gt gt> Gt Π QR-decomp Q> Gt SVD QW β t+1 = Total
B
Line
A DA - FULL
A DA - LR
p2
p3
p2 2 p log τ τ 2p τ p2 τ 2p
p2
τp
τ 2p τ 2p τ3 τ 2p τp
3
τ p2
τ 2p
3 4 5 6 7 8 10
p
R ADAG RAD p log τ pτ
R ADA - VR: R ADAG RAD with variance reduction.
Algorithm 3 R ADA - VR Input: η > 0, δ ≥ 0, τ , S number of epochs, m iterations per epoch, initial β 10 1: for s = 1 P . . . S do n 2: µ = ∇ i=1 fi (β s0 ) 3: for t = 1 . . . m − 1 do 4: Compute VR gradient: gt = ∇ft (β st ) − ∇ft (β s0 ) + µ ˜t = Πgt 5: Project: g ˜t = G ˜ t−1 + gt g ˜t> 6: G ˜t ) 7: Qt , Rt ← qr_update(Qt−1 , Rt−1 , gt , g ˜ >Q 8: B=G t 9: U, Σ, W = B {SVD} ˜> 10: V = WQ s 11: β t+1 = β st − ηV(Σ1/2 + δI)−1 V> gt − γ t 12: end for 13: β s+1 = β st+1 0 14: end for
Output: β Sm
C
Analysis
C.1 Regret bound for A DA - LR The following proof is based on the proof for Theorem 7 in [10]. The key difference is that instead 1/2 of having the square root and (pseudo-)inverse of the full matrix Gt : Gt and S†t we have the ˜ t = (QQ> Gt )1/2 and approximate square root and inverse based on the randomized SVD [16]): S † > −1/2 ˜ ˜ t xi or ψt = hx, H ˜ t xi St = (QQ Gt ) . Essentially we use the proximal function ψt = hx, S ˜ ˜ where we set Ht = δI + St . Here Q is the approximate basis for the range of the matrix Gt [16]. ˜ −1/2 . We first state the following facts about the relationship between G and G 11
˜ −1/2 = (QQ> G)−1/2 we have Lemma 3. Defining G ˜ −1/2 G = (G−1 (QQ> )G2 )1/2 , (I) G ˜ 1/2 ) . (II) tr((G−1 (QQ> )G2 )1/2 ) = tr(G We also require the following Lemma which bounds the sequence of proximal terms by the trace of ˜ −1/2 . the final G Lemma 4 (Based on Lemma 10 in [10]). T X t=1
˜ −1/2 gt i ≤ 2 hgt , G t
T X ˜ −1/2 gt i = 2tr(G ˜ 1/2 ). hgt , G T T
(3)
t=1
We are now ready to prove Proposition 2. Proof of Proposition 2. Inspecting Lemma 6: R(T ) ≤ we first bound the term
PT
t=1
T 1 ηX 0 ψT (β opt ) + kf (β )k2 ∗ , η 2 t=1 t t ψT −1
kft0 (β t )k2ψ∗
T −1
.
From [10, Proof of Theorem 7] we have that the squared dual norm associated with ψt is kxk2ψt∗ = hx, (δI + (QQ> Gt )1/2 )−1 xi and thus it is clear that kgt k2ψt∗ ≤ hgt , (QQ> Gt )−1/2 gt i. Lemma 8 shows that kgt k2ψ∗ ≤ t−1 ˜ t gt i as long as δ ≥ kgt k2 . Lemma 4 then implies that hgt , S T X t=1
˜ 1/2 ). kft0 (β t )k2ψT∗ −1 ≤ 2tr(G T
˜ 1/2 ) by 2(tr(G1/2 ) + τ √): We now bound 2tr(G T T 1/2
1/2
1/2
1/2
˜ ) − tr(G ) = tr(G ˜ tr(G T T T − GT ) p τ X X 1/2 ˜ 1/2 ) − λj (G1/2 ) − = λ j (G λj (GT ) T T ≤
j=1 τ X j=1
(4) (5)
j=τ +1
˜ 1/2 ) − λ1 (G1/2 ) λ1 (G T T
(6)
˜ T ) = 0, ∀j > τ . since λj (G
Now, using the reverse triangle inequality and Theorem 5 we obtain τ X j=1
τ X ˜ 1/2 ) − λ1 (G1/2 ) ≤ ˜ 1/2 − G1/2 k2 λ 1 (G kG T T T T
≤
j=1 τ X
√
(8)
j=1
√ ≤ τ . 12
(7)
(9)
√ 1/2 It remains to show that ψT (β opt ) in Lemma 6 is bounded by δ + + tr(GT ) kβ opt k2 to get the statement of Theorem 2: ψT (β opt ) = hβ opt , δI + (QQ> GT )1/2 β opt i ≤ kβ opt k2 k(QQ> GT )1/2 k2 + δkβ opt k2 √ 1/2 ≤ kβ opt k2 + kGT k + δkβ opt k2 √ 1/2 + tr(GT ) + δkβ opt k2 ≤ kβ opt k2
where we again use the reverse triangle inequality and Theorem 5 as above. Finally, plugging this into the statement of Lemma 6 and setting η = kβ opt k2 (as in Corollary 11 in [10]) we get the expression for the regret of A DA - LR as stated in Theorem 2. C.2 Proofs of supporting results Proof of Lemma 3. By direct computation we have for (I) ˜ −1/2 G = (QQ> G)−1/2 G G = ((QQ> G)−1 G2 )1/2 = (G−1 (QQ> )−1 G2 )1/2 = (G−1 (QQ> )G2 )1/2 . and for (II) tr((G−1 (QQ> )G2 )1/2 ) = tr((Q> GQ)1/2 ) = tr((QQ> G)1/2 ) ˜ 1/2 ). = tr(G Proof of Lemma 4. We set up the following proof by induction. In the base case: −1/2
˜ hg1 , G 1
−1/2
˜ , g1 i = tr(G 1
1/2
1/2
˜ ) ≤ 2tr(G ˜ ), g1 g1> ) = tr(G 1 1
where we have used (II). Now, assuming that the lemma is true for T − 1, we get: T T X X ˜ −1/2 gt i + hgT , G ˜ −1/2 gT i. ˜ −1/2 , gt i ≤ 2 hgt , G hgt , G t
T −1
t=1
t=1
T
−1/2
˜ Now using that G T −1 does not depend on t and (II): T −1 X t=1
Therefore we get
˜ −1/2 gt i = tr(G ˜ −1/2 GT −1 ) = tr(G ˜ 1/2 ). hgt , G T −1 T −1 T −1
T X ˜ −1/2 gt i ≤ 2tr(G ˜ 1/2 ) + hgT , G ˜ −1/2 gT i. hgt , G t T −1 T
(10)
t=1
We can rewrite
1/2
> > > ˜ tr(G T −1 ) = tr( QT −1 QT −1 GT − QT −1 QT −1 gT gT
1/2
)
(11)
Now since range(QT −1 ) ⊂ range(QT ) and Proposition 8.5 in [16] we can use Lemma 7 with ν = 1 and g = gt to obtain: ˜ 1/2 ) + hgT , G ˜ −1/2 , gT i ≤ 2tr(G ˜ 1/2 ) 2tr(G (12) T −1
T
13
T
D
Supporting Results
p Theorem 5 (SRFT approximation error (Theorem 11.2 in [16])). Defining = 1 + 7p/τ · σk+1 the following holds with failure probability at most O k −1
Gt − QQ> Gt ≤ , (13) 2 h√ i2 p where σk+1 is the kth largest singular value of Gt , and 4 k + 8 log(kn) ≤ τ ≤ p. Lemma 6 (Proposition 2 from [10]). R(T ) :=
T X t=1
ft (β t ) + ϕ(β t ) − ft (β opt ) − ϕ(β opt ) ≤
T 1 ηX 0 ψT (β opt ) + kf (β )k2 ∗ η 2 t=1 t t ψT −1
Lemma 7 (Lemma 8 from [10]). Let B 0. For any ν such that B − νgg> 0 the following holds 2tr((B − νgg> )1/2 ) ≤ 2tr(B1/2 ) − νtr(B1/2 gg> )
Lemma 8 (Lemma 9 from [10]). Let δ ≥ kgk2 and A 0, then
hg, (δI + A1/2 )−1 gi ≤ hg, ((A + gg> )† )1/2 gi
14