Figure 2: Computational graph for chaining the hypergradients, d?, from the optimizee. - The loss defined in Eq. 2 is eq
Neural Optimizers with Hypergradients for Tuning Parameter-Wise Learning Rates Jie Fu*, Ritchie Ng*, Danlu Chen, Ilija Ilievski, Christopher Pal, Tat-Seng Chua
Introduction
Training the Optimizer
- Recently, there has been a rising interest in learning to learn, i.e. learning to update an optimizee?s parameters directly - In this work, we combine hand-designed and learned optimizers in which an LSTM-based optimizer learns to propose parameter-wise learning rates for the hand-designed optimizer and is trained with hypergradients - Our method can be used to tune any continuous dynamic hyperparameters (including momentum), but we focus on learning rates as a case study
- Because an LSTM is limited to a few hundred time-steps as the long-term memory contents are diluted at every time-step, we do not propose learning rates at every iteration - We conjecture that learning rates tend not to change dramatically across iterations and the LSTM optimizer only proposes learning rates every S iterations - Essentially, we use a straight line to approximate the optimal learning rate curve within S steps in the hope that the actual curve is not highly bumpy with the computational graph in Fig. 2 - As a by-product, proposing learning rates every S iterations also makes it efficient for training and testing with averaged gradients over S iterations
Setup - Fig. 1 shows one step of the LSTM optimizer where all LSTMs share the same parameters, but have separate hidden states - The red lines indicate gradients, and the blue lines indicate hypergradients Figure 2: Computational graph for chaining the hypergradients, d?, from the optimizee
Experiments Experiment Setup
Figure 1: one step of an LSTM optimizer with hypergradients and gradients flow - In (Andrychowicz et al., 2016), an LSTM optimizer g with its own set of parameters ?, is used to minimize the loss of optimizee l in the form of (1)
- We have the following objective function that depends on the entire training trajectory (2)
Generating Learning Rates
- We evaluate our method on MNIST, SVHN, and CIFAR-10 - A 2-hidden-layer (20 neurons at each layer) MLP with sigmoid activation functions is trained using Adam - The batch-size is 120 - Our LSTM optimizer controls the learning rates of Adam - We use a ?xed set of random initial parameters for every meta-iteration for all datasets - The learning rates for baseline optimizers are grid-searched from 10-1 to 10?10 - The LSTM optimizer has 3 layers, each having 20 hidden units, which is trained by Adam with a ?xed learning rate of 10?7 - The LSTM is trained for 5 meta-iterations and unrolled for 50 steps Results - We can see from Fig. 3 that our LSTM optimizer (NOH) improves the optimizee performance signi?cantly in terms of convergence rate and ?nal accuracy after training for 5 meta-iterations for MNIST - We freeze the NOH and transfer it to SVHN and CIFAR-10 where it also improves the optimizee performance - We can also observe from Fig. 3 that the learning rates for SVHN, and especially CIFAR-10, decay more slowly than those on MNIST.
Issues - The loss defined in Eq. 2 is equivalent to optimizing the final loss when we set ?T = 1 where the back-propagation through time becomes inefficient - The neural optimizer learns to update the parameters directly which makes the learning task unnecessarily difficult - The LSTM optimizer needs to unroll for too many time-steps, and thus the training of the LSTM optimizer itself is difficult
Proposed Solution - In contrast to Eq 1, we adopt the following update rule as shown in Eq 3 (3)
- where h(·), the input to the LSTM optimizer, is de?ned as the state description vector of the optimizee gradients at iteration t
Figure 3: Learning curves of the optimizee and randomly picked parameter-wise learning rate schedules for Adam on different datasets.
References Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. Neural Information Processing Systems, 2016.