benchmarks, the proposed attention based networks perform almost as well as ... Our model adaptation architecture, leverages the residual convolution neural ...
Attention based Efficient Model Adaptation Anonymous
Abstract We present a new method to efficiently adapt off-the-shelf deep neural networks to new computer vision tasks. This is achieved with the help of light-weight adaptation modules that learn to modify the hidden responses of such networks. During the process of cross-task transfer, the modules are attached across convolutional units of the pre-trained network and trained from scratch using the supervised objective for the target task. The adapters generate attention masks to attenuate the output of the convolutional layers that they are connected to. These masks effectively guide the subsequent stages of the pre-trained network towards possible regions of interest to the new problem. None of the convolutional layers in the pre-trained network are changed during the transfer. The frugal design of the proposed adapter modules ensures that an existing deep network can be extended to solve a new kind of problem with very few additional parameters. Compared to fine-tuning which creates a copy of an entire network for every new task, the proposal is more scalable and efficient. Experiments are designed to evaluate adaptation of networks trained on the imageNet dataset, to mid scale image classification benchmarks such as MIT SUN, MIT Indoor and Caltech 256. Results indicates that with only 8 − 15% additional parameters, the attention based scheme achieves performance comparable to that of full fine-tuning.
One of the most remarkable virtues of neural networks is their transferability. Very deep networks (Krizhevsky, Sutskever, and Hinton 2012; He et al. 2016) trained with large scale datasets such as imageNet (Deng et al. 2009) or places (Zhou et al. 2014), learn to recognize generic semantic elements (e. g. “faces” and “object parts”) within their hidden layers (Zeiler and Fergus 2014). Such strong selectivity allows neural networks trained for one task (e. g. object recognition) to be adapted easily to multiple other related tasks (e. g. scene recognition, face verification, object tracking). A simple way to extend a pre-trained neural network to solve other problems, is by using its activations as a generic image representation. For example, the responses of an imageNet trained object recognition network can be used as an image descriptor to train fine-grained recognizers, attribute detectors and object detectors with a reasonable degree of success (Donahue et al. 2014; Gong et al. 2014; Kaiming et al. 2014). An even more powerful approach is to c 2019, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved.
adapt the entire pre-trained neural network to the new problem. This process, commonly referred to as fine-tuning (Girshick et al. 2014), requires a few controlled iterations of supervised back-propagation on a dataset labeled for the new task. Due to its simplicity and good performance on on a wide range of tasks, fine-tuning is now a widely accepted standard for knowledge transfer across neural networks (Girshick et al. 2014; Girshick 2015; Ren et al. 2015; Lin, RoyChowdhury, and Maji 2018; Zhou et al. 2016; Simonyan and Zisserman 2014). Despite its recognized merits, fine-tuning based model adaptation faces severe limitations in terms of scalability. First, the technique does not provide fine-grained control over the extent to which a network must be adapted. In most cases, all except the initial few layers of the network are updated during adaptation. These layers account for more than 90% of the total parameters of a network. As a result, for every new task fine-tuning adds tens of millions of parameters to the model repository. A second, perhaps related, problem is that of catastrophic forgetting (French 1999). Upon fine-tuning, a neural network forgets the original task completely. Thus, an object recognition net adapted to perform face verification no longer remains generic enough to recognize other objects. The original model has to be retained separately along with its versions specialized for different problems. Due to these issues fine-tuning does not appear to be an efficient and scalable solution to new computer vision tasks. A possible approach to achieving efficient adaptation, is to set aside a small number of layers in a pre-trained network for task-specificity. For every new task, only these parameters are learned through adaptation while rest of the network is kept unchanged. Lightweight layers such as batch normalization that are primarily regularizers (Ioffe and Szegedy 2015), could also potentially be candidates for task specific tuning. They are however, too limited to ensure a performance that is comparable to full fine-tuning. Rebuffi et. al. propose a residual module (Rebuffi, Bilen, and Vedaldi 2017) that combines a batch normalization layer and a linear layer embedded in the original network for adaptation. A more expressive adapter helps them achieve task-specific accuracy close to that of full-finetuning at the cost of a very few extra parameters. The most critical problem with their method, however, is that the original large scale model must be trained with the residual
modules in it. A neural network pre-trained without these modules cannot be augmented at the time of task specific learning. Their technique, therefore, is perhaps more suited to a multi-task learning scenario, where the whole system is trained concurrently on the original large scale problem (e. g. imageNet object recognition) and other related small scale problems (e. g. SUN scene classification, VOC object detection etc.). In this work, we present a framework for networks adaptation that is more efficient and scalable compared to finetuning as well as the residual adapters method in (Rebuffi, Bilen, and Vedaldi 2017). We propose to learn light-weight adaptation modules that attach to the filters in a pre-trained large scale neural network and attenuate their activations using generated attention masks. During our adaptation process, most units from the original pre-trained network are kept unchanged while the parameters within the adapter modules are learned to minimize a supervised loss for the new task. The attention masks generated by the proposed adapters serve to guide the subsequent layers of the pre-trained network towards areas of the receptive field relevant to the new problem. We use residual networks (He et al. 2016) pre-trained on imageNet object recognition as our universal model and adapt it to other tasks such as scene classification using SUN and MIT Indoor datasets (Quattoni and Torralba 2009; Xiao et al. 2010) and unseen object recognition on the Caltech 256 dataset (Griffin, Holub, and Perona 2006). On these benchmarks, the proposed attention based networks perform almost as well as fine-tuning while adding only (8 − 15%) more parameters for each new task. Our adaptation method is also comparable in performance with the residual adapter framework in (Rebuffi, Bilen, and Vedaldi 2017). It is noteworthy that the attention modules are not pre-trained on imageNet unlike the residual adapters of (Rebuffi, Bilen, and Vedaldi 2017). They are merely attached to the filters of the learned imageNet model during adaptation and trained from scratch. In fact, we show that such structural augmentations cannot be made using the residual adapters (Rebuffi, Bilen, and Vedaldi 2017) mainly because they are serial units and cause a large co-variate shift at the input of the subsequent layers which hinders learning. On the other hand, if the proposed attention modules are pre-trained within an imageNet network and then finetuned to each task, they outperform the method in (Rebuffi, Bilen, and Vedaldi 2017).
Attention based Model Adaptation Our model adaptation architecture, leverages the residual convolution neural networks (ResNets) introduced in (He et al. 2016). Every stage of a ResNet consists of a basic residual block shown in fig. 1 a). It is composed of two convolution filters W0 and W1 , each followed by a batch normalization and a scaling layer. The input is short-circuited to the output of this block in order to impose a structural constraint that allows training of very deep architectures possible (He et al. 2016). An input map x0 ∈ RN ×D×D , where N indicates the number of channels and D indicates the dimensions, gets transformed by the residual block as
x x 00
μ μ 00 σ σ0
W W00
γ γ 00 [. ]+ β [. ]+ β 00
0
x x 11
μ μ 11 σ σ1
W W11
1
x x2 γ γ 11 [. ]+ 2 β 1 [. ]+ β1
x +x x 00 + x 22
x x 22 γ γ 11 [. ]+ β1 [. ]+ β1
x +x x 00 + x 22
a) Resnet basic block H( ., Φ ) H( ., Φ11)
H( ., Φ ) H( ., Φ00) x x 00
..
W W00
μ μ 00 σ σ00
x x 11
γ γ 00 [. ]+ β [. ]+ β00
W W11
..
μ μ 11 σ σ11
b) ResAdaptNet block (Rebuffi, Bilen, and Vedaldi 2017) H( ., Φ0) x0
μ0 σ0
W0
γ0 [. ]+ β0
H( ., Φ1)
x1
x 0+x 2 μ1 σ1
W1
γ1 [. ]+ β1
c) AttnNet block Figure 1: a) shows the basic building block of a Residual network (He et al. 2016). The Residual adapter model of (Rebuffi, Bilen, and Vedaldi 2017) is shown in fig. b). Our modification to the Residual net., with attention based modules H(., φi ) is shown in fig. c).
z0 = W0 ∗x0 ,
(z0 − µ0 ) y0 = p 2 , (σ0 + 0 )
x1 = γ0 y0 +β0 (1)
(z1 − µ1 ) , x2 = γ1 z1 +β1 (2) y1 = p 2 (σ1 + 1 ) x ˜ 0 = x0 + x2 . (3) The convolution filters in a basic block have a kernel size of 3 × 3 and the number of channels C ∈ {64, 128, 256, 512} varies depending on the depth of the block within the network. The batch normalization parameters µ0 , µ1 ∈ RC and σ0 , σ1 ∈ RC + are the means and variances estimated from the input minibatch. The scaling parameters γ0 , γ1 and β0 , β1 serve to restore the identity of the normalized input signal (Ioffe and Szegedy 2015). The processed output x2 ∈ RN ×D×D of the block is then added back to the input signal x0 using a short circuit connection. Fine-tuning to adapt the ResNet entails updating Wi0 s in each residual block according to data driven gradients. Our approach instead is to attenuate the responses zi ∈ RC×D×D with attention masks ai ∈ RC×D×D generated by trainable parallel modules H(., φi ). z1 = W1 ∗x1 ,
zˆ0 = z0 a0
a0 = H(x0 , φ0 ), σ ˆ02
(4)
µ ˆ0 = E{ˆ z0 }, = V ar{zˆ0 } (5) (ˆ z0 − µ ˆ0 ) y0 = p 2 . (6) (ˆ σ0 + 0 ) Here denotes a pointwise multiplication between convolutional map z0 and the attention map a0 which produces the masked output zˆ0 . The batch norm layer parameters, as a result, change into running means and variances {ˆ µ0 , σ ˆ0 } of the masked input zˆ0 . Fig 1 b) shows a schematic of a Resnet block with attention based modules attached to its conv. layers Wi . The effect of attenuating the responses of
Image
.
Co W0 Conv k x k Adapted model output
Cin Attention mask
Figure 2: An illustration of response attenuation. The attention mask generated by our trainable, light-weight, adapters is combined with activations of the imageNet trained network and propagated forward.
these filters, propagates through the rest of the module. For example, a convolutional layer that follows the masked unit processes the input values in its receptive field as well as their relative importance as indicated by the ai values. The role of the attention modules H(., φi ) can be seen as that of guiding the imageNet layers to pay attention to the regions that are important to the new task. The illustration of attention based attenuation of network activations is shown in Figure 2. There are many possible ways to implement the attention modules H(., φ). For reasons of efficiency, however, our emphasis is on light-weight designs with sufficient flexibility to achieve good results on any new task. The following section, describes two adapter prototypes that we used in our experiments.
Adapter Designs Figure 3 illustrates two designs, Attn-1 and Attn-2, for the attention based adapter module that we used in our experiments. A convolution W0 with Co filters of spatial size k × k and input dimension Cin denotes the imageNet trained layer that is being attenuated by both adapters. Attn-1 samples the input to the imageNet layer W0 and propagates it through another, parallel convolutional unit that has Cm filters each of size Cin × k × k. The output is then batch normalized, scaled and transformed into Cm dimensional attention maps using a sigmoid non-linearity σ(.) The number of attention masks is maintained to be smaller than the total number of channels in the response of W0 ,i. e. Cm < Co . To match exactly the size of the activation map of W0 , the Cm attention masks are tiled. Attenuation of the output response of W0 with tiled attention maps explicitly enforces multiple channels of the imageNet filter to share a common areas of the receptive filed. This serves as a structural regularization for the adaptation process. The architecture of Attn-2 uses two sequential convolutional stages within the adapter. The first stage samples the input to W0 and convolves it with Cm depthwise filters of size Cin /R × k × k. Depthwise convolutions are performed by equally dividing the channels of the input signal into R subsets, each of which is processed by one filter (Chollet 2016). While an obvious advantage of depthwise convolution
Co
Cm σ(.)
W0 Conv k x k
Scale
Cm σ(.) Scale BN
BN
Cin
Cm Conv k x k
Cin
Cm
Til
Tile
Tile
Tile
. Co
Co
Co Imagenet output
. .
. .
.
σ(.
σ(.)
Co W0 Conv k x k
Cin
Conv k x k
Cin
Co
Scale BN
Cm
W0 Conv k x k
Conv 1 x 1 ReLU
Cm Conv k x k
Cin
a) Attn-1 b) Attn-2 Figure 3: Architectures of adapter H(., φi ) used in our method. BN denotes batch normalization layer, Scale indicates a scaling and translation layer and σ(.) denotes a sigmoid non-linearity.
is faster computation, our primary goal in using this strategy is to reduce the number of parameters and maintain the adapter to be lightweight. Output of the depthwise conv layer is subjected to a ReLu non-linearity and 1 × 1 convolutions as shown in the figure. The generated map is transformed into Cm attention masks with a batchnorm-scaling-sigmoid stage similar to Attn-1. The masks are then tiled to match the size of imageNet output and then multiplied to it pointwise for desired attenuation. The benefit of using a sigmoid nonlinearity in our framework is two fold. First, a sigmoid is known for having slow learning rate because of its gradient σ(x)(1 − σ(x)) being very low for most values of activation x and close to zero when the activation is saturated (i. e. x = 0 or 1). As a result, the adapter unit receives gentler gradients and updates its parameters slowly. The adaptation rate is, therefore, implicitly controlled, unlike fine-tuning where the only way to do so is to manipulate the learning rate. A second benefit is that, the range of the sigmoid output is [0, 1] unlike that of a linear or a ReLu unit. A sigmoid output, therefore, does not alter the dynamic range of a feature x when multiplied to it. We believe that this prevents large covariate shifts (Ioffe and Szegedy 2015) in the signal that is attenuated by our adapters and does not slow down the learning process. Learning a randomly initialized serial unit (e. g. a conv filter, a residual block or a residual adapter (Rebuffi, Bilen, and Vedaldi 2017)) between layers of a pre-trained imageNet network, however, will suffer from this problem. We briefly demonstrate this effect in our experimental section.
Relation to Gradient Descriptors The proposed adaptation method generates an embedding that is reminiscent of gradient based representations used in classical vision, namely Fisher vectors (FV) (S´anchez et al. 2013). A FV uses a mixture of Gaussian codebooks to encode image features into a high dimensional descriptor. Each Gaussian codeword N (µk , Wk ), transforms a descriptor x0
Cin
Sc
BN Conv
ReL Conv
Table 1: Comparison of the proposed method, with finetuning, classifier training and residual adapters (Rebuffi, Bilen, and Vedaldi 2017) on MIT SUN scene and Caltech 256 object classification. Performance is reported as % accuracy. The number of additional parameters needed for adaptation is reported as a percentage of the total parameters in Resnet-18. (*) indicates that the adapter modules were pretrained with the imageNet model. Model
FT
CLS
BN
AttnNet (4)
AttnNet (2)
AttnNet* (4)
AttnNet* (2)
ResAdaptNet*
nparams(%)
100
-
≤1
8
15
8
15
11
SUN (%)
55.6
48.8
52.9
54.1
54.9
55.9
56.2
54.7
MIT Indoor (%)
72.5
66.6
69.2
70.0
69.9
70.7
70.0
70.7
Caltech (%)
80.3
77.2
78.6
79.64
79.6
79.75
80.0
79.43
into zk ∝ ak Wk (x0 − µk ) ,
(7)
where ak ∈ [0, 1] represents the posterior probability of x0 being sampled from the k th codeword. In a two code-word case, a0 can be expressed as a sigmoid function. The input transformation due to our adapters, shown in Figure 1 b), is quite similar to an FV. In our framework, the input x0 propagates though a linear layer W0 and a parallel sigmoid layer H(., φ0 ) to generate an output z0 = W0 ∗ x0 and its confidence mask a0 , respectively. The map z0 is then scaled with the confidence a0 and the result is centered using a batch norm layer. If W0 is assumed to have Co , 1 × 1 filters each element zˆ0i,j of the masked output can be expressed as a Co dimensional FV of element xi,j 0 under a Gaussian codeword as µ ˆ0 1 , zˆ0i,j
i,j ∝ ai,j W ∗ x − µ ˆ 0 0 . 0 0
(8)
An average pooling over the entire attenuated map zˆ0 will result in a pooled FV descriptor as described in (S´anchez et al. 2013). Model adaptation is performed using the encoding in (8) by training a task-specific branch H(., φ0 ) that generates the attention or confidence map a0 .
Residual Adapters Figure 1 c) depicts the residual adapter method presented in (Rebuffi, Bilen, and Vedaldi 2017) for multi-domain learning. In this method, task specific adaptation is achieved using the modules H(., φ0 ) that consist of a batch normalization and scaling layer, followed by a 1 × 1 convolution stage and a residual shortcut. The residual adapter samples the output z0 = W0 ∗ x0 of a pretrained imageNet conv layer and attenuates it as 0) z0 = W0 ∗ x0 , y0 = √(z0 −µ , ζ0 = γ0t y0 + β0t , 2
(σ0 +0 )
δ0 =
W0t
∗ ζ0 , zˆ0 = z0 + δ0 .
(9)
In short, the output z0 of the imageNet W0 is batch normalized using {µ0 , σ0 }, scaled using task specific {γ0t , β0t } and affine transformed using a task specific conv layer W0t before being added back to itself using a shortcut connection. The 1
p For the sake of this comparison, we drop the normalization term σ ˆ02 + 0 as it can be absorbed in the subsequent scaling layer.
attenuated zˆ0 then propagates to subsequent layers of the residual block. Residual adapters, like our attention modules, provide a parameter efficient framework to model adaptation. There are, however, some key differences between the two frameworks. First, the residual adapters are composed of fully linear operations (BN, Scaling, 1 × 1 Conv). Therefore, the combined effect of these adapters and the conv layer preceding them is a single affine transformation of the input x0 (Rebuffi, Bilen, and Vedaldi 2017). Our attention based adapters, on the other hand, transform the input into masks using a sigmoid (see sec. ). The attenuating transformation H(x0 , φ0 ), therefore, is non-linear and likely to be more invariant. Second, the residual adapters in (Rebuffi, Bilen, and Vedaldi 2017) need to be pre-trained within the imageNet model, before they are adapted to a new task. The adaptation of these modules to different tasks is, therefore, equivalent to fine-tuning. Attention based adapters proposed in this work can be attached across conv layers of any off-the-shelf neural network and trained from scratch on a new task. In fact, we show that it is necessary to pre-train residual adapters on imageNet. Training them from scratch during task adaptation fails completely. The likely reason for this is that residual adapters are serial layers and randomly initializing them between successive pre-trained layers may result in large co-variate shifts in the activations of subsequent layers.
Overall Architecture and Efficiency To adapt an imageNet trained neural network, we attach the proposed attention modules only across its final few layers. These layers are generally known to exhibit semantic selectivity and therefore, have to be tuned with task-specific information. The initial layers of the network, on the other hand, capture generic visual primitives and can be used as a single feature extractor for all the tasks. Additionally, the attention modules are designed to be very lightweight compared to the imageNet trained convolutional units that they attach across. These modules amount to only about 10% the size of the entire imageNet network. Our motivation behind these design choices is to enable efficient multi-task inference in environments with limited storage and memory resources (e.g. mobile devices, autonomous drones and AR/VR headsets). The lightweight design of our attention based modules allows a smaller storage requirement for the entire multi-task classifier. If the size of a base network trained on imageNet is P, its adaptation to N new tasks
using fine-tuning will increase the model size to (N + 1)P . N Using our method, the overall model size is only P (1 + 10 ), P since, for every new task, only ∼ 10 parameters are added. A second obvious advantage of our design is the speed of inference. Since our task specific adapters are attached only to the last few layers of the base network, the initial layers can be used as a common feature extractor stage for multiple tasks. This saves a large amount of processing time since feature extraction layers are the slowest. On the other hand, multi-task inference with fully-finetuned networks, requires full forward propagation, sequentially, through all the taskspecific networks. This makes them slower than the proposed adaptation method, despite the added computation in the attention modules required for the latter. Note that in (Rebuffi, Bilen, and Vedaldi 2017), the residual adapters are present in all layers, not just the last few. As a result, no part of their network can be identified as a common feature extractor for all tasks. The latency of sequential multi-task inference using their method is, therefore, just as high as it is for fine-tuning. In some specific cases of multi-task learning, a multi-head network architecture is preferred over full finetuning to each individual task (He et al. 2017). The initial layers of such a network are task-agnostic feature extractors, whereas final few are tuned to the different tasks of interest (e.g. detection, segmentation, proposal generation, counting etc.). This framework appears similar to ours, but there are two main differences. First, while a multi-head classifier results in fewer model parameters compared to fine-tuning, it is still more expensive than the proposed method. The task-specific layers (heads) in multi-head architectures have a large number parameters. A single residual block at the top of Resnet-18, for example, has about 40% of the network’s parameters. Fine-tuning it to different tasks, therefore, adds more parameters to the model than our method. Second, a multi-head architecture is feasible only for very related tasks, such as detection and segmentation of the same stimuli. In such cases, most of the network layers can be assigned to extract common features, whereas relatively smaller heads (one or two residual stages) can be assigned to perform task-specific processing. In our case, we do not assume that the tasks are so closely related. In fact, in our experiments, we show successful adaptation of an object recognition network to holistic scene classification using our attention based framework.
Experiments We present a comprehensive evaluation of our attention based adaptation method using off-the-shelf networks trained on imageNet object recognition. As target datasets in the adaptation scenario, we use well known image classification benchmarks such as MIT Indoor, MIT SUN and Caltech 256. MIT SUN (Xiao et al. 2010) consists of images from 397 classes of indoor and outdoor scenes. The dataset provides a wide coverage of known scene categories and has more than 100 images per class. Standard procedure for evaluation on this benchmark is to use 50 images per class for training and 50 images for testing. The MIT Indoor dataset (Quattoni and Torralba 2009) has 67 categories of Indoor scenes such as “office”, “bedroom”, “livingroom”, “store” etc. A training-test
split of roughly 80-20 images per class is specified by the authors. Caltech 256 (Griffin, Holub, and Perona 2006) was the most challenging benchmark for object image classification before imageNet. It has images belonging to 256 diverse object categories. On this dataset, training is performed with 60 images per class and the remaining images are used for testing. Performance on all benchmarks is reported in terms of average per class classification accuracy. For all our experiments we use a standard Resnet-18 (He et al. 2016) and its extension which we refer to as Resnet-26. A Resnet-18 model has a sequential structure conv1(64 × 7 × 7)−2×R(64)−2×R(128)−2×R(256)−2×R(512). Here, conv1(64 × 7 × 7) denotes the initial convolutional layer, whereas K×R(.) denotes a sequence of K residual blocks. An example of a residual block is depicted in Figure 1 a). Thus, R(128) is a residual block with two conv(128 × 3 × 3) filters. To extend Resnet-18 we add one residual block to each one of its stages and construct a deeper Resnet-26: conv1(64 × 7 × 7) − 3 × R(64) − 3 × R(128) − 3 × R(256) − 3 × R(512). To adapt imageNet trained Resnets, we attach our attention modules across the conv filters in R(256) and R(512). The resulting networks are referred to as AttnNets. The residual adapters framework of (Rebuffi, Bilen, and Vedaldi 2017), needs pre-trained Resnets with the adapters already in them. Unlike the Attn modules, the residual adapters cannot be attached to an off-the-shelf imageNet model and trained from scratch. Therefore, we first train the Resnets augmented with residual adapters on the imageNet dataset. The adapter modules in this pre-trained model are then fine-tuned for a new task. We refer to these adapted networks as ResAdaptNets. In addition to (Rebuffi, Bilen, and Vedaldi 2017), we also evaluate AttnNets against some common baselines such as full fine-tuning and classifier only training. During adaptation, all networks are trained with a learning rate of 0.001 which is dropped by 1/10th approximately every 10000 iterations. The learning rate for the classifier layer is set to be 10 times higher than lower layers. This is a commonly used approach that we prefer because it ensures the best performance for the fine-tuning baseline.
Comparison of Attention Modules We begin with an experiment to compare the two attention architectures proposed in Section by using them to adapt an imageNet trained Resnet-18 to the SUN dataset. For Attn-1 module, we design the conv layer to have a kernel size of 3 kernels and vary its output dimension in the range Cm ∈ {8, 16, 32, 64}. A higher dimension equals a heavier adapter. Cm also represents the number of unique attention masks that are generated within the module. At the output, these are tiled Co /Cm times to match the size of the response map generated by the imageNet conv layer W0 . Tiled maps are then pointwise multipled to attentuate the response. The results on using different number of masks Cm are reported in terms of the accuracy of the adapted model on SUN. For Attn-2 module, we allow the first conv layer 3 kernels and its a fixed output dimension Cm = 128. The second conv layer is also of a fixed size 1 × 1 × Cm . To vary the parameters of the module, we use the depth-wise convolution strategy in the first conv layer. Specifically, we allow each conv filter in
1.00
55.5
0.95
55.0
0.90
54.5
0.85
0.85
54.0
0.80
0.80
53.5
0.75
0.75
53.0 52.5
1.00
Res-adapt 18 sched. 1 Res-adapt 18 sched. 2 Res-adapt 18 sched. 3 Res-adapt 18 sched. 4 Attn Net
0.70 0.65
0.90
0.70 0.65
52.0
0.60
0.60
51.5
0.55
0.55
51.0
0.50
0.50
50.5
0.45
Attn-1 Attn-2
50.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 Approx parameters x 106
0.40
Res26 sched. 1 Res26 sched. 2 Res26 sched. 3 Res26 sched. 4 Res26 sched. 5 Attn Net
0.95
% Test Accuracy
% Test Accuracy
% Accuracy
56.0
0.45
0
25000 50000 75000 100000 125000 150000 175000 200000 225000 250000 275000 300000 Number of training iterations
0.40
0
8000 16000 24000 32000 40000 48000 56000 64000 72000 80000 88000 96000 Number of training iterations
a) b) Figure 4: Performance comparison of modules Attn-1 and Attn-2 on SUN scene classification. Figure 5: Both Resnet-26 and ResAdaptNet-18 are trained with a pre-trained Resnet-18 Attn-1 saturates while Attn-2 improves with an initialization. Layers common with the pre-trained net are copied over and frozen. The remaining, randomly initialized layers are trained from scratch on SUN classification. increase in the number of parameters.
this layer to process only Cin /R channels of the input. This limits the size of the filters to 3 × Cin /R instead of 3 × Cin as in standard convolution. The channel grouping factor R is varied as R ∈ { 1, 2, 4, 16, 32, 128 } to vary the complexity of Attn-2 and its impact is also reported in terms of accuracy on the SUN benchmark. Figure 4 shows the performance for both modules plotted against the approximate number of adaptable parameters used in the network. When the size of the adapters are restricted (parameters < 0.3 million), the relatively deeper Attn-2 fares better. The performance of Attn-1 improves steadily with an increase in the number of unique masks Cm . It, in fact, outperforms Attn-2 narrowly when both networks train 0.5 to 1 million adaptable parameters. The improvements in Attn-1, however, saturate at Cin = 32 (∼ 1M params) while Attn-2 continues to improve as its grouping factor R is gradually reduced to 1 (standard convolution). Due to the clear superiority of Attn-2, therefore, we use it for the remainder of experiments as our adapter of choice. The models adapted using Attn-2 are simply referred to as AttnNet.
Transfer In this set of experiments we compare the performance of the AttnNet with ResAdaptNets, full finetuning (FT), Batch Norm finetuning (BN) and classifier re-training (CLS) using Resnet18 as a backbone architecture. FT simply adapts 100% of the network and, therefore, is the most expensive method. CLS is basically just re-training the pre-softmax linear classifier which does not entail tuning any hidden parameters. In BN, only the batch norm-scaling layer pairs are finetuned to the new task which account for less than 1% of Resnet parameters. AttnNet (R) indicates an attention module that implements depthwise convolution of the input map by dividing it into R groups of channels. For a fair comparison with ResAdaptNets we also trained a version of AttnNet that is pretrained on imageNet with the adapters (denoted as AttnNet*). They are then fine-tuned to the task of interest as in (Rebuffi, Bilen, and Vedaldi 2017). All the networks are evaluated for SUN and Caltech classification. Results shown in Table 1 support interesting conclusions.
First, among all the hidden layers of a deep network, finetuning Batch normalization layers does not seem to be a bad idea. The BN layers only account for less than 1% of the total model parameters. Yet, the performance of BN adaptation is significantly better than simple classifier re-training. BN achieves accuracies of 52.9% and 78.6% on SUN and Caltech respectively. This is a 1 − 4% improvement over CLS for negligible cost. Second, AttnNet (2) reports a performance comparable to that of full fine-tuning which is quite remarkable. FT leverages tens of millions of parameters to fit to the new task. AttnNet (2), on the other hand, uses only about 15% of that. A third conclusion is that AttnNet (4) is comparable to ResAdaptNets without having to be pre-trained on imageNet. And when pre-training is pursued, a fine-tuned AttnNet*(4) outperforms the ResAdaptNet on both SUN and Caltech. Note that both require around 10% additional parameters. A final observation is that the scale of improvements on Caltech is much smaller than SUN, perhaps because both imageNet and Caltech lie in the same visual domain (object images). The pre-trained model is, therefore, already quite representative of Caltech images and does not gain much in adaptation. SUN, on the other hand, is a completely separate domain (scene images). As a result, improvements due to a better approach seem very evident. Similar experiments are conducted using Resnet-26, which is an extended version of Resnet-18, on SUN, Caltech and MIT Indoor scene dataset. The results reported in table 2 reflect very similar trends.
Importance of Attention Next we try to ascertain the importance of a sigmoidal attention to our style of model adaptation. While spatial attention based signal attenuation is intuitive, it is by no means the only way to change a signal passing through a network. A simple alternative is to attach a convolutional layer across an existing imageNet conv layer as an additive branch. We evaluate this approach using a pre-trained Resnet-18. As adapters, we attach a 1 × 1 kernel convolutional layers across the conv layers in the last four residual blocks (R(256)-R(256)-R(512)-R(512)). The number of fil-
Table 2: Comparative evaluation of the proposed method, using a Resnet-26 on MIT SUN scene, MIT Indoor scene and Caltech 256 object classification. Model
FT
CLS
AttnNet (4)
AttnNet (2)
AttnNet* (4)
ResAdaptNet
nparams (%)
100
-
8
15
8
11
SUN (%)
56.4
48.4
55.6
56.3
56.1
55.7
Caltech (%)
81.9
78.6
80.9
80.9
80.8
81.7
MIT Indoor (%)
71.8
65.9
71.9
68.9
70.5
70.2
Table 3: Additive convolution as an alternative to attention based adapters Model
SUN
Caltech
MIT Indoor
AttnNet(4)
54.14
79.64
70.0
AddConv
52.4
78.64
68.6
ters in the adapter units match the number of filters in the pre-trained conv layers they attach to. The adapter outputs, in this case, are added to the network responses. The results in Table 3 show that the Attn based unit outperforms the additive unit (denoted Add conv) on both SUN and Caltech. In fact, the latter performs even worse than just fine-tuning the batch norm units (BN). The reason for this, perhaps, is the relative inability of the network to learn the new task. Note that unlike BN layers that are finetuned, the additive layer is initialized at random during training. The high dynamic range of the linear adapter output may cause large covariate shifts (signal attenuations) making optimization difficult. A sigmoid, in comparison, is much more invariant due to a limited output range of [0, 1]. Attention based adaptation, therefore, is more stable due to the nature of the sigmoid non-linearity 2 .
Network Structure Adaptation An key feature of our method is that it can directly be used with any off-the-shelf neural network. AttnNets do not require their adapters to be trained on imageNet before hand. They can be plugged across a any layer of an existing model and learned using small scale datasets. ResAdaptNets (Rebuffi, Bilen, and Vedaldi 2017) cannot be learned from scratch, also perhaps due to the reasons of covariate shift. The Residual adapters process inputs serially and therefore, initializing them at random between pretrained units may cause difficulties in learning. To study this effect, we initialize standard layers in the ResAdaptNet using imageNet trained Resnet18. The remaining, adapter parameters are initialized at random. Despite trying training schedules (different choices of (iterations, lr steps)), the network does not succeed in learning. The test error trends on SUN are shown in fig 5 a). To verify that ResAdaptNet is not a special case, we performed another experiment trying to learn a Resnet26 using a pre-trained Resnet-18. Resnet-26 is constructed by introducing an additional residual block sequentially in each stage of Resnet-18. During training, the parameters in this block are initialized at random, whereas other layers are 2 We also tried replacing sigmoid with tanH and ReLU. They did not perform well.
initialized using imageNet Resnet-18. The training behavior is very similar to that of Resadapt, as shown in Figure 5 b). Compared to both these methods, an AttnNet is much more successful in learning without deterministic initialization. Its test error (also shown in Figure 5) decays much faster than the rest.
Conclusion We propose a method to perform efficient adaptation of imageNet trained neural networks to smaller vision tasks. It relies on light-weight attention modules that can be attached to units in the pre-trained network. These modules are, then, trained from scratch using a supervised objective for a desired new tasks. Most of the pre-trained network layers are left unchanged during the process. Within the adapted model, the attention modules guide the subsequent layers towards input regions that are related to the new task. The proposed adaptation method performs as well as complete fine-tuning using much fewer parameters for every new task.
References Chollet, F. 2016. Xception: Deep learning with depthwise separable convolutions. CoRR abs/1610.02357. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255. Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; and Darrell, T. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference in Machine Learning (ICML). French, R. M. 1999. Catastrophic forgetting in connectionist networks: Causes, consequences and solutions. Trends in Cognitive Sciences 3(4):128–135. Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Girshick, R. 2015. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV). Gong, Y.; Wang, L.; Guo, R.; and Lazebnik, S. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Computer Vision ECCV 2014, volume 8695, 392–407. Griffin, G.; Holub, A.; and Perona, P. 2006. The caltech-256. Technical report, caltech technical report. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778. IEEE Computer Society. He, K.; Gkioxari, G.; Doll´ar, P.; and Girshick, R. 2017. Mask RCNN. In Proceedings of the International Conference on Computer Vision (ICCV).
Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. Kaiming, H.; Xiangyu, Z.; Shaoqing, R.; and Sun, J. 2014. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, 1097–1105. Lin, T. Y.; RoyChowdhury, A.; and Maji, S. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(6):1309–1322. Quattoni, A., and Torralba, A. 2009. Recognizing indoor scenes. 2012 IEEE Conference on Computer Vision and Pattern Recognition 0:413–420. Rebuffi, S.-A.; Bilen, H.; and Vedaldi, A. 2017. Learning multiple visual domains with residual adapters. In Proceedings of Advances in Neural Information Processing Systems (NIPS). Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NIPS). S´anchez, J.; Perronnin, F.; Mensink, T.; and Verbeek, J. J. 2013. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision 105(3):222–245. Simonyan, K., and Zisserman, A. 2014. Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199. Xiao, J.; Hays, J.; Ehinger, K.; Oliva, A.; and Torralba, A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 3485–3492. Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, 818–833. Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; and Oliva, A. 2014. Learning Deep Features for Scene Recognition using Places Database. NIPS. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; and Torralba, A. 2016. Learning deep features for discriminative localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2921–2929.