Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi Chubu University 1200 Matsumotocho, Kasugai, Aichi, Japan {
[email protected],
[email protected], yamashita@isc, fujiyoshi@isc}.chubu.ac.jp
Abstract
Attention map
L(xi)
Great grey owl
Label
GAP & fc.
Feature extractor
Visual explanation enables human to understand the decision making of Deep Convolutional Neural Network (CNN), but it is insufficient to contribute the performance improvement. In this paper, we focus on the attention map for visual explanation, which represents high response value as the important region in image recognition. This region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends the top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for attention mechanism and is trainable for the visual explanation and image recognition in end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attributes recognition. Experimental results show that ABN can outperform the accuracy of baseline models on these image recognition tasks while generating an attention map for visual explanation. Our code is available 1 .
Feature map
wc
Ruffed grouse
…
Input image
(a) Class Activation Mapping Attention map
Attention branch Feature extractor Attention
Input image
Latt(xi) Label
arXiv:1812.10025v1 [cs.CV] 25 Dec 2018
Attention Branch Network: Learning of Attention Mechanism for Visual Explanation
mechanism
Perception branch
Lper(xi)
(b) Attention Branch Network
Figure 1. Network structures of Class Activation Mapping and our Attention Branch Network.
Visual explanation can be categorized into bottom-up or top-down methods. Bottom-up methods typically use gradients with auxiliary data, such as noise [4] and class index [24, 3]. These methods interpret a CNN without retraining and modifying the architecture, however, they require the backpropagation process to obtain gradients. In contrast, top-down methods can interpret a CNN during the inference process. Class Activation Mapping (CAM) [41], which is a representative top-down method, can visualize the attention map in each category using the response of the convolution layer. Instead of a fully connected layer, it replaces the convolution and global average pooling (GAP) [20] and obtains class specific feature maps that include high response value positions representing the class, as shown in Fig. 1(a). However, CAM requires replacing the fully-connected layer with a convolution layer and enable to passing through the GAP, thus, decreasing the performance of CNN.
1. Introduction Deep Convolutional Neural Network (CNN) [1, 17] approaches have outperformed various image recognition tasks on computer vision [25, 9, 7, 34, 8, 12, 18]. However, inspite of these CNN approaches achieve impressive performance on such tasks, it is difficult to interpret the CNN models. To understand the decision making of CNN, methods of interpreting CNN have been proposed [39, 41, 26, 4, 24, 3, 22]. “Visual explanation” has been used to interpret CNN by highlighting attention region during the inference process. 1 https://github.com/machine-perception-robotics -group/attention_branch_network
To avoid this problem, bottom-up methods are often used 0
for interpreting the CNN. The highlight location in visual explanation is considered an important location in image recognition. To use top-down methods that can visualize an attention map during a forward pass, we extended a topdown visual explanation model to an attention mechanism. By employing the attention map for visual explanation as an attention mechanism, our network is trained while paying attention to the important location in image recognition. The attention mechanism with a top-down visual explanation model can simultaneously interpret CNN and improve their performance. Inspired by top-down visual explanation methods and attention mechanisms, we propose Attention Branch N etwork (ABN), which extends a top-down visual explanation model by introducing a branch structure with an attention mechanism, as shown in Fig 1(b). ABN consists of three components: feature extractor, attention branch, and perception branch. The feature extractor contains multiple convolution layers for extracting feature maps. The attention branch is designed to apply an attention mechanism by introducing a top-down visual explanation model. This component is important in ABN because it generates an attention map for attention mechanism and visual explanation. The perception branch outputs the probabilities of class by feeding both feature maps and attention map to convolution layers. ABN has a simple structure and is trainable in an end-to-end manner using training loss at both branch. Moreover, by introducing the attention branch to various baseline model such as ResNet [9], ResNeXt [34], and multi-task network [27], ABN can be applied to several networks and image recognition tasks. Our contributions are as follows:
have been proposed [30, 39, 41, 26, 13, 4, 24, 3, 22]. Visual explanation is two types of such methods: bottom-up, which are gradient-based methods, and top-down, which use the response of a forward pass. For example, SmoothGrad [24] obtains sensitivity maps by adding noise to the input image iteratively and takes the average of these sensitivity maps. Guided backpropagation [13] and Gradientweighted Class Activation Mapping (Grad-CAM) [4, 3], which are bottom-up methods, have been proposed. Guided backpropagation and Grad-CAM visualize the attention map by the backward pass to only positive gradients at a specific class. Grad-CAM and guided backpropagation have been widely used because they can interpret various pre-trained models using the attention map of a specific class. Top-down methods visualize an attention map by using the response value of a forward pass using a convolution layer or deconvolution layer. While top-down methods need to re-train and modify network model, they can directly visualize an attention map during forward pass. CAM [41] can visualize an attention maps for each class using the response of a convolution layer and the weight at last fullyconnected layer. CAM performs well on weakly supervised object localization but decreases perform well in image classification due to replacing fully-connected layers with a convolution layers and passing through the GAP. We build ABN based on CAM, which can visualize a CNN using the attention map during a forward pass. CAM easily compatibles with attention mechanism that directly weights the feature map. In contrast, bottom-up visual explanation methods difficulty are not compatibles with an attention mechanism due to these methods requiring the back propagation process because of calculating gradients. Therefore, we used CAM for attention mechanism of proposed method.
• ABN is designed to extend a top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN is the first attempt to improve the performance of the CNN by including a top-down method.
2.2. Attention mechanism Attention mechanisms have been used in various fields such as computer vision and natural language processing [19, 15, 32, 12]. They have been widely used in sequential models [15, 36, 37, 2, 31] by using recurrent neural network and Long Short Term Memory (LSTM) [10]. A typical attention model on sequential data was proposed by Xu et al. [15]. Attention mechanism of Xu et al. is based on two type attention mechanisms that are soft attention and hard attention. The soft attention mechanism of Xu et al. model is used as the gate of LSTM, and image captioning and visual question answering have been used [36, 37]. Additionally, the Non-local Neural Network [33], which uses the self-attention approach, and the Recurrent Attention Model [21], which controls the attention location by reinforcement learning, have been proposed. Recent attention mechanisms work advance to a single
• ABN is applicable to various baseline network models such as VGGNet [14], ResNet [9], and multi-task learning [27] by dividing a baseline model and including an attention branch for generalizing an attention map. • ABN improves the performance of CNN and visual explanation simultaneously due to the attention map during a forward pass.
2. Related works 2.1. Interpreting CNN Several visual explanation methods for highlighting the important region on image recognition as an attention map 1
Attention map
Label
GAP
Softmax
Prob. score
Latt(xi)
Attention branch
Feature map
xi
Prob. score
Softmax
Input image
…
ReLU
layers
1x1 Conv., K
Convolution Feature extractor
1x1 Conv., K
1x1 Conv., 1
K
: Batch Normalization
Batch Normalization
Sigmoid
: Activation function
Batch Normalization
M(xi)
: Convolution layer
g(xi) Classifier
… Attention mechanism
Lper(xi)
Feature map
g′(xi) !
Perception branch
Feature map
h×w
h×w
Feature vector
3x3 Conv., 512
h×w
3x3 Conv., 512
Feature map
h×w
3x3 Conv., 512
Feature map
3x3 Conv., 512, /2
3x3 Conv., 512
3x3 Conv., 512
3x3 Conv., 512
3x3 Conv., 512
3x3 Conv., 512
h×w
3x3 Conv., 512, /1
Feature map
3x3 Conv., 512
Classifier
Convolution layers
ResNet model
Fc., 4024
Feature map
Fc., 4024
h×w
3x3 Conv., 512
(ex. VGGNet)
Feature map
3x3 Conv., 512
CNN with Fc. layer model
3x3 Conv., 512
(a) Overview of Attention Branch Network
Convolution layers
Classifier
(b) Structure of attention branch
(c) Structure of perception branch
Feature map
(h × w) /2
Figure 2. Detailed structure of the Attention Branch Network.
3. Attention Branch Network
image recognition task [32, 12, 6]. Typical attention models on a single image are Residual Attention Network [32] and Squeeze-and-Excitation Network (SENet) [12]. Residual Attention Network includeds two attention components: stacked network structure that consists of multiple attention components, and attention residual learning, which applies residual learning [9] to an attention mechanism. SENet includes a squeeze-and-excitation block that contains a channel-wise attention mechanism is introduced for each residual block.
ABN consists of three modules as follows: feature extractor, attention branch, and perception branch, as shown in Fig. 1. The feature extractor contains multiple convolution layers and extracts feature maps from an input image. The attention branch converts the attention location based on CAM to an attention map by using an attention mechanism. The perception branch outputs the probability of each class by receiving the feature map from the feature extractor and attention map. ABN is based on a baseline model such as VGGNet [14] and ResNet [9]. The feature extractor and perception branch are constructed by dividing a baseline model between a specific layer. The attention branch is constructed after feature extractor on the basis of the CAM. ABN can be applied to several image recognition tasks by introducing the attention branch. We provide ABN for the image recognition tasks of image classification, fine-grained recognition, and multitask learning.
ABN is designed to focus on the attention map for visual explanation that represents the important region in image recognition. Previous attention models extracts an attention map for attention mechanism using only response value of convolution layers during foward pass. However, ABN easily extracts the effective attention map in image recognition by generating the attention map for visual explanation on the basis of top-down method. 2
3.1. Attention branch
classification models such as VGGNet or ResNet, as shown in Fig. 2(c). First, the attention map is applied to the feature map by the attention mechanism. We use the two types of attention mechanisms, as in Eq. 1 and Eq. 2. Here, gc (xi ) is the feature map at the feature extractor, M (xi ) is an attention map, and gc′ (xi ) is the output of the attention mechanism, as shown in Fig. 2(a). Note that {c|1, . . . , C} is the index of the channel.
CAM has a K × 3 × 3 convolution layer, a GAP and fully-connected layer as shown in Fig. 1(a). Here, K is the number of categories, and “K × 3 × 3 convolution layer” means a 3 × 3 kernel with K channels at the convolution layer. K × 3 × 3 convolution layer outputs a K × h × w feature map, which represents the attention location for each class. The K × h × w feature map is down-sampled to a 1 × 1 feature map by the GAP and outputs the probability of each class by passing through the fully-connected layer with the softmax function. When CAM visualizes the attention map of each class, attention map is generated by multiplying and weighted sum of K × h× w feature map and weight at the last fully-connected layer. Instead of a fully-connected layer, CAM stacks the convolution layers. This restriction is also introduced into the attention branch. The fully-connected layer that connects a unit with all units at the next layer negates the ability to localize the attention area in the convolution layer. Therefore, if a the baseline model contains a fully-connected layer, such as VGGNet, the attention branch replaces fullyconnected layer with 3 × 3 convolution layer, similar with CAM, as shown in the top of Fig. 2(b) . The ResNet model with ABN is constructed from the residual block at the attention branch, as shown in the bottom of Fig. 2(b). Here, we set the stride of first convolution layer at the residual block as 1 to maintain the resolution of the feature map. To generate an attention map, we build a top-layer at the attention branch, the attention branch outputs the probability of each class and generates an attention map of an attention mechanism. However, CAM cannot generate an attention map in the training process because the attention map is generated using the feature map and weight at a fullyconnected layer after training. To address this issue, we replace the fully-connected layer to a K × 1 × 1 convolution layer, as with CAM. This K × 1 × 1 convolution layer is imitated at the last fully-connected layer of CAM in a forward pass. After the K × 1 × 1 convolution layer, the attention branch outputs the probability by using the response of the GAP and softmax function. Finally, the attention branch generates an attention map from the K × h × w feature map. Then, to aggregate the K feature maps, these feature maps are convoluted by a 1 × 1 × 1 convolution layer. By convoluting with the 1 × 1 × 1 kernel, 1 × h × w feature map is generated. We employ the 1 × h × w feature map that is normalized by sigmoid function as attention map for attention mechanism.
gc′ (xi ) = gc′ (xi ) =
M (xi ) · gc (xi ) (1 + M (xi )) · gc (xi )
(1) (2)
Equation 1 is simply a dot-product between the attention map and the feature map at a specific channel c. In contrast, Eq. 2 can highlight the feature map at the peak of the attention map while preventing the lower value region of the attention map from degrading to zero.
3.3. Training ABN can be trainable in an end-to-end manner using losses at both branches. Our training loss function L(xi ) is a simple sum of losses at both branches as Eq. 3. L(xi ) = Latt (xi ) + Lper (xi )
(3)
Here, Latt (xi ) denotes training loss at the attention branch with a input sample xi , and Lper (xi ) denotes training loss at the perception branch. Training loss for each branch is calculated by the combination of the softmax function and cross-entropy in image classification tasks. The feature extractor is optimized by passing through the gradients of the attention and perception branches during back propagation. If ABN is applied to other image recognition tasks, our training loss can adaptively change depending on the baseline model.
3.4. ABN for multi-task learning ABN using a classification model was designed to divide the branch that is generated the attention map and output the probability of each class. This network design can be applicable to other image recognition tasks, such as multitask learning. In this section, we explain ABN for multitask learning. Conventional multi-task learning has units outputting the recognition scores corresponding to each task [27]. In training, the loss function defines multiple tasks using a single network. However, there is a problem with ABN for multitask learning. In image classification, the relation between the numbers of inputs and recognition tasks is one-to-one. In contrast, the relation between the numbers of inputs and recognition tasks of multi-task learning is one-to-many. The one-to-one relation can be focused on the specific target location using a single attention map, but one-to-many relation can not be focused on multiple target locations using
3.2. Perception branch The perception branch outputs the final probability of each class by receiving the feature map from the feature extractor and attention map. The structure of the perception branch is the same for conventional top layers from image 3
}
…
Table 1. Comparison with the top-1 error on CIFAR100 with attention mechanism manner.
T
Attention maps
ResNet20 ResNet32 ResNet44 ResNet56 ResNet110
Feature Attention branch
…
Labels
extractor
M t(xi) Input image
Feature maps
g(xi) Perception branch
…
M t(xi)
Weight sharing
g(x) · (1 + M (x)) 30.46 27.91 25.59 24.07 22.82
4.1. Experiments detail on image classification
Figure 3. Attention Branch Network for multi-task learning.
First, we evaluate ABN for image classification task using the CIFAR10, CIFAR100, Street View Home Number (SVHN) [23], ImageNet [5] datasets. The input image size of the CIFAR10, CIFAR100, SVHN datasets are 32×32 pixels, and ImageNet is 224×224 pixels. The number of categories for each dataset is as follows: CIFAR10 and SVHN consist of 10 class, CIFAR100 consists of 100 class, and ImageNet consists of 1,000 class. During training, we applied the standard data augmentation. For CIFAR10, CIFAR100, and SVHN, the images are first zero-padded with 4 pixels for each side, then randomly cropped to again produce 32×32 pixels images, and the images are then horizontally mirrored at random. For ImageNet, the images are resized 256×256 pixels, then randomly cropped to again produce 224×224 pixels images, and the images are then horizontally mirrored at random. The number of training, validation, and testing images of each dataset are as follows: CIFAR10 and CIFAR100 consist of 60,000 training images and 10,000 testing images, SVHN consists of 604,388 training images (train:73,257, extra:531,131) and 26,032 testing images, and ImageNet consists of 1,281,167 training images and 50,000 validation images. We optimize the networks by Stochastic Gradient Descent (SGD) with momentum. On CIFAR10 and CIFAR100, the total number of iterations to update the parameters is 300 epochs, and the batch size is 256. The total number of iterations to update the networks is as follows: CIFAR10 and CIFAR100 are 300 epochs, SVHN is 40 epoch, and ImageNet is 90 epoch. The initial learning rate is set to 0.1, and is divided by 10 at 50 % and 75 % of the total number of training epochs.
a single attention map. To address this issue, we generate multiple attention maps for each task by introducing the multi-task learning to the attention and perception branches. Note that we use ResNet with multi-task learning as the baseline model. To output the multiple attention maps corresponding to specific tasks, we design the attention branch with multitask learning, as shown in Fig. 3. First, a feature map at residual block 4 is convoluded by the T ×1×1 convolution layer, and the T ×14×14 feature map is output. The probability score during a specific task {t|1, . . . , T } is output by applying the 14×14 feature map at specific task t to GAP and sigmoid function. In training, we optimize by combining the sigmoid function and binary cross-entropy loss function. We apply the 14×14 feature maps to attention maps. We introduce the perception branch to multi-task learning. Converting feature map gc′t (x) is first generated using attention map M t (x) at specific task t and feature map g(x) at the feature extractor, as shown in Eq. 4, as discussed sec. 3.2. After generating feature map gc′t (x), the probability score at specific task t is calculated on perception branch pper (·), which output the probability for each task by inputting feature map g ′t (x). = M t (xi ) · gc (xi ) = pper (gc′t (xi ) ; θ)
g(x) · M (x) 30.61 28.34 24.83 24.22 23.28
4. Experiments
Perception branch
gc′t (xi ) O(gc′t (xi ))
g(x) 31.47 30.13 25.90 25.61 24.14
(4) (5)
4.2. Image classification O(gc′t (xi ))
Analysis on attention mechanism manner We compare the accuracies of attention mechanisms of Eq. 1 and Eq. 2. We use the ResNet {20, 33, 44, 56, 110} models on CIFAR100. Table 1 shows the top-1 errors of attention mechanisms Eq. 1 and Eq. 2. The g(x) is the conventional ResNet. First, we compare ABN with attention mechanism g(x) · M (x)
This probability matrix of each task on the perception branch consists of T × 2 components, which is defined two categories classification for each task. The probability Ot (gc′t (xi )) at specific task t is used when the perception branch receives the feature map gc′t (x) that applies the attention map at specific task t, as shown in Fig. 3. These processes are repeated for all each task. 4
Table 2. Comparison of top-1 error on CIFAR10, CIFAR100, SVHN, and ImageNet dataset.
Dataset VGGNet [14] VGGNet+BN ResNet [9] VGGNet+CAM [41] VGGNet+BN+CAM ResNet+CAM WideResNet [38] DenseNet [11] ResNeXt [34] Attention [32] AttentionNeXt [32] SENet [12] VGGNet+BN+ABN ResNet+ABN WideResNet+ABN DenseNet+ABN ResNeXt+ABN SENet+ABN
CIFAR10 – – 6.43 – – – 4.00 4.51 3.84∗ 3.90 – – – 4.91 (−1.52) 3.78 (−0.22) 4.17 (−0.34) 3.80 (−0.04) –
CIFAR100 – – 24.14∗ – – – 19.25 22.27 18.32∗ 20.45 – – – 22.82 (−1.32) 18.12 (−1.13) 21.63 (−0.64) 17.70 (−0.62) –
SVHN [23] – – 2.18∗ – – – 2.42∗ 2.07∗ 2.16∗ – – – – 1.86 (−0.32) 2.24 (−0.18) 2.01 (−0.06) 2.01 (−0.15) –
ImageNet [5] 31.2 26.24∗ 22.19∗ 33.4 27.42∗(+1.18) 22.11∗(−0.08) 21.9 22.2 22.4 21.76 21.20 21.57 25.55 (−0.69) 21.37 (−0.82) – – – 20.77 (−0.80)
∗ indicates results of re-implementation accuracy
DenseNet decrease the top-1 error from 6.43 % to 4.91 % and 4.51 % to 4.17 %, respectively. Additionally, all ResNet models are decrease the top-1 error more 0.6 % in CIFAR100.
at Eq. 1 and conventional ResNet g(x). Attention mechanism g(x) · M (x) is lower the top-1 error than conventional ResNet. We also compared the accuracy of both attention mechanisms of g(x) · M (x) and g(x) · (1 + M (x)). Attention mechanism g(x) · (1 + M (x)) is slightly more accurate than attention mechanism g(x) · M (x). In Residual Attention Network, which includes the same attention mechanisms, accuracy decreased with attention mechanism g(x) · M (x) [32]. From this result, our attention map responds to the effective region in image classification. We employ the the attention mechanism g(x) · (1 + M (x)) at Eq. 2 version by as default manner. Accuracy on CIFAR and SVHN Table 2 shows the top-1 errors on CIFAR10/100, SVHN, and ImageNet. We evaluate the top-1 error using various baseline models, CAM, and ABN regarding image classification. The accuracy are an original top-1 error at referring paper or top1 error of our model, and the ’∗’ indicates results of reimplementation accuracy. The numbers in brackets denote the difference in the top-1 error from the baseline model at re-implementation. On CIFAR and SVHN, we evaluate the top-1 error by using several ResNet models as follows: ResNet (depth=110), DenseNet (depth=100, growth rate=12), Wide ResNet (depth=28, widen factor=4, drop ratio=0.3), ResNeXt (depth=28, cardinality=8, widen factor=4). Note that ABN is constructed by dividing a ResNet model at residual block 3.
Accuracy on ImageNet We evaluate the image classification accuracy on ImageNet using Table 2 in the same manner as in CIFAR10/100 and SVHN. In ImageNet, we evaluated the top-1 error by using the VGGNet (depth=16), ResNet (depth=152), and SENet (ResNet152 model). First, we compare the top-1 errors of CAM. The performance of CAM slightly decreased with specific baseline model because of the removal of the fully-connected layers and adding a GAP [41]. Similarly, the performance on VGGNet+BatchNormalization (BN) [29] with CAM decrease even in re-implementation. In contrast, the performance of ResNet with CAM is almost the same as that of baseline ResNet. The structure of ResNet that contains a GAP and fully-connected layer as last layer resembles that in CAM. ResNet with CAM can be easily constructed by stacking on the K × 1 × 1 convolution layer at the last residual block, which sets the stride to 1 at first convolution layer. Therefore, ResNet with CAM is difficult to decrease in performance due to removal of the fully-connected layer and adding a GAP. On the other hand, ABN outperformed conventional VGGNet and CAM models. Similarly, ABN performed better than the conventional ResNet and CAM model.
ResNet, Wide ResNet, DenseNet and ResNeXt improve the accuracy by introducing ABN. In CIFAR10, ResNet and
We compare the accuracy of a conventional attention model. SENet reduce the top-1 error from 22.19% to 5
Table 3. Comparison of accuracy in CompCars dataset
GT : ‘Violin’
‘Violin’ : 0.93
‘Violin’ : 0.93
‘Violin’ : 0.97
GT : ‘Cliff’
‘Cliff’ : 0.98
‘Cliff’ : 0.99
‘Cliff’ : 0.97
GT : ‘Australian_terrier’
‘Seat_belt’ : 0.66
‘Seat_belt’ : 0.58
‘Australian_terrier’ : 0.99
Original image
Grad-CAM
CAM
ABN
task VGG16 ResNet101 VGG16+ABN ResNet101+ABN
Figure 4. Visualizing the high attention area with CAM, GradCAM, and our ABN. CAM and Grad-CAM are visualized the attention maps at top-1.
Original image
model [%] 85.9 90.2 90.7 97.1
maker [%] 90.4 90.1 92.9 98.1
Nissan
Nissan GT-R
Lamborghini
Gallardo
Benz
Benz C Class estate
Maker recognition Model recognition
Figure 5. Visualizing attention map on fine-grained recognition.
21.90% at ResNet152. However, ABN reduce the top-1 error from 22.19 % to 21.37%, which indicating that ABN is more accurate than SENet. Moreover, ABN can introduce the SENet in parallel. SENet with ABN reduces the top-1 error by 22.19 % to 20.77 % than the conventional ResNet152. In Residual Attention Network, it achieved the top-1 errors on the size of the input image that is 224 × 224 as follows: ResNet model is 21.76%, and ResNeXt model is 21.20%, indicating that ResNet152+SENet with ABN performs better than these Residual Attention Networks.
Original ResNet101 Conventional ResNet101
applyingattention attention map BeforeBefore applying map
applyingattention attention mapmap AfterAfter applying
Figure 6. Comparison of the distribution maps at residual block 4 by t-SNE. Left : Distribution of conventional ResNet101. Center and Right : Distribution of the attention branch network. Center has not applied the attention map.
Visualizing attention map We compare the attention maps visualizes using Grad-CAM, CAM, and ABN. GradCAM extracts an attention map by using the baseline model to ResNet152. CAM and ABN are constructed using a baseline model to ResNet152. Figure. 4 shows the attention maps for each model on ImageNet dataset.
4.3. Fine-grained recognition We evaluate the ABN for the fine-grained recognition on Comprehensive Cars (CompCars) dataset [35], which has 36,451 training images and 15,626 testing images with 432 car models and 75 makers. We use VGG16 and ResNet101 as baseline model and optimized models by SGD with momentum. Total number of update iterations is 50 epochs, and mini-batch size is 32. The learning rate starts from 0.01 and is multiplied by 0.1 at 25 and 35 epochs. The input image is resized to 323×224 pixels. The image size is calculated by taking the average of bounding box aspect ration from training data. This resizing process is suppressed the collapse of the car shape. Table 3 shows the car model and maker recognition accuracy on the CompCars dataset. The car model recogni-
As shown in Fig. 4, Grad-CAM, CAM and ABN highlighted a similar region. For example in the first column in Fig. 4, these models classify the “Violin”, and highlight the “Violin” localization on the original image. Similarly, “Cliff” in second column is highlights the “Cliff” region. For the third column, this original image is a typical example because multiple objects such as “Seat belt” and “Australian terrier” are included. In this case, Grad-CAM (conventional ResNet152) and CAM are failed, but ABN performed well. When visualizing the attention maps in the third column, the attention map of ABN highlights each object. Therefore, this attention map can focus on a specific region when multiple objects are in an image. 6
Original image
Eyeglasses
Sideburns
Wearing_Hat
Wavy_hair
Male
Pale_skin
Smiling
Original image
Smiling
Wearing_necklace
Wealing_Lipstick
Young
High_cheakbones
Wearing_necktie
Back_hair
Figure 7. Visualizing attention map on multiple facial attributes recognition.
Table 4. Comparison with accuracy on CelebA dataset
Method FaceTracer [16] PANDA-l [40] LNet+ANet [42] MOON [28] ResNet101 ABN
Average of accuracy [%] 81.13 85.43 87.30 90.93 90.69 91.07
10 epochs, and the learning rate is set to 0.01. Table 4 shows the average recognition rate and the number of facial attribute tasks which ABN outperform each previous methods. Note that the number of the third column at Tab. 4 is the number of winning tasks when we compared conventional models with ABN for each facial attribute. The accuracy of a specific facial attribute task is described in the appendix. When we compare ResNet101 and ABN, ABN is 0.38% more accurate. Moreover, the accuracy of 27 facial tasks is improved. ABN also performes better than conventional facial attribute recognition methods, such as FaceTracer [16], PANDA-l [40], LNet+ANet [42], Mixed Objective Optimization Network (MOON) [28]. ABN outperform conventional facial attributes recognition methods for difficult tasks such as “arched eyebrows”, “pointy nose”, “wearing earring”, and “wearing necklace”. Figure 7 shows the attention map of ABN on CelebA dataset. This attention map highlights the specific locations such as mouth, eye, beard, and hair. These highlight locations correspond to the specific facial task, as shown in Fig. 7. It is conceivable that these highlight locations are contributed to performance improvement of ABN.
Odds 40/40 39/40 37/40 29/40 27/40 –
tion accuracy of ABN improves by 4.9 % and 6.2 % with VGG16 and ResNet101, respectively. Moreover, maker recognition improves by 2.0 % and 7.5 %, respectively. These results indicate that ABN is also effective for finegrained recognition. We visualize the attention map for car model or maker recognition, as shown in Fig. 5. In these visualizing results, training and testing images are the same for car model and maker recognition, however, our attention maps differ depending on the recognition task. We compare the feature representations of the conventional ResNet101 and ABN with ResNet101. In this experiments, we visualize distributions by t-distributed Stochastic Neighbor Embedding (t-SNE) [30] and analyze the distributions. We use the comparison feature maps at the final layer on residual block 4. Figure 6 shows distribution maps of t-SNE. We use 5,000 testing images on CompCars dataset. Feature maps of conventional ResNet101 and the feature extractor in the attention branch network are clustered by car pose. However, feature map applying attention map is split distribution by car pose and detail car form.
5. Conclusion We proposed an Attention Branch Network, which extends the top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN can be simultaneously trainable for visual explanation and image recognition with an attention mechanism in an end-toend manner. It is also applicable to several CNN models and image recognition tasks. In our experiments, we evaluated the accuracy of ABN for image classification, fine-grained recognition, and multi-task learning, and ABN performed improvement performance for these tasks. We plan to apply ABN to reinforcement learning that does not include label in training process.
4.4. Multi-task Learning In multi-task learning, we evaluate for multiple facial attributes recognition using the CelebA dataset [42], which consists of 40 facial attribute labels and 202,599 images (182,637 training images and 19,962 testing images). The total number of iterations to update the parameters is 7
References [1] K. Alex, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neural Information Processing Systems, pages 1097–1105. 2012. [2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2016. [3] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. arXiv preprint arXiv:1710.11063, 2017. [4] S. Daniel, T. Nikhil, K. Been, B. V. Fernanda, and W. Martin. Smoothgrad: removing noise by adding noise, 2017. [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. [6] L. Drew, S. Dan, E. Sven, and S. Thomas. Global-andlocal attention networks for visual recognition. arXiv, abs/1805.08819, 2018. [7] H. Emily M. and C. Rama. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In Association for the Advancement of Artificial Intelligence, 2017. [8] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick. Mask r-cnn. In International Conference on Computer Vision, 2017. [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Computer Vision and Pattern Recognition, pages 770–778, 2016. [10] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput, 9(8):1735–1780, 1997. [11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [12] H. Jie, S. Li, and S. Gang. Squeeze-and-excitation networks. Computer Vision and Pattern Recognition, 2017. [13] S. Jost, Tobias, D. Alexey, B. Thomas, and R. Martin. Striving for simplicity: The all convolutional net. In International Conference on Learning Representations. 2015. [14] S. Karen and Z. Andrew. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015. [15] X. Kelvin, B. Jimmy, K. Ryan, C. Kyunghyun, C. Aaron, S. Ruslan, Z. Rich, and B. Yoshua. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048– 2057, 2015. [16] N. Kumar, P. N. Belhumeur, and S. K. Nayar. Facetracer: A search engine for large collections of images with faces, October 2008. [17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, 1989. [18] C. Liang-Chieh, Z. Yukun, P. George, S. Florian, and A. Hartwig. Encoder-decoder with atrous separable convo-
[19]
[20] [21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30] [31]
[32]
[33]
[34]
8
lution for semantic image segmentation. In European Conference on Computer Vision, 2018. T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing, pages 1412–1421, 2015. L. Min, C. Qiang, and Y. Shuicheng. Network in network. International Conference on Learning Representations, 2014. V. Mnih, N. Heess, A. Graves, and k. kavukcuoglu. Recurrent models of visual attention. In Neural Information Processing Systems, pages 2204–2212. 2014. G. Montavon, W. Samek, and K.-R. M¨uller. Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73:1–15, 2018. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In Neural Information Processing Systems, 2011. S. Ramprasaath, R., C. Michael, D. Abhishek, V. Ramakrishna, P. Devi, and B. Dhruv. Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision, pages 618–626, 2017. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Neural Information Processing Systems, pages 91–99. 2015. M. T. Ribeiro, S. Singh, and C. Guestrin. ”why should i trust you?”: Explaining the predictions of any classifier. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. C. Richard. Multitask learning: A knowledge-based source of inductive bias. In International Conference on Machine Learning, pages 41–48, 1993. E. Rudd, M. Gunther, and T. Boult. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision. 2016. I. Sergey and S. Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015. M. L. Van, Der and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . u. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems, pages 5998–6008. 2017. F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In Computer Vision and Pattern Recognition, 2017. W. Xiaolong, G. Ross, G. Abhinav, and H. Kaiming. Nonlocal neural networks. Computer Vision and Pattern Recognition, 2018. S. Xie, R. B. Girshick, P. Doll´ar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 5987–5995, 2017.
[35] L. Yang, P. Luo, C. C. Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In Computer Vision and Pattern Recognition, pages 3973– 3981, 2015. [36] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention networks for image question answering. In Computer Vision and Pattern Recognition, pages 21–29, 2016. [37] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. Computer Vision and Pattern Recognition, pages 4651–4659, 2016. [38] S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016. [39] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833, 2014. [40] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1644, 2014. [41] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. Computer Vision and Pattern Recognition, 2016. [42] L. Ziwei, L. Ping, W. Xiaogang, and T. Xiaoou. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.
9