Dec 16, 2016 - [15] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual rela- tionship ... Conf. Comp. Vis., 2016. 2. [22] H. Noh, P. H. Seo, and B. Han.
The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions Peng Wang∗ , Qi Wu∗, Chunhua Shen, Anton van den Hengel School of Computer Science, The University of Adelaide, Australia
arXiv:1612.05386v1 [cs.CV] 16 Dec 2016
{p.wang,qi.wu01,chunhua.shen,anton.vandenhengel}@adelaide.edu.au
Abstract One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and counting, to segmentation and reconstruction. To train a method to perform even one of these operations accurately from {image,question,answer} tuples would be challenging, but to aim to achieve them all with a limited set of such training data seems ambitious at best. We propose here instead a more general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Our method thus learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine [10]. The core of our proposed method is a new co-attention model. In addition, the proposed approach generates human-readable reasons for its decision, and can still be trained end-to-end without ground truth reasons being given. We demonstrate the effectiveness on two publicly available datasets, Visual Genome and VQA, and show that it produces the state-of-the-art results in both cases.
Figure 1: Two real example results of our proposed model. Given an image-question pair, our model generates not only an answer, but also a set of reasons (as text) and visual attention maps. The colored words in the question have Top-3 weights, ordered as red, blue and cyan. The highlighted area in the attention map indicates the attention weights on the image regions. The Top-3 weighted visual facts are re-formulated as human readable reasons. The comparison between the results for different questions relating to the same image shows that our model can produce highly informative reasons relating to the specifics of each question.
1. Introduction
erates a variety of information which we label the image facts. Inevitably, much of this information would not be relevant to the particular question asked. So part of the role of the attention mechanism is to determine which types of facts are useful in answering a question. The fact that the VQA-Machine is able to exploit a set of off-the-shelf CV methods in answering a question means that it does not need to learn how to perform these functions itself. The method instead learns to predict the appropriate combination of algorithms to exploit in response to a previously unseen question and image. It thus represents a step towards a Neural Network capable of learning an algorithm for solving its problem. In this sense it is comparable to the Neural Turning Machine [10] (NTM) whereby an RNN is
Visual Question Answering (VQA) is an AI-complete task lying at the intersection of computer vision (CV) and natural language processing (NLP). Current VQA approaches are predominantly based on a joint embedding [3, 9, 19, 23, 35, 41] of image features and question representations into the same space, the result of which is used to predict the answer. One of the advantages of this approach is its ability to exploit a pre-trained CNN model. In contrast to this joint embedding approach we propose a co-attention based method which learns how to use a set of off-the-shelf CV methods in answering image-based questions. Applying the existing CV methods to the image gen∗ indicates
equal contribution.
1
trained to use an associative memory module in solving its larger task. The method that we propose does not alter the parameters of the external modules it uses, but then it is able to exploit a much wider variety of module types. In order to enable the VQA-Machine to exploit a wide variety of available CV methods, and to provide a compact, but flexible interface, we formulate the visual facts as triplets. The advantages of this approach are threefold. Firstly, many relevant off-the-shelf CV methods produce outputs that can be reformulated as triplets (see Tab.4). Secondly, such compact formats are human readable, and interpretable. This allows us to provide human readable reasons along with the answers. See Fig. 1 for an example. At last, the proposed triplet representation is similar to the triplet representations used in some Knowledge Bases, such as < cat, eat, fish >. The method might thus be extendable to accept information from these sources. To select the facts which are relevant in answering a specific question, we employ a co-attention mechanism. This is achieved by extending the approach of Lu et al. [16], which proposed a co-attention mechanism that jointly reasons about the image and question, to also reason over a set of facts. Specifically, we design a sequential co-attention mechanism (see Fig.3) which aims to ensure that attention can be passed effectively between all three forms of data. The initial question representation (without attention) is thus first used to guide facts weighting. The weighted facts and the initial question representation are then combined to guide the image weighting. The weighted facts and image regions are then jointly used to guide the question attention mechanism. All that remains is to run the fact attention again, but informed by the question and image attention weights, as this completes the circle, and means that each attention process has access to the output of all others. All of the weighted features are further fed into a multilayer perceptron (MLP) to predict the answer. One of the advantages of this approach is that the question, image, and facts are interpreted together, which particularly means that information extracted from the image (and represented as facts) can guide the question interpretation. A question such as ‘Who is looking at the man with a telescope?’ means different things when combined with an image of a man holding a telescope, rather than an image of a man viewed through a telescope. Our main contributions are as follows: • We propose a new VQA model which is able to learn to adaptively combine multiple off-the-shelf CV methods to answer questions. • To achieve that, we extend the co-attention mechanism to a higher order which is able to jointly process questions, image, and facts. • The method that we propose generates not only an answer to the posed questions, but also a set of supporting
information, including the visual (attention) reasoning and human-readable textual reasons. To the best of our knowledge, this is the first VQA model that is capable of outputting human-readable reasons on free-form open-ended visual questions. • Finally, we evaluate our proposed model on two VQA datasets. Our model achieves the state-of-art in both cases. A human agreement study is conducted to evaluate the reason generation ability of our model.
2. Related Work Joint Embedding Most recent methods are based on a joint embedding of the image and the question using a deep neural network. Practically, image representations are obtained through CNNs that have been pre-trained on an object recognition task. The question is typically passed through an RNN, which produces a fixed-length vector representation. These two representations are jointly embedded into the same space and fed into a classifier which predicts the final answer. Many previous works [3, 9, 19, 23, 33, 41] adopted this approach, while some [8, 13, 24] have proposed modifications of this basic idea. However, these modifications are either focused on developing more advanced embedding techniques or employing different question encoding methods, with only very few methods actually aim to improve the visual information available [21, 30, 35]. This seems a surprising situation given that VQA can be seen as encompassing the vast majority of CV tasks (by phrasing the task as a question, for instance). It is hard to imagine that a single pre-trained CNN model (such as VGG [26] or ResNet [11]) would suffice for all such tasks, or be able to recover all of the required visual information. The approach that we propose here, in contrast, is able to exploit the wealth of methods that already exist to extract useful information from images, and thus does not need to learn to perform these operations from a dataset that is ill suited to the task. Attention Mechanisms Instead of directly using the holistic global-image embedding from the fully connected layer of a CNN, several recent works [12, 16, 25, 38, 39, 42] have explored image attention models for VQA. Specifically, the feature map (normally the convolutional layer of a pre-trained deep CNN) is used with the question to determine spatial weights that reflect the most relevant regions of the image. Recently, Lu et al. [16] determine attention weights on both image regions and question words. In our work, we extend the co-attention to a higher order so that the image, question and facts can be jointly weighted. Modular Architecture and Memory Networks Neural Module Networks (NMNs) were introduced by Andreas et al. in [1, 2]. In NMNs, the question parse tree is turned into an assembly of modules from a predefined set, which are
CVPR
CVPR
CVPR
#****
#****
#****
CVPR 2017 S CVPR 2017REVIEW Submission #****. CONFIDENTIAL CVPR 2017 Submission #****. CONFIDENTIAL COPY. DO NOT DISTRIBUTE
CVPR
#**** 864
1080 q 1081 973 1082 974 1083 975 Visual 1084 976 Fact F CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBU Facts: 1085 977 1086 978 … … … 871 1087 1083 979 1188 V CVPR CNNs Image: 872 1088 CVPR CVPR 980 CVPR CVPR 1084 1189 CVPR #**** 873 #402 #**** #**** 1089 #402 981 #402 1190 1085 CVPR 2017 Submission #****. CONFIDENTIAL REVIEW CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. CVPR 2017 Submission #402. CONFIDENT CVPR 2017 Submission CVPR 2017 Submission #402. CONFIDENTIAL REVIEW COPY. DOCVPR NOT DISTRIB 874 1090 2017 #** Su 982 1191 1086 Co-attention Co-attention Co-attention 875 1091 983 1192 1087 876 1092 984 p 648p ˜p 540 540 p 648 1193 w 756 ˜ w , ˜f w } ↵ {˜ qw , v 1088 ˜ , f } ↵756 {˜ q ,v 877 ˜ q , ˜f q } ↵qf {˜ qq , v f 1093 f 985 541 541 649 649 757 1194 757 Answer: riding a horse MLP 878 1089 1094 542 542 986 758 650 650 1195 758 • The image contains 879 1095759 1090 543 543 651 987 651 the object of horse 1196 759 Top-N 880 1096760 544 1091 544 + 652 Reasons • The image contains Formatting Ranking 988 652 760 1197 Facts the action riding 881 545 1097761 545 653 1092 989 • The girl is on the horse 653 761 1198 546 1098762 546882 654 990 1093 762 654 1199 547 1099763 547883 655 991 763 levels. Given 655 Figure 2: The proposed VQA model. The input question, facts and image features are weighted at three question-encoding 1094 548 1200 764the co-weighted 548884 1100 656 features at all levels, a multi-layer perceptron (MLP) classifier 1095 is1201 used to predict answers. 656 Then the992 ranked facts are used764 to generate reasons. 549 765 885 549 1101 657 550 766 765 993 657 1202 886 1096 550 658 1102 767 551 766 994 658 agreement 1203 A human study is reported in Sec. then used to answer the question. Dynamic Memory Net887 551 659 1097 1103768 4.2.3 which 552 767 995 659 1204 888 660 of this approach. works (DMN) [37] retrieves the ‘facts’ required 552 to answer 553demonstrates the performance 1104769 1098 768 996 660 553 1205 889 661 1105770 554 the question, where the ‘facts’ are simply CNN features 1099 cal769 661 997 554 890 1206 662 555 1106771 1100 770 culated over small image patches. In comparison555 to1207 NMNs 556 662 998 891 663 1107772 771 1101 663 999 556 and DMNs, our method uses a set of external algorithms 773 557 664 892 1208 1108 772 In this 774 VQA model 557 664 section, 1000 we 558 1102 665 introduce the proposed 893 that does not depend on the question. The set of algorithms 1209 1109 773facts as inputs 775 and outputs 558 559that takes 665 questions, images and 1001 666 894 1103 1210 out1110776 used is larger, however, the method for combining559 their 774 560 666 667 1002 with ranked reasons. The overall frame895 a predicted answer 1104 1211 1111777 puts is more flexible, and varies in response to the560 question, 775 561 668 667 1003 896 1212 1112 work is described in Sec. 3.1, while Sec. 3.2 1105 778demonstrates 562 776 561 the image, and the facts. 669 668 1004 897 1213 1113779 563how the three types of input are 1106 jointly embedded using the 777 562 670 669 898 1005 780 1214 564 563 778 model. 1114 1107 proposed sequential co-attention Finally, the mod671 670 899 Explicit Reasoning One of the limitations of most 1006 565 1115781 1215VQA 564 779 672 answers 1108 782is introduced 671 ule used for generating and reasons 900 1007 566 methods is that it impossible to distinguish between an an1116 1216 565 780 783 1109 901 in Sec.672 3.3. 1008 673 1117 566 1217 swer which has arisen as a result of the image content, 567 781 674 784 568 673 902 1110 1009 1118785 1218train782 675 and one selected because it occurs frequently in567 the 569 674 903 1111 1010 568 1119786 1219 783 676 ing set [28]. This is a significant limitation to the practi- 570 904 675 1011 1112 569 1120787 1220 784 677 571 905 cal application of the technology, particularly in570 Medicine The676 entire 1012 model678is shown785 in the Fig.1121 2.788The first step 1113 1221 572 677 571906 1013 or Defence, as it makes it impossible to have any faith in sees the input question encoded at three different levels. At 1122789 679 573 786 1114 1222 678 790 572907 1014 574 1123 680 787 the answers provided. One solution to this problem is to each level, the question features are embedded jointly with 1223 1115 791 679 573908 1015 575 681 1124 788 1224 provide human readable reasoning to justify or explain the images and facts via the proposed sequential co-attention 1116 792 909 574 680 576 1016 682 1125793 1225 910 Net1117 answer, which has been a longstanding goal in Neural a 683 multi-layer789 perceptron (MLP) is used 575 577model.681Finally, 1017 1126 790 794 1226 911 682 answers 1118 works (see [7, 29], for example). Wang et al. [31]576 propose a 578to predict based on the outputs (i.e., 684 1018 1127795the weighted 791 1227 579 912 683 image 1119 1019and685 796 VQA framework named “Ahab” that uses explicit577 reasoning question, fact features) of the co-attention mod1128 792 1228 580 578 913 684 686 797 1120 1020 1129 793 over an RDF (Resource Description Framework) Knowlels at all levels. Reasons are generated by ranking and re581 1229 579 914 685 687 1130798 1021 794 1121 582formulating the edge Base to derive the answer, which naturally580 gives rise weighted facts. 1230 915 799 688 686 795 1131 1022 583 1122 581 1231 916 800 a hierarchito a reasoning chain. This approach is limited to a hand- 584Hierarchical Question We1132 apply 689 Encoding 687 796 1023 582 1123 801 1232 917 690 688 cal question encoding [16] to effectively capture the incrafted set of question templates, however. FVQA [32] 585 797 1133802 1024 583 1124 1233 691 689 from a question at798 586formation used an LSTM and a data-driven approach to learn584the mapmultiple scales, i.e. word, 1025 803 692 1234 1125 587 690 799 9804 vectors of 585 considping of images/questions to RDF queries, but only phase and sentence level. Firstly, the one-hot 693 1235 588 1126 691 800 805 586 694 [q1 , . . . , qT ] are embedded individers questions relating to specified Knowledge Bases. In this 589question words Q = 1236 10 692 806 1127 801 587 w w 590ually to continuous695 1237 work, we employ attention mechanisms over facts provided vectors Q802 = [q1 , . 807 . . , qw 693 T ]. Then
Question: What is the girl doing?
Word Embedding 865 866 CVPR 867 #**** 1080 868 Embedding 1081 869 870 1082
972
p Q REVIEW COPY. DO NOT LSTM 1-D Conv. CVPR 2017QSubmission #****. CONFIDENTIAL
Qw
3. Models
3.1. Overall Framework
588 1128
1238 by multiple off-the-shelf CV methods. The facts are for589 1129 1239 590 mulated as human understandable structural triplets and are 1130 1240 591 further processed into human readable reasons. To the best 1131 1241 592 of our knowledge, this is the first approach that can provide 1132 593 human readable reasons for the open-ended VQA 1133 problem.
591 592
696
694 1-D convolutions with different (unigram, bi803 filter sizes808 697 809 695 804to the word-level embedare applied 593gram and trigram) 698 696 805 w dings Q , followed699by a max-pooling over different fil697 806 700 6 features ters at each word location, to form the phrase-level 698 807 701 Qp = 699 [qp1 , . . . , qpT ]. Finally, 808 the phrase-level features are 700 701
809
6
12
7
11 7
Triplet ( img, scene,img scn) ( img, att,img att) ( img, contain,obj) (obj, att,obj att) (obj1,rel,obj2)
Example ( img, scene,office) ( img, att,wedding) ( img, contain,dog) (shirt, att,red) (man,hold,umbrella)
Table 1: Facts represented by triplets. img, scene, att and contain are specific tokens. While img scn, obj, img att, obj att and rel refer to vocabularies describing image scenes, objects, image/object attributes and relationships between objects.
further encoded by an LSTM, resulting the question-level features Qq = [qq1 , . . . , qqT ]. Encoding Image Regions Following [16, 39], the input image is resized to 448 × 448 and divided to 14 × 14 regions. The corresponding regions of the last pooling layer of VGG-19 [26] or ResNet-100 [26] networks are extracted and further embedded using a learnable embedding weight. The outputs of the embedding layer, denoted as V = [v1 , . . . , vN ] (N = 196 is the number of image regions), are taken as image features. Encoding Facts In this work, we use triplets of the form (subject,relation,object) to represent facts in an image, where subject and object denote two visual concepts and relation represents a relationship between these two concepts. This format of triplets is very general and widely used in large-scale structured knowledge graphs (such as DBpedia [4], Freebase [5], YAGO [18]) to record a surprising variety of information. In this work, we consider five types of visual concepts as shown in Table 4, which respectively records information about the image scene, objects in the image, attributes of objects and the whole image, and relationships between two objects. Vocabularies are constructed for subject, relation and object, and the entities in a triplet are represented by individual one-hot vectors. Three embeddings are learned end-to-end to project the three triplet entities to continuous-valued vectors respectively (fs for subject, fr for relation, fo for object). The concatenated vector f = [fs ; fr ; fo ] is then used to represent the corresponding fact. By applying different types of visual models, we achieve a list of encoded fact features F = [f1 , . . . , fM ], where M is the number of extracted facts. Note that this approach may be easily extended to using any existing vision methods to extract image information that might usefully be recorded as a triplet, or even more generally, to any method which generates output which can be encoded as a fixed-length vector.
3.2. Sequential Co-attention Given the encoded question/image/fact features, the proposed co-attention approach sequentially generates attention weights for each feature type using the other two as guidance, as shown in Fig 3. The operation in each of the ˜ = Atten(X, g1 , g2 ), five attention modules is denoted x
CVPR
#****
Q
432 CVPR 433
Atten( Q , 0, 0)
CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY.
#**** 434
F
V
CVPR 2017 Submission #****. CVPR CONFIDENTIAL REVIEW#****. COPY.CONFIDENTIAL DO NOT DISTRIBUTE. 2017 Submission REVIEW COPY. DO NO 1296 ˜0 q 435 1297 CVPR 436 1298 #**** 437 540 CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO 1299 Atten( F , q ˜ 0 , 0) 438 CVPR 541 1300 439 #**** 542 1301 2017 Submission REVIEW #****. COPY.CONFIDENTIAL DO NOT DISTRIBUTE. CVPR 2017 Submission REVIEW COPY. DO NOT DIS 1404 440 CVPR ˜f 0 #****. CONFIDENTIAL 543 1302 CVPR 1405 Weighted 441 544 1303 #**** 1406 features 442 545 1304 648
443 546 CVPR 649
444 547 #**** 650
Input Sequences
548 445 651 549 446 652 550 447 653 756 551 448 654 CVPR 757 552 449 655 #**** 758 553 656 450 759 554 657 451 760 658 555 452 761 659 556 864 453 762 660 557 865 454 763 661 558 866 455 764 662 559 867 456 765 663 560 868 457 766 664 561 869 458 767 665 562 870 459
1407Atten( V , q ˜ 0 , ˜f 0 ) 1305 1408 1306 1409 1307 1512 ˜ v 1410 1308 1513 CVPR 1411 1309 1514 #**** 1412 1515 1310 ˜ , ˜f 0 ) Atten( Q , v 1413 1516 1311 1414 1517 1312 1620 1415 ˜ q 1518 1313 1621 CVPR 1416 1519 1314 1622 #**** 1417 1520 1315 1623 1418Atten( F , v 1521 ˜,q ˜) 1316 1624 1419 1522 1317 1625 1420 1523 1318 1626 1728 ˜ f 1421 1524 1627 1319 1729 1422 1525 1628 1320 1730 1423 1526 1629 1321 1731 1527 1424 1630
CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. D
CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT
CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO NOT D
CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. DO N
CVPR 2017 Submission #****. CONFIDENTIAL REVIEW COPY. D
Figure 3: The sequential co-attention module. Given the feature sequences for the question (Q), facts (F) and image (V), this module sequentially ˜, ˜ generates weighted features (˜ v, q f ).
which can be expressed as follows: Hi = tanh(Wx xi +Wg1 g1 +Wg2 g2 ), >
αi = softmax(w Hi ), P ˜ = N x i=1 αi xi ,
i = 1, . . . ,N,
(1a) (1b) (1c)
where X = [xi , . . . , xN ] ∈ Rd×N is the input sequence, and the fixed-length vectors g1 , g2 ∈ Rd are attention guidances. Wx , Wg1 , Wg2 ∈ Rh×d and w ∈ Rh are embedding parameters to be learned. Here α is the attention weights of the input sequence features and the weighted ˜ is the weighted feature. sum x In the proposed co-attention approach, the encoded question/image/fact features (see Sec. 3.1) are sequentially fed into the attention module (Equ.1) as input sequences, and the weighted features from the previous two steps are used as guidance. Firstly, the question features are summarized without any guidance (˜ q0 = Atten(Q, 0, 0)). At the second step, the fact features are weighted based on ˜ 0 , 0)). the summarized question features (˜f0 = Atten(F, q Next, the weighted image features are generated with the weighted fact features and summarized question features ˜ 0 , ˜f0 )). In step 4 (˜ as guidances (˜ v = Atten(V, q q = ˜ ˜ ˜ , f0 )) and step 5 (f = Atten(F, v ˜, q ˜ )), the quesAtten(Q, v tion and fact features are re-weighted based on the outputs of the previous steps. Finally, the weighted question/im˜ ) are further used for answer preage/fact features (˜ q, ˜f , v diction and the attention weights of the last attention module αf are used for reasons generation.
3.3. Answer Prediction and Reason Generation Similar to many previous VQA models [16, 22, 23, 41], the answer prediction process is treated as a multi-class classification problem, in which each class corresponds to a distinct answer. Given the weighted features generated
from the word/phrase/question levels, a multi-layer perceptron (MLP) is used for classification: ˜ w + ˜f w ) , hw = tanh Ww (˜ qw + v (2a) p p p w p ˜ ˜ + f ); h h = tanh Wp (˜ q +v , (2b) q q q p q ˜ ˜ + f ); h , h = tanh Wq (˜ q +v (2c) p
= softmax(Wh hq ),
(2d)
˜ w , ˜f w }, {˜ ˜ p , ˜f p } and {˜ ˜ q , ˜f q } are where {˜ qw , v qp , v qq , v weighted features from all three levels. Ww , Wp , Wq and Wh are parameters and p is the probability vector. As shown in Fig. 2, the attention weights of facts from all p q three levels (αw f , αf , αf ) are summed together. Then the top-3 ranked facts are automatically formulated (by simple rule-based approaches, see supplementary) to human readable sentences and are considered as reasons.
4. Experiments We evaluate our models on two datasets, Visual Genome QA [14] and VQA-real [3]. The Visual Genome QA contains 1,445,322 questions on 108,077 images. Since the official split of the Visual Genome dataset is not released, we randomly generate our own split. In this split, we have 723,060 training, 54,506 validation and 667,753 testing question/answer examples, based on 54,038 training, 4,039 validation and 50,000 testing images. VQA dataset [3] is one of the most widely used datasets, which comprises two parts, one using natural images, and a second using cartoon images. In this paper, we only evaluate our models on the real image subset, which we have labeled VQA-real. VQAreal comprises 123,287 training and 81,434 test images.
4.1. Implementation Details Facts Extraction Using the training splits of the Visual Genome dataset, we trained three naive multilabel CNN models to extract objects, object attributes and object-object relations. Specifically, we extract the top-5000/10000/15000 triplets of the form ( img, contain,obj)/(obj, att,obj att)/ (obj1,rel,obj2) based on the annotations of Visual Genome and use them as class labels. We then formulate them as three separate multi-label classification problems using an element-wise logistic loss function. The pretrained VGG-16 model is used as initialization and only the fully-connected layers are fine-tuned. We also used the scene classification model of [40] and the image attribute model of [33] to extract facts not included in the Visual Genome dataset (i.e., image scenes and image attributes). Thresholds are set for these visual models and on average 76 facts are extracted for each image. The confidence score of the predicted facts is added as an additional dimension of encoded fact features f .
Training Parameters In our system, the dimensions of the encoded question/image/fact features (d in Eq. 1) and the hidden layers of the LSTM and co-attention models (h in Eq. 1) are set to 512. For facts, the subject/relation/object entities are embedded to 128/128/256 dimensional vectors respectively and concatenated to form 512d vectors. We used two layers of LSTM model. For the MLP in Eq. (2), the dimensions of hw and hp are also 512, while the dimension of hq is set to 1024 for the VQA dataset and 2048 for the Visual Genome dataset. For prediction, we take the top 3000 answers for the VQA dataset and the top 5000 answers for the Visual Genome dataset. The whole system is implemented on the Torch7 [6]1 and trained end-to-end but with fixed CNN features. For optimzation, the RMSProp method is used with a base learning rate of 2 × 10−4 and momentum 0.99. The model is trained for up to 256 epochs until the validation error has not improved in the last 5 epochs.
4.2. Results on the Visual Genome QA Metrics We use the accuracy value and the Wu-Palmer similarity (WUPS) [36] to measure the performance on the Visual Genome QA (see Tab.2). Before comparison, all responses are made lowercase, numbers converted to digits, and punctuation & articles removed. The accuracy according to the question types are also reported. Baselines and State-of-the-art The first baseline method is VGG+LSTM from Antol et al. in [3], who uses a two layer LSTM to encode the questions and the last hidden layer of VGG [26] to encode the images. The image features are then `2 normalized. We use the author provided code2 to train the model on the Visual Genome QA training split. VGG+Obj+Att+Rel+Extra+LSTM uses the same configuration as VGG+LSTM, except that we concatenate additional image features extracted by VGG-16 models (fc7) that have been pre-trained on different CV tasks described in the previous section. HieCoAtt-VGG is the original model presented in [16], which is the current state of art. The authors’ implementation3 is used to train the model. 4.2.1
Ablation Studies with Ground Truth Facts
In this section, we conduct an ablation study to evaluate the effectiveness of incorporating different types of facts. To avoid the bias that would be caused by the varying accuracy with which the facts are predicted, we use the ground truth facts provided by the Visual Genome dataset as the inputs to our proposed models for ablation testing. GtFact(Obj) is the initial implementation of our proposed co-attention model, using only the ‘object’ facts, 1 The code and pre-trained models will be released upon the acceptance of the paper. 2 https://github.com/VT-vision-lab/VQA LSTM CNN 3 https://github.com/jiasenlu/HieCoAttenVQA
Methods VGG+LSTM [3] VGG+Obj+Att+Rel+Extra+LSTM HieCoAtt-VGG [16] Ours-GtFact(Obj) Ours-GtFact(Obj+Att) Ours-GtFact(Obj+Rel) Ours-GtFact(Obj+Att+Rel) Ours-GtFact(Obj+Att+Rel)+VGG Ours-PredFact(Obj+Att+Rel) Ours-PredFact(Obj+Att+Rel+Extra) Ours-PredFact(Obj+Att+Rel)+VGG Ours-PredFact(Obj+Att+Rel+Extra)+VGG
What (60.5%) 35.12 36.88 39.72 37.82 42.21 38.25 42.86 44.28 37.13 38.52 40.34 40.91
Where (17.0%) 16.33 16.85 17.53 17.73 17.56 18.10 18.22 18.87 16.99 17.86 17.80 18.33
Accuracy (%) When Who Why (3.5%) (5.5%) (2.7%) 52.71 30.03 11.55 52.74 32.30 11.65 52.53 33.80 12.62 51.48 37.32 12.84 51.89 37.45 12.93 51.13 38.22 12.86 51.06 38.26 13.02 52.06 38.87 12.93 51.70 33.87 12.73 51.55 34.65 12.87 52.12 34.98 12.78 52.33 35.50 12.88
How (10.8%) 42.69 44.00 45.14 43.10 43.90 43.32 44.26 46.08 42.87 44.34 45.37 46.04
Overall 32.46 33.88 35.94 34.77 37.50 35.15 38.06 39.30 34.01 35.20 36.44 36.99
WUPS (%) Overall @0.9 @0.0 38.30 58.39 39.61 58.89 41.75 59.97 40.83 59.69 43.24 60.39 41.25 59.91 43.86 60.72 44.94 61.21 39.92 59.20 41.08 59.75 42.16 60.09 42.73 60.39
Table 2: Ablation study on the Visual Genome QA dataset. Accuracy for different question types are shown. The percentage of questions for each type is shown in parentheses. We additionally calculate the WUPS at 0.9 and 0.0 for different models.
which means the co-attention mechanisms only occur between question and the input object facts. The overall accuracy of this model on the test split is 34.77%, which already outperforms the baseline model VGG+LSTM. However, there is still a gap between this model and the HieCoAttVGG (35.94%), which applies the co-attention mechanism to the questions and whole image features. Considering that the image features are still not used at this stage, this result is reasonable. Based on this initial model, we add the ‘attribute’ and ‘relationship’ facts separately, producing two models, GtFact(Obj+Att) and GtFact(Obj+Rel). Table 2 shows that the former performs better than the latter (37.50% VS. 35.15%), although both outperform the previous ‘Object’only model. This suggests that ‘attribute’ facts are more effective than the ‘relationship’ facts. However, when it comes to questions starting with ‘where’ and ‘who’, the GtFact(Obj+Rel) performs slightly better (18.10% VS. 17.56%, 38.22% VS.37.45%), which suggests that the ‘relationship’ facts play a more important role in these types of question. This makes sense because ‘relationship’ facts naturally encode location and identity information, for example < cat, on, mat > and < boy, holding, ball >. The GtFact(Obj+Att) model exceeds the image features based coattention model HieCoAtt-VGG by a large margin, 1.56%. This is mainly caused by the substantial improvement in the ‘what’ questions, from 39.72% to 42.21%. This is not surprising because the ‘attributes’ facts include detailed information relevant to the ‘what’ questions, such as ‘color’, ‘shape’, ‘size’, ‘material’ and so on. In the GtFact(Obj+Att+Rel), all of the facts are plugged into the proposed model, which brings further improvements for nearly all question types, achieving an overall accuracy 38.06%. Compared with the baseline model VGG+Obj+Att+Rel+Extra+LSTM that uses a conventional method (concatenating and embedding) to introduce
additional features into the VQA, our model outperforms by 4.18%, which is a large gap. However, for the ‘when’ questions, we find that the performance of our ‘facts’ based models are always lower than the image-based ones. We observed that this mainly because the annotated facts in the Visual Genome normally do not cover the ‘when’ related information, such as time and day or night. The survey paper [34] makes a similar observation - “98.8% of the ‘when’ questions in the Visual Genome can not be directly answered from the provided scene graph annotations”. Hence, in the final model GtFact(Obj+Att+Rel)+VGG, we add the image features back, allowing for a higher order co-attention between image, question and facts, achieving an overall accuracy 39.30%. It also brings 1% improvements on the ‘when’ questions. The WUPS evaluation exhibits the same trend as the above results, the question typespecific results can be found in the supplementary material. 4.2.2 Evaluation with Predicted Facts In a normal VQA setting the ground truth facts are not provided, so we now evaluate our model using predicted facts. All facts were predicted by models that have been pre-trained on different computer vision tasks, e.g. object detection, attributes prediction, relationship prediction and scene classification (see Sec. 4.1 for more details). The PredFact(Obj+Att+Rel) model uses all of the predicted facts as the input, while PredFact(Obj+Att+Rel+Extra) additionally uses the predicted scene category, which is trained on a different data source, the MIT Places 205 [40]. From Table 2, we see that although all facts have been included, the performance of these two facts-only models is still lower than that of the previous state-of-the-art model HieCoAtt-VGG. Considering that errors in fact prediction may pass to the question answering part and the image features are not used, these results are reasonable. If the fact prediction models perform better, the final answer accuracy will be
Q: What is in the water? A: boat
Q: What time of day is it? A: daytime
Q: What color is the plane’s tail? A: blue
Q: What is sitting on the chair? A: cat
Q: Who is surfing? A: man
• This image contains the object of boat. • The boat is on the water. • This image happens in the scene of harbor.
• The sky is blue. • The horse is pulling. • This image contains the object of road.
• The tail is blue. • This image contains the object of airplane. • This image happens in the scene of airport.
• This image contains the object of cat. • The cat is on the chair. • The fur is on the cat
• The man is riding the surfboard. • The man is wearing the wetsuit. • This image contains the action of surfing.
Q: Where is the kite? A: in the sky
Q: How is the weather outside? A: rainy
Q: What the policemen riding? A: horse
Q: Where is this place? A: office
Q: Who is talking on a phone? A: woman
• The kite is in the sky. • This image contains the object of kite. • This image happens in the scene of beach.
• The woman is with the umbrella. • This image contains the attribute of rain. • This image contains the object of umbrella.
• The military officer is on the horse. • The military officer is riding the horse. • This image contains the object of police.
• This image contains the object of desk. • The printer is on the desk. • This image happens in the scene of office.
• The woman is holding the telephone. • The woman is wearing the sunglasses. • The woman is wearing the short pants.
Figure 4: Some qualitative results produced by our complete model on the Visual Genome QA test split. Image, QA pair, attention map and predicted Top-3 reasons are shown in order. Our model is capable of co-attending between question, image and supporting facts. The colored words in the question have Top-3 identified weights, ordered as red, blue and cyan. The highlighted area in the attention map indicates the attention weights on the image regions (from red: high to blue: low). The Top-3 identified facts are re-formulated as human readable reasons, shown as bullets.
higher, as shown in the previous ground-truth facts experiments. The PredFact(Obj+Att+Rel+Extra) outperforms the PredFact(Obj+Att+Rel) model by 1.2%, because the extra scene category facts are included. This suggests that our model benefits by using multiple off-the-shelf CV methods, even though they are trained on different data sources, and on different tasks. And as more facts are added, our model performs better. Compared with the baseline model VGG+Obj+Att+Rel+Extra+LSTM, our PredFact(Obj+Att+Rel+Extra) outperforms it by a large margin, which suggests that our co-attention model can more effectively exploit the fact-based information. Image features are further inputted to our model, producing two variants PredFact(Obj+Att+Rel)+VGG and PredFact(Obj+Att+Rel+Extra)+VGG. Both of these outperform the state of art, and our complete model Pred-
Fact(Obj+Att+Rel+Extra)+VGG outperforms it by 1%, i.e. more than 6,000 more questions are correctly answered. Please note that all of these results are produced by using the naive facts extraction models described in Sec. 4.1. We believe that as better facts extraction models (such as the relationship prediction models from [15], for instance) become available, the results will improve further. 4.2.3
Human agreements on Predicted Reasons
A key differentiator of our proposed model is that the weights resulting from the facts attention process can be used to generate human-interpretable reasons for the answer generated. To evaluate the human agreement with these generated reasons, we sampled 1,000 questions that have been correctly answered by our PredFact(Obj+Att+Rel+Extra)+VGG model, and conduct a
Method iBOWING [41] MCB-VGG [8] DPPnet [22] D-NMN [1] VQA team [3] SMem [38] SAN [39] ACK [35] DMN+ [37] MRN-VGG [13] HieCoAtt-VGG [16] Re-Ask-ResNet [20] FDA-ResNet [12] MRN-ResNet [13] HieCoAtt-ResNet [16] MCB-Att-ResNet [8] Ours-VGG Ours-ResNet
Y/N 76.6 80.7 80.5 80.5 80.9 79.3 81.0 80.5 82.5 79.6 78.4 81.1 82.4 79.7 82.5 81.2 81.5
Open-Ended Num Other 35.0 42.6 37.2 41.7 37.4 43.1 36.8 43.1 37.3 43.1 36.6 46.1 38.4 45.2 36.8 48.3 38.3 46.8 38.4 49.1 36.4 46.3 36.2 45.8 38.4 49.3 38.7 51.7 37.6 55.6 37.7 50.5 38.4 53.0
Test-dev All 55.7 57.1 57.2 57.9 57.8 58.0 58.7 59.2 60.3 60.5 60.5 58.4 59.2 61.5 61.8 64.7 61.7 63.1
Y/N 76.7 80.8 80.5 82.6 79.7 82.4 79.7 81.3 81.5
Multiple-Choice Num Other 37.1 54.4 38.9 52.2 38.2 53.0 39.9 55.2 40.1 57.6 39.7 57.2 40.0 59.8 39.9 60.5 40.0 62.2
All 61.7 62.5 62.7 64.8 64.9 65.6 65.8 69.1 66.8 67.7
Y/N 76.8 80.3 80.6 80.8 81.1 78.2 82.4 81.3 81.4
Open-Ended Num Other 35.0 42.6 36.9 42.2 36.4 43.7 37.3 43.1 37.1 45.8 36.3 46.3 38.2 49.4 36.7 50.9 38.2 53.2
Test-std All 55.9 57.4 58.0 58.2 58.2 58.9 59.4 60.4 58.4 59.5 61.8 62.1 61.9 63.3
Y/N 76.9 80.4 80.6 82.4 81.4 81.4
Multiple-Choice Num Other 37.3 54.6 38.8 52.8 37.7 53.6 39.6 58.4 39.0 60.8 39.8 62.3
All 62.0 62.7 63.1 66.3 66.1 67.0 67.8
Table 3: Single model performance on the VQA-real test set in the open-ended and multiple-choice settings.
human agreement study on the generated reasons. Since this is the first VQA model that can generate human readable reasons, there is no previous work that we can follow to perform the human evaluation. We have thus designed the following human agreement experimental protocols. At first, an image with the question and our correctly generated answer are given to a human subject. Then a list of human readable reasons (ranging from 20 to 40) are shown. These reasons are formulated from facts that are predicted from the facts extraction model introduced in the Sec.4.1. Although these reasons are all related to the image, not all of them are relevant to the particular question asked. The task of the human subjects is thus to choose the reasons that are related to answering the question. The human agreements are calculated by matching the human selected reasons with our model ranked reasons. In order to ease the human subjects’ workload and to have an unique guide for them to select the ‘reasons’, we ask them to select only the top-1 reason that is related to the question answering. And they can choose nothing if they think none of the provided reasons are useful. Finally, in the evaluation, we calculate the rate at which the human selected reason can be found in our generated top-1/3/5 reasons. We find that 30.1% of the human selected top-1 reason can be matched with our model ranked top-1 reason. For the top-3 and top-5, the matching rate are 54.2% and 70.9%, respectively. This suggests that the reasons generated are both interpretable and informative. Figure 4 shows some example reasons generated by our model.
4.3. Results on the VQA-real Table 3 compares our approach with state-of-the-art on the VQA-real dataset. Since we do not use any ensemble models, we only compare with the single models on the VQA test leader-board. The test-dev is normally used for
validation while the test-standard is the default test data. The first section of Table 3 shows the state of art methods that use VGG features, except iBOWING [41], which uses the GoogLeNet features [27]. The second section gives the results of models that use ResNet [11] features. Two versions of our complete model are evaluated at the last section, using VGG and ResNet features, respectively. Ours-VGG produces the best result on all of the splits, compared with models using the same VGG image encoding method. Ours-ResNet ranks the second amongst the single models using ResNet features on the test-dev split, but we achieve the state of the art results on the test-std, for both Open-Ended and Multiple-Choice questions. The best result on the test-dev with ResNet features is achieved by the Multimodal Compact Milinear (MCB) pooling model with the visual attention [8]. We believe the MCB can be integrated within our proposed co-attention model, by replacing the linear embedding steps in Eqs. 1 and 2, but we leave it as a future work.
5. Conclusion We have proposed a new approach which is capable of adaptively combining the outputs from other algorithms in order to solve a new, more complex problem. We have shown that the approach can be applied to the problem of Visual Question Answering, and that in doing so it achieves state of the art results. Visual Question Answering is a particularly interesting application of the approach, as in this case the new problem to be solved is not completely specified until run time. In retrospect, it seems strange to attempt to answer general questions about images without first providing access to readily available image information that might assist in the process. In developing our approach we proposed a co-attention method applicable to questions, image and facts jointly. We also showed that attention-
weighted facts serve to illuminate why the method reached its conclusion, which is critical if such techniques are to be used in practice.
References [1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Learning to compose neural networks for question answering. In Proc. Conf. of North American Chapter of Association for Computational Linguistics, 2016. 2, 8 [2] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural Module Networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2 [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In Proc. IEEE Int. Conf. Comp. Vis., 2015. 1, 2, 5, 6, 8, 10 [4] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007. 4 [5] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In ACM SIGMOD International Conference on Management of Data, pages 1247– 1250. ACM, 2008. 4 [6] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In Proc. Advances in Neural Inf. Process. Syst. Workshop, 2011. 5 [7] M. Craven and J. W. Shavlik. Using sampling and queries to extract rules from trained neural networks. In Proc. Int. Conf. Mach. Learn., pages 37–45, 1994. 3 [8] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. Conf. Empirical Methods in Natural Language Processing, 2016. 2, 8 [9] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering. In Proc. Advances in Neural Inf. Process. Syst., 2015. 1, 2 [10] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. 1 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2, 8 [12] I. Ilievski, S. Yan, and J. Feng. A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485, 2016. 2, 8 [13] J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. In Proc. Advances in Neural Inf. Process. Syst., 2016. 2, 8 [14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016. 5
[15] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei. Visual relationship detection with language priors. In Proc. Eur. Conf. Comp. Vis., 2016. 7 [16] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In Proc. Advances in Neural Inf. Process. Syst., 2016. 2, 3, 4, 5, 6, 8, 10 [17] L. Ma, Z. Lu, and H. Li. Learning to Answer Questions From Image using Convolutional Neural Network. In Proc. Conf. AAAI, 2016. 10 [18] F. Mahdisoltani, J. Biega, and F. Suchanek. YAGO3: A knowledge base from multilingual Wikipedias. In CIDR, 2015. 4 [19] M. Malinowski, M. Rohrbach, and M. Fritz. Ask Your Neurons: A Neural-based Approach to Answering Questions about Images. In Proc. IEEE Int. Conf. Comp. Vis., 2015. 1, 2 [20] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A deep learning approach to visual question answering. arXiv preprint arXiv:1605.02697, 2016. 8 [21] A. Mallya and S. Lazebnik. Learning models for actions and person-object interactions with transfer to question answering. In Proc. Eur. Conf. Comp. Vis., 2016. 2 [22] H. Noh, P. H. Seo, and B. Han. Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 4, 8 [23] M. Ren, R. Kiros, and R. Zemel. Image Question Answering: A Visual Semantic Embedding Model and a New Dataset. In Proc. Advances in Neural Inf. Process. Syst., 2015. 1, 2, 4, 10 [24] K. Saito, A. Shin, Y. Ushiku, and T. Harada. Dualnet: Domain-invariant network for visual question answering. arXiv preprint arXiv:1606.06108, 2016. 2 [25] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2 [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. Int. Conf. Learn. Representations, 2015. 2, 4, 5 [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015. 8 [28] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. CoRR, abs/1609.05600, 2016. 3 [29] S. Thrun. Extracting rules from artificial neural networks with distributed representations. In Proc. Advances in Neural Inf. Process. Syst., pages 505–512, 1995. 3 [30] T. Tommasi, A. Mallya, B. Plummer, S. Lazebnik, A. C. Berg, and T. L. Berg. Solving visual madlibs with multiple cues. In Proc. British Machine Vis. Conf., 2016. 2 [31] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick. Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570, 2015. 3
[32] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, and A. Dick. Fvqa: Fact-based visual question answering. arXiv:1606.05433, 2016. 3 [33] Q. Wu, C. Shen, A. v. d. Hengel, L. Liu, and A. Dick. What Value Do Explicit High Level Concepts Have in Vision to Language Problems? In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2, 5 [34] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. Visual question answering: A survey of methods and datasets. arXiv preprint arXiv:1607.05910, 2016. 6 [35] Q. Wu, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 1, 2, 8 [36] Z. Wu and M. Palmer. Verbs semantics and lexical selection. In Proc. Conf. Association for Computational Linguistics, 1994. 5 [37] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In Proc. Int. Conf. Mach. Learn., 2016. 3, 8 [38] H. Xu and K. Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Proc. Eur. Conf. Comp. Vis., 2016. 2, 8 [39] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked Attention Networks for Image Question Answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2, 4, 8 [40] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Proc. Advances in Neural Inf. Process. Syst., 2014. 5, 6 [41] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167, 2015. 1, 2, 4, 8 [42] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7W: Grounded Question Answering in Images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016. 2
A. Reformatting Facts to Reasons In this section we provide additional detail on the method used to reformulate the ranked facts into human-readable reasons using a template-based approach. Each of the extracted facts has been encoded by a structural triplet (see Sec.3.1 for more details) (subject,relation,object), where subject and object denote two visual concepts and relation represents a relationship between these two concepts. This structured representation enables us to reformulate the fact to a human-readable sentence easily, with a pre-defined template. All of the templates are shown in Table 4. For the ( img, att,img att) triplet, since we also have the super-class label of the img att vocabulary, such as ‘action’, ‘number’, ‘attribute’ etc., we can generate more varieties. For example, we can generate a reason such as ‘This image contains the action of surfing’, because ‘surfing’ belongs to the super-class of ‘action’.
Fact Triplet ( img, scene,img scn) ( img, att,img att) ( img, contain,obj) (obj, att,obj att) (obj1,rel,obj2)
Reason Template This image happens in the scene of img scn. This image contains the attribute of img att. This image contains the object of obj. The obj is obj att. The obj1 is rel the obj2.
Table 4: Facts represented by triplets and the corresponding reason template. img, scene, att and contain are specific tokens. While img scn, obj, img att, obj att and rel refer to vocabularies describing image scenes, objects, image/object attributes and relationships between objects. These words are filled into the pre-defined template to generate reasons. Example Fact ( img, scene,office) ( img, att,wedding) ( img, contain,dog) (shirt, att,red) (man,hold,umbrella)
Example Reason This image happens in the scene of office. This image contains the attribute of wedding. This image contains the object ofdog. The shirt is red. The man is hold the umbrella.
Table 5: Example facts in triplet and their corresponding generated reasons based on the templates in the previous table.
B. WUPS Evaluation on the Visual Genome QA The WUPS calculates the similarity between two words based on the similarity between their common subsequence in the taxonomy tree. If the similarity between two words is greater than a threshold then the candidate answer is considered to be right. We report on thresholds 0.9 and 0.0, following [17, 23]. Table 6 and 7 show the question typespecific results. Methods VGG+LSTM [3] VGG+Obj+Att+Rel+Extra+LSTM HieCoAtt-VGG [16] GtFact(Obj) GtFact(Obj+Att) GtFact(Obj+Rel) GtFact(Obj+Att+Rel) GtFact(Obj+Att+Rel)+VGG PredFact(Obj+Att+Rel) PredFact(Obj+Att+Rel+Extra) PredFact(Obj+Att+Rel)+VGG PredFact(Obj+Att+Rel+Extra)+VGG
What (60.5%) 42.84 44.44 47.38 45.86 49.74 46.35 50.45 51.69 44.87 46.23 47.86 48.48
Where (17.0%) 18.61 19.12 19.89 20.05 19.87 20.45 20.55 21.18 19.46 20.27 20.17 20.69
WUPS @ 0.9 (%) When Who Why (3.5%) (5.5%) (2.7%) 55.81 35.75 12.53 55.85 37.94 12.62 55.83 39.40 13.72 55.19 42.99 13.86 55.53 42.95 13.98 54.93 43.75 13.89 54.98 43.81 14.15 55.90 44.06 14.05 54.97 39.70 13.89 55.30 40.31 13.95 55.77 40.35 13.81 55.76 40.96 13.97
How (10.8%) 45.78 46.98 48.15 48.80 47.00 46.42 47.37 49.03 46.04 47.37 48.34 48.97
Overall 38.30 39.61 41.75 40.83 43.24 41.25 43.86 44.94 39.92 41.08 42.16 42.73
Table 6: Ablation study on the Visual Genome QA dataset. WUPS at 0.9 for different question types are shown. The percentage of questions for each type is shown in parentheses.
Methods VGG+LSTM [3] VGG+Obj+Att+Rel+Extra+LSTM HieCoAtt-VGG [16] GtFact(Obj) GtFact(Obj+Att) GtFact(Obj+Rel) GtFact(Obj+Att+Rel) GtFact(Obj+Att+Rel)+VGG PredFact(Obj+Att+Rel) PredFact(Obj+Att+Rel+Extra) PredFact(Obj+Att+Rel)+VGG PredFact(Obj+Att+Rel+Extra)+VGG
What (60.5%) 66.49 67.10 68.41 67.94 69.07 68.22 69.43 69.99 67.36 67.98 68.55 68.88
Where (17.0%) 27.39 27.64 28.50 28.57 28.37 28.85 28.85 29.29 28.27 28.91 28.71 29.07
WUPS @ 0.0 (%) When Who Why (3.5%) (5.5%) (2.7%) 63.29 55.25 15.99 63.25 56.05 16.12 62.97 56.83 17.41 62.27 58.19 17.42 62.68 58.35 17.62 62.10 58.60 17.51 62.10 58.67 18.00 62.91 58.56 17.70 62.30 56.73 17.54 62.56 57.11 17.70 62.70 57.01 17.43 62.91 57.28 17.76
How (10.8%) 72.25 72.61 73.33 72.78 73.06 72.66 73.24 73.80 72.68 72.96 73.32 73.38
Overall 58.39 58.89 59.97 59.69 60.39 59.91 60.72 61.21 59.20 59.75 60.09 60.39
Table 7: Ablation study on the Visual Genome QA dataset. WUPS at 0.0 for different question types are shown. The percentage of questions for each type is shown in parentheses.