Learning Multimodal Semantic Models for Image ... - Google Sites

0 downloads 143 Views 721KB Size Report
to jointly learn the semantic representations of the image and the question to pre- ... With the advancement of deep lea
Learning Multimodal Semantic Models for Image Question Answering

Zichao Yang1 , Xiaodong He2 , Jianfeng Gao2 , Li Deng2 Carnegie Mellon University, 2 Microsoft Research, Redmond, WA 98052, USA, [email protected] {xiaohe, jfgao, deng}@microsoft.com 1

Abstract This paper presents a set of models and methods that learn to answer natural language questions from an image. The proposed methods use deep neural networks to jointly learn the semantic representations of the image and the question to predict the answer. We carried out evaluations on three image question answering (QA) data sets. The experimental results demonstrate that the proposed multimodal semantic model based image QA system gives superior performance, outperforming previous state-of-the-art approaches by a significant margin.

1

Introduction

With the advancement of deep learning in computer vision and in natural language processing, merging of vision and natural language processing is becoming an increasingly important research area. One of the recent success in this area is automatic caption generation for images [2, 18, 3, 9, 7, 2, 13, 20]. The models underling this success jointly learn from high level representations of images using convolutional neural networks (CNN) and of word sequences using recurrent neural networks (RNN) or using CNN. More recently, image question answering (QA) is proposed as a new challenging task in the vision and language area. In image QA, we need to answer natural language questions according to the content of a reference image. In order to facilitate the research of image QA, several datasets have been constructed, either through automatic generation based on image caption data, or by human labeling of question and answers given images [4, 14, 11, 1, 12]. Though closely related, image QA differs from automatic image captioning in several aspects. Unlike image captioning, language generation is not a main focus for image QA. Therefore, instead of learning models for generating syntactically fluent language, the model for image QA focuses more on capturing the semantics of the image and the question effectively, and therefore predicting answers correctly. In this paper, we propose using deep neural networks to jointly learn the semantic representations of the image and the question, and then to predict the answer based on the learned abstract semantic representations of the image and the question. Unlike previous work[11], no domain-specific knowledge in language or vision processing, such as natural language parsing and image segmentation, is required. Instead, the new method reported in this paper is an end-to-end image QA framework, which is learnable directly from the training data. The overall architecture of our model is shown in Fig. 1. Our model is composed of three parts, a image model for extracting semantic representations of images, a question (text) model for extracting semantic representations of natural language questions, and a prediction model that combines the image representation and the question representation to predict the answer. The main contributions of our work are: 1) A novel end-to-end trainable framework, which is based on joint image and text semantic representation learning, for image QA tasks; 2) comprehensive evaluation on three image QA benchmarks; 3) detailed analysis and demonstration of the effectiveness of the proposed image QA system through case studies. 1

2

The Proposed Model

In this work, we map the image(I), and question(Q) to the same semantic space through joint learning. Denote by the vector of them as vI and vQ , respectively. Assuming that the semantics of the image can be decomposed in to the semantics of the question plus the semantics of the answer and some background noise. Thus, subtracting vQ from vI provides strong information for predicting answer. Therefore, we propose the multimodal semantic learning based image QA framework as illustrated in Fig.1.

vI

CNN

Question: what is the color of the horses

CNN/ LSTM

vQ

Softmax

vI

Answer: Brown

vQ

Figure 1: Image QA Multimodal model 2.1

Predicting answers from semantic representation of images and questions

As discussed above, given the semantic vector vI and vQ for the image and the question, subtracting vQ from vI provides strong information for predicting the answer. In the three image QA benchmarks, the set of possible answers is pre-define. Therefore, we make prediction of the answer directly based on vI − vQ using a classifier, i.e., pA = softmax(Wmm (vI − vQ ) + bmm ). (1) 2.2

The Image Model

We use a CNN to learn the images features. We choose to use the VGGnet[16] to extract image presentation. Specifically, denote by the image as I and the VGGnet image feature is: hI = CNNvgg (I). (2) Then, we add a MLP layer to further transform the VGGnet image feature to the joint image-text semantic space: (3) vI = tanh(WI hI + bI ), 2.3

The Question Model

In this work, we explore using the LSTM and the CNN to capture the semantic meaning of questions for image QA tasks. 2.3.1

The LSTM based question model

As illustrated in Fig.2, the output ht after seeing the word sequence up to time t preserves the semantic meaning of text up to word t. Given the question q = [q1 , ..., qT ], where qt is the one hot vector representation of word at position t, we first embed the words to a vector space through an embedding matrix xt = We qt . Then for every time step, we feed the corresponding embedding vector to the LSTM and at the end we use the last output state as the representation of the whole question. (4) xt =We qt , t ∈ {1, 2, ...T } (5) ht =LSTM(xt ), t ∈ {1, 2, ...T } The model is shown in Fig. 2, where the question what is the color of the horse is fed into the LSTM and the final hidden layer of the LSTM after consuming the last word of the whole question is taken as the semantic representation of the question, i.e., vQ = hT . 2

Question:

LSTM

LSTM



LSTM

We

We



We

what

is



horse

Figure 2: LSTM for question 2.3.2

The CNN based question model

CNN has been demonstrated as powerful models for capturing the semantic meaning of text [15, 6]. In this work, we adopt a similar CNN architecture as in [8, 15]. The diagram of CNN for the question model is shown in Fig. 3. In CNN, we first embed the words to a vector space, then we apply convolution and pooling operations to the word embedding vectors. We use three convolution filters which have a size of one (monogram), two (bigram) or three (trigram) words respectively. Max pooling operation applies following the convolution operation to generate a global semantic representation vector for the whole question.

unigram

trigram

bigram

max pooling over time

convolution

embedding horse …

color

is what

Question:

Figure 3: The CNN based question model The diagram of the CNN based question model is shown in Fig. 3. The red, blue and orangecolored diagram illustrates the monogram, bigram and trigram convolution and pooling operations, respectively. The semantic representation of the whole question is the vector after max pooling.

3 3.1

Experiments Datasets & Baselines

We evaluate our model on the three common benchmarks recently proposed for image QA: DAQUAR-ALL [11], DAQUAR-REDUCED, COCO-QA[14]. We compare our model with methods proposed in recent works on image QA. In [11], the authors use a parser to parse the question and images segmentation to get the answer. Other papers make use of neural network models [14, 12, 10]. Although these paper also make use of LSTM or CNN, our approach has a different model architecture. We report the accuracy of answers in percentage 3

in this paper. Besides, since the reference models also use Wu-Palmer similarity (WUPS) measure [19], we report WUPS0.9 and WUPS0.0 as well. 3.2

Training and results

All the models are trained using stochastic gradient descent with a momentum 0.9. The best learning rate is picked among 0.1, 0.01, 0.001. The batch size is fixed to be 100. Gradient clipping [5] and dropout[17] are used during training. The experimental results for DAQUAR-ALL, DAQUAR-REDUCED and COCO-QA datasets are show in Table. 1, 4 and 3 respectively. The name of our model explains itself. The experimental results show that our models perform better than previous models significantly, by 4% absolutely across all datasets. 3.3

A Case study

To further verify our assumption that the semantic meaning of the image can be decomposed into two parts, the semantics of the question and semantics of the answer plus some background noise, in Fig.4, at each row, we compute the vQ of the question and the vI of the reference image, and then we retrieve the nearest four images measured by the distance between their semantic vector vI and the difference between vI − vQ . Ideally, assuming the semantic information of the image is decomposable, then vI − vQ corresponds to semantic representation of the object of the answer plus some background. In Fig.4, the nearest neighbor images retrieved by vI − vQ all contains the object of the answer, plus various background, showing that the proposed approach learns the image and the question in a unified semantic space, which then helps predicting the correct answer effectively in our framework.

Figure 4: Image query with vI − vQ

4

Conclusion

In this paper, we propose a novel framework for image QA and explore important variations. The evaluation on three image QA benchmarks shows that our models outperform other methods proposed in previous studies. Further, it is observed that the CNN perform better than the LSTM in 4

Methods

Accuray WUPS0.9 WUPS0.0

Multi-World: [11] Multi-World 7.86 11.86

38.79

Neural-based: [12] Language 17.15 22.80 Neural-based 19.43 25.28

58.42 62.00

CNN: [10] IMG-CNN

23.40

29.59

62.95

Ours: LSTM-MM CNN-MM

27.80 28.61

34.27 34.91

68.09 68.84

Human:[11] Human 50.20

50.82

67.27

Methods

Objects Number Color Location

VSE: [14] GUESS BOW LSTM IMG IMG+BOW VIS+LSTM 2-VIS+BLSTM

2.11 37.27 35.87 40.37 58.66 56.53 58.17

35.84 43.56 45.34 29.26 44.10 46.10 44.79

Ours: LSTM-MM CNN-MM

61.94 62.96

47.51 53.65 51.70 46.97 53.86 53.15

8.93 40.84 38.42 44.19 49.39 45.52 47.34

Table 2: COCO-QA accuracy per class, in percentage

Table 1: DAQUAR-ALL results, in percentage

Methods

Methods

13.87 34.75 36.26 42.68 51.96 45.87 49.53

Accuray WUPS0.9 WUPS0.0

Accuray WUPS0.9 WUPS0.0

Multi-World: [11] Multi-World 12.73

18.20

51.47

Neural-based: [12] Language 31.65 Neural-based 34.68

38.35 40.76

80.08 79.54

VSE: [14] GUESS BOW LSTM IMG+BOW VIS+LSTM 2-VIS+BLSTM

18.24 32.67 32.73 34.17 34.41 35.78

29.65 43.19 43.50 44.99 46.05 46.83

77.59 81.30 81.62 81.48 82.23 82.15

VSE: [14] GUESS BOW LSTM IMG IMG+BOW VIS+LSTM 2-VIS+BLSTM

6.65 37.52 36.76 43.02 55.92 53.31 55.09

17.42 48.54 47.58 58.64 66.78 63.91 65.34

73.44 82.78 82.34 85.85 88.99 88.25 88.64

CNN: [10] IMG-CNN CNN

54.95 32.70

65.36 44.32

88.58 80.89

CNN: [10] IMG-CNN

39.66

44.86

83.06

Ours: LSTM-MM CNN-MM

58.88 59.69

68.93 69.62

90.02 90.20

Ours: LSTM-MM CNN-MM

43.45 43.45

49.09 47.70

83.09 83.22

Human:[11] Human

60.27

61.04

78.96

Table 3: COCO-QA results, in percentage

Table 4: DAQUAR-REDUCED results, in percentage

extracting general semantic meaning of sentences, while the LSTM is able to capture the representation of number in the context better than the CNN.

References [1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. arXiv preprint arXiv:1505.00468, 2015. [2] Xinlei Chen and C Lawrence Zitnick. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014. 5

[3] Hao Fang, Saurabh Gupta, Forrest N. Iandola, Rupesh Srivastava, Li Deng, Piotr Doll´ar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. CoRR, abs/1411.4952, 2014. [4] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question answering. arXiv preprint arXiv:1505.05612, 2015. [5] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013. [6] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2333–2338. ACM, 2013. [7] Andrej Karpathy, Armand Joulin, and Fei Fei F Li. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889–1897, 2014. [8] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751, 2014. [9] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014. [10] Lin Ma, Zhengdong Lu, and Hang Li. Learning to answer questions from image using convolutional neural network. arXiv preprint arXiv:1506.00333, 2015. [11] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, pages 1682–1690, 2014. [12] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. arXiv preprint arXiv:1505.01121, 2015. [13] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014. [14] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. arXiv preprint arXiv:1505.02074, 2015. [15] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 101–110. ACM, 2014. [16] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. [18] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014. [19] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics, 1994. [20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044, 2015.

6

Suggest Documents