Department of Computer Science,. Arizona State University, Tempe, USA. Image Understanding Through Text. Summary. Genera
Explainable Image Understanding using Vision and Reasoning Somak Aditya Department of Computer Science, Arizona State University, Tempe, USA
AAAI-17 Doctoral Consortium
Image Understanding Through Text
General Architecture What is Understanding? (do you understand) ● Ask students questions about a subject, if the student can answer then he/she “understands” it. ● UNDERSTANDING here is equivalent to Question-Answering Quality of understanding (how much do you understand) ● Increase difficulty of questions. ● According to Bloom’s Taxonomy [4], they are:
Image Understanding through text: ● Gained huge popularity recently. ● Primarily two tasks are designed: ○ Caption Generation ○ Visual Question Answering
○ Knowledge, Comprehension, Application, Analysis, Synthesis, Evaluation Architecture Should: ー Explicitly model Connections between Vision and Reasoning and Knowledge. (Modular or not, reasoning and knowledge has to be modeled/learnt by any Image Understanding system, VQA and Captioining models learn some such knowledge implicitly, but not clearly)
Explainability
i. ii. iii. iv.
(Knowledge) List the objects in the image. (Comprehension) what will the man do next? (Application) how to cut tofu? (Analysis) Why is the man holding the bowl with his other hand? v. (Synthesis) Can you propose how else to cut a tofu? vi. (Evaluation) Is there a better way to cut a tofu?
Got the Results: ● Why did you do that? ● Why not something else? ● When do you succeed? ● When do you fail? ● When Can I trust You? ● How do I correct an error? How Do you Explain: ● (Customer) Natural Language, simple ● (Manager) Structured, detailed.
Difficulty in Current Architectures Difficulties of Explainability in End-to-End Learning: ● What to fix? ○ module/parameter/function ● ● Some work on understanding Learnt Models: [3] ● Can we explain in Natural Language space or symbolic space? ● ● Is structured explanations possible? ● Largely Un-explored. ● What Can we do? ● ○ [Impose A Structure] Use Knowledge and Probabilistic Reasoning to replace final layers. (Current Work) ○ [Explanation Interface] Use Knowledge and ● Probabilistic Reasoning to explain. (Have plans to explore)
DeepIU: An Architecture
What Do we Try? [Examples] Visual Commonsense for Scene Understanding using Perception, Semantic Parsing and Reasoning [Image Captioning] Image Understanding Through Scene Description Graphs [Architecture] DeepIU: An Architecture for Image Understanding [Puzzles/New Challenge] Image Riddles using Vision Reasoning
An Example was The Implementation Used In SDG Another Example
Type of Knowledge Used: • A Knowledge Graph: Combined semantic Parses of sentences. • Bayesian Network: Dependencies among objects and scene constituents.
How did we store the knowledge: • Graph on File-system with self-built Query-Engine
Reasoning Module Used: • Probabilistic Reasoning using Bayesian Network, and IF-THEN reasoning using Constructed Knowledge-Base
Components Used
Application: Image Riddles
Toy Example What is the Connecting Word? Answer is: Fall Why? i) First image depicts the season Fall, ii) the second one has water-fall, iii) the third one has rainfall and iv) a statue is “fall”ing
The Tofu example: ● A snapshot of a cooking video. ● We detect triplets : . ● Here, We detect . ● Downstream inference: “Knife is cutting bowl” (?) ○ Not possible !!! ○ Humans use commonsense. ○ Knife cutting something inside a bowl i.e. tofu.
Motivations:
GUR+All: Is our Method. Higher bars on the right means more correct puzzles and more intelligent (less gibbersih) answers
● Visual Question Answering: Task requires explicit model of commonsense reasoning. However a big percentage of the dataset concentrates on “what” and “where”, “how many” questions. ● VQA: Constrained set of answers. ● Image Riddles: target answers in train and test are mostly unique. (Zero-shot) ● Lastly, puzzles are fun.
● We Used: ○ Commonsense Knowledge: Artifact cutting artifact is abnormal ○ (Ontological Knowledge) From Semantic Parser (KParser): Knife and bowl are artifacts ○ (Reasoning) Using Answer Set Programming: easy to conclude that “Knife is cutting bowl” is not true.
Image Riddles
Interpretable Intermediate Structure Undirected Labeled Graph: ● Nodes: are verbs (actions), nouns (objects, regions, attributes). ● Verbs connected to concrete nouns with semantic roles. ● Inferred Aspects connected to dummy node SCENE
Visual Detections Used: • CNN-ILSVRC trained Object Recognition [Girshick et.al. ] • Scene Classification [Zhou et. Al.] • Scene Constituent: Pre-trained CNN (VGG-16) +SVM multi-class
Visual Detections Used: Residual Network (ResNet-200), Clarifai API (commerical, finetuned) Reasoning Module Used: Probabilistic Soft Logic: Uses First-Order-Logic syntax to define Markov Random Field Potentials
Utilities: - Question-Answering, Sentence Generation.
Type of Knowledge Used: ● ConceptNet: (Semi-Curated), large Vocabulary. How did we Create/store the knowledge: ● Publicly Available APIs
Components Used
https://imageriddle.wordpress.com/imageriddle/
Summary Drawbacks: • Explainability: How did you get the result? • Explainability: Do you know what to fix? • Interpretable Intermediate Structure
Scene Description Graph
• (KR/SRL Paradigms) Connecting Large-scale Knowledge Bases and Reasoning Efficiently • ASP : Not Probabilistic. • Probabilistic Soft Logic: Not well-documented, inflexible in incorporations of Phrase-semantics
Solutions: • Modular Architecture (performing comparable to a previous DL method) • Defined SDG (can perform reasoning, can be used in QA, Sentence Generation) • Modifying and Extending Probailistic Logic Semantics. Will be made available for use • Constraint Addition Capabilities. • Predicates: not just symbols, phrases and words (similarity calculated using Embedding)
Future Work Extending Current Probabilistic Soft Logic. Visual Question Answering Using Unsupervised Semantic Parsing to create Graph from Scenes. Extending ideas to Robotic vision
References Check out: http://bit.ly/1MMN1wZ
[1] Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." In ICML, vol. 14, pp. 77-81. 2015. [2] Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B. and Darrell, T., 2016, October. Generating visual explanations. In European Conference on Computer Vision (pp. 3-19). Springer International Publishing. [3] Agrawal, A., Batra, D. and Parikh, D., 2016. Analyzing the behavior of visual question answering models. arXiv preprint arXiv:1606.07356. [4] Bloom, B.S., 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. New York: McKay, pp.20-24.