Signature redacted Certified by ................... Signature ... - DSpace@MIT

22 downloads 10961 Views 5MB Size Report
Apr 15, 2016 - Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science. Thesis Supervisor. Signature redacted. C ertified by .
Computational Perception of Physical Object Properties by

Jiajun Wu B.Eng., B.Ec., Tsinghua University (2014) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2016

@

Massachusetts Institute of Technology 2016. All rights reserved.

Signature redacted Author ................. Department of E)rical Engineering and Computer Science January 29, 2016

Certified by

...................

Signature redacted.........

William T. Freeman and Computer Science Thomas and Gerd Perkins Professor of Electrical Engineering Thesis Supervisor

Signature redacted .................... Joshua B. Tenenbaum Professor of Computational Cognitive Science Thesis Supervisor

C ertified by .......

Signature redacted MASSACHUSMS INSTTUTE OF TECHNOLOGY

APR 15 2016 LIBRARIES ARCHIVES

......

Chair, Depar

- .. U ----

.... .. .... .... ..

.

A ccepted by .................

Leslie A. Kolodziejski ent Committee on Graduate Students

2

Computational Perception of Physical Object Properties by Jiajun Wu Submitted to the Department of Electrical Engineering and Computer Science on January 29, 2016, in partial fulfillment of the requirements for the degree of Master of Science

Abstract We study the problem of learning physical object properties from visual data. Inspired by findings in cognitive science that even infants are able to perceive a physical world full of dynamic content at a early age, we aim to build models to characterize object properties from synthetic and real-world scenes. We build a novel dataset containing over 17, 000 videos with 101 objects in a set of visually simple but physically rich scenarios. We further propose two novel models for learning physical object properties by incorporating physics simulators, either a symbolic interpreter or a mature physics engine, with deep neural nets. Our extensive evaluations demonstrate that these models can learn physical object properties well and, with a physic engine, the responses of the model positively correlate with human responses. Future research directions include incorporating the knowledge of physical object properties into the understanding of interactions among objects, scenes, and agents. Thesis Supervisor: William T. Freeman Title: Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science Thesis Supervisor: Joshua B. Tenenbaum Title: Professor of Computational Cognitive Science

3

4

Acknowledgments I would like to express sincere gratitude to my advisors, Professor William Freeman and Professor Joshua Tenenbaum. Bill and Josh are always inspiring and encouraging, and have led me through my research with profound insights.

Not only have they

taught me how to aim for top-quality research, but they have been sharing with me invaluable lessons about life. I deeply appreciate the guidance and support from my undergraduate advisor, Professor Zhuowen Tu, who introduced me into the world of AI and vision, and has long been my mentor and friend since then.

I also thank Professor Andrew Chi-

Chih Yao and Professor Jian Li for advising me during my undergraduate study, Dr. Yuandong Tian for mentoring me at Facebook AI Research, and Dr. Kai Yu and Dr. Yinan Yu for mentoring me at Baidu Research. The thesis would not have been possible without the inspiration and support from my colleagues in the MIT Vision Group and Computational Cognitive Science (CoCoSci) Group. I would like to deliver my appreciation to my collaborators, Dr. Joseph Lim, Tianfan Xue, and Dr. Ilker Yildirim. I am also thankful to other encouraging and helpful group members, especially Andrew Owens, Donglai Wei, Dr. Tomer Ullman, Katie Bouman, Kelsey Allen, Tejas Kulkarni, Dr. Dilip Krishnan, Dr. Hossein Mohabi, Dr. Tali Dekel, Dr. Daniel Zoran, Pedro Tsividis, and Hongyi Zhang. I would like to extend my appreciation to my dear friends for their backing in my academic and daily life. I received the Edwin S. Webster Fellowship during my first year, and have been partially funded by NSF-6926677 (Reconstructive Recognition). I appreciate the support from all funding agencies. Finally, I thank my parents, for their lifelong encouragement and love.

5

6

Contents

1

Introduction

13

2

Modeling the Physical World

17

2.1

Scenarios .........

18

2.2

The Physics 101 Dataset .........................

3

4

..................................

20

Physical Object Model: Learning with a Symbolic Interpreter

23

3.1

Visual Property Discoverer . . . . . . . . . . . . . . . . . . . . . . . .

24

3.2

Physics Interpreter

25

3.3

Physical World Simulator

. . . . . . . . . . . . . . . . . . . . . . . .

26

3.4

Experim ents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.4.1

Learning Physical Properties . . . . . . . . . . . . . . . . . . .

28

3.4.2

Detecting Objects with Unusual Properties . . . . . . . . . . .

30

3.4.3

Predicting Outcomes . . . . . . . . . . . . . . . . . . . . . . .

31

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Physical Object Model: Incorporating a Physics Engine

35

4.1

The Galileo M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.1.1

Tracking as Recognition

. . . . . . . . . . . . . . . . . . . . .

38

4.1.2

Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.2

Sim ulations

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.3

Bootstrapping as Efficient Perception in Static Scenes . . . . . . . . .

39

4.4

Experim ents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.4.1

41

Outcome Prediction

. . . . . . . . . . . . . . . . . . . . . . .

7

4.4.2

M ass Prediction . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.4.3

"Will it move" Prediction . . . . . . . . . . . . . . . . . . . . .

43

5

Beyond Understanding Physics

45

6

Conclusion

47

8

List of Figures 2-1

Abstraction of the physical world, and a snapshot of our dataset. .

18

2-2

Illustrations of the scenarios in our Physics 101 dataset. . . . . . . . .

19

2-3

Physics 101: this is the set of objects we used in our experiments. We vary object material, color, shape, and size, together with external conditions such as the slope of a surface or the stiffness of a string. Videos recording the motions of these objects interacting with target objects will be used to train our algorithm. . . . . . . . . . . . . . . .

3-1

20

Our first model exploits the advancement of machine learning algorithm (convolutional neural network) - we supervise all levels by a physics interpreter. This interpreter provides the physical constraints on what each layer can take values. During the training and testing, our model has no label of physical properties, in contrast to the standard approaches.

3-2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Charts for the estimations of rings. The physical properties, especially density, of the first ring is different from those of the other rings. The difference is hard to perceive by merely visual appearances; however, by observing videos with object interactions, our algorithm is able to learn the properties and find the outlier. All figures are on a log-normalized scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3-3

30

Heat maps of user predictions, model outputs (in orange), and ground truths (in white). Objects from top to bottom, left to right: dough, metal coin, metal pole, plastic block, plastic doll, and porcelain.

9

. . .

31

4-1

Our second model formalizes a hypothesis space of physical object representations, where each object is defined by its mass, friction coefficient, 3D shape, and a positional offset w.r.t. an origin. To model videos, we draw objects from that hypothesis space into the physics engine. The simulations from the physics engine are compared to ob. . . . . . . . . . . . . . . . . . . . .

servations in the velocity space. 4-2

36

Simulation results. Each row represents one video in the data: (a) the first frame of the video, (b) the last frame of the video, (c) the first frame of the simulated scene generated by Bullet, (d) the last frame of the simulated scene, (e) the estimated object with larger mass, (f) the estuimateu oUjecUtwih larig iiction

4-3

cIeUfIcIent.

. . . . . . . . . . . .

39

Mean squared errors of oracle estimation, our estimation, and uniform estimations of mass on a log-normalized scale, and the correlations between estimations and ground truths . . . . . . . . . . . . . . . . .

4-4

The log-likelihood traces of several chains with and without recognitionmodel (LeNet) based initializations. . . . . . . . . . . . . . . . . . . .

4-5

41

41

Mean errors in numbers of pixels of human predictions, Galileo outputs, and a uniform estimate calculated by averaging ground truth ending points over all test cases. As the error patterns are similar for both target objects (foam and cardboard), the errors here are averaged across target objects for each material.

4-6

43

Heat maps of user predictions, Galileo outputs (orange crosses), and ground truths (white crosses).

4-7

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

43

Average accuracy of human predictions and Galileo outputs on the tasks of mass prediction and "will it move" prediction. indicate standard deviations of human accuracies.

10

Error bars

. . . . . . . . . . .

44

List of Tables 3.1

Accuracies

(%, for oracle) or clustering purities (%, for joint training)

on material estimation. In the joint training case, as there is no supervision on the material layer, it is not necessary for the network to specifically map the responses in that layer to material labels, and we do not expect the numbers to be comparable with the oracle case. Our analysis is just to show even in this case the network implicitly grasps some knowledge of object materials. . . . . . . . . . . . . . . . . . . . 3.2

Correlation coefficients of our estimations and ground truth for mass, density, and volume . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.3

30

Mean squared errors in pixels of human predictions (H), model outputs (M), or uniform estimate minimizing the mean squared error (U) . . .

3.4

29

31

Correlation coefficients on the tasks of predicting the moving distance and the bounce height, and accuracies on predicting whether an object floats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.1

32

Correlations between pairs of outputs in the mass prediction experiment (in Spearman's coefficient) and in the "will it move" prediction experiment (in Pearson's coefficient).

11

. . . . . . . . . . . . . . . . . .

44

12

Chapter 1 Introduction Our visual system is designed to perceive a physical world that is full of dynamic content. Consider yourself watching a Rube Goldberg machine unfold: as the kinetic energy moves through the machine, you may see objects sliding down ramps, colliding with each other, rolling, entering other objects, falling -

many kinds of physical

interactions between objects of different masses, materials, and other physical properties. How does our visual system recover so much content from the dynamic physical world? What is the role of experience in interpreting a novel dynamical scene? Further, there is evidence that babies form a visual understanding of basic physical concepts, as a basic component of common sense knowledge, at a very young age; they learn properties of objects from their motions [1]. As young as 2.5 to 5.5 months old, infants learn basic physics even before they acquire advanced high-level knowledge like semantic categories of objects [5, 1]. Both infants and adults also use their physics knowledge to learn and discover latent labels of object properties, as well as predict the physical behavior of objects [2]. These facts suggest the importance for a visual system of understanding physics, and motivate our goal of building a machine with such visual competency. Recent behavioral and computational studies of human physical scene understanding push forward an account that people's judgments are best explained as probabilistic simulations of a realistic, but mental, physics engine [2, 151. Specifically, these studies suggest that the brain carries detailed but noisy knowledge of the physical 13

attributes of objects and the laws of physical interactions between objects (i. e., Newtonian mechanics). To understand a physical scene, and more crucially, to predict the future dynamical evolution of a scene, the brain relies on simulations from this mental physics engine. Even though the probabilistic simulation account is very appealing, there are missing practical and conceptual leaps. First, as a practical matter, the probabilistic simulation approach is shown to work only with synthetically generated stimuli in only 2D or 3D block worlds. The joint inference of the mass and coefficient of friction is also not handled [2].

Second, as a conceptual matter, previous research rarely

clarifies how a mental physics engine could take advantage of previous experience of the agent

118]. It is the case that humans have a life long experience with dynamical

scenes, and a fuller account of human physical scene understanding should address it. We aim to build on the idea that humans utilize a realistic physics engine as part of a generative model to interpret real-world physical scenes. Given a video as observation to the model, physical scene understanding in the model corresponds to inverting the generative model by probabilistic inference to recover the underlying physical object properties in the scene. Our formulation combines deep learning, which serves as a powerful low-level visual recognition system, with a physics simulator to estimate physical properties directly from unlabeled videos. We study two possible forms of a physics simulator: the first is a symbolic physics interpreter encoded as layers in deep learning; and the second is a mature physics engine. Compared to recent studies in vision and robotics on predicting physical interactions for 3D reasoning [10, 231 and tracking [16], our goal is to infer physical object properties directly, and we incorporate a generative physics simulator with a powerful discriminative recognition model, which distinguishes our framework from previous methods introduced in the computer vision and robotics community for predicting physical interactions or properties of objects for various purposes [14, 20, 10, 23, 19, 3, 4, 8, 24]. We also construct a video dataset for evaluating machine and human performance on real-world data. We collected a dataset of 101 objects made of different materials and with a variety of masses and volumes. We started by collecting videos of these 14

objects from multiple viewpoints in four various scenarios:

objects slide down an

inclined surface and possibly collide with another object; objects fall onto surfaces made of different materials; objects splash in water; and objects hang on a spring. These seemingly straightforward setups require understanding multiple physical properties, e.g., material, mass, volume, density, coefficient of friction, and coefficient of restitution, as discussed later. We called this dataset Physics101, highlighting that we are learning elementary physics, while also indicating the current object count. Our dataset contains not only over 12, 000 RGB videos, but also more than 4, 000 depth videos and audios, which could benefit our future study on learning from multimodality data. Based on the estimates we derived from visual input with a physics simulator, a natural extension is to generate or synthesize training data for any automatic learning systems by bootstrapping from the videos already collected, and labeling them with estimates of models. This is a self-supervised learning algorithm for inferring generic physical properties, and relates to the wake/sleep phases in Helmholtz machines [9], and to the cognitive development of infants. Extensive studies suggest that infants either are born with or can learn quickly physical knowledge about objects when they are very young, even before they acquire more advanced high-level knowledge like semantic categories of objects [5, 1].

Young babies are sensitive to physics of

objects mainly from the motion of foreground objects from background [1]; in other words, they learn by watching videos of moving objects. But later in life, and clearly in adulthood, we can perceive physical attributes in just static scenes without any motion. Here, building upon the idea of Helmholtz machiness [9], our approach suggests one potential computational path to the development of the ability to perceive physical content in static scenes. Following the recent work [22], we train a recognition model (i.e., sleep cycle) that is in the form of a deep convolutional network, where the training data is generated in a self-supervised manner by the generative model itself (i.e., wake cycle: real-world videos observed by our model and the resulting physical inferences). Interestingly, this computational solution asserts that the infant starts 15

with a relatively reliable mental physics engine, or acquires it soon after birth. Our research has various generalizations and extensive applications. With physical object properties, we may build intelligent systems for high-level scene understanding, including the study of physics-related concepts like object stability in the scene, and we may incorporate agents interacting with the physical world for particular goals. Our study is inspired by findings in developmental psychology, but can also lead to interesting and fundamental research questions there, for instance, whether there exist connections between the learning processes of infants and machines on physical concepts.

16

Chapter 2 Modeling the Physical World There exist highly involved physical processes in daily events in our physical world, even simple scenarios like objects sliding down an inclined surface.

As shown in

Figure 2-la, we can divide all involved physical properties into two groups: the first is the intrinsic physical properties of objects like volume, material, and mass, many of which we cannot directly measure from the visual input; the second is the descriptive physical properties which characterize the scenario in the video, including but not limited to velocity of objects, distances that objects traveled, or whether objects float if they are thrown into water. The second group of parameters are observable, and are determined by the first group, while both of them determine the content in videos. Our goal is to build an architecture that can automatically discover those observable descriptive physical properties from unlabeled videos, and use them as supervision to further learn and infer unobservable latent physical properties. Our generative model can then apply learned knowledge of physical object properties for other tasks like predicting outcomes in the future. The computer vision community has made much progress through its datasets, and there are datasets of objects, attributes, materials, and scene categories. Here, we introduce a new type of dataset, Physics 101, capturing physical interactions of objects. The dataset consists of four different scenarios, for each of which plenty of intriguing questions may be asked. For example, in the ramp scenario, will the object on the ramp move, and if so and two objects collide, which of them will move next 17

Descriptive Physical Properties Acceleration Velocity

Bounce

Extended

Height

Distance

Intrinsic Physical Object Properties Coeff Restitution

Coeff Friction

Mass

Volume

Material

Videos

(a) Abstraction of physical properties and how they determine the content of video.

(b) Our scenario and a snapshot of our dataset, Physics 101, of various objects at different stages. Our data are taken by four sensors (3 RGB and 1 depth).

Figure 2-1: Abstraction of the physical world, and a snapshot of our dataset.

and how far?

2.1

Scenarios

We seek to learn physical properties of objects by observing videos. To this end, we build a dataset by recording videos of moving objects. We pick an introductory setup with four different scenarios, which are illustrated in Figures 2-1b and 2-2. We then introduce each scenario in detail. Ramp

We put an object on an inclined surface, and the object may either slide

down or keep static, due to gravity and friction.

This seemingly straightforward

scenario already involves understanding many physical object properties including material, coefficient of friction, mass, and velocity. Figure 2-2a analyzes the physics behind our setup. At first, there are three external forces on the object: a gravitational force G, a normal force N from the surface, and a friction force R. When the friction force R is strong, then the object would not move. Otherwise, the object will start to slide. After it reaches the ground, these forces would still exist, but now the object will slow 18

R.,

N, 1N

NH

N,

NB

N

N

I. Initial setup II. Before collision III. At collision IV. After collision V. Final result (a) The ramp scenario. Several physical properties will determine if object A will move, if it will reach to object B, and how far each object will move. Here, N, R, and G indicate a normal force, a friction force, and a gravity force, respectively.

I. Initial setup II. After extension

I. A floating object

(c) The liquid scenario.

(b) The spring scenario.

I. Initial setup

II. A sunk object

II. At collision

III. Bounce

(d) The fall scenario. Figure 2-2:

Illustrations of the scenarios in our Physics 101 dataset.

down due to the friction force R. If the object A slides all the way to B, then A will hit B and both of them will move. How far A and B move depends on their friction coefficients, masses, and the velocity of A at the moment of collision. In this scenario, the observable descriptive physical properties are the velocities of the objects, and the distances both objects traveled. The latent properties directly involved are coefficient of friction and mass. Spring

We hang objects on a spring, and gravity on the object will stretch the

spring, as shown in Figure 2-2b. Here the observable descriptive physical property is length that the spring gets stretched, and the latent properties are the mass of the

object and the elasticity of the spring. Fall

We drop objects in the air, and they freely fall onto various surfaces. Figure 2-

2d illustrates this scenario.

Here the observable descriptive physical properties are

19

Plastic

*

Block

Metal Foam

Coin

Hallow

4

Dough

Wood Plastic

Metal Pole Wooden Pole Plastic Doll

-10000

O

4l

eE

-

VOO

Toy

Porcelain

#4

Plastic

40

Wooden Bleck Hollow% Rubbe, Cardlbomii

1111146

IWO.1 10 ,10

V

0

Target

Rubber

I

Figure 2-3: Physics 101: this is the set of objects we used in our experiments. We vary object material, color, shape, and size, together with external conditions such as the slope of a surface or the stiffness of a string. Videos recording the motions of these objects interacting with target objects will be used to train our algorithm.

the the bounce heights of the object, and the latent properties are the coefficient of restitution of the object and the surface.

Liquid

As shown in Figure 2-2c, we drop objects into some liquid, and they may

float or sink at various speeds. In this scenario, the observable descriptive physical property is the velocity of the sinking object (0 if it floats), and the latent properties are the densities of the object and the liquid.

2.2

The Physics 101 Dataset

The outcomes of various physical events depend on multiple factors of objects, such as materials (density and friction coefficient), sizes and shapes (volume), and slopes of ramps (gravity), elasticities of springs, etc. We collect our dataset while varying all these conditions. Figure 2-3 shows the entire collection of our 101 objects, and the following are more details about our variations: 20

Material

Our 101 objects are made of 15 different materials - cardboard, dough,

foam, hollow rubber, hollow wood, metal coin, metal pole, plastic block, plastic doll, plastic ring, plastic toy, porcelain, rubber, wooden block, and wooden pole. Appearance

For each material, we have 4 ~ 12 objects of different sizes, shapes,

and colors. Slope (ramp)

We also vary the angle a between the inclined surface and the ground

(to vary the gravity force). We set a = 100 and 200 for each object. Target (ramp)

We have two different target objects - a cardboard and a foam

box. They are made of different materials, thus having different friction coefficients and densities. Spring

We use two springs with different stiffness.

Surface (fall)

We drop objects onto five different surfaces:

foam, glass, metal,

wooden table, and woolen rug. These materials have different coefficients of restitution. We also measure the physical properties of these objects. We record the mass and volume of each object, which also determine density. Please refer to the supplementary material for the statistics of all these measured properties. For each setup, we record their actions for 3 ~ 10 trials. We measure multiple times because some external factors, e.g., orientations of objects and rough planes, may lead to different outcomes. Having more than one trial per condition increases the diversity of our dataset by making it cover more possible outcomes. Finally, we record each trial from three different viewpoints: one sideview, one top-down view, and one upper-top view. For the first two views, we take data with DSLR cameras, and for the upper-top view, we use a Kinect V2 to record both RGB and depth maps. After removing trials with significant noise, we have 4,352 trials in total. Given we captured videos in three RGB maps and one depth map, there are 17,408 video clips altogether. These video clips constitute the Physics 101 dataset.

21

22

Chapter 3 Physical Object Model: Learning with a Symbolic Interpreter We aim to discover physical object properties under a unified system with minimal supervision, rather than training each classifier/regressor for labels (such as material and volume) in a fully supervised manner.

With this philosophy, we develop two

physical object models; one uses deep learning and a symbolic physics interpreter for recognizing physical properties, and the other incorporates a mature physics engine and predicts physical properties via an analysis-by-synthesis approach. Both methods have built-in knowledge of physics, and work in an unsupervised setting. With these generative models, we are able to not only discover all physical properties (e.g. material, volume) simply by observing motions of objects in unlabeled videos, but also predict different physical interactions (e.g. how far will the object move, if it moves at all) based on inferred physical properties. In this chapter we describe our first model, shown in Figure 3-1. Our method is based on a convolutional neural network (CNN) [11], which consists of three components. The bottom component is a visual property discoverer, which aims to discover physical properties like material or volume which could at least partially be observed from visual input; the middle component is a physics interpreter, which explicitly encodes physical laws into the network structure and models latent physical properties like density and mass; the top component is a physical world simulator, which 23

Bounce Height

Acceleraton on ampn

Velocity

Acceleraton in Sinken

Extended Distance

Descriptive Physical Properties Physical World Simulator

Physical Laws Mass

Latent Intrinsic Physical Properties Coeff Restitution

Coeff Friction

Density

Physics Interpreter Material

Volume

Visual Intrinsic Physical Properties

ConvNet

ConvNet

Visual Property Discoverer

Videos With Learning

Without Learning

Data

Our first model exploits the advancement of machine learning algorithm (convolutional neural network) - we supervise all levels by a physics interpreter. This interpreter provides the physical constraints on what each layer can take values. During the training and testing, our model has no label of physical properties, in contrast to the standard approaches. Figure 3-1:

characterizes descriptive physical properties like distances that objects traveled, all of which we may directly observe from videos. Our network corresponds to our physical world model introduced in Chapter 2. We would like to emphasize here that our model learns object properties from completely unlabeled data. We do not provide any labels for physical properties like material, velocity, or volume; instead, our model automatically discovers observations from videos, uses them as supervision to the top physical world simulator, which in turn advises what the physics interpreter should discover.

3.1

Visual Property Discoverer

The bottom meta-layer of our architecture in Figure 4-1 is designed to discover and predict low-level properties of objects including material and volume, which can also be at least partially perceived from the visual input. These properties are the basic parts of predicting any derived physical properties at upper layers, e.g. density and mass. In order to interpret any physical interaction of objects, we need to be able to 24

first locate objects inside videos. We use a KLT point tracker [17] to track moving objects . We also compute a general background model for each scenario to locate foreground objects. Image patches of objects are then supplied to our visual property discoverer.

Material and volume

Material and volume are properties that can be estimated

directly from image patches. extracted by the tracker.

Hence, we have LeNet [12] on top of image patches

Once again, rather than directly supervising each LeNet

with their labels, we supervise them by automatically discovered observations which are provided to our physical world simulator.

To be precise, we do not have any

individual loss layer for LeNet components. Note that inferring volumes of objects from static images is an ambiguous problem. However, this problem is alleviated by our data from different viewpoints and both RGB and depth maps.

3.2

Physics Interpreter

The second meta-layer of our model is designed to encode the physical laws. For instance, if we assume an object is homogeneous, then its density is determined by its material; the mass of an object should be the multiplication of its density and volume. Based on material and volume, we expand a number of physical properties in this physics interpreter, which will later be used to connect to real world observations. The following shows how we represent each physical property as a layer depicted in Figure 4-1: Material

A Nm dimensional vector, where Nm is the number of different materials.

The value of each dimension represents the confidence that the object belongs to that dimension. This is an output of our visual property discoverer. Volume

A scalar representing the predicted volume of the object. This is an output

of our visual property discoverer. Coefficient of friction and density

Each is a scalar representing the predicted

physical property based on the output of the material layer. Each output is the inner 25

product of Nm learned parameters and responses from the material layer. Coefficient of restitution

A Nm dimensional vector representing how much of the

kinetic energy remains after a collision between the input object with other objects of various materials. The representation is a vector, not a scalar, as the coefficient of restitution is determined by the materials of both objects involved in the collision. Mass

A scalar representing the predicted mass based on the outputs of the density

layer and the volume layer. This layer is the product of the density and volume layers.

3.3

Physical World Simulator

Our physical world simulator connects the inferred physical properties to realworld observations.

We have different observations for different scenarios, and use

velocities of objects and distances objects traveled as observations of the ramp scenario, the length that the string is stretched as an observation of the spring scenario, the bounce distance as an observation of the fall scenario, and the velocity that object sinks as an observation of the liquid scenario. All observations can be derived from the output of our tracker. To connect those observations to physical properties our model inferred, we employ physical laws. The physical laws we used in our model include Newton's law

F = mg sin 0 - pmg cos 0 = ma, or (sin 0 - p cos O)g = a, where 0 is

angle between the inclined surface and the ground, p is the coefficient of friction, and a is the acceleration of an object (observation). This is used for the ramp scenario.

Conservation of momentum and energy

CR = (Vb

-

Va)/(ta -

Ub),

where vi

is the velocity of object i after collision, and ui is its velocity before collision. All ui and vi are observations, and this is also used for the ramp scenario. Hooke's law

F = kX, where X is the distance that the string is extended (our

observation), k is the stiffness of the string, and F = G = mg is the gravity on the object. This is used for the spring scenario.

26

Bounce

CR=

h/H, where

CR

is the coefficient of restitution, h is the bounce

height (observation), and H is the drop height.

This can be viewed as another

representation of conservation of energy and momentum, and is used for the fall scenario. Buoyancy

dVg - dmVg = ma = dVa, or (d - d.)g = da, where d is density of the

object, d, is the density of water (constant), and a is the acceleration of the object in water (observation). Note that for d < de, a = 0. This is used for the liquid scenario. We use MSE between our model's estimate and the target value supplied by the physical world simulator as our loss during training.

3.4

Experiments

In this section, we present experiments with our models in various settings. We start with extensive verifications of our models on learning physical properties. Later, we investigate the generalization ability of our model on other tasks like detecting objects with unusual properties, predicting outcomes given partial information, and transferring knowledge across different scenarios. We use Torch7 [6] for all experiments. For learning physical properties from Physics 101, we study our algorithm in the following settings: " Split by frame: for each trial of each object, we use 95% of the patches we get from tracking as training data, while the other 5% of the patches as test data. " Split by trial: for each trial of each object, we use all patches in 95% of the trials we have as training data, while patches in the other 5% of the trials as test data. " Split by object: we randomly choose 95% of the objects, and use their patches as training data and the others as test data. Among these three settings, split by frame is the easiest as for each patch in test data, the algorithm may find some very similar patch in the training data. Split by 27

object is the most difficult setting as it requires the model to generalize to objects that it has never seen before. We consider training our model in different ways: " Oracle training: we train our model with images of objects and their associated ground truth labels.

We apply oracle training on those properties we have

ground truths labels of (material, mass, density, and volume). " Standalone training: we train our model on data from one scenario. Automatically extracted observations serve as supervision.

* Joint training: we jointly train the entire network on all training data without any labels of physical properties.

Our only supervision is the physical laws

encoded in the top physical world simulator.

Data from different scenarios

supervise different layers in the network. Oracle training is designed to test the ability of each component and can be viewed as an upper bound of the performance the model may achieve.

Our focus

is on standalone and joint training, where our model learns from unlabeled videos directly. We are also interested in understanding how our model can perform at inferring some physical properties purely from depth maps.

Therefore, besides using RGB

data, we conduct some experiments where training and test data are depth maps only.

3.4.1

Learning Physical Properties

Material perception: We start with the task of material classification. Table 3.1 shows the accuracy of the oracle models on material classification. We observe that they can achieve nearly perfect results in the easiest case, and is still significantly better than chance on the most difficult split-by-object setting.

Both depth maps

and RGB maps give good performance on this task with oracle training. 28

Methods

Frame

Trial

Object

Depth (Oracle) RGB (Oracle)

92.6 99.9

62.5 77.4

35.7 52.2

RGB (ramp) RGB (spring) RGB (fall) RGB (liquid) RGB (joint) Depth (joint)

26.9 29.9 29.4 22.2 35.5 38.3

24.7 22.4 25.0 15.4 28.7 26.9

19.7 14.3 17.0 12.6 25.7 22.4

Uniform

6.67

6.67

6.67

Table 3.1: Accuracies (%, for oracle) or clustering purities (%, for joint training) on material estimation. In the joint training case, as there is no supervision on the material layer, it is not necessary for the network to specifically map the responses in that layer to material labels, and we do not expect the numbers to be comparable with the oracle case. Our analysis is just to show even in this case the network implicitly grasps some knowledge of object materials.

In the standalone and joint training case, given we have no labels on materials, it is not possible for the model to classify materials; instead, we expect it to cluster objects by their materials. To measure this, we perform K-means on the responses of the material layer of test data, and use purity, a common measure for clustering, to measure if our model indeed discovers clusters of materials automatically. As shown in Table 3.1, the clustering results indicate that the system learns the material of objects to a certain extent. Physical parameter estimation: We then test our systems, trained with or without oracles, on the task of physical property estimation. We use Pearson productmoment correlation coefficient as measures. Table 4-3 shows the results on estimating mass, density, and volume. Notice that here we evaluate the outputs on a log scale to avoid unbalanced emphases on objects with large volumes or masses. We observe that with oracle our model can learn all physical parameters well. For standalone and joint learning, our model is also consistently better than a nontrivial baseline, which selects the optimum uniform estimate which minimizes the mean squared error. 29

Density

Mass Methods

Volume

Frame

Trial

Object

Frame

Trial

Object

Frame

Trial

Object

RGB (Oracle) Depth (Oracle)

0.79 0.79

0.72 0.72

0.67 0.67

0.83 0.83

0.74 0.74

0.65 0.65

0.77 0.77

0.67 0.67

0.61 0.61

RGB (spring) RGB (liquid) RGB (joint) Depth (joint)

0.40 N A 0.58 0.43

0.35 N/A 0.42 0.32

0.20 N'A 0.38 0.25

N/A 0.33 0.38 0.49

N/A 0.27 0.39 0.37

N A 0.30 0.39 0.17

N A N A 0.40 0.30

N/A N A 0.37 0.20

N1 A N 'A 0.30 0.22

0

0

0

0

0

0

0

0

0

Uniform

Table 3.2: Correlation coefficients of our estimations and ground truth for mass, density, and volume

Est CT

e

a

Volume

Density

Mass

a

b

c

Est d

Est

a

a

b

C

d

e

Figure 3-2: Charts for the estimations of rings. The physical properties, especially density, of the first ring is different from those of the other rings. The difference is hard to perceive by merely visual appearances; however, by observing videos with object interactions, our algorithm is able to learn the properties and find the outlier. All figures are on a log-normalized scale.

3.4.2

Detecting Objects with Unusual Properties

Sonetimes objects with similar appearances may have distinct physical properties. In this section, we test whether our system is able to find these expectation-violation cases. In Physics 101, among all five plastic rings, the bottom part of smallest ring is made of a different material with a larger density, which makes its mass greater than those of the other four, but its volume smaller.

The material of the smallest ring

also has a lower friction coefficient, indicating that the velocity of the smallest ring at collision would be higher than those of the others. In Figure 3-2, we show the estimations of our RGB joint model on the properties of all five rings, as well as their appearances.

As shown, it is hard to perceive the

difference between the physical properties of the first ring and those of the others 30

Foam

Cardboard

Material

H

M

U

H

M

U

cardboard dough hollow wood metal coin metal pole

28.8 27.4 35.7 13.4 272.9

40.7 25.2 19.4 32.2 257.6

97.0 84.4 108.9 149.8 280.0

15.0 150.9 81.0 31.9 91.4

77.2 105.1 35.0 33.3 188.7

84.0 113.4 21.4 75.8 184.0

plastic block

29.8

82.1

97.6

46.9

57.2

35.0

plastic doll plastic toy porcelain wooden block wooden pole

49.4 30.1 138.5 45.9 78.9

23.6 41.9 127.0 32.8 88.0

44.0 121.2 110.9 36.2 138.9

128.8 33.3 196.0 47.3 58.7

41.8 9.5 216.6 37.5 89.8

93.9 70.6 314.8 14.2 74.3

Mean

68.2

70.1

115.4

80.1

81.1

98.3

Table 3.3: M\ean squared errors in pixels of human predictions (H), model outputs (M), or uniform estimate minimizing the mean squared error (U)

Figure 3-3: Heat maps of user predictions, model outputs (in orange), and ground truths (in white). Objects from top to bottom, left to right: dough, metal coin, metal pole, plastic block, plastic doll, and porcelain.

purely from visual appearances. By observing videos where they slide down and hit other objects, our system can learn physical parameters, and model the outliers.

3.4.3

Predicting Outcomes

We may apply our model to a variety of outcome prediction tasks for different scenarios. We consider three of them: how far would an object move after being hit by another object; how high an object will bounce after being dropped at a certain height; and whether an object will float in the water. With estimated physical object properties, our model can answer these questions using physical laws.

Transferring Knowledge Across Multiple Scenarios

As some physical knowl-

edge is shared across multiple scenarios, it is natural to evaluate how learned knowl31

Frame

Trial

Object

RGB (joint) Uniform

0.65 0

0.42 0

0.33 0

Bounce Height Bounce Height Spring Ext

RGB (joint) RGB (transfer) Uniform

0.35 0.22 0

0.31 0.21 0

0.23 0.11 0

Float Float

RGB (joint) Uniform

0.94 0.70

0.87 0.70

0.84 0.70

Tasks

Methods

Collision Dist Collision Dist

Table 3.4: Correlation coefficients on the tasks of predicting the moving distance and the bounce height, and accuracies on predicting whether an object floats edge from one scenario may be applied to a novel one. Here we consider the case where the model is trained on all but the fall scenarios. We then apply the model to the fall scenario for predicting how high an object bounces. Our intuition is the learned coefficients of restitution from the ramp scenario can help to predict to some extent.

Results

Table 3.4 shows outcome prediction results. We can see that our method

works well, and can also transfer learned knowledge across multiple scenarios.

Behavior Experiments

We would like to see how well our model does compared to

a human. To do this, we conducted experiments on predicting the moving distance of an object after collision on Amazon Mechanical Turk. Specifically, among all objects that slide down, we select one object of each material, show AMT workers the videos of the object, but only to the moment of collision. We then ask workers to label where they believe the target object (either cardboard or foam) will be after the collision, i.e., how far the target will move. Before testing, each users are provided four full videos of other objects made of the same material, which contain complete collisions, so that users can simply infer the physical properties associated with the material and the target object in their mind. We tested 30 users per case. Table 3.3 shows the mean squared errors in pixels of human predictions (H), model predictions (M), or uniform estimate minimizing the mean squared error (U). We can

32

see that the performance of our model is close to that of human on this task. Figure 33 shows the heat maps of user predictions, model outputs (orange), and ground truths

(white).

33

34

Chapter 4 Physical Object Model: Incorporating a Physics Engine 4.1

The Galileo Model

Here we describe our second model. Compared to the first one, our second model (shown in Figure 4-1) incorporates a physics engine in its core, and the gist of our second model can be summarized as probabilistically inverting the physics engine to recover unobserved physical properties of objects. For this model, we focus on the ramp scenario, and in honor of the famous physicist, we name our model Galileo. The first component of Galileo is the physical object representations, where each object is a rigid body and represented not only by its 3D geometric shape (or volume) and its position in space, but also by its mass and its friction. All of these object attributes are treated as latent variables in the model, and are approximated or estimated on the basis of the visual input. Specifically, we collectively refer to the unobserved latent variables of an object as its physical representationT. For each object i, T consists of its mass mi, friction coefficient ki, 3D shape Vi, and position offset pi w.r.t. an origin in 3D space. We place uniform priors over the mass and the friction coefficient for each object: mi ~ Uniform(O.001, 1) and ki ~ Uniform(0, 1), respectively.

For 3D shape Vi, we

have four variables: a shape type ti, and the scaling factors for three dimensions

35

Physical object i - Mass (m)

- Friction coefficient (k) - 3D shape (S) - Position offset (x)

Draw two physical objects

7

2 3D Physics engine 2

)

Simulated velocities(t,1 v Likelihood function Observed velocities (!r _,,_-

v.2)

Tra~cking algorithm

Figure 4-1: Our second model formalizes a hypothesis space of physical object representations, where each object is defined by its mass, friction coefficient, 3D shape, and a positional offset w.r.t. an origin. To model videos, we draw objects from that hypothesis space into the physics engine. The simulations from the physics engine are compared to observations in the velocity space.

xi, yi, zi.

We simplify the possible shape space in our model by constraining each

shape type ti to be one of the three with equal probability: a box, a cylinder, and a torus. Note that applying scaling differently on each dimension to these three basic shapes results in a large space of shapes.' The scaling factors are chosen to be uniform over the range of values to capture the extent of different shapes in the dataset. Remember that our scenario consists of an object on the ramp and another on the ground.

{0,

1,

2,

.

The position offset, pi, for each object is uniform over the set 5, 5}. This indicates that for the object on the ramp, its position

can be perturbed along the ramp (i.e., in 2D) at most 5 units upwards or downwards from its starting position, which is 30 units upwards on the ramp from the ground. The next component of our generative model is a fully-fledged realistic physics 'For shape type box, xi, y, and zi could all be different values; for shape type torus, we constrained the scaling factors such that xi = zi; and for shape type cylinder, we constrained the scaling factors such that yi = zi.

36

engine that we denote as p. Specifically we use the Bullet physics engine the earlier related work.

[7] following

The physics engine takes a specification of each of the

physical objects in the scene within the basic ramp setting as input, and simulates it forward in time, generating simulated velocity vectors for each object in the scene, v., and v, 2 respectively - among other physical properties such as position, rendered image of each simulation step, etc. In light of initial qualitative analysis, we use velocity vectors as our feature representation in evaluating the hypothesis generated by the model against data. We employ a standard tracking algorithm (KLT point tracker [17]) to "lift" the visual observations to the velocity space. That is, for each video, we first run the tracking algorithm, and we obtain velocities by simply using the center locations of each of the tracked moving objects between frames. This gives us the velocity vectors for the object on the ramp and the object on the ground, v,1 and

v0 2 ,

respectively. Note that

we could replace the KLT tracker with state-of-the-art tracking algorithms for more complicated scenarios. The third part of Galileo is the likelihood function. We evaluate the observed real-world videos with respect to the model's hypotheses using the velocity vectors of objects in the scene. Given a pair of observed velocity vectors, v0, and v 02 , the recovery of the physical object representations T and T2 for the two objects via physics-based simulation can be formalized as

P(T, T 2 vI0 1 , v 0 2 , p()) oc P(vO, v 0 21v5I,

8 2)

' P(v8 1 , V 8 2 1T1, T2 , p(.)) .P(T1

where we define the likelihood function as P(vO,

V0 2

1v81I, V 8 2 )

, T 2 ),

(4.1)

= N(voIv, E), where vo

is the concatenated vector of vol, v 0 2 , and v, is the concatenated vector of v, 1 , V8 2 . The dimensionality of vo and v, are kept the same for a video by adjusting the number of simulation steps we use to obtain v, according to the length of the video. But from video to video, the length of these vectors may vary. In all of our simulations, we fix E to 0.05, which is the only free parameter in our model. Experiments show that the value of E does not change our results significantly.

37

4.1.1

Tracking as Recognition

The posterior distribution in Equation 4.1 is intractable. In order to alleviate the burden of posterior inference, we use the output of our recognition model to predict and fix some of the latent variables in the model. Specifically, we determine the Vi, or {ti, xi, yi, zi }, using the output of the tracking algorithm, and fix these variables without further sampling them. Furthermore, we fix values of pis also on the basis of the output of the tracking algorithm.

4.1.2

Inference

Once we initialize and fix the latent variables using the tracking algorithm as our recognition model, we then perform single-site Metropolis Hasting updates on the remaining four latent variables,

in,

M 2 , k, and k 2. At each MCMC sweep, we

propose a new value for one of these random variables, where the proposal distribution is Uniform(-0.05, 0.05). In order to help with mixing, we also use a broader proposal distribution, Uniform(-0.5, 0.5) at every 20 MCMC sweeps.

4.2

Simulations

For each video, as mentioned earlier, we use the tracking algorithm to initialize and fix the shapes of the objects, S 1 and S2 , and the position offsets, pi and P2We also obtain the velocity vector for each object using the tracking algorithm. We determine the length of the physics engine simulation by the length of the observed video -

that is, the simulation runs until it outputs a velocity vector for each object

that is as long as the input velocity vector from the tracking algorithm. We use 150 videos from our Physics 101 dataset, uniformly distributed across different object categories.

We perform 16 MCMC simulations for a single video,

each of which was 75 MCMC sweeps long. We report the results with the highest log-likelihood score across the 16 chains (i.e., the MAP estimate). In Figure 4-2, we illustrate the results for three individual videos. Every two frame

38

ra

On

0

(a)

(b)

(c)

Figure 4-2: Simulation results.

(d)

(e)

(f)

Each row represents one video in the data: (a) the

first frame of the video, (b) the last frame of the video, (c) the first frame of the simulated scene generated by Bullet, (d) th e last frame of the simulated scene, (e) the estimated object with larger mass, (f) the estimated object with larger friction coefficient.

of the top row shows the first and the last frame of a video, and the bottom row images show the corresponding frames from our model's simulations with the MAP estimate. We quantify different aspects of our model in the following behavioral experiments, where we compare our model against human subjects' judgments. Furthermore, we use the inferences made by our model here on the 150 videos to train a recognition model to arrive at physical object perception in static scenes with the model. Importantly, note that our model can generalize across a broad range of tasks beyond the ramp scenario.

For example, once we infer the coefficient friction of an

object, we can make a prediction on whether it will slide down a ramp with a different slope by doing simulation. We test some of the generalizations in Chapter 4.4.

4.3

Bootstrapping as Efficient Perception in Static Scenes

Based on the estimates we derived from the visual input with a physics engine, we bootstrap from the videos already collected, by labeling them with estimates of Galileo.

This is a self-supervised learning algorithm for inferring generic phys-

ical properties.

As discussed in Chapter 1, this formulation is also related to the

39

wake/sleep phases in Helmholtz machines, and to the cognitive development of infants. Here we focus on two physical properties: mass and friction coefficient.

To do

this, we first estimate these physical properties using the method described in earlier sections. Then, we train LeNet [13], a widely used deep neural network for small-scale datasets, using image patches cropped from videos based on the output of the tracker as data, and estimated physical properties as labels. The trained model can then be used to predict these physical properties of objects based on purely visual cues, even though they might have never appeared in the training set. We also measure masses of all objects in the dataset, which makes it possible for us to quantitatively evaluate the predictions of the deep network.

We choose one

object per material as our test cases, use all data of those objects as test data, and the others as training data. We compare our model with a baseline, which always outputs a uniform estimate calculated by averaging the masses of all objects in the test data, and with an oracle algorithm, which is a LeNet trained using the same training data, but has access to the ground truth masses of training objects as labels. Apparently, the performance of the oracle model can be viewed as an upper bound of our Galileo system. Table 4-3 compares the performance of Galileo, the oracle algorithm, and the baseline. We can observe that Galileo is much better than baseline, although there is still some space for improvement. Because we trained LeNet using static images to predict physical object properties such as friction and mass ratios, we can use it to recognize those attributes in a quick bottom-up pass at the very first frame of the video. To the extent that the trained LeNet is accurate, if we initialize the MCMC chains with these bottom-up predictions, we expect to see an overall boost in our log-likelihood traces. We test by running several chains with and without LeNet-based initializations. Results can be seen in Figure 4-4. Despite the fact that LeNet is not achieving perfect performance by itself, we indeed get a boost in speed and quality in the inference.

40

-

initialization with recognition model -

random initialization

-

Oe+00

Mass

Methods

MSE

Corr

Oracle

0.042

0.71

Galileo

0.052

0.44

Uniform

0.081

0

0

o- le+05-

o -2e+05

0

Figure 4-3: Mean squared errors of oracle estimation, our estimation, and uniform estiniations of mass on a log-normalized scale, and the correlations between estimations and ground

20 40 60 Number of MCMC sweeps

The log-likelihood traces Figure 4-4: of several chains with and without recognition-model (LeNet) based initializations.

truths

4.4

Experiments

In this section, we conduct experiments from multiple perspectives to evaluate our model. Specifically, we use the model to predict how far objects will move after the collision; whether the object will remain stable in a different scene; and which of the two objects is heavier based on observations of collisions. For every experiment, we also conduct behavioral experiments on Amazon Mechanical Turk so that we may compare the performance of human and machine on these tasks.

4.4.1

Outcome Prediction

In the outcome prediction experiment, our goal is to measure and compare how well human and machines can predict the moving distance of an object if only part of the video can be observed.

Specifically, for behavioral experiments on Amazon

Mechanical Turk, we first provide users four full videos of objects made of a certain material, which contain complete collisions. In this way, users may infer the physical properties associated with that material in their mind. We select a different object, but made of the same material, show users a video of the object, but only to the 41

moment of collision.

We finally ask users to label where they believe the target

object (either cardboard or foam) will be after the collision, i.e., how far the target will move. We tested 30 users per case. Given a partial video, for Galileo to generate predicted destinations, we first run it to fit the part of the video to derive our estimate of its friction coefficient. We then estimate its density by averaging the density values we derived from other objects with that material by observing collisions that they are involved. We further estimate the density (mass) and friction coefficient of the target object by averaging our estimates from other collisions. We now have all required information for the model to predict the ending point of the target after the collision. Note that the information available to Galilpo is

fxactly the s

a that qvqih1p to hImans

We compare three kinds of predictions: human feedback, Galileo output, and, as a baseline, a uniform estimate calculated by averaging ground truth ending points over all test cases. Figure 4-5 shows the Euclidean distance in pixels between each of them and the ground truth. We can see that human predictions are much better than the uniform estimate, but still far from perfect. Galileo performs similar to human in the average on this task. Figure 4-6 shows, for some test cases, heat maps of user predictions, Galileo outputs (orange crosses), and ground truths (white crosses). The error correlation between human and POM is 0.70. The correlation analysis for the uniform model is not useful because the correlation is a constant independent of the uniform prediction value.

4.4.2

Mass Prediction

The second experiment is to predict which of two objects is heavier, after observing a video of a collision of them. For this task, we also randomly choose 50 objects, we test each of them on 50 users. For Galileo, we can directly obtain its guess based on the estimates of the masses of the objects. Figure 4-7 demonstrates that human and our model achieve about the same accuracy on this task. We also calculate correlations between different outputs. Here for correlation analysis, we use the ratio of the masses of the two objects estimated

42

250

JEHuman 77@Model

Uniform 200 150

0 100 LLJ 50

ILI.IipK~ILI

Figure 4-5: Mean errors in numbers of pixels of human predictions, Galileo outputs, and a uniform estimate calculated by averaging ground truth ending points over all test cases. As the error patterns are similar for both target objects (foam and cardboard), the errors here are averaged across target objects for each material.

Figure 4-6: Heat maps of user predictions, Galileo outputs (orange crosses), and ground truths (white crosses).

by Galileo as its predictor. Human responses are aggregated for each trial to get the As the relation is highly nonlinear, we

proportion of people making each decision.

calculate Spearman's coefficients. From Table 4.1, we notice that human responses, machine outputs, and ground truths are all positively correlated.

4.4.3

"Will it move" Prediction

Our third experiment is to predict whether a certain object will move in a different scene, after observing one of its collisions.

On Amazon Mechanical Turk, we show

users a video containing a collision of two objects.

In this video, the angle between

the inclined surface and the ground is 20 degrees. We then show users the first frame of a 10-degree video of the same object, and ask them to predict whether the object will slide down the surface in this case.

43

We randomly choose 50 objects for the

Mass

|Human [-~Model

Spearman's Coeff

Human vs Galileo Human vs Truth Galileo vs Truth

0.8

06

0.51 0.68 0.52

0.4

"Will it move"

02 0

Pearson's Coeff

Human vs Galileo Human vs Truth Galileo vs Truth

Will it move"

Figure 4-7: Average accuracy of human predictions and Galileo outputs on the tasks of mass prediction and "will it move" prediction. Error bars indicate standard deviations of human accuracies.

0.56 0.42 0.20

Table 4.1: Correlations between pairs of outputs in the mass prediction experiment (in Spearman's coefficient) and in the "will it move" prediction experiment (in Pearson s coefficient).

experiment, and divide them into lists of 10 objects per user, and get each of the item tested on 50 users overall. For Galileo, it is straightforward to predict the stability of an object in the 10degree case using estimates from the 20-degree video. Interestingly, both humans and the model are at chance on this task (Figure 4-7), and their responses are reasonably correlated (Table 4.1). Again, here we aggregate human responses for each trial to get the proportion of people making each decision. Moreover, both subjects and the model show a bias towards saying "it will move." Future controlled experimentation and simulations will investigate what underlies this correspondence.

44

Chapter 5 Beyond Understanding Physics The perception of intrinsic object properties like physical properties, appearances, and affordances serves as a key role in explaining many of our daily observations, including the interactions within objects, between objects and scenes, and between agents and objects.

Scene Understanding

Knowledge about physical object properties could be

crucial to scene understanding. Is the configuration of the objects in the room stable? What may happen if someone throws a ball against some particular object?

Will

people inside the room be safe if there is a minor earthquake? To answer questions like these, a computational system needs to understand basic physical laws, which could be provided by a mature physics engine, as well as some level of physical object properties. An initial attempt could be to build a system working with synthetic scenes, which we can generate at very little cost and have perfect knowledge of. We are actively designing a new generative model with a physics engine, which follows the architecture of our second model, but with a focus on scene understanding.

We hope our

model could achieve two goals: first, using physics to help generate physically plausible scenes; and second, discriminatively predicting the stability and other physical properties of every location in a given scene.

45

Cognitive Agents

In the dynamic world we live in, we are not only an observer,

but also a participant. Similarly, with a physical object model, it is natural to incorporate an agent, which actively explores and interacts with the world. An agent could be unintentional. Infants may play with an object not because of any particular purpose; instead, they merely want to discover what they can do to the object and what the responses will be. In perceiving physical object properties, it is reasonable to expect an agent, which actively interacts with objects, to perform better than a computational system, which only learns by watching videos. Agents may also target at some goals like moving an object to a certain place efficiently or deconstructing an unstable pile of building blocks. Besides combining Ueep

eingtt1LL..L,

UU

a

pIysic n

inei1,

Uit

aiiuiith

ioi

Itha

U Lct-I VsJ ILLI Ul

sd

sL

to integrate reinforcement learning into the loop, which has been proven effective in similar tasks.

Developmental Psychology

In our second model, we assume uniform priors

on physical properties like mass and coefficient of friction during sampling. This does not seem to align with the intuition that people expect objects with larger volumes to be heavier, or objects with smoother surfaces to have smaller coefficients of friction. To what extent do these priors exist, and if so, how do these priors affect human decisions?

These questions could have profound implications when agents try to

interact with objects, e.g., a robot should exert a smaller force on a light and fragile object to avoid breaking it. To rigorously answer these questions, however, requires a careful design of behavior experiments. Further, as discussed in Chapter 1, infants obtain basic concepts of physics at an early age. If we observe these priors on adults, when do young children develop similar concepts, and what kind of priors do they have in their minds? A thorough understanding of these questions could inspire research in both developmental psychology and artificial intelligence.

46

Chapter 6 Conclusion In this thesis, we discussed the task of learning physical object properties.

We

studied several scenarios, with which humans are familiar and can learn to infer the involved physical object properties even when they are young. We proposed a novel dataset, Physics 101, which contains over 17, 000 videos from four viewpoints of 101 objects in four scenarios. We further proposed two novel models for learning physical properties of objects by incorporating physics simulators with deep neural nets, and conducted extensive evaluations. The main contribution of this thesis is that it shows that a generative vision system with physical object representations and a realistic 3D physics engine or a symbolic physics interpreter at its core can efficiently deal with real-world data when proper recognition models and feature spaces are used.

Our behavior study also

points towards an possibility of an account of human vision with generative physical knowledge at its core, and various recognition models as helpers to induce efficient inference.

We hope our paper could inspire future study on learning physical and

other common sense knowledge from visual data.

47

48

Bibliography [1] Ren6e Baillargeon. Infants' physical world. Current directions in psychological

science, 13(3):89-94, 2004. [2] Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical scene understanding. PNAS, 110(45):18327-18332, 2013. [3] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recognition in the wild with the materials in context database. CVPR, 2015. [4] Katherine L Bouman, Bei Xiao, Peter Battaglia, and William T Freeman. Estimating the material properties of fabric from video. In ICCV, 2013. [5] Susan Carey. The origin of concepts. Oxford University Press, 2009. [6] Ronan Collobert, Koray Kavukcuoglu, and C16ment Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, 2011. [7] Erwin Coumans. Bullet http://bulletphysics.org, 2010.

physics

engine.

Open

Source

Software:

[8] Abe Davis, Katherine L Bouman, Justin G Chen, Michael Rubinstein, Fr6do Durand, and William T Freeman. Visual vibrometry: Estimating material properties from small motions in video. In CVPR, 2015. [9J Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889-904, 1995. [101 Zhaoyin Jia, Andy Gallagher, Ashutosh Saxena, and Tsuhan Chen. 3d reasoning

from blocks to stability. IEEE TPA MI, 2014. [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. [121 Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, Nov 1998. [13] Yann LeCun, L6on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, 1998.

49

[14] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015. [151 Adam N Sanborn, Vikash K Mansinghka, and Thomas L Griffiths. Reconciling intuitive physics and newtonian mechanics for colliding objects. Psychological

review, 120(2):411, 2013. [16] John Schulman, Alex Lee, Jonathan Ho, and Pieter Abbeel. Tracking deformable objects with point clouds. In ICRA, 2013. [171 Carlo Tomasi and Takeo Kanade. Detection and tracking of point features. IJCV,

1991. [18] Tomer Ullman, Andreas Stuhlmiiller, Noah Goodman, and Josh Tenenbaum. Learning physics from dynamical scenes. In CogSci, 2014. [191 Manik Varma and Andrew Zisserman. A statistical approach to material classification using image patch exemplars. IEEE TPAMI, 31(11):2032-2047, 2009. [20] Jacob Walker, Abhinav Gupta, and Martial Hebert. Patch to the future: Unsupervised visual prediction. In CVPR, 2014. [211 Jiajun Wu, Ilker Yildirim, Joseph J. Lim, William T. Freeman, and Joshua B. Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In NIPS, 2015. [22] Ilker Yildirim, Tejas D Kulkarni, Winrich A Freiwald, and Joshua B Tenenbaum. Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations. In CogSci, 2015. [23] Bo Zheng, Yibiao Zhao, Joey C Yu, Katsushi Ikeuchi, and Song-Chun Zhu. Detecting potential falling objects by inferring human action and natural distur-

bance. In ICRA, 2014. [24] Yixin Zhu, Yibiao Zhao, and Song-Chun Zhu. Understanding tools: oriented object modeling, learning and recognition. In CVPR, 2015.

50

Task-