do convolutional neural networks act as ...

3 downloads 0 Views 1MB Size Report
networks for pixel-level tasks learn a fast compositional nearest neighbor synthe- .... training images with K pixels each, global nearest-neighbors can produce N ...
D O CONVOLUTIONAL NEURAL NETWORKS ACT AS C OMPOSITIONAL N EAREST N EIGHBORS ? Chunhui Liu1,2∗, Aayush Bansal1 , Victor Fragoso3 , Deva Ramanan1 Carnegie Mellon University, 2 Peking University, 3 West Virginia University

1

A BSTRACT We present a simple approach based on pixel-wise nearest neighbors to understand and interpret the functioning of state-of-the-art neural networks for pixel-level tasks. We aim to understand and uncover the synthesis/prediction mechanisms of state-of-the-art convolutional neural networks. To this end, we primarily analyze the synthesis process of generative models and the prediction mechanism of discriminative models. The main hypothesis of this work is that convolutional neural networks for pixel-level tasks learn a fast compositional nearest neighbor synthesis/prediction function. Our experiments on semantic segmentation and imageto-image translation show qualitative and quantitative evidence supporting this hypothesis.

1

I NTRODUCTION

Convolutional neural networks (CNNs) have revolutionized computer vision, producing impressive results for discriminative tasks such as image classification and semantic segmentation. More recently, they have also produced startlingly impressive results for image generation through generative models. However, in both cases, such feedforward networks largely operate as “black boxes”. As a community, we are still not able to succinctly state why and how such feedforward functions generate a particular output from a given input. If a network fails on a particular input, why? How will a network behave on never-before-seen data? To answer such questions, there is a renewed interest in so-called explainable AI1 . The central goal in this (re)invigorated space is the development of machine learning systems that are designed to be more interpretable and explanatory. Interpreting networks: Rather than redesigning a network to be more interpretable, a complementary approach is to post-hoc add interpretability to an existing network. Deep hierarchical features dramatically outperform manually-designed hierarchies (typically based upon edges, parts, objects (Marr, 1982)), so it would be valuable to understand what non-obvious hierarchical concepts are being learned by such high-performing deep networks. What did classic visual hierarchies fail to capture? An illustrative example of such reasoning might be post-hoc interpretation of the AlphaGo network (Silver et al., 2016); while an automated Go player itself is certainly interesting, what maybe more so is understanding the novel game heuristics (unknown to current experts) learned by the network. Spatial prediction: In this work, we focus on the class of CNNs designed to make predictions at each image pixel. Many problems in computer vision can be cast in this framework, from semantic segmentation to depth estimation to image synthesis/translation. We explore a simple hypothesis for explaining the behaviour of such networks: they operate by cutting-and-pasting (composing) image patches found in training data. Consider Fig. 1, where we visualize the output of Isola et al. (2016)’s translation network trained to synthesize images of building facades from label masks. Why does the network generate the strange diagonal gray edge at the top? To answer this question, we visualize image pixels extracted from the closest-matching nearest-neighbor (NN) patches found in the training data. Remarkably, NN-synthesis looks quite similar to the CNN output, providing a clear explanation for the synthesized corner artifact. ∗ 1

Work done as a summer intern at Carnegie Mellon University. http://www.darpa.mil/program/explainable-artificial-intelligence

1

Input

Convolutional Neural Networks

Compositional Nearest Neighbors

Input

(a) Labels-to-Facades

Input

Convolutional Neural Networks

Convolutional Neural Networks

Compositional Nearest Neighbors

(b) Facades-to-Labels

Compositional Nearest Neighbors

Input

Convolutional Neural Networks

Compositional Nearest Neighbors

(c) Images-to-Labels

Figure 1: We propose a non-parametric method to explain and modify the behavior of convolutional networks, including those that classify pixels and those that generate images. For example, given the label mask on the top-left, why does a network generate strange gray border artifacts? Given the image on the bottom-right, how is a network able to segment out the barely-visible lamppost? We show that the output of such networks can be explained through a simple compositional nearestneighbor operation. On the right of every image triple, we show the output obtained by simply (1) matching input patches to those in the training set and (2) returning a corresponding output label. This means that the output is created by cutting-and-pasting (composing) patches of training images. Such an operation can readily explain the strange behavior - the border artifact and lamppost were matched to such output labels in the training set! Importantly, non-parametric matching does not require any complex feedforward operations given a suitable embedding for comparing patches. We provide evidence that networks implicitly learn such embeddings at all layers of a network. Such a perspective allows one to “explain” errors and modify the biases of a network by changing the set of image patches used for non-parametric matching. Compositional embeddings: Our central thesis is consistent with recent work on network memorization (Zhang et al., 2016), but notably, naive memorization fails to explain how/why networks generalize to never-before-seen data. We explain the latter through composition: the synthesized output in Fig. 1 consists of windows copied from different training images. Given a database of N training images with K pixels each, global nearest-neighbors can produce N possible output images. Compositional nearest-neighbors can produce exponentially-more (N K)K outputs, obtained by independently matching each of the K patches in the input query to one of N K patches in the training set. But many of these outputs may be unrealistic. For example, one should not synthesize a facade by composing together windows from training images with different architectural styles. To ensure global consistency, one needs a carefully-tuned metric for comparing patches that can encode such global knowledge. But where do we obtain such a metric? Devlin et al. (2015a) demonstrate that the penultimate (“FC7”) layer of classification networks learn semantic embeddings that can be used to compare images with surprising accuracy. We apply this observation to spatial prediction networks, to learn semantic embeddings of image patches. Interestingly, we show that such an interpretation also holds for the internal layers of a network- e.g., one can “explain” the activations in a layer by matching to training patches using embeddings computed from the previous layer. This may suggest a novel perspective of CNNs as learning compositions of embeddings. Correspondence and bias: Beyond being a mechanism for interpretation, we demonstrate that compositional NN matching is a viable algorithm that can rival the accuracy of highly-tuned CNNs. Although slower than a feedforward net, compositional matching is attractive in two respects: (1) It provides spatial correspondences between pixels in the predicted output and pixels in the training set. This may be useful in some applications such as label transfer (Liu et al., 2011). (2) Implicit biases of the network can be explicitly manipulated by changing the set of images used for matching 2

– it need not be the same set used for training the network. As an illustrative example, we can force a pre-trained translation network to predict European or American building facades by restricting the set of images used for matching. Such a manipulation may, for example, be used to modify a biased face recognition network to process genders and races in a more egalitarian fashion. Contribution: We introduce compositional NN embeddings for interpreting discriminative and generative CNNs. Specifically, we demonstrate that CNNs appear to work by memorizing image patches from training data, and then composing them into new configurations. To make compositional matching viable, CNNs learn an embedding of image patches that captures both global and local semantics. We validate our hypothesis on state-of-the-art networks for image translation and semantic image segmentation. Finally, we also show evidence that compositional matching can be used to generate spatial correspondences and manipulate the implicitly-learned biases of a network.

2

R ELATED W ORK

We broadly classify networks for spatial prediction into two categories: (1) discriminative prediction, where one is seeking to infer high-level semantic information from RGB values and (2) generative synthesis, where the intent is to synthesize a new image from a given input “prior”. There is a broad literature for each of these tasks, and here we discuss the ones most relevant to ours. Discriminative Models: An influential formulation for state-of-the-art spatial prediction tasks is that of fully convolutional networks (Long et al., 2015). These have been used for pixel prediction problems such as semantic segmentation (Long et al., 2015; Hariharan et al., 2015; Ronneberger et al., 2015; Bansal et al., 2017a; Chen et al., 2016), depth/surface-normal estimation (Bansal et al., 2016; Eigen & Fergus, 2015), or low-level edge detection (Xie & Tu, 2015; Bansal et al., 2017a). Substantial progress has been made to improve the performance by employing deeper architectures (He et al., 2015), or increasing the capacity of the models (Bansal et al., 2017a), or utilizing skip connections, or intermediate supervision (Xie & Tu, 2015). However, we do not precisely know what these models are actually capturing to do pixel-level prediction. In the race for better performance, the interpretability of these models has been typically ignored. In this work, we take a first step to analyze the prediction space of these discriminative models. In this work, we focus on analyzing encoder-decoder architectures for spatial classification (Ronneberger et al., 2015; Badrinarayanan et al., 2017). Generative Models: Goodfellow et al. (2014) proposed a two-player min-max formulation where a generator G synthesized an image from random noise z, and a discriminator (D) is used to distinguish the generated images from the real images. While this Generative Adversarial Network (GAN) formulation was originally proposed to synthesize an image from random noise vectors z, this formulation could also be used synthesize new images from other priors such as a low resolution image or label mask by treating z as an explicit input to be conditioned upon. This conditional image synthesis via generative adversarial formulation has been well utilized by multiple follow-up works to synthesize a new image conditioned on a low-resolution image (Denton et al., 2015), class labels (Radford et al., 2015), and other inputs (Isola et al., 2016; Zhu et al., 2017). While the quality of synthesis from different inputs has rapidly improved in recent history, interpretation of GANs has been relatively unexplored. In this work, we examine the influential Pix2Pix network Isola et al. (2016) and demonstrate an intuitive non-parametric representation for explaining its impressive results. Interpretability & Nearest Neighbors: There is a substantial body of work (Zeiler & Fergus, 2014; Mahendran & Vedaldi, 2015; Zhou et al., 2014; Bau et al., 2017) on interpreting general convolutional neural networks (CNNs). The earlier work of Zeiler & Fergus (2014) presented an approach to understand and visualize the functioning of intermediate layers of CNN. Mahendran & Vedaldi (2015) proposed to invert deep features to visualize what is learnt by CNNs, similar to inverting HOG features to understand object detection (Vondrick et al., 2013). Zhou et al. (2014) demonstrated that object detectors automatically pop up while learning the representation for scene categories. Krishnan & Ramanan (2016) explored interactive modification of a pre-trained network to learn novel concepts, and recently Bau et al. (2017) proposed to quantify interpretability by measuring scene semantics such as objects, parts, texture, material etc. Despite this, understanding the space of pixel-level CNNs is not well studied. The recent work of PixelNN (Bansal et al., 2017b) focuses on high-quality image synthesis by making use of a two-stage matching process that begins by 3

feedforward CNN processing and ends with a nonparametric matching of high-frequency detail. We differ in our focus on interpretability rather than image synthesis, our examination of networks for both discriminative classification and generative synthesis, and our simpler single-stage matching process that does not require feedforward processing. Compositionality & CNNs: The design of part-based models (Crandall et al., 2005; Felzenszwalb et al., 2008), pictorial structures or spring-like connections (Fischler & Elschlager, 1973; Felzenszwalb & Huttenlocher, 2005), star-constellation models (Weber et al., 2000; Fergus et al., 2003), and the recent works using CNNs share a common theme of compositionality. While the earlier works explicitly enforce the idea of composing different parts for object recognition in the algorithmic formulation, there have been suggestions that CNNs also take a compositional approach (Zeiler & Fergus, 2014; Krishnan & Ramanan, 2016; Bau et al., 2017). In this work, we explicitly use a nearest neighbors pipeline to strongly suggest that Convolutional NeuralNetworks behave as Compositional Nearest Neighbors!

3

C OMPOSITIONAL N EAREST N EIGHBORS

We now introduce our method to interpret various fully convolutional networks designed for pixellevel tasks. Global Nearest-Neighbor Embeddings: Our starting point is the observation that classification networks can be interpreted as linear classifiers defined on nonlinear features extracted from the penultimate layer (e.g., “FC7” features) of the network. We formalize this perspective with the following notation: Label(x) = k ∗

where k ∗ = argmax wk · φ(x),

[K-way classification]

(1)

k∈{1...K}

where φ(x) ∈ RN corresponds to the penultimate FC7 features computed from input image x. Typically, the parameters of the linear classifier {wy } and those of the feature encoder φ(·) are trained on large-scale supervised datasets: D = {(xn , yn )}

[Training database]

(2)

Devlin et al. (2015b) make the observation that penultimate features φ(x) can be interpreted as an embedding in RN . By extracting such embeddings for training images xn , Devlin et al. (2015b) build a nonparametric nearest-neighbor (NN) predictor for complex tasks such as image captioning. We write this as follows:   Label(x) = yn∗ where n∗ = argmin Dist φ(x), φ(xn ) [NN classification] (3) n

Importantly, the above NN classifier performs quite well even when feature encoders φ(·) are trained for classification rather than as an explicit embedding. In some sense, deep nets seem to implicitly learn embeddings upon which simple linear classifiers (or regressors) operate. We argue that such a NN perspective is useful in interpreting the predicted classification since the corresponding training example can be seen as a visual “explanation” of the prediction - e.g., the predicted label for x is “dog” because x looks similar to training image xn∗ . Pixel-wise Nearest-Neighbor Embeddings: We now extend the above insight to pixel-prediction networks that return back a prediction for each pixel i in an image: Labeli (x) = k ∗

where k ∗ = argmax wk · φi (x)

[K-way pixel classification]

(4)

k∈{1...K}

where we write Labeli (·) for the label of the ith pixel and φi (·) for its corresponding feature vector. Because we will also examine pixel-level prediction networks trained to output a continuous value, write out the formulation for pixel-level regression: Predicti (x) = W φi (x) where W ∈ RM ×N

[Pixel regression]

(5)

For concreteness, consider a Pix2Pix (Isola et al., 2016) network trained to regress RGB values at each pixel location. These predictions are obtained by convolving features from the penultimate layer with filters of size 4 × 4 × 128. In this case, the three filters that generate R,G, and B values can be written as a matrix W of size M × N , where M = 3 and N = 4 ∗ 4 ∗ 128 = 2048. 4

Input: Label

Output: Image

Nearest Neighbor Search

Input: Image

Output: Label

Label-Image Pair in Training Set

Nearest Neighbor Search

Image-Label Pair in Training Set

Figure 2: Overview of Pipeline: Given an input label/image on the top-left, our approach extracts an embedding for each pixel. We visualize two pixels with a yellow and white dot. The embedding captures global context, which is visualized with the surrounding rectangular box. We then find the closest matching patches in the training set (with a nearest neighbor search), and then report back the corresponding pixel labels to generate the final output (bottom-left). We visualize an example for label-to-image synthesis on the left, and image-to-label prediction on the right.

Pix2Pix

CompNN

Pix2Pix

CompNN

Pix2Pix

CompNN

Figure 3: We visualize a non-parametric approach to computing activations from internal layers. By matching to a training database of decoder4 features from Pix2Pix, we can compute activations for the next layer (decoder3 ) with nearest-neighbors. Each image represents a feature map of 3 continuous channels visualized in the R,G, and B planes. The collective set of 6 images displays 18 out of the 256 channels in decoder3 . We see that non-parametric prediction looks nearly identical to the actual activations computed in decoder3 by Pix2Pix.

Analogously, φi (x) corresponds to N dimensional features extracted by reshaping local 4 × 4 convolutional neighborhoods of features from the penultimate feature map (of size H × W × 128). We now can perform nearest-neighbor regression to output pixel values:   Predicti (x) = yn∗ ,m∗ where (n∗ , m∗ ) = argmin Dist φi (x), φm (xn ) [NN pixel regression] n,m

where yn∗ ,m∗ refers to the mth pixel from the nth training image. Importantly, pixel-level nearest neighbors reveals spatial correspondences for each output pixel. We demonstrate that these can be used to provide an intuitive explanation of pixel outputs, including an explanation of errors that otherwise seem quite mysterious (see Fig.1 and Fig. 2). Distance function: We explored different distance functions such as euclidean and cosine distance. Similar to past work (Devlin et al., 2015b), we found that cosine distance consistently performed slightly better, so we use that in all of our experiments. 5

Nearest-Neighbor Embeddings of Interior Layers: We now extend our embedding interpretation of neural nets to internal layers. Recall that activations a at a spatial position i and layer j can be computed using thresholded linear functions of features (activations) from the previous layer:   aij (x) = max 0, W φij (x) , where aij ∈ RM , φij ∈ RN , W ∈ RM ×N (6) where we write aij for the vector of activations corresponding to the ith pixel position from layer j, possibly computed with bilinear interpolation (Long et al., 2015). We write φij for the local convolutional neighborhood of features (activations) from the previous layer that are linearly combined with bank of linear filters W to produce aij . For concreteness, by matching to a training database of decoder4 features from Pix2Pix, we can compute activations for the next layer (decoder3 ) with nearest-neighbors. Here, aij ∈ RM where M = 256 and φij ∈ RN where N = 4 ∗ 4 ∗ 512 = 8192. We similarly posit that one can produce approximate activations by nearest neighbors. lch Specifically, let us run Pix2Pix on the set of training images, and construct a dataset of training patches with features φij (xn ) as data and corresponding activation vectors aij (xn ) as labels. We can then predict activation maps for a query image x with NN:   aij (x) = am∗ ,j (xn∗ ) where (m∗ , n∗ ) = argmin Dist φij (x), φmj (xn ) [NN activations] m,n

Notably, this is done without requiring explicit access to the filters W . Rather such responses are implicitly encoded in the training dataset of patches and activation labels. We show that such an approach actually produces reasonable activations, which suggests one novel interpretation of neural networks as learning hierarchies of embeddings (Fig. 3). An important special case is the bottleneck feature, which is computed from an activation map of size 1 × 1 × 512. In this case, we posit that the corresponding feature φij (x) ∈ R512 is a good global descriptor of image x. In our experiments, we found that such a global descriptor could be used to prune the NN pixel search, significantly speeding up run-time performance (e.g., we first prune the training database to a shortlist of images with similar bottleneck features, and then search through these images for similar patches). Bias modification: Finally, our results suggest that the matching database from Eq.(2) serves as an explicit “associative memory” of a network (Carpenter, 1989). We can explicitly modify the memory by changing the dataset of training images {xn }, labels {yn }, or both. We experiment with various modifications in our experimental results. One modification that consistently produced smoother visual results was to refine the training labels to those predicted by a network:    xn , yn ⇒ { xn , CN N (xn ) (7) Such as procedure for “self-supervised” learning is sometimes used when training labels are known to be noisy. From an associative network perspective, we posit that such labels capture a more faithful representation of a network’s internal memory. Unless otherwise specified, all results make use of the above matching database. We visualize the impact of self-supervised labels in Fig. 4. 3.1

S UFFICIENT S TATISTICS FOR P IXEL - LEVEL TASKS

We now discuss information-theoretic properties of the introduced nearest-neighbor embedding φi (·) presented in Sec. 3. Specifically, we show that this embedding produces sufficient statistics (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017; Achille & Soatto, 2017) for various pixel level tasks. Global embeddings: We begin with the simpler case of predicting a global class label y from an image x. If we assume that the input image x, global feature embedding φ(x), and output label y form a Markov chain x → φ(x) → y, we can write the following: p(y | φ(x), x) = p(y | φ(x)).

[Sufficiency]

(8)

As Achille & Soatto (2017) show, standard loss functions in deep learning (such as cross-entropy) search for an embedding that minimizes the entropy of the label y given the representation φ(x). This observation explicitly shows that the embedding φ(x) is trained to serve as a sufficient representation of the data x that is rich enough in information to predict the label y. In particular, if the learned embedding satisfies the above Markov assumption, the prediction y will not improve even when given access to the raw data x. 6

Input

Pix2Pix

Original Labels

Self-Supervised Labels

Ground Truth

Input

Pix2Pix

Original Labels

Self-Supervised Labels

Ground Truth

Figure 4: Bias modification of training labels: Given the label input on the left, we show results of Pix2Pix, and non-parametric matching to the training set using the original labels and the predicted “self-supervised” labels of the Pix2Pix network. Generating images with the predicted labels looks smoother, though the qualitative behavior of the network is still explained by the original training labels.

Pixelwise embeddings: Spatial networks predict a set of pixelwise labels {yi } given an input image. If we assume that the pixelwise labels are conditionally independent given {φi (x)} and x, then we can write the joint posterior distribution over labels with the following product:

p({yi } | φ(x), x) =

Y

p(yi | φ(x), x)

[Conditional Spatial Independence]

(9)

i

=

Y

p(yi | φi (x)),

[Sufficiency]

(10)

i

where φi (x) are the sufficient statistics needed to predict label yi , and φ(x) = {φi (x)} is the set of the sufficient statistics. One can similarly show that pixel-wise cross-entropy losses jointly minimize the entropy of the labels yi given φi (x). This suggests that pixelwise features φi (x) do serve as a remarkably rich characterization of the image. This characterization includes both global properties (e.g., the color of a building facade being synthesized) as well as local properties (e.g., the presence of a particular window ledge being synthesized). Importantly, this requires the conditional independence assumption from Eq. (9) to be conditioned on the entire image x rather than just the local pixel value xi (from which it would be hard to extract global properties). An interesting observation from the factorization shown in Eq. (10) is that we can synthesize an image by predicting pixel values independently. Thus, this theoretical observation suggests that a simple nearest-neighbor regression for every output pixel can synthesize plausible images. Multi-layered representations: The above pixelwise formulation also suggests that internal feature layers serve as sufficient representations to predict activations for subsequent layers. Interestingly, skip connections that directly connect lower layers to higher layers break the Markov independence assumption. In other words, skip connections suggest that higher layer features often do not serve as sufficient representations, in that subsequent predictions improve when given access to earlier layers. However, Eq.(10) technically still holds so long as we write φi (x) for the concatenated representation including lower-level features. For example, in Pix2Pix, we write φi (x) ∈ RN where N = 4 ∗ 4 ∗ (64 + 64) = 2048, where the second set of 64 channel features are copied from the first encoder layer. In the next Section, we show qualitative and quantitative experiments supporting this analysis. 7

Input

Pix2Pix

CompNN

Ground Truth

Input

SegNet

CompNN

Ground Truth

Figure 5: Semantic Segmentation: The results of semantic segmentation on cityscape and CamVid dataset. This result suggests the following observations. First, the difference between images generated by generative networks (Pix2Pix and SegNet columns) and NN embedding (CompNN column) is surprisingly small. Thus, our method can perfectly interpret discriminative deep networks on pixel-level classification. Second, we can notice some noise edges with a high gradient (see columns 5-8). This phenomenon can also be used to understand the difficulty of image segmentation task: borders with high gradients are usually hard to classify due to the ambiguous patches in training set.

4

E XPERIMENTS

We now present experimental results for discriminative networks trained for semantic segmentation, as well as generative networks trained for image synthesis. The goal of the experiments is to show that images generated with a simple NN regression are nearly-identical to those synthesized by the generative networks. Networks: We use SegNet (Badrinarayanan et al., 2017), a recent state-of-the-art network for image segmentation, and Pix2Pix (Isola et al., 2016), a state-of-the-art network for conditional image synthesis and translation. We evaluate our findings for multiple datasets/tasks on which the original networks were trained. These include tasks such as synthesizing facades from architectural labels and vice versa, predicting segmentation class labels from urban RGB images, and synthesizing Google maps from aerial/satellite views. Semantic Segmentation: We use the CityScape (Cordts et al., 2016) and CamVid (Brostow et al., 2008; 2009) datasets for the task of semantic segmentation. Both datasets are annotated with semantic class labels for outdoor images collected from a car driving on the road. We use SegNet (Badrinarayanan et al., 2017) for CamVid sequences, and Pix2Pix (Isola et al., 2016) for CityScape dataset. Figure 5 shows qualitatively results for these datasets. We can observe in Figure 5 that the compositional NN produces a nearly-identical result (CompNN column) to the one produced by the network (Pix2Pix and SegNet columns). This suggests that our method enables a good interpretation of discriminative deep networks on pixel-level classification. Architectural Labels to Facades: We followed Isola et al. (2016) for this setting that use the ˇ ara, 2013). There are 400 training images in this dataset, and 100 annotations from (Tyleˇcek & S´ images in validation set. We use the same dataset to generate architectural labels from images of facades. Pix2Pix (Isola et al., 2016) models are trained using 400 images in training set for both labels-to-facades and vice versa. Figure 6 shows qualitative examples of synthesizing real-world images using pixel-wise nearest neighbor embedding in its first and second rows. We observe that the NN-embedding perfectly explains the generation of deformed edges (see CompNN column), and how the generative architecture is eventually memorizing patches from the training set. Satellite to Maps: This dataset contains 1096 training images and 1000 images in test set scraped from Google Maps. We use the same settings used by Isola et al. (2016) for this task. Figure 6 qualitatively shows in its third row how Google maps are synthesized from satellite images. We 8

Input

Pix2Pix

CompNN

Original Image

Input

Pix2Pix

CompNN

Original Image

Figure 6: Image Synthesis: The results suggest that our approach can also explain the results from Pix2Pix. We conclude this because our approach reproduces color and the structure of the Pix2Pix output, including a few artifacts (e.g., the image cuts in the 6th and 7th columns).

Input

Original

Pix2Pix

CompNN

A

B

C

D

Figure 7: Correspondence Map: Given the input label mask on the top left, we show the groundtruth image and the output of Pix2Pix below. Why does Pix2Pix synthesize the peculiar red awning from the input mask? To provide an explanation, we use CompNN to synthesize an image by explicitly cutting-and-pasting (composing) patches from training images. We color code pixels in the training images to denote correspondences. For example, CompNN copies doors from training image A (blue) and the red awning from training image C (yellow). can observe that our synthesis (CompNN column) is nearly identical to the image generated by the network (Pix2Pix columns). Correspondence Map: Figure 7 shows a correspondence map that explicitly illustrates how an output image is synthesized by cutting-and-pasting patches from training images. We can observe patches are selected from many different styles of facades, but in such a manner that ensures that the composed output is globally consistent while maintaining the appropriate local structures (such as windows, doors, awnings, etc.). This implies the learned patch embedding captures both global semantics and local structure. Bias modification: Figure 8 shows that the output of nonparametric matching (3rd - 6th columns from left to right) can be controlled through explicit modification of the matching database. In 9

Labels

Pix2Pix

Bias Modification

Figure 8: Bias modification: Given the same label input, we show different results obtained by matching to different databases (using an embedding learned by Pix2Pix). By modifying the database to include specific buildings from specific locations, one can introduce and remove implicit biases in the original network (e.g.,one can generate “European” facades versus “American” facades).

Approach

Facades-to-Labels

CityScape

CamVid

Baseline CNN

0.545

0.738

0.785

CompNN(Self-Supervised Label)

0.505

0.735

0.798

CompNN(Original Label)

0.493

0.722

0.757

Table 1: Quantitative Evaluation: We compare our compositional nearest neighbor approach to baseline networks trained on the same training/matching database. We report the fraction of correctly-classified pixels, where predicted segmentation labels are compared to ground-truth labels. We specifically compare to (Isola et al., 2016) for Facades and CitiScape, and compare with (Badrinarayanan et al., 2017) for CamVid. Our results are quite competitive, and sometimes even outperform the baseline network.

simple words, this experiment shows that we can control properties of the synthesized image by simply specifying the exemplars in the database. We can observe in Figure 8 that the synthesized images preserve the structure (e.g., windows, doors, and roof), since it is the conditional input, but the textural components (e.g., color) change given different databases. Quantitative Evaluation: We present the quantitative analysis of our pixel-wise nearest neighbor approach with an end-to-end pipeline in Table 1. We report classification accuracy of ground truth labels compared to the predicted labels for the task of semantic segmentation. We can observe in Table 1 that our results (CompNN with Self-Supervised and Original Labels) are competitive or sometimes better to those of the baseline. We can note that CompNN with Self-Supervised labels achieves a higher accuracy than that of CompNN with original labels. Finally, we can observe that CompNN with Self-Supervised labels outperforms the baseline in the CamVid dataset, and achives a quite competitive classification accuracy on Facades-to-Labels and CityScape datasets.

5

D ISCUSSION

In this paper, we have presented a simple approach based on pixel-wise nearest neighbors to understand and interpret the functioning of convolutional neural networks for spatial prediction tasks. Our analysis suggests that CNNs behave as compositional nearest neighbor operators over a training set of patch-label pairs that act as an associative memory. But beyond simply memorizing, CNNs can generalize to novel data by composing together local patches from different training instances. Also, we argued that networks for pixel-level tasks learn sufficient statistics that enable the generation of pixel predictions. Our analysis and experiments not only support this argument, but also enables example-based explanations of network behavior and explicit modulation of the implicit biases learned by the network. We hope that our framework enables further analysis of convolutional networks from a non-parametric perspective. 10

R EFERENCES Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. CoRR, abs/1706.01350, 2017. URL http://arxiv.org/abs/1706.01350. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for scene segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017. Aayush Bansal, Bryan Russell, and Abhinav Gupta. Marr Revisited: 2D-3D model alignment via surface normal prediction. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Aayush Bansal, Xinlei Chen, Bryan Russell, Abhinav Gupta, and Deva Ramanan. PixelNet: Representation of the pixels, by the pixels, and for the pixels. arXiv:1702.06506, 2017a. Aayush Bansal, Yaser Sheikh, and Deva Ramanan. abs/1708.05349, 2017b.

PixelNN: Example-based image synthesis.

CoRR,

David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. CoRR, abs/1704.05796, 2017. Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. Segmentation and recognition using structure from motion point clouds. In Proc. of the European Conference on Computer Vision (ECCV), pp. 44–57, 2008. Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009. Gail A Carpenter. Neural network models for pattern recognition and associative memory. Neural networks, 2 (4):243–257, 1989. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. David Crandall, Pedro Felzenszwalb, and Daniel Huttenlocher. Spatial priors for part-based recognition using statistical models. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005. Emily L. Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. CoRR, abs/1506.05751, 2015. Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015a. Jacob Devlin, Saurabh Gupta, Ross B. Girshick, Margaret Mitchell, and C. Lawrence Zitnick. Exploring nearest neighbor approaches for image captioning. CoRR, abs/1505.04467, 2015b. David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2015. Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale, deformable part model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008. Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 2005. R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2003. M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. IEEE Trans. Comput., 22(1), January 1973.

11

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, abs/1406.2661, 2014. Bharath Hariharan, Pablo Arbel´aez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv, 2016. Vivek Krishnan and Deva Ramanan. Tinkering under the hood: Interactive zero-shot learning with net surgery. CoRR, abs/1612.04901, 2016. Ce Liu, Jenny Yuen, and Antonio Torralba. Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2368–2382, 2011. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional models for semantic segmentation. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. David Marr. Vision: A computational investigation into the human representation and processing of visual information. 1982. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015. Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. abs/1503.02406, 2015. URL http://arxiv.org/abs/1503.02406.

CoRR,

ˇ ara. Spatial pattern templates for recognition of objects with regular structure. In Radim Tyleˇcek and Radim S´ Proc. German Conference on Pattern Recognition (GCPR), Saarbrucken, Germany, 2013. C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing Object Detection Features. Proc. of the IEEE International Conference on Computer Vision (ICCV), 2013. Markus Weber, Max Welling, and Pietro Perona. Towards automatic discovery of object categories. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2000. Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2015. Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Proc. of the European Conference on Computer Vision (ECCV), pp. 818–833, 2014. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016. ` Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. In Proc. of the International Conference on Learning Representations (ICLR), 2014. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR, abs/1703.10593, 2017.

12