Discriminative k-shot learning using probabilistic models arXiv ...

1 downloads 0 Views 751KB Size Report
Jun 1, 2017 - Richard E. Turner† ...... [2] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. ... [4] Jake Snell, Kevin Swersky, and Richard Zemel.
Discriminative k-shot learning using probabilistic models Matthias Bauer∗ † ‡

Mateo Rojas-Carulla∗ † ‡

Jakub Bartłomiej Świątkowski†

Bernhard Schölkopf‡ Richard E. Turner† † Department of Engineering, University of Cambridge, Cambridge, UK

arXiv:1706.00326v1 [stat.ML] 1 Jun 2017



Max Planck Institute for Intelligent Systems, Tübingen, Germany ∗

MB and MR contributed equally to this work

This paper introduces a probabilistic framework for k-shot image classification. The goal is to generalise from an initial large-scale classification task to a separate task comprising new classes and small numbers of examples. The new approach not only leverages the feature-based representation learned by a neural network from the initial task (representational transfer), but also information about the form of the classes (concept transfer). The concept information is encapsulated in a probabilistic model for the final layer weights of the neural network which then acts as a prior when probabilistic k-shot learning is performed. Surprisingly, simple probabilistic models and inference schemes outperform many existing k-shot learning approaches and compare favourably with the state-of-the-art method in terms of error-rate. The new probabilistic methods are also able to accurately model uncertainty, leading to well calibrated classifiers, and they are easily extensible and flexible, unlike many recent approaches to k-shot learning.

1. Introduction A child encountering images of helicopters for the first time is able to generalize to instances with radically different appearance from only a handful of labelled examples. This remarkable feat is supported in part by a high-level feature-representation of images acquired from past experience. However, it is likely that information about previously learned concepts, such as aeroplanes and vehicles, is also leveraged (e.g. that sets of features like tails and rotors tend to co-occur or that objects like pilots and drivers are likely to appear in images of new vehicles). The goal of this paper is to build machine systems for performing k-shot learning, which leverage both existing feature representations of the inputs and existing class information that have both been honed by learning from large amounts of labeled data. K-shot learning has enjoyed a recent resurgence in the academic community [1–5]. In contrast to the main body of existing work, this paper proposes a general framework based upon a combination of deep learning and traditional probabilistic modelling subsuming two elegant existing approaches in this vein [5, 6]. The motivation is that deep learning will learn powerful feature representations and that probabilistic inference will transfer top-down conceptual information from old classes. Representational learning is driven by a large number of training examples from the original classes making it amenable to standard deep learning [7, 8]. However, the

1

transfer of top-down conceptual information to the new classes relies on a relatively small number of existing classes and k-shot data points, which means probabilistic inference is appropriate. This dual approach has an interesting relation to the fields of sensory neuroscience and cognitive science. Our basic setup mimics that of the motivating example above: a standard convolutional neural network (CNN) is trained on a large training set, albeit one which is fully labelled. This learns a rich representation of images at the top hidden layer of the CNN. Accumulated knowledge about classes is embodied in the top layer softmax weights of the network. This information is extracted by training a probabilistic model on these weights. K-shot learning can then 1) use the representation of images provided by the CNN as input to a new softmax function, 2) learn the new softmax weights by combining prior information about their likely form derived from the original dataset with the k-shot likelihood. Our contributions. We develop a general framework for k-shot learning that combines deep learning and probabilistic inference. A range of probabilistic models and associated approximate inference schemes for performing the concept transfer are evaluated. Assessment of the method on CIFAR-100 [9] and miniImageNet [3] is rigorous and multifaceted: k-shot accuracies are complemented by measures of the quality of the probabilistic outputs (held-out log-likelihood scores for k-shot learning as well as calibration curves). These are critical for decision making problems such as active learning, and with small numbers of examples uncertainty is rife [10]. In order to test the robustness of the methods, evaluation is performed both when the k-shot classes are selected at random and in adversarial settings where the new classes are very distinct from the old ones. Finally we consider the online k-shot learning setting where the goal is to predict on the new and the old classes. The main conclusions from our experiments are: 1. A surprisingly simple variant of our approach outperforms many existing methods [3, 11] and is competitive with state-of-the-art prototypical networks [4]; 2. fully probabilistic models lead to predictions with desirable features such as well-calibrated probabilities, which may help avoid catastrophic forgetting in online learning settings and are helpful in adversarial settings; 3. the new approach is more extensible and flexible than existing approaches: it automatically handles additional k-shot examples without requiring retraining to achieve excellent errorrates and extends to situations where the number of classes grows over time. In the following section, we introduce the framework of k-shot learning. Moreover, we present the machinery needed to perform probabilistic k-shot learning given a knowledge base, such as a neural network trained with large amounts of data.

2. Probabilistic k-shot learning This section formally describes the basic task and notation, before developing the probabilistic framework for k-shot learning. k-shot learning task. We consider the following discriminative k-shot learning task: First, we receive a large dataset e e = {u e that indicate which of the C e classes e i , yei }N e i and labels yei ∈ {1, . . . , C} D i=1 of images u N each image belongs to. Second, we receive a small dataset D = {ui , yi }i=1 of C new classes, yi ∈ {Ce + 1, Ce + C}, with k images from each new class. The goal is to construct a probabilistic 2

e and D to predict well on unseen images u∗ from model that can leverage the information in D the new classes; the performance is evaluated against ground truth labels y ∗ .

Notation. We further introduce a convolutional neural network (CNN) Φϕ as feature extractor whose last hidden layer activations are mapped to two sets of softmax output units corresponding to the e and the C classes in the small dataset D, respectively. These Ce classes in the large dataset D f for the old classes and W for the separate mappings are parametrized by weight matrices W new classes. Denoting the output of the final hidden layer as x = Φϕ (u), the first softmax f = softmax(W fx e n , W) e n ) and the second p(yn | xn , W) = softmax(Wxn ), cf. units compute p(yen | x Fig. 1 (left).

phase 4: k-shot testing

phase 3: k-shot learning

or

phase 1: representational learning

shared layers

phase 2: concept learning

new classes

old classes

or Figure 1: left: Shared feature extractor Φϕ and separate softmax units with associated weights for the original task involving the old class labels and for k-shot learning on the new classes. right: Graphical model for probabilistic k-shot learning

2.1. A framework for probabilistic k-shot learning The probabilistic k-shot learning approach comprises four phases mirroring the dataflow: Phase 1: Representational learning. e is used to train the CNN Φϕ using standard deep learning optimisation The large dataset D approaches. This involves learning both the parameters ϕ of the feature extractor up to the last f The network parameters ϕ are fixed from this hidden layer, as well as the softmax weights W. point onwards and shared across phases. This is a standard setup for multi-task learning and in the present case it ensures that the features derived from the representational learning can be leveraged for k-shot learning. Phase 2: Concept learning. f are effectively used as data for concept learning by training a probabilisThe softmax weights W tic model that detects structure in these weights which can be transferred for k-shot learning. This approach will be justified in the next section. For the moment, we consider a general class of probabilistic models in which the two sets of weights are generated from shared hyperparameters f W, θ) = p(θ)p(W|θ)p(W|θ) f θ, so that p(W, (see Fig. 1); specific choices for these distributions will be considered in Section 2.4.

3

Phases 3 and 4: k-shot learning and testing. Probabilistic k-shot learning leverages the learned representation Φϕ from phase 1 and the f W, θ) from phase 2 to build a (posterior) predictive model for unseen probabilistic model p(W, new examples using examples from the small dataset D.

2.2. Probabilistic approach to concept modelling and k-shot learning Given the dataflow and assumed probabilistic model, a completely probabilistic approach would involve the following steps. In the concept learning phase, the initial dataset would be used to e The k-shot learning form the posterior distribution over the concept hyperparameters p(θ | D). e phase combines the information about the new weights provided by D with the information in the k-shot dataset D to form the posterior distribution e ∝ p(W | D) e p(W | D, D)

Y n

e = p(yn | xn , W) where p(W | D)

Z

e p(W | θ)p(θ | D)dθ.

(1)

Finally, in the k-shot testing phase predictions for the new labels are found by averaging the softmax outputs over the posterior distribution of the softmax weights given by the two datasets, e = p(y | x , D, D) ∗



Z

e p(y ∗ | x∗ , W)p(W | D, D)dW.

(2)

The approach is intractable and requires approximation. The main challenge is computing the posterior distribution over the hyper-parameters given the initial dataset. However, progress can be made if we assume that the posterior distribution over the weights can be well approximated f | D) f −W f MAP ). This is an arguably justifiable assumption e ≈ δ(W by the MAP value p(W as the initial dataset is large and so the posterior will concentrate on narrow modes (with f MAP ) due to the structure of the e ≈ p(θ | W similar predictive performance). In this case p(θ | D) e in equations (1) and (2) can be replaced probabilistic model and consequently all instances of D MAP f by the analogous expressions involving W without additional approximation. This greatly simplifies the learning pipeline as the probabilistic modelling only needs to have access to the weights returned by the representational learning. Remaining intractabilities involve only a small number of data points D and can be handled using standard approximate inference tools.

2.3. Summary of the full learning pipeline for probabilistic k-shot learning Phase 1: Representational learning. Deep learning is used to train a CNN. The representation of input images at the last hidden layer, x = Φϕ (u), is used in subsequent phases. The final layer softmax weights are assumed to f MAP . be MAP estimates W

Phase 2: Concept learning. f MAP ) ∝ p(θ)p(W f MAP |θ). For A probabilistic model is fit directly to the MAP weights p(θ | W f MAP ) ≈ conjugate models a full posterior can be retained, otherwise a MAP estimate p(θ | W δ(θ − θMAP ) is used.

Phase 3: k-shot learning. f MAP ) ∝ p(W | W f MAP ) QN p(yn |xn , W) The posterior distribution over the new softmax weights p(W | D, W n=1 is generally intractable. The posterior can, however, be approximated using the MAP estimate f MAP ) ≈ δ(W − WMAP ) or through sampling Wm ∼ p(W | D, W f MAP ). Note that p(W | D, W

4

R

f MAP )dθ is analytic for conjugate models and, if instead a MAP f MAP ) = p(W|θ)p(θ | W p(W | W f MAP ) ≈ p(W|θ MAP ). estimate for θ is provided by the concept modelling stage, then p(W | W

Phase 4: k-shot testing. R f MAP )dW. f MAP ) = p(y ∗ | x∗ , W)p(W | D, W Approximate inference is used to compute p(y ∗ | x∗ , D, W f MAP ) ≈ p(y ∗ | x∗ , WMAP ). If the k-shot learning phase provides a MAP estimate of W then p(y ∗ | x∗ , D, W 1 PM ∗ ∗ MAP ∗ ∗ f If samples are returned then p(y | x , D, W ) ≈ M m=1 p(y | x , Wm ).

2.4. Probabilistic models for the training weights

The model over the weights is key: a good model will transfer useful knowledge that improves performance. However, the usual trade-off between model complexity and learnability is parf are high-dimensional and the number ticularly egregious in our setting where the weights W of classes is small. With an eye on simplicity, the models considered in this paper make two assumptions. First, treating the weights from the hidden layer to one of the softmax outputs as a vector, we assume independence. Second, we assume the distribution between the weights of old and new classes to be identical, that is, f W, θ) = p(θ) p(W,

e C Y

c0 =1

e c0 |θ) p(w

C Y

c=1

dist

e c0 |θ) = p(wc |θ). p(wc |θ) where p(w

(3)

We now describe the specific model classes and associated inference schemes used in the experiments (with the shorthand name in square parentheses). Full derivations can be found in Appendix A. (i) Multivariate Gaussian. A Gaussian model p(w|θ) = N (w|µ, Σ) with a conjugate Normal-inverse-Wishart prior p(θ) = f p(µ, Σ) = N IW(µ, Σ | µ0 , κ0 , Λ0 , ν0 ). The integrated posterior on the new weights p(W | W) is a multivariate Student t-distribution [Gauss (integr. prior)]. Alternatively, MAP solutions θMAP = {µMAP , ΣMAP } can be estimated [Gauss (MAP prior)]. Isotropic [iso] and correlated covariances are considered to test whether modelling dependencies between the dimensions of the weight vectors yields better performance. Biases have their own hyperparameters in this model and the ones that follow, but we drop them from the notation for readability. (ii) Mixture of multivariate Gaussians. A Gaussian mixture model can potentially leverage cluster structure in the weights (animal classes might have similar weights, for example). This is related to the tree-based prior proposed in [5]. MAP inference is performed as exact inference is intractable. Similarly to the Gaussian case, different structures for the covariance of each cluster were tested. In our experiments, we fit the parameters of the GMM via maximum likelihood using the EM algorithm. Both 3 [GMM (3)] and 10 [GMM (10)] clusters were considered. For the CIFAR-100 dataset, in which the classes are grouped into superclasses, we fit a Gaussian model to the weight vectors belonging to each superclass, and equally distribute the mass across the clusters [GMM (supercl.)]. (iii) Laplace prior. Sparsity is an attractive feature which could be helpful for modelling the weights e.g. if each class is well characterised by a small number of features, whilst other features are irrelevant. As such, we consider a product of independent Laplace distributions. We also considered versions with tied [Laplace (iso)] and untied [Laplace (diag)] variance parameters.

5

Approximate inference schemes In all cases the gradients of the densities w.r.t. W can be computed, enabling MAP inference in the k-shot learning phase to be efficiently performed via gradient-based optimisation using L-BFGS [12]. Alternatively, Markov Chain Monte Carlo (MCMC) sampling can be performed to approximate the associated integral. Due to the high dimensionality of the space and as gradients are available, we employ Hybrid Monte Carlo (HMC) sampling in the form of the recently proposed NUTS sampler that automatically tunes the HMC parameters (step size and number of leapfrog steps) [13]. Unless stated otherwise, we used 4500 HMC samples and discarded the first 500 samples as burn-in. For the GMMs we employed pymc3 [14] to perform MAP inference. Baselines In addition to the modelling approaches above, we compare to two baselines that do not transfer weight information from the training data. The first performs k-shot learning by nearest neighbour classification in the hidden layer activation space with cosine metric [Nearest Neighbour]. The second performs logistic regression via maximum likelihood estimation [Logistic Regression].

3. Related work There has been a flurry of recent activity in the area of active learning since the inspirational work of Lake et al. [1]. In this section we present and discuss prior work related to the new approach. Embedding methods map the k-shot training and test points into a non-linear space and perform classification by assessing which training points are closest, according to a metric, to the test points. Siamese Networks [2] train the embedding using a same/different prediction task derived from the original dataset and use a weighted L1 metric for classification. Matching Networks [3] construct a set of k-shot learning tasks from the original dataset to train an embedding defined through an attention mechanism that linearly combines training labels weighted by their proximity to test points. More recently, Prototypical Networks [4] are a streamlined version of Matching Networks in which embedded classes are summarised by their mean in the embedding space. These embedding methods learn representations for k-shot learning, but do not directly leverage concept transfer. Amortised optimisation methods [11] also simulate related k-shot learning tasks from the initial dataset, but instead train a second network to initialise and optimise a CNN to perform accurate classification on these small datasets. This method can then be applied for new k-shot tasks. Importantly, both embedding and amortized optimisation methods improve when the system is trained for a specific k-shot task: to perform well in 5-shot learning, training is carried out with episodes containing 5 examples in each class. The approach proposed in this paper is more flexible; it is not tailored for a specific k and so it is more robust to variability in k (over time or across classes) and does not require retraining. Deep probabilistic methods include the approach developed in this paper. The work most closely related to our own is not an approach to k-shot learning per se, but rather a method for training CNNs with highly imbalanced classes [5]. It is similar in that it trains a form of Gaussian mixture model over the final layer weights using MAP inference that regularises learning. Our work considers a different setting and explores a broader range of probabilistic models and inference schemes. Burgess et al. [6] propose an elegant approach to k-shot learning that is an instance of the framework described here: a Gaussian model is fit to the weights with MAP inference. The evaluation is promising, but limited. One of the goals in this paper is to perform a comprehensive evaluation.

6

4. Experiments The models and inference schemes are tested on a 5-way classification task, with 1, 5 and 10 k-shot examples for CIFAR-100 [9] and miniImageNet [3]. We first detail general information about the datasets and the network, before turning to the specific experiments. Datasets CIFAR-100 consists of 100 classes each with 500 training and 100 test images of size 32×32. The classes are grouped into 20 superclasses with 5 classes each. Unless otherwise stated, we used a random split into 80 base classes and 20 k-shot learning classes, see Appendix B.1. miniImageNet has become a standard testbed for k-shot learning and is derived from the ImageNet ILSVRC12 dataset [15] by extracting 100 out of the 1000 classes. Each class contains 600 images downscaled to 84 × 84 pixels. We use the 100 classes (80 train/validation, 20 test) proposed in [11], which we split into 500 training and 100 test samples. All our splits and code will be published. Representational learning with CNNs We employ standard CNNs that are inspired by the VGG architecture [16] for the representational learning on the C base classes, cf. Phase 1 in Section 2.1. These trained networks then f and the fixed feature representation Φϕ for the k-shot learning and testing. provide both W Details on the architecture and training can be found in Appendix B.4.

4.1. Analysis of the models on held-out training weights First, we analyse how well the different prior models for the new softmax weights are able to f We randomly excluded 10 of those weights and evaluated their fit the Ce training weights W. held-out negative log likelihood given the remaining C − 10 weights. We emphasise that this approach also constitutes a principled way to set the hyperparameters of the prior and, critically, relies on an explicit probabilistic model. We used this procedure to fix the hyperprior variances for our experiments on CIFAR-100 and miniImageNet. The negative log probabilities are averaged over 50 random splits and results of best optimised values w.r.t. hyperparameters are shown in Table 1 for CIFAR-100 (lower is better). We find that multivariate Gaussian models generally outperform other models but that isotropic Gaussians and Laplace are not much worse. For the GMMs, the models with component assignment set to the merged superclasses outperform the models for which the cluster assignment is optimised via EM, and models with fewer components generally perform better. We attribute the superior performance of the simpler models to the small number of data points (C − 10 = 70 training weights) and the high dimensionality of the space, which entail that fitting even simple models is difficult. The strong performance of the isotropic Gaussian begs the question of whether a logistic regression trained with a correctly chosen regularisation parameter would perform equally well, as this regularisation essentially corresponds to adding a Gaussian prior on the weights. We found that, in terms of accuracies, the Gaussian MAP model performs as well as regularising with the optimal choice of regularisation parameter that was selected using an oracle validation set (see Fig. 6 in the Supplement for full details). However, in real scenarios this oracle information is not available: the k-shot training set is small and separating off a validation dataset is deleterious.

7

Model

Optimised held-out neg. log prob.

Gauss (isotropic) Gauss (integr. prior) GMM super (iso) GMM super (full) GMM 3-means (iso) GMM 3-means (diag) GMM 10-means iso Laplace (diag) −2,000

−1759 ± 3 −2006 ± 4 −1309 ± 4 −1426 ± 5 −1090 ± 8 −1000 ± 12 −928 ± 13 −1766 ± 5

0

−1,000

Table 1: Held-out negative log probabilities on random 70/10-splits of the training weights on CIFAR-100 (lower is better). Values are optimised w.r.t. the hyperprior variances and averaged over 50 splits. See Table 6 in the Supplement for an extended version.

Log Likelihood

0.7

400

0.6

500

1.0 Proportion of times correct

Accuracy

600

0.5

700 0.4 1

5 k-shot

10

Gauss (iso) MAP Gauss (MAP prior) MAP Gauss (MAP prior) HMC

1

5 k-shot

Gauss (integr. prior) MAP Gauss (integr. prior) HMC GMM (supercl.) MAP

10

Calibration curve (5-shot)

0.5

0.0 0.0

GMM (3, iso) MAP Laplace (diag) MAP Laplace (diag) HMC

0.5 Predicted probability

1.0

Nearest Neighbour Logistic Regression

Figure 2: Results for 5-way classification on CIFAR-100 averaged over 20 splits and 10 restarts with different training images. Our methods outperform the baselines in terms of accuracy and test log-likelihood, and lead to better calibrated classifiers.

4.2. CIFAR-100 Random set Accuracies are measured on a 5-way classification task on the k-shot classes for k ∈ {1, 5, 10}. The results in Fig. 2 were averaged two-fold: (i) 20 random splits of the 5 k-shot classes; (ii) 10 repetitions of each split with different k-shot training examples. Our methods outperform both baseline methods. Among our models, no statistically significant difference in accuracy is observed. The only exception is the Laplace MAP, which consistently underperforms; however, Laplace HMC attains the same accuracy as our other methods. In terms of test log-likelihood, the improvement relative to the baseline is even larger. We do not plot the baselines as they are orders of magnitude worse. Finally, we show calibration curves for the methods that provide probabilities over the output classes. Calibration in the context of deep learning is discussed in [17]. The construction of the curves is described in Appendix D. Our methods are generally well calibrated, with Gaussian models generally better than Laplace models, whereas logistic regression is poorly calibrated. Interestingly, both GMM approaches are not able to outperform the other, simpler models. This

8

Accuracy

Log Likelihood

1.0 Proportion of times correct

2000 2100

0.25

2200

0.20

2300 0.15 2400 1

5 k-shot

10

Gauss (iso) MAP Gauss (MAP prior) MAP Gauss (MAP prior) HMC

1

5 k-shot

Gauss (integr. prior) MAP Gauss (integr. prior) HMC GMM (supercl.) MAP

10

Calibration curve (5-shot)

0.5

0.0 0.0

GMM (3, iso) MAP Laplace (diag) MAP Laplace (diag) HMC

0.5 Predicted probability

1.0

Nearest Neighbour Logistic Regression

Figure 3: Adversarial split: The super classes “aquatic mammals” and “fish” were removed from training of the CNN and 10-way k-shot learning is performed on the 10 held-out classes. Results are averaged over 10 restarts with different training images. is in line with the previous observation that the simpler models are better able to explain the weights, cf. Section 4.1. Again, we attribute this inability of mixture models to use their larger expressivity to the small number of data points and the high-dimensionality of weight-space which means learning even simple models is difficult. These observations suggest the use of mixture models in this type of k-shot learning framework is not beneficial and is in contrast to the approach of Srivastava and Salakhutdinov [5] who employ a tree-structured mixture model. The authors show that learning the parameters of the mixture outperforms using a naïve initialization, but do not compare against a simpler model. Adversarial set Similar experiments are performed on 10-way classification but with the left-out k-shot classes selected adversarially, see Fig. 3. We observed a considerable drop in accuracy compared to the random set (even when accounting for the different number of classes), which indicates that the feature extractor Φϕ learns features less relevant for classifying the new examples. The trend singled out on the random set remains the same: our methods still outperform the baselines in terms of accuracy, log likelihood, and calibration. Interestingly, the models are less well calibrated, with HMC variants generally doing better. This is also true for the log likelihoods for the Gauss (MAP prior) HMC and Laplace HMC. This provides evidence that considering the uncertainty in the new weights is beneficial for prediction, especially when this uncertainty is high as the new classes are less like those encountered during phase 1. Adversarial k-short learning is briefly discussed in [3].

4.3. mini ImageNet The results for 5-way classification are presented in Fig. 4 and are in line with our observations on CIFAR: All methods (with the exception of Laplace MAP) outperform the baselines in terms of accuracy and log likelihoods and are also better calibrated. Again, most methods display a very similar performance. Table 2 shows that our probabilistic models are competitive with the current state-of-the-art Prototypical Networks on miniImageNet splits from [11] and outperform many of the other methods. Importantly, our method is not trained for a specific k, and therefore provides a general approach not suffering from the large variation in performance as k varies displayed by Prototypical Networks.

9

Accuracy

0.6

500

0.5

600

0.4

700 1

5 k-shot

Log Likelihood

400

10

1

Gauss (iso) MAP Gauss (MAP prior) MAP Gauss (MAP prior) HMC

1.0 Proportion of times correct

0.7

5 k-shot

10

Gauss (integr. prior) MAP Gauss (integr. prior) HMC

Calibration curve (5-shot)

0.5

0.0 0.0

0.5 Predicted probability

Laplace (diag) MAP Laplace (diag) HMC

1.0

Nearest Neighbour Logistic Regression

0.5 Accuracy on all classes

0.6

Accuracy on old classes

Accuracy on new classes

0.4

0.3 0.4

0.3

0.2

0.2

0.2

0.1

0.0

0.0

0.1 0.0

1

5 k-shot

10

Gauss (iso) MAP Gauss (MAP prior) MAP

1

5 k-shot

10

1

Gauss (MAP prior) HMC Gauss (integr. prior) MAP

5

Proportion of times correct

Figure 4: Results for 5-way classification on miniImageNet averaged over 20 splits and 10 restarts with different images. Our methods outperform the baselines in term of accuracy and test log-likelihood, but most display similar performances. The obtained classifiers are also better calibrated.

10

Gauss (integr. prior) HMC Laplace (diag) MAP

1.0Calibration curve (5-shot) 0.5 0.0 0.0 0.5 1.0 Predicted probability Laplace (diag) HMC Logistic Regression

Figure 5: Classification on all 100 classes (80 old and 20 k-shot). Contrary to the baseline, our methods display limited forgetting of the old clases after k-shot learning. There is a trade-off between forgetting on the old classes and learning on the new ones, with Gauss (MAP prior) HMC displaying a good performance in both scenarios. Evaluation on all 100 classes The issue of catastrophic forgetting [18] is a well known problem when performing k-shot learning. Whilst not the focus of this work, we evaluate the performance on all 100 classes in a 20-way 5-shot learning setting for miniImageNet. During k-shot learning and testing we employ a softmax which includes both the new and the old weights resulting in a total of 100 weight vectors. f MAP for While the k-shot weights were modelled probabilistically, we use the map estimate W the old weights. Accuracies are reported in Fig. 5 for i) all the 100 classes, ii) the old 80 classes only, and iii) the new 20 classes only. There is a trade-off between accuracy on the old and on the new classes. However, all probabilistic methods lead to limited forgetting, whereas logistic regression leads to forgetting about the old classes completely (0% accuracy). Gauss (MAP prior) HMC provides a good trade-off between performance on old and new classes, as well as overall performance over the 100 classes. It achieves 47.0 ± 0.1% accuracy over all classes in the 5-shot setting, against an upper-bound of 58.2% achieved by the same CNN architecture trained on all data for all 100 classes. These trade-off properties of our models are encouraging and

10

Method

1-shot

Matching Networks [3]* Matching Networks FCE [3]* Meta-Learner LSTM [11] Prototypical Nets (1-shot) [4] Prototypical Nets (5-shot) [4]

43.4 43.6 43.4 49.4 45.1

Gauss (MAP pr.) HMC (ours)

± ± ± ± ±

0.8% 0.8% 0.8% 0.8% 0.8%

50.0% ± 0.5%

5-shot 51.1 55.3 60.6 65.4 68.2

± ± ± ± ±

0.7% 0.7% 0.7% 0.7% 0.7%

64.3% ± 0.6%

Table 2: Accuracy on 5-way classification on miniImageNet. Our best method, Gauss (MAP prior) HMC, is competitive with Prototypical Networks on 1-shot learning and outperforms many other methods on 5-shot learning. *Results reported by [11]. open a wide array of interesting research directions, also towards online learning across classes. More precisely, one could use similar methods to extend a previously trained neural network to new classes in an online fashion, leveraging the features and information learned on the original classes.

5. Conclusion This paper developed a probabilistic framework for k-shot learning that exploits the powerful features and class information learned by a neural network on a large training dataset. Probabilistic models are used to transfer information in the network weights to new classes. Experiments on CIFAR-100 and miniImageNet show that simple models are competitive with other state-of-theart methods and yield well-calibrated classifiers. Moreover, we show that probabilistic inference can be beneficial when the k-shot classes are selected in an adversarial fashion, rendering the trained feature extractor less able to extract meaningful representations. Preliminary results hint that the presented probabilistic framework may help avoid catastrophic forgetting by automatically balancing performance on the new and old classes. In contrast to previous approaches, this new approach is flexible and extensible, being applicable to general discriminative models and k-shot learning paradigms. It automatically handles additional k-shot examples without requiring retraining and extends to situations where the number of classes grows over time. Extensions in future work could also leverage original networks that have a large collection of output layers (e.g. sets of softmax layers that perform different 5-way classifications) to develop more sophisticated and more accurate priors for concept transfer, as well as more powerful classifiers for k-shot learning, such as Bayesian neural networks. Acknowledgements MB gratefully acknowledges funding by a Qualcomm European Research Studentship. RET thanks EPSRC grants EP/M0269571 and EP/L000776/1 as well as Google for funding.

References [1] Brenden Lake, Ruslan Salakhutdinov, and Joshua Tenenbaum. “Human-level concept learning through probabilistic program induction”. In: Science 350.6266 (2015). [2] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition”. In: Deep Learning workshop, International Conference of Machine Learning (2015).

11

[3] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. “Matching networks for one shot learning”. In: Advances in Neural Information Processing Systems. 2016. [4] Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical Networks for Few-shot Learning”. In: arXiv e-print: 1703.05175 (2017). [5] Nitish Srivastava and Ruslan R Salakhutdinov. “Discriminative transfer learning with treebased priors”. In: Advances in Neural Information Processing Systems. 2013. [6] Jordan Burgess, James Robert Lloyd, and Zoubin Ghahramani. “One-Shot Learning in Discriminative Neural Networks”. In: NIPS Bayesian Deep Learning workshop (2016). [7] Juergen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural Networks 61 (2015). [8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in Neural Information Processing Systems. 2012. [9] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech. rep. 2009. [10] Zoubin Ghahramani. “Probabilistic machine learning and artificial intelligence”. In: Nature 521.7553 (2015). [11] Sachin Ravi and Hugo Larochelle. “Optimization as a model for few-shot learning”. In: International Conference on Learning Representations. Vol. 1. 2. 2017. [12] Dong C Liu and Jorge Nocedal. “On the limited memory BFGS method for large scale optimization”. In: Mathematical programming 45.1 (1989). [13] Matthew D Hoffman and Andrew Gelman. “The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.” In: Journal of Machine Learning Research 15.1 (2014). [14] John Salvatier, Thomas. Wiecki, and Christopher Fonnesbeck. “Probabilistic programming in Python using PyMC3.” In: PeerJ Computer Science 2 (2016). [15] Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C Berg, and Li Fei-Fei. “Detecting avocados to zucchinis: what have we done, and where are we going?” In: Proceedings of the IEEE International Conference on Computer Vision. 2013. [16] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv e-print:1409.1556 (2014). [17] Patrick McClure and Nikolaus Kriegeskorte. “Representation of uncertainty in deep neural networks through sampling”. In: arXiv e-print:1611.01639 (2016). [18] Robert M French. “Catastrophic forgetting in connectionist networks”. In: Trends in cognitive sciences 3.4 (1999).

12

Supplement to: “Discriminative k-shot learning using probabilistic models” g A. Models for p(W | W)

A.1. Gaussian model

f as a Gaussian distribution: Possibly the simplest approach consists on modeling p(W | W) f = p(W | W)

Z

f N (W | µ, Σ)p(µ, Σ | W)dµdΣ.

(4)

Details for this section can be found in [1]. The normal-inverse-Wishart distribution for µ and Σ is a conjugate prior for the Gaussian, which allows for the posterior to be written in closed form. More precisely, p(µ, Σ) = N IW(µ, Σ | µ0 , κ0 , Λ0 , ν0 ) κ0 1 1 −1 t −1 = |Σ|−(ν0 +p)/2+1 e− 2 tr(Λ0 Σ )− 2 (µ−µ0 ) Σ (µ−µ0 ) , Z

(5)

f also follows a normal-inversewhere Z is the normalising constant. The posterior p(µ, Σ | W) Wishart distribution: f = N IW(µ, Σ | µ , κ , Λ , ν ), p(µ, Σ | W) (6) e N e N e N e N

where

µNe =

κ0

e κ0 + N e κNe = κ0 + N

µ0 +

ΛNe = Λ0 + S + e, νNe = ν0 + N

e N

e κ0 + N

e κ0 N

e κ0 + N

f W

f − µ0 )(W f − µ0 ) t (W

f and S is the sample covariance of W. For this model, we can integrate (4) in closed form, which results in the following multivariate Student t-distribution: f = tν p(W | W)

e −p+1

N

!

ΛNe (κNe + 1) µNe , . κNe (νNe − p + 1)

As with other approaches, one can also compute the MAP solutions for the mean µM AP and f = N (W | µM AP , ΣM AP ). covariance ΣM AP , such that p(W | W) f depends on the hyperFor both the analytic posterior and the MAP approximation, p(W | W) parameters of the normal-inverse-Wishart distribution: µ0 , ν0 , κ0 and Λ0 .

A.2. Mixture of Gaussians (GMM) f as a mixture of A natural extension to the Gaussian model consists in modelling p(W | W) Gaussians with S components:

13

f = p(W | W) P

Z

S X

s=1

!

f πs N (W | µs , Σs ) p(µ1 , . . . , µS , Σ1 , . . . , ΣS | W)dµ 1 . . . dµS dΣ1 . . . dΣS , (7)

where Ss=1 πs = 1. In this work, we only compute the MAP mean and covariance for each of the clusters, as opposed to averaging over the parameters of the mixture. The resulting posterior is f = p(W | W)

S X

s=1

πs N (W | µM AP,s , ΣM AP,s ).

(8)

The components of the mixture are fit in two ways. For CIFAR-100, the classes are grouped into 20 superclasses, each containing 5 of the 100 classes. One option is therefore to initialize 20 components, each fit with the data points in the corresponding superclass. For each such individual Gaussian, the MAP inference method presented in the previous section can be used. In order to increase the number of weight examples in each superclass, we merge the original superclasses into 9 larger superclasses. The merging of the superclasses is the following: • Aquatic mammals + fish • flowers + fruit and vegetables + trees • insects + non-insect invertebrates + reptiles • medium-sized mammals + small mammals • large carnivores + large omnivores and herbivores • people • large man-made outdoor things + large natural outdoor things • food containers + household electrical devices + household furniture • Vehicles 1 + Vehicles 2. The parameters of the mixture can also be fit using maximum likelihood with EM. We use the implementation of EM in scikit-learn. Both 3 and 10 clusters are considered, both in CIFAR-100 and miniImageNet. Note that similarly to the Gaussian model, we consider isotropic, diagonal or full covariance models for the covariance matrices.

A.3. Laplace distribution The last model we consider is a Laplace distribution. We do so to analyse a potential effect of enforcing sparse weight vectors, which is motivated by the fact that some classes may use only relevant features for classification, and discard the rest. Moreover, we observe empirically that the weights seem to be more peaked around their mean than a Gaussian, which justifies a model concentrating more mass around the mean. We consider a prior which factors along the feature dimensions:   e f p C X Y 1 | W − µ | ij j f | {µj }, {λj }) = . p(W exp − 2λ λ j j i j

14

where the product over j is along the feature dimensions and the sum over i is across the classes. We fit the parameters µ and λ via maximum likelihood: f ij ) µMLE,j = mediani (W 1 X f λMLE,j = |Wij − µj |, N i

such that f = p(W | W)

p Y j

1

2λMLE,j

exp −

C X |Wij − µMLE,j | i

λMLE,j

!

.

An isotropic Laplace model with mean µ and scale λ is also considered:

where

f | µ, λ) = 1 exp − p(W 2λ

P

ij

!

f ij − µ| |W , λ

f µMLE = median(W) 1 X f λMLE = |Wij − µ|, N p ij

B. Training and evaluation procedure details The experiments are performed on two color image datasets CIFAR-100 [2] and miniImageNet [3].

B.1. CIFAR-100 CIFAR-100 consists of 100 classes grouped into 20 superclasses with 5 classes each. For example, the superclass "fish" contains the classes aquarium fish, flatfish, ray, shark, and trout. Each class has 500 training and 100 test colour images of size 32 × 32. For k-shot learning and testing, we split the 100 classes into 80 base classes used for network training and 20 k-shot learning classes. classes_base = [ 0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 13, 14, 15, 16, 17, 18, 19, 21, 22, 24, 25, 27, 28, 32, 34, 35, 36, 38, 40, 42, 43, 44, 45, 46, 48, 49, 50, 51, 52, 53, 54, 55, 56, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 69, 70, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 ] classes_heldout = [ 8, 11, 12, 20, 23, 26, 29, 30, 31, 33, 37, 39, 41, 47, 57, 68, 71, 72, 81, 84 ] Moreover, we consider adversarial splits, for which we leave out entire superclasses from the base classes for the network training.

15

B.2. mini ImageNet To construct miniImageNet we use the 100 ImageNet ILSVRC12 [4] classes proposed in [5] and select 600 images based on their lists. It is split into 64 training classes, 16 validation classes and 20 test classes. As our method does not require validation classes, we use the 64 + 16 classes for the training of our network. The images are rescaled to 84 × 84 pixels. The complete list of classes and images we used will be provided together with the code to replicate the experiments.

B.3. k-shot learning procedure For the 5-way k-shot learning we randomly sample N = 5 classes out of the 20 test classes and k examples per class from the 500 training part of those classes. For each randomly selected set of N = 5 classes we performed k-shot learning 10 times with different training examples from the same class. We repeat this procedure for 20 random splits. To evaluate the accuracies, test-log likelihoods, and calibration curves we used the 100 examples provided in the test part of each class. Given the trained CNN and the 5 test classes with k support points, we use last layer weights of the 80 training classes to infer respective weights for the 5 new test classes. The inferred weights are used to perform 5-way classification on the sampled test data points. The reported accuracies are averaged across splits and data samples.

B.4. Network architecture and training Network for miniImageNet

Network for CIFAR-100 Output size

Layers

32 × 32 × 3 16 × 16 × 64 8 × 8 × 64 4 × 4 × 128 2 × 2 × 128 2 × 2 × 128 256 256 128 Ce

Input patch 2× (Conv2D, ELU), Pool 2× (Conv2D, ELU), Pool 2× (Conv2D, ELU), Pool 2× (Conv2D, ELU), Pool Dropout (0.5) FullyConnected, ELU Dropout (0.5) FullyConnected, ELU FullyConnected, SoftMax

Output size 84 × 84 × 3 42 × 42 × 32 21 × 21 × 64 11 × 11 × 128 6 × 6 × 128 3 × 3 × 128 3 × 3 × 128 512 512 256 Ce

Layers Input patch (Conv2D, ELU), Pool (Conv2D, ELU), Pool (Conv2D, ELU), Pool (Conv2D, ELU), Pool (Conv2D, ELU), Pool Dropout (0.5) FullyConnected, ELU Dropout (0.5) FullyConnected, ELU FullyConnected, SoftMax

2× 2× 2× 2× 2×

Table 3: Network architectures. All 2D convolutions have kernel size 3 × 3 and padding SAME; max-pooling is performed with stride 2. The output of the shaded layer corresponds to Φϕ (u), the feature space representation of the image u, which is used as input for probabilistic k-shot learning The network architecture was inspired by the VGG networks [6]. We do not employ batch normalisation [7] as the network training distribution and k-shot learning distribution will be different. To speed up training, we employ exponential linear units (ELUs), which have been reported to lead to faster convergence as compared to ordinary ReLUs [8]. To regularise the networks, we employ dropout [9] and regularisation of the weights in the fully connected layers. The networks are trained with the ADAM optimiser [10] with decaying learning rate.

16

B.5. Method descriptions Table 4 and Table 5 show descriptions of the methods analysed for respectively phase 2 (Table 1) and phase 3 (Fig. 2 and Fig. 4). Phase 2 Prior distribution

Method Gauss (iso) Gauss (MAP prior) Gauss (integr. prior) GMM (supercl.) GMM (3, iso) GMM (3, diag) GMM (10, iso) Laplace (diag)

Gaussian isotropic covariance Gaussian isotropic covariance Gaussian full covariance GMM on superclasses iso. cov. GMM on 3 clusters iso. cov. GMM on 3 clusters diagonal cov. GMM on 10 clusters iso. cov. Laplace diagonal covariance

Inference MAP MAP Integrated MAP MLE MLE MLE MLE

Table 4: Description of the inference for the parameters of the prior in phase 2 for the models in from Table 1.

Method Gauss (iso) MAP Gauss (MAP prior) MAP Gauss (MAP prior) HMC Gauss (integr. prior) MAP Gauss (integr. prior) HMC GMM (supercl.) MAP GMM (3, iso) MAP Laplace (diag) HMC Laplace (diag) MAP

Phase 3 Prior distribution

Inference

Gaussian Gaussian Gaussian Gaussian Gaussian GMM on superclasses GMM on 3 isotropic comp. Laplace (diagonal) Laplace (diagonal)

MAP MAP HMC MAP HMC MAP MAP HMC MAP

Table 5: Methods and inference procedure during phase 3 for the models used in Fig. 2 and Fig. 4.

C. Analysis of the model on held-out training weights In Table 6 we present an extended version of Table 1 from the main text.

D. Calibration curves In order to construct the curves, we bin the interval [0, 1] into 10 equally sized bins. For each of them, we count both how many of the output probabilities of the classifier are contained in the bin, as well as how many of these assignments lead to the correct prediction. A well calibrated classifier is such that examples to which it assigns a probability of 42% to belonging to a given class is right on 42% of those examples, and so forth.

17

Model

Optimised value of mean negative log probability

Gauss (iso) Gauss (MAP prior) Gauss (integr. prior) GMM super (iso) GMM super (diag) GMM super (full) GMM 3-means (iso) GMM 3-means (diag) GMM 10-means iso GMM 10-means (diag) Laplace (iso) Laplace (diag) −2,000

−1,000

−1759 ± 3 −1961 ± 5 −2006 ± 4 −1309 ± 4 −1396 ± 5 −1425 ± 5 −1090 ± 8 −1000 ± 12 −928 ± 13 −481 ± 25 −1738 ± 4 −1766 ± 5

0

Table 6: Held-out log probabilities on random 70/10-splits of the training weights for the different models on CIFAR-100. Values are averaged over 50 splits.

E. Regularised Logistic Regression We report accuracy of regularised logistic regression in the CIFAR-100 dataset as a function of the regularisation parameter, see Fig. 6. The best achieved performance is on par or slightly superior to the Gaussian MAP model, at the expense of no principled way to fit the regularisation parameter during training. The hyperprior parameters in the probabilistic model are set by minimisation of the held-out log probabilities on the training weight as detailed in Section 4.1 in the main text. 0.50

Accuracy

0.45

Regularised Logistic Regression Gauss MAP (MAP Prior)

0.40 0.35 0.30 −6 10

10−5 10−4 10−3 10−2 Regularisation Parameter

10−1

Figure 6: Comparison of regularised logistic regression with our Gaussian MAP on 20-way 5shot learning on CIFAR-100. Regularised logistic regression outperforms our method slightly with the right choice of regularisation parameter; however, compared to our methods, there is no principled way to determine a good value without a validation set.

18

References [1] Kevin Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. [2] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech. rep. 2009. [3] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. “Matching networks for one shot learning”. In: Advances in Neural Information Processing Systems. 2016. [4] Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C Berg, and Li Fei-Fei. “Detecting avocados to zucchinis: what have we done, and where are we going?” In: Proceedings of the IEEE International Conference on Computer Vision. 2013. [5] Sachin Ravi and Hugo Larochelle. “Optimization as a model for few-shot learning”. In: International Conference on Learning Representations. Vol. 1. 2. 2017. [6] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv e-print:1409.1556 (2014). [7] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift”. In: arXiv e-print:1502.03167 (2015). [8] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. “Fast and accurate deep network learning by exponential linear units (elus)”. In: arXiv e-print:1511.07289 (2015). [9] Yarin Gal and Zoubin Ghahramani. “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning”. In: International Conference on Machine Learning. 2016. [10] Diederik Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv e-print:1412.6980 (2014).

19

Suggest Documents