Transfer Learning Algorithms for Image ... - Semantic Scholar

23 downloads 0 Views 2MB Size Report
that our model can successfully recover jointly sparse solutions. Thesis Supervisor: Michael Collins. Title: Associate Professor. Thesis Supervisor: Trevor Darrell.
Transfer Learning Algorithms for Image Classification by

Ariadna Quattoni Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2009 c Ariadna Quattoni, MMIX. All rights reserved. ° The author hereby grants to MIT permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole or in part.

Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 22, 2009 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Collins Associate Professor Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trevor Darrell Associate Professor Thesis Supervisor Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Terry P. Orlando Chairman, Department Committee on Graduate Students

Transfer Learning Algorithms for Image Classification by Ariadna Quattoni Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2009, in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Abstract An ideal image classifier should be able to exploit complex high dimensional feature representations even when only a few labeled examples are available for training. To achieve this goal we develop transfer learning algorithms that: 1) Leverage unlabeled data annotated with meta-data and 2) Exploit labeled data from related categories. In the first part of this thesis we show how to use the structure learning framework (Ando and Zhang, 2005) to learn efficient image representations from unlabeled images annotated with meta-data. In the second part we present a joint sparsity transfer algorithm for image classification. Our algorithm is based on the observation that related categories might be learnable using only a small subset of shared relevant features. To find these features we propose to train classifiers jointly with a shared regularization penalty that minimizes the total number of features involved in the approximation. To solve the joint sparse approximation problem we develop an optimization algorithm whose time and memory complexity is O(n log n) with n being the number of parameters of the joint model. We conduct experiments on news-topic and keyword prediction image classification tasks. We test our method in two settings: a transfer learning and multitask learning setting and show that in both cases leveraging knowledge from related categories can improve performance when training data per category is scarce. Furthermore, our results demonstrate that our model can successfully recover jointly sparse solutions. Thesis Supervisor: Michael Collins Title: Associate Professor Thesis Supervisor: Trevor Darrell Title: Associate Professor

2

Acknowledgments I would like to thank my two advisors: Professor Michael Collins and Professor Trevor Darrell for their advice. They have both given me a great deal of independence in pursuing my ideas and contributed towards their development. My gratitude also goes to my thesis reader Professor Antonio Torralba. I would also like to thank my dearest friend Paul Nemirovsky for introducing me to computer science and giving me constant encouragement throughout my years at MIT. Finally, a very special thanks goes to Xavier Carreras who has contributed to the ideas of this thesis and who has given me unconditional support.

3

Contents 1 Introduction 1.1

8

Problem: Transfer Learning for Image Classification . . . . . . . . . . . .

9

1.1.1

Image Classification . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.1.2

Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2

Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3

Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Background

14

2.1

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2

Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3

Linear Classifiers and Regularization . . . . . . . . . . . . . . . . . . . . . 17 2.3.1

2.4

Finding Sparse Solutions . . . . . . . . . . . . . . . . . . . . . . . 19

Optimization: Projected SubGradient Methods . . . . . . . . . . . . . . . . 20

3 Previous Work

24

3.1

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2

A brief overview of transfer learning . . . . . . . . . . . . . . . . . . . . . 25

3.3

Learning Hidden Representations . . . . . . . . . . . . . . . . . . . . . . . 27

3.4

3.3.1

Transfer Learning with Neural Networks . . . . . . . . . . . . . . 27

3.3.2

Structural Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.3

Transfer by Learning A Distance Metric . . . . . . . . . . . . . . . 31

Feature Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4

3.4.1

Enforcing Parameter Similarity by Minimizing The Euclidean Distance between Parameter Vectors . . . . . . . . . . . . . . . . . . . 32

3.5

3.6

3.4.2

Sharing Parameters in a Class Taxonomy . . . . . . . . . . . . . . 34

3.4.3

Clustering Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.4

Sharing A Feature Filter . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.5

Feature Sharing using l1,2 Regularization . . . . . . . . . . . . . . 37

3.4.6

Sharing Features Via Joint Boosting . . . . . . . . . . . . . . . . . 39

Hierarchical Bayesian Learning . . . . . . . . . . . . . . . . . . . . . . . 41 3.5.1

Transfer Learning with Hidden Features and Shared Priors . . . . . 41

3.5.2

Sharing a Prior Covariance . . . . . . . . . . . . . . . . . . . . . . 42

3.5.3

Learning Shape And Appearance Priors For Object Recognition . . 44

Related Work Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Learning Image Representations Using Images with Captions

48

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2

Learning Visual Representations . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1

Learning Visual Representations from Auxiliary Tasks . . . . . . . 52

4.2.2

Metadata-derived auxiliary problems . . . . . . . . . . . . . . . . 55

4.3

Examples Illustrating the Approach . . . . . . . . . . . . . . . . . . . . . 57

4.4

Reuters Image Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5

Experiments on Images with Captions . . . . . . . . . . . . . . . . . . . . 60 4.5.1

4.6

4.7

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6.1

The Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6.2

The Data-SVD Model . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6.3

A Model with Predictive Structure . . . . . . . . . . . . . . . . . . 70

4.6.4

The Word-Classifiers Model . . . . . . . . . . . . . . . . . . . . . 71

4.6.5

Cross-Validation of Parameters . . . . . . . . . . . . . . . . . . . . 71

4.6.6

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5

5 Transfer Learning for Image Classification with Sparse Prototype Representations

76

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2

Learning a sparse prototype representation from unlabeled data and related tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3

5.4

5.2.1

Computing the prototype representation . . . . . . . . . . . . . . . 78

5.2.2

Discovering relevant prototypes by joint sparse approximation . . . 80

5.2.3

Computing the relevant prototype representation . . . . . . . . . . 83

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3.1

Baseline Representation . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.2

Raw Feature Baseline Model . . . . . . . . . . . . . . . . . . . . . 86

5.3.3

Low Rank Baseline Model . . . . . . . . . . . . . . . . . . . . . . 87

5.3.4

The Sparse Prototype Transfer Model . . . . . . . . . . . . . . . . 87

5.3.5

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6 An Efficient Projection for l1,∞ Regularization

92

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2

A projected gradient method for l1,∞ regularization . . . . . . . . . . . . . 95

6.3

6.2.1

Constrained Convex Optimization Formulation . . . . . . . . . . . 95

6.2.2

An Application: Multitask Learning . . . . . . . . . . . . . . . . . 96

6.2.3

A Projected Gradient Method . . . . . . . . . . . . . . . . . . . . 97

Efficient Projection onto the l1,∞ Ball . . . . . . . . . . . . . . . . . . . . 98 6.3.1

Characterization of the solution . . . . . . . . . . . . . . . . . . . 98

6.3.2

An efficient projection algorithm . . . . . . . . . . . . . . . . . . . 101

6.3.3

Computation of hi . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.5

Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.6

Image Annotation Experiments . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6.1

Evaluation and Significance Testing . . . . . . . . . . . . . . . . . 107

6

6.6.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.7

Scene Recognition Experiments . . . . . . . . . . . . . . . . . . . . . . . 114

6.8

Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7 Conclusion

120

7.1

Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.2

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2.1

Representations and Sparsity . . . . . . . . . . . . . . . . . . . . . 122

7.2.2

Differences between feature sharing and learning hidden representations approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

7.3

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

7

Chapter 1 Introduction An ideal image classifier should be able to exploit complex high dimensional feature representations even when only a few labeled examples are available for training. Learning with small training sets is important because there are many real world applications where only a few labeled examples might be available. For example, when building a system that classifies images in a personal photo collection based on user defined categories, an appealing application would ask the user to label a handful of images only. As the photo collection application illustrates, there is a pressing need for “sample efficient” learning algorithms. The photo collection application also illustrates another point: we might only have a few labeled examples for a given task, but plenty of labeled data for hundreds of related tasks might be available. Ideally, an image classifier should be able to exploit all of these resources, just as humans can exploit previous experience when learning some concepts. For example, when a child learns to recognize a new letter of the alphabet he will use examples provided by people with different hand-writing styles using pens of different colors and thicknesses. Without any prior knowledge a child would need to consider a large set of features as potentially relevant for learning the new concept, so we would expect the child to need a large number of examples. But if the child has previously learnt to recognize other letters, he can probably discern the relevant attributes (e.g. number of lines, line curvatures) from the irrelevant ones (e.g. the color of the lines) and learn the new concept with a few examples. This observation suggests that a “sample efficient” image 8

classification algorithm might need to exploit knowledge gained from related tasks. Otherwise, in the absence of prior knowledge a typical image classification task would require the exploration of a large and complex feature space. In such high dimensional spaces some features will be discriminative but most probably a large number of them will be irrelevant. If we knew what the irrelevant features were we might be able to learn a concept from fewer examples. While we do not know a-priori what features are irrelevant, related tasks might share irrelevant features and we can use training data from these tasks to discover them. The goal of this thesis is to develop efficient transfer learning algorithms for image classification that can exploit rich feature representations. Our emphasis is on minimizing the amount of supervised training data necessary to train such classifiers by discovering shared representations among tasks.

1.1 Problem: Transfer Learning for Image Classification 1.1.1

Image Classification

In an image classification task our goal is to learn a mapping from images x to class labels y. In general we will be considering binary classification tasks, and thus y = {+1, −1}. For example, in the news domain setting we might define a class for every news story. In this case a task consists of predicting whether an image belongs to a particular story or not. Similarly, we can consider a caption word to be a class and predict whether a given word is a proper annotation for an image. Because our goal is to develop an approach that can handle a wide range of image classification tasks, we work with general, rich feature representations (i.e. not tailored or engineered for a particular problem) and delegate the task of finding useful features to the training algorithm. In other words, our philosophy is to use a flexible and general feature extraction process which generates a large number of features and let the training algorithm discover what features are important in a given domain.

9

1.1.2 Transfer Learning We now introduce some notation that will be useful in setting the transfer learning problem. We assume that we have a collection of tasks and a set of supervised training samples for each of them. Our data is of the form: D = {T1 , T2 , . . . , Tm } where Tk = {(xk1 , y1k ), (xk2 , y2k ), . . . , (xknk , ynkk )} is the training set for task k. Each supervised training sample consists of some input point x ∈ Rd and its corresponding label y ∈ {+1, −1}. We will usually refer to the individual dimensions of x as features. Broadly speaking a transfer algorithm will exploit commonality among tasks to learn better classifiers. The transfer algorithm achieves this by training classifiers on D jointly; the meaning of joint training depends on the transfer learning framework. In a feature sharing framework (Evgeniou and Pontil, 2004; Jacob et al., 2008; Jebara, 2004; Obozinski et al., 2006; Torralba et al., 2006) the relatedness assumption is that tasks share relevant features. In a learning hidden representations framework (Thrun, 1996; Ando and Zhang, 2005; Argyriou et al., 2006; Amit et al., 2007) the relatedness assumption is that there exists a mapping from the original input space to an underlying shared feature representation. In a hierarchical bayesian learning framework (Bakker and Heskes, 2003; Raina et al., 2006; Fei-Fei et al., 2006) tasks are assumed to be related by means of sharing a common prior distribution over classifiers’ parameters The work that we present in chapter 4 is an instance of learning hidden representations and the transfer algorithm presented in chapters 5 and 6 is an instance of feature sharing. We consider two transfer learning settings which we call symmetric transfer and asymmetric transfer (Thrun, 1996). In a symmetric transfer setting there are a few training samples for each task and the goal is to share information across tasks in order to improve the average performance over all tasks. We also refer to this setting as multitask learning. In contrast, in an asymmetric transfer setting there is a set of tasks for which a large amount of supervised training data is available, we call these tasks auxiliary (Thrun, 1996). The goal is to use training data from the auxiliary tasks to improve the classification performance of a target task T0 (Thrun, 1996) for which training data is scarce. It has been shown (Ando and Zhang, 2005) that asymmetric transfer algorithms can be

10

used for semi-supervised learning by automatically deriving auxiliary training sets from unlabeled data. The auxiliary tasks are designed so that they can be useful in uncovering important latent structures. In chapter 4 we will show a semi-supervised application where we learn an image representation using unlabeled images annotated with some form of meta-data that is used to derive auxiliary tasks.

1.2

Thesis Contributions

In this thesis we study efficient transfer algorithms for image classification. These algorithms can exploit rich (i.e. high dimensional) feature spaces and are designed to minimize the amount of supervised training data necessary for learning. In the first part of this work we apply the structure learning framework of Ando and Zhang (2005) to a semi-supervised image classification setting. In the second part we present a transfer learning model for image classification based on joint regularization. • Learning Image Representations using Unlabeled Data annotated with Meta-Data: Current methods for learning visual categories work well when a large amount of labeled data is available, but can run into severe difficulties when the number of labeled examples is small. When labeled data is scarce it may be beneficial to use unlabeled data to learn an image representation that is low-dimensional, but nevertheless captures the information required to discriminate between image categories. We present a semi-supervised image classification application of asymmetric transfer where the auxiliary tasks are derived automatically using unlabeled data annotated with meta-data. In particular, we consider a setting where we have thousands of images with associated captions, but few images annotated with story labels. We take the prediction of content words from the captions to be our auxiliary tasks and the prediction of a story label to be our target task.

11

Our objective is to use the auxiliary tasks to learn a lower dimensional representation that still captures the relevant information necessary to discriminate between different stories. To this end we applied the structure learning framework of Ando and Zhang. This framework is based on using data from the auxiliary tasks to learn a subspace of the original input space that is discriminative for all auxiliary tasks. Projecting into this space will give us a new image representation that will be useful for learning target classifiers. Our results indicate that the representations learned via structure learning on the auxiliary tasks can improve the performance of the target story classifiers. In particular, we show that the induced representation can be used to minimize the amount of supervised target training data necessary for learning accurate image classifiers. Our experiments show that our method significantly outperforms a fully-supervised baseline model and a model that ignores the captions and learns a visual representation by performing PCA on the unlabeled images alone. In brief, we show that when meta-data labels are suitably related to a target task, the structure learning learning method can discover feature groupings that speed learning of the target task. Our current work concentrated on captions as the source of meta-data, but more generally other types of meta-data could be used. • A Transfer Algorithm based on l1,∞ Joint Regularization: In the second part of this thesis we developed a joint regularization transfer algorithm for image classification. Our algorithm is based on the observation that related tasks might be learnable using only a small subset of shared relevant features. To find these features we propose to train tasks jointly using a shared regularization penalty. The shared regularization penalty is used in our model to induce solutions where only a few shared features are used by any of the classifiers. Previous approaches to joint sparse approximation (Obozinski et al., 2006; Torralba et al., 2006) have relied on greedy coordinate descent methods. In contrast, our objective function can be expressed as a linear program and thus a globally optimal solution can be found with an off-the-shelf package. 12

We test our algorithm in an image classification asymmetric transfer learning problem. The classification problem consists of predicting topic labels for images. We assume that we have enough training data for a set of auxiliary topics but less data for a target topic. Using data from the auxiliary topics we select a set of discriminative features that we utilize to train a classifier for the target topic. • An Efficient Training Algorithm for l1,∞ regularization: While the training algorithm for joint regularization described in chapter 5 is feasible for small problems, it becomes impractical for high dimensional feature spaces. In chapter 6 we address this problem and develop a time and memory efficient general optimization algorithm for training l1,∞ regularized models. The algorithm has O(n log n) time and memory complexity with n being the number of parameters of the joint model. This cost is comparable to the cost of training m independent sparse classifiers. Thus our work provides a tool that makes implementing a joint sparsity regularization penalty as easy and almost as efficient as implementing the standard l1 and l2 penalties. For our final set of experiments we consider a symmetric transfer learning setting where each task consists of predicting a keyword for an image. Our results show that l1,∞ regularization leads to better performance than both l1 and l2 regularization and that it is effective in discovering jointly sparse solutions in high dimensional spaces.

1.3

Outline of the thesis

The structure of the thesis is as follows: chapter 2 provides the necessary background on linear prediction models and regularization. Chapter 3 reviews existing work on general transfer algorithms and transfer algorithms for image classification. Chapter 4 describes our work on learning image representations from unlabeled data annotated with meta-data . Chapter 5 presents our joint regularization transfer algorithm based on l1,∞ regularization. Chapter 6 develops a more general efficient training algorithm for l1,∞ regularization. Finally, in chapter 7 we draw conclusions and discuss future lines of research. 13

Chapter 2 Background In this chapter we provide the general machine learning background necessary to understand the remaining chapters. Section 2.3 describes the supervised learning setting that will be assumed throughout the thesis and the empirical risk minimization framework. This section shows that to ensure that a classifier will generalize to unseen samples one must control the complexity of the class of functions explored by the training algorithm. Section 2.3 describes the setting of linear classification together with a regularization framework designed to control the complexity of a function class. The joint regularization model that we present in chapter 5 is an instantiation of this regularization framework to the transfer learning setting. Finally, in Section 2.4 we describe a general optimization framework for training regularized models. While multiple algorithms have been developed for this purpose we focus our attention on primal projected gradient methods. The optimization algorithm for joint regularization presented in chapter 6 follows this approach.

2.1 Notation We will use uppercase letters to denote matrices and lowercase bold letters to denote vectors, for example we will write ak for the k-th column of matrix A and ak for the k-th row. Sets will be represented using uppercase italic letters. We write ||w||p to indicate the p norm of w ∈ Rd , that is: ||w||p = ( 14

Pd i=1

1

|wip |) p , for

example ||w||2 is used to denote the Euclidean norm and ||w||1 is used to denote the l1 norm. We write Loss(f (x) , y) for a loss function and when useful we adopt the notation f w (x) to indicate that function f is parameterized by w.

2.2

Empirical Risk Minimization

In a supervised learning framework the goal is to learn a mapping h : X → Y from some input space X to labels Y. In this chapter we consider binary classification tasks, i.e. Y = {+1, −1}. To learn this mapping, the training algorithm is given a set of n pairs D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where x ∈ Rd is some training pattern and y ∈ Y is its corresponding label. It is common to assume that each labeled example is drawn independently at random from some fixed probability distribution p(x, y). Consider learning a function from a fixed hypothesis class H. For example H could be the class of linear prediction models of the form h(x) = w · x, for some parameter vector w ∈ Rd . A learning algorithm takes a training set D as input and finds a hypothesis h ∈ H that will have small error on future samples drawn from p(x, y). More formally, assume that we have a loss function Loss(h(x), y) that returns the cost of making prediction h(x) when the true label for the sample is y. These are some examples of popular classification loss functions: • The zero-one loss : Loss0−1 (h(x), y) = 1 if sign(h(x)) 6= y and 0 otherwise • The hinge loss : Losshinge (h(x), y) = max(0, 1 − h(x)y) • The exponential loss : Lossexp (h(x), y) = exp(−h(x)y) • The logistic loss: Losslog (h(x), y) = ln(1 + exp(−h(x)y)) We now define the expected risk of a hypothesis h to be: R(h) = E[Loss(h(x), y)]

15

(2.1)

where E denotes the expectation over random variables (x, y) drawn from the distribution p(x, y). Ideally, we would like to find the hypothesis with minimum expected risk: h∗ = argminh∈H E[Loss(h(x), y)]

(2.2)

Since p(x, y) is unknown the expected risk can not be minimized directly, instead we will find a hypothesis that minimizes the empirical risk on D: ˆ = argminh∈H 1 h n

n X

Loss(h(xi ), yi )

(2.3)

i=1

For this strategy to work we need to make sure that given a large enough training set the empirical risk will be a good estimate of the expected risk. Consider a [0, 1] bounded real value loss function Loss(h(x), y), this is a random variable with expected value R(h) = E[Loss(h(x), y)]. Since it is a bounded random variable, we can use Hoeffding ’s inequality to bound the deviation of its empirical mean from its expected value. By doing that we obtain the following result (e.g. see Anthony and Bartlett (1999)): n

1X ∀h ∈ H P (| Loss(h(xi ), yi ) − R(h)| ≥ ²) ≤ 2 exp(−2²2 n) n i=1

(2.4)

Thus the probability that the empirical loss of a function h is ² greater than its expected loss decays exponentially with the size of the training set. Recall that the expected risk is the expected loss of the best hypothesis in H. Therefore, to guarantee convergence to the expected risk it suffices to ensure that there is no hypothesis h ∈ H such that the difference between its empirical loss and its expected loss is greater than ². That is, we must bound the following:

n

1X P (max | Loss(h(xi ), yi ) − R(h)| ≥ ²) h∈H n i=1 [

(2.5)

n

1X = P( {| Loss(h(x)i), yi ) − R(h)| ≥ ²}) n i=1 h∈H

(2.6)

By combining (2.4) with a union bound we can state that for a finite hypothesis class H (e.g. see (Anthony and Bartlett, 1999)): 16

n

1X P (max | Loss(h(xi ), yi ) − R(h)| > ²) ≤ 2|H|exp(−2²2 n) h∈H n i=1

(2.7)

Notice that the complexity of a hypothesis class (i.e. |H|) is closely related to the speed of convergence of the empirical risk to the expected risk. The difference between the empirical risk and the expected risk is usually referred as estimation or generalization error. Similar results linking the complexity of a hypothesis class H to the generalization error can be derived for infinite hypothesis classes using combinatorial measures of complexity. In particular, for the class of linear classifiers the generalization error can be bounded in terms of the norm of the parameter vector w (Zhang, 2002). The bound (2.7) suggests that to obtain a small generalization error it is best to search over a small hypothesis class. However, to ensure that there is an h ∈ H with small expected risk it is best to consider a large hypothesis class, i.e. we want the approximation error to be small. The optimal complexity of H should be chosen to balance the tradeoff between approximation and generalization error. In the next section we describe a regularization technique designed with this goal in mind.

2.3

Linear Classifiers and Regularization

In this section we consider linear classifiers of the form h(x) = w ·φ(x); for some w ∈ Rd . This type of classifier is linear on any arbitrary function of the input φ(x) but for notational simplicity we will refer to φ(x) as x. In a regularization framework the tradeoff between approximation and generalization error is achieved by introducing a complexity penalty. More precisely, one minimizes a penalized version of the empirical risk: n

1X min Loss(w · xi , yi ) + γΦ(w) h∈H n i=1

(2.8)

Notice that this formulation is the same as (2.3) except that we have added a regularization penalty term Φ(w). Generally speaking, this term measures the complexity of a function h and the constant γ in (2.8) controls the tradeoff between generalization and 17

approximation error (Kearns and Vazirani, 1994). In the case of linear classification the two most commonly used penalty terms are: • Φl2 (w) = • Φl1 (w) =

Pd j=1

Pd j=1

wj2 |wj |

To give a concrete example, notice that by combining an l2 penalty with a hinge loss we get the well known SVM primal objective (Vapnik, 1995): n

d

X 1X min Losshinge (w · xi , yi ) + γ wj2 w n i=1 j=1

(2.9)

When the regularization penalty can be expressed as a convex constraint on the parameters w, the regularization problem can be formulated as a constrained convex optimization: n

1X Loss(w · xi , yi ) : w ∈ Ω min w n i=1

(2.10)

where Ω is some convex set. For example, for the SVM objective we would get the corresponding convex constrained formulation: n

d

X 1X min Loss(w · xi , yi ) : wj2 ≤ C w n i=1 j=1

(2.11)

Similarly an l1 regularized problem can be expressed as: n

d

X 1X min Loss(w · xi , yi ) : |wj | ≤ C w n i=1 j=1

(2.12)

Here C plays a role analogous to that of γ in the previous formulation in that it controls the tradeoff between approximation and generalization error. Both l2 and l1 regularized linear models have been widely studied by the learning theory community (Kakade et al., 2008). The main results show that it is possible to derive generalization bounds for these models where the generalization error can be expressed as a function of C. So far we have refrained from making any computational considerations, that is we have not considered how would a training algorithm optimize (2.8) and (2.10). Section 2.4 addresses this by describing a general optimization method for (2.10). 18

2.3.1 Finding Sparse Solutions Until now we have reviewed structural risk minimization formulations based on convex regularization penalties. However, when the goal is to obtain sparse solutions (i.e. solutions with few non-zero parameters) a natural penalty to consider is the l0 penalty given by: ||w||0 = |{j =: wj 6= 0}|

(2.13)

While this penalty seems to be the most natural choice to induce sparsity, it has the drawback that finding an exact solution becomes a hard combinatorial problem. In general, solving for a given l0 norm is known to be an N P hard problem (Donoho, 2004). To address this computational challenge there exists two main sparse approximation approaches: • Greedy Schemes: This approach is based on iterative optimization algorithms that add one feature at the time. At iteration t the algorithm adds to the model the feature that most reduces some form of residual loss. Broadly speaking, the residual loss at iteration t is what can not be predicted from the model with t − 1 features.

1

• Convex Relaxation: These approaches use the convex l1 penalty to approximate the l0 penalty. It has been shown that under certain conditions the l1 convex relaxation is equivalent to the l0 penalty (Donoho, 2004).

2

While under certain conditions approximation guarantees have been proven for some greedy algorithms (Tropp, 2004), proving such guarantees is in general not a trivial task. In contrast, for the l1 convex relaxation provably optimal solutions can be be easily obtained by standard polynomial-time optimization algorithms (Tropp, 2006b). In other words, by using a convex formulation of the the sparse approximation problem we can take advantage of the vast amount of results in convex optimization theory. In chapter 5 we will follow this approach and propose a transfer algorithm based on a convex 1

Multiple greedy schemes have been proposed by various communities with different names, statistics:

forward stagewise regression, approximation theory: greedy algorithms, learning theory: boosting methods, signal processing: projection pursuit methods 2 By equivalent we mean that the optimal solution to the l1 regularized problem will be a solution with minimum l0 penalty.

19

relaxation of a joint sparsity inducing regularization penalty. Using standard methods in convex optimization we develop in chapter 6 an efficient training algorithm that can scale to large number of examples and dimensions.

2.4

Optimization: Projected SubGradient Methods

There are multiple primal and dual methods for optimizing a convex function subject to convex constraints. In this section we focus on primal projected subgradient methods because they are the most relevant to this work. A subgradient of a function f at a point x is any vector g satisfying: f (y) ≥ f (x) + g T (y − x) for all y. Projected subgradient methods have been recently revived in the machine learning community for solving classification and regression problems involving large number of features. For some of these large scale problems the alternative dual interior point methods impose computational and memory demands that makes them unpractical. Furthermore, while the convergence rates of gradient based methods are known to be relatively slow, in practice the approximate solutions obtained after a few iterations are good enough for most classification and regression applications (Shalev-Shwartz and Srebro, 2008). Broadly speaking, projected subgradient methods iterate between performing unconstrained subgradient updates followed by projections to the convex feasible set. For example in the case of l2 regularization (2.11), the convex set would be an l2 ball of radius C. Similarly, for l1 regularization (2.12), the convex set would be an l1 ball of radius C. More formally, a projected subgradient method is an algorithm for minimizing a convex function f (w) subject to convex constraints of the form w ∈ Ω, where Ω is a convex set (Bertsekas, 1999). For the regularization problems described in the previous section; f (w) is some convex loss function, w is a parameter vector and Ω is the set of all vectors satisfying a convex constraint penalty, i.e. r(w) ≤ C. A projected subgradient algorithm works by generating a sequence of solutions wt via wt+1 = PΩ (wt − ηt ∇f (wt )). Here ∇f (wt ) is a subgradient of f at wt and PΩ (w) is the

20

Euclidean projection of w onto Ω, given by: min ||w0 − w||2 = min 0 0

w ∈Ω

w ∈Ω

X

2

(wj0 − wj )

(2.14)

j

Finally, ηt is the learning rate at iteration t that controls the amount by which the solution changes at each iteration. The following theorem shows the convergence properties of the projected subgradient method (Boyd and Mutapcic, 2007). t = argminwi ,1≤i≤t f (wi ), where wi are the Define w∗ = argminw∈Ω f (w), and wbest

parameter vectors found by the projected subgradient method after each iteration. Lemma 1 Assume a convex set Ω and constants R and G such that: ||w1 − w∗ ||2 ≤ R

(2.15)

||∇f (w)||2 ≤ G

(2.16)

and that for all w ∈ Ω:

t Then wbest satisfies:

t f (wbest )

P R2 + G2 ti=1 ηi2 − f (w ) ≤ P 2 ti=1 ηi ∗

(2.17)

Proof: Let z t+1 = wt − ηt ∇f (wt ) be the subgradient update before the projection step. We have that: ||z t+1 − w∗ ||22 = ||wt − ηt ∇f (wt ) − w∗ ||22 = ||wt − w∗ ||22 − 2ηt ∇f (wt )T · (wt − w∗ ) + ηt2 ||∇f (wt )||22 ≤ ||wt − w∗ ||22 − 2ηt (f (wt ) − f (w∗ )) + ηt2 ||∇f (wt )||22 where the last step follows directly from the definition of the subgradient which gives us: ∇f (wt )T · (wt − w∗ ) ≥ f (wt ) − f (w∗ ) Notice that the projection PΩ (w) can only move us closer to an optimal point w∗ : ||wt+1 − w∗ ||2 = ||PΩ (z t+1 ) − w∗ ||2 ≤ ||z t+1 − w∗ ||2 21

(2.18)

Combining (2.18) with (2.18) we obtain: ||wt+1 − w∗ ||22 ≤ ||wt − w∗ ||22 − 2ηt (f (wt ) − f (w∗ )) + ηt2 ||∇f (wt )||22

(2.19)

Applying the above inequality recursively we get that:

||w

t+1



w∗ ||22

1

≤ ||w −

w∗ ||22

−2

t X

i



ηt (f (w ) − f (w )) +

i=1

t X

ηt2 ||∇f (wt )||22 (2.20)

i=1

and therefore:

2

t X

i



2

ηt (f (w ) − f (w )) ≤ R +

i=1

t X

ηt2 ||∇f (wt )||22

(2.21)

i=1

Combining this with: t X

t t X X t ηi (f (wi ) − f (w∗ )) ≥ ( ηi ) min((f (wi ) − f (w∗ ))) = ( ηi )(f (wbest ) − f (w∗ ))

i=1

i=1

i

i=1

(2.22)

we get 2.17. If we set the learning rate to ηi = t f (wbest )

1 √ i

the above theorem gives us:

P R2 + G2 ti=1 − f (w ) ≤ P 2 ti=1 √1i

1 i





R2 + G2 log(t + 1) √ 2 t

(2.23)

The last inequality follows from two facts. First, we use the upper bound

Pt

1 i=1 i



log(t + 1), which we prove by induction. For t = 1 it follows trivially. For the inductive step, for t > 1, we need to show that log(t) +

1 t

≤ log(t + 1), which is easy to see √ P by exponentiating both sides. Second, we use the lower bound ti=1 √1i ≥ t, which we also prove by induction. For t = 1 it follows trivially. For the inductive step we √ √ want to show that t − 1 + √1t ≥ t. Squaring each side and rearranging terms we get 4(t−1) t

+

1 t2

+

√ 4√ t−1 t t−1

≥ 1, which is true because

4t−1 t

≥ 1 and the

1 t2

+

√ 4√ t−1 t t−1

≥ 0.

Consider the computational cost of each iteration of the projected subgradient method. Each iteration involves two steps: In the first step we must compute the gradient of the objective function ∇f (wt ). For the hinge loss this can be done in O(nd) time and memory where n is the total number of examples and d is the dimensionality of w. In the second 22

step we must compute the projection to the convex set: PΩ (w). Continuing with the SVM example, for the l2 penalty computing the projection is trivial to implement and can be done in O(d) time and memory. The projected subgradient method has been applied to multiple regularized classification problems. Shalev-Shwartz et al. (2007) developed an online projected gradient method for l2 regularization. Duchi et al. (2008) proposed an analogous algorithm for l1 showing that computing projections to the l1 ball can be done in O(d log d) time. The results in (Shalev-Shwartz et al., 2007; Duchi et al., 2008) show that for large scale optimization problems involving l1 and l2 regularization, projected gradient methods can be significantly faster than state-of-the-art interior point solvers.

23

Chapter 3 Previous Work In this chapter we review related work on general transfer learning algorithms as well as previous literature on transfer learning algorithms for image classification. The chapter is organized as follows: section 3.1 provides the necessary notation used throughout the chapter, section 3.2 gives a high level overview of the three main lines of research in transfer learning: learning hidden representations, feature sharing and hierarchical bayesian learning. Each of these lines of work is described in more detail in sections 3.3, 3.4 and 3.5 respectively. Finally, section 3.6 highlights the main contributions of this thesis in the context of the previous literature.

3.1

Notation

In transfer learning, we assume that we have a multitask collection of m tasks and a set of supervised training samples for each of them: D = {T1 , T2 , . . . , Tm }, where Tk = {(xk1 , y1k ), (xk2 , y2k ), . . . , (xknk , ynkk )} is the training set for task k. Each supervised training sample consists of some input point x ∈ Rd and its corresponding label y ∈ {+1, −1}, we will usually refer to the dimensions of x as features. In a symmetric transfer setting there are a few training samples for each task and the goal is to share information across tasks to improve the average performance. We also refer to this setting as multitask learning. In asymmetric transfer there is a set of tasks for which a large amount of supervised training data is available, we call these tasks auxil24

iary. The goal is to use training data from the auxiliary tasks to improve the classification performance of a target task T0 for which training data is scarce.

3.2

A brief overview of transfer learning

Transfer learning has had a relative long history in machine learning. Broadly speaking, the goal of a transfer algorithm is to use supervised training data from related tasks to improve performance. This is usually achieved by training classifiers for related tasks jointly. What is meant by joint training depends on the transfer learning framework; each of them develops a particular notion of relatedness and the corresponding transfer algorithms designed to exploit it. We can distinguish three main lines of research in transfer learning: learning hidden representations, feature sharing and hierarchical bayesian learning. The work that we present in chapter 4 is an instance of learning hidden representations and the transfer algorithm presented in chapters 5 and 6 is an instance of feature sharing. In learning hidden representations the relatedness assumption is that there exists a mapping from the original input space to an underlying shared feature representation. This latent representation captures the information necessary for training classifiers for all related tasks. The goal of a transfer algorithm is to simultaneously uncover the underlying shared representation and the parameters of the task-specific classifiers. For example, consider learning a set of image classifiers for predicting whether an image belongs to a particular story on the news or not. To achieve this goal we start with a basic low level image representation such as responses of local filters. In a learning hidden representations framework we assume that there exists a hidden transformation that will produce a new representation that is good for training classifiers in this domain. In this example, the underlying representation would capture the semantics of an image and map semantically-equivalent features to the same high level meta-feature. Broadly speaking, in a feature sharing framework the relatedness assumption is that tasks share relevant features. More specifically, we can differentiate two types of feature sharing approaches: parameter tying approaches and joint sparsity approaches. 25

In the parameter tying framework we assume that optimal classifier parameters for related tasks will lie close to each other. This is a rather strong assumption since it requires related tasks to weight features similarly. To relax this assumption we could only require the tasks to share the same set of non-zero parameters. This is what is assumed in the joint sparsity framework, i.e. the relatedness assumption is that to train accurate classifiers for a set of related tasks we only need to consider a small subset of relevant features. The goal of a transfer algorithm is to simultaneously discover the subset of relevant features and the parameter values for the task-specific classifiers. Returning to the news story prediction problem, consider using a non-parametric or kernel based image representation as our starting point. More precisely, consider a representation where every image is represented by its similarity to a large set of unlabeled images. For this example, the transfer assumption states that there is a subset of prototypical unlabeled images such that knowing the similarity of a new image to this subset would suffice for predicting its story label. One of the main differences between learning hidden representations and feature sharing approaches is that while the first one infers hidden representations the latter one chooses shared features from a large pool of candidates. This means that a feature sharing transfer algorithm must be able to efficiently search over a large set of features. Feature sharing transfer algorithms are suitable for applications where we can generate a large set of candidate features from which to select a shared subset. Furthermore, this approach is a good fit when finding the subset of relevant features is in itself a goal. For example, in computational biology one might be interested in finding a subset of genes that are markers for a group of related diseases. In a hierarchical bayesian learning framework tasks are assumed to be related by means of sharing a common prior distribution over classifiers’s parameters. This shared prior can be learned in a classical hierarchical bayesian setting using data from related tasks. By sharing information across different tasks the prior parameter distribution can be better estimated.

26

3.3 Learning Hidden Representations 3.3.1

Transfer Learning with Neural Networks

One of the earliest works on transfer learning was that of Thrun (1996) who introduced the concept of asymmetric transfer. Thrun proposed a transfer algorithm that uses D to learn an underlying feature representation: v(x). The main idea is to find a new representation where every pair of positive examples for a task will lie close to each other while every pair of positive and negative examples will lie far from each other. Let P k be the set of positive samples for the k-th task and N k the set of negative samples, Thrun’s transfer algorithm minimizes the following objective:

min v∈V

m X X X k=1 xi

∈P k

xj

||v(xi ) − v(xj )||2 −

∈P k

X X xi

∈P k

xj

||v(xi ) − v(xj )||2

(3.1)

∈N k

where V is the set of transformations encoded by a two layer neural network. The transformation v(x) learned from the auxiliary training data is then used to project the samples of the target task: T00 = {(v(x1 ), y1 ), . . . , (v(xn ), yn )}. Classification for the target task is performed by running a nearest neighbor classifier in the new space. The paper presented experiments on a small object recognition task consisting of recognizing 10 different objects. The results showed that when labeled data for the target task is scarce, the representation obtained by running their transfer algorithm on auxiliary training data could improve the classification performance of a target task.

3.3.2

Structural Learning

The ideas of Thrun were further generalized by several authors (Ando and Zhang, 2005; Argyriou et al., 2006; Amit et al., 2007). The three works that we review in the reminder of this section can all be casted under the framework of structure learning (Ando and Zhang, 2005) 1 . We start by giving an overview of this framework, followed by a discussion of three particular instantiations of the approach. 1

Note that structure learning is different from structure learning which refers to learning in structured

output domains, e.g. parsing

27

In a structure learning framework one assumes the existence of task-specific parameters wk for each task and shared parameters θ that parameterize a family of underlying transformations. Both the structural parameters and the task-specific parameters are learned together via joint risk minimization on some supervised training data D for m related tasks. Consider learning linear predictors of the form : hk (x) = wkT v(x) for some w ∈ Rz and some transformation v : Rd → Rz ∈ V. In particular, let V be the family of linear transformations: v θ (x) = θ x where θ is a z by d matrix that maps a d dimensional input vector to a z dimensional space2 . We can now define the task-specific parameters matrix: W = [w1 , . . . , wm ] where wk ∈ Rz are the parameters for the k-th task and wj,k is the parameter value for the j-th hidden feature and the k-th task. A structure learning algorithm finds the optimal taskspecific parameters W ∗ and structural parameters θ∗ by minimizing a jointly regularized empirical risk:

argminW,θ

nk m X 1 X Loss(wkT θ xki , yik ) + γΦ(W ) + λΨ(θ) n k i=1 k=1

(3.2)

The first term in (3.2) measures the mean error of the m classifiers by means of some loss function Loss. The second term is a regularization penalty on the task-specific parameters W and the last term is a regularization penalty on the structural parameters θ. Different choices of regularization functions Φ(W ) and Ψ(θ) result on different structure learning algorithms. Sharing a Low Dimensional Feature Representation Ando and Zhang (2005) combine an l2 regularization penalty on the task-specific paramP T 2 eters: m k=1 ||wk ||2 with an orthonomal constraint on the structural parameters: θθ = I, resulting in the following objective: 2

For some of the approaches reviewed in this section z < d and thus the transformation projects the inputs

to a shared lower dimensional space. For other approaches z = d and feature sharing will be realized by other means.

28

nk m m X X 1 X T k argminW,θ Loss(wk θ x, yi ) + γ ||wk ||22 n k i=1 k=1 k=1

(3.3)

s.t.θθT = I

(3.4)

where θ is a z by d matrix, z is assumed to be smaller than d and its optimal value is found using a validation set. Thus, feature sharing in this approach is realized by mapping the high dimensional feature vector x to a shared low dimensional feature space θx. Ando and Zhang proposed to minimize (3.4) using an alternating minimization procedure. Their algorithm will be described in more detail in chapter 4 where we will apply their approach to an image classification task. In Ando and Zhang (2005) this transfer algorithm is applied in the context of asymmetric transfer where auxiliary training sets are utilized to learn the structural parameter θ. The structural parameter is then used to project the samples of the target task and train a classifier on the new space. The paper presented experiments on text categorization where the auxiliary training sets were automatically derived from unlabeled data. More precisely, the auxiliary tasks consisted of predicting frequent content words for a set of unlabeled documents. Note that given that the auxiliary training sets are automatically derived from unlabeled data, their transfer algorithm can be regarded as a semi-supervised training algorithm. Their results showed that the semi-supervised method gave significant improvements over a baseline method that trained on the labeled data ignoring the auxiliary training sets. Sharing Hidden Features by a Sparse Regularization on the Latent Space Argyriou et al. (2006) proposed an alternative model to learn shared hidden representations. In their approach the structural parameter θ is assumed to be a d by d matrix, i.e. the linear transformation does not map the inputs x to a lower dimensional space. Instead, sharing of hidden features across tasks is realized by a regularization penalty imposed on the taskspecific parameters W . Consider the matrix W = [w1 , w2 , . . . , wm ] and the joint minimization problem on D. The wj row of W corresponds to the parameter values for hidden feature j across the m 29

tasks. Requiring only a few hidden features to be used by any task is equivalent to requiring only a few rows of W to be non-zero. We can achieve this goal by imposing a regularization penalty on the parameter matrix W . Argyriou et al. proposed the use of the following matrix norm: l1,2 (W ) = Pz j j=1 ||w ||2 . The l1,2 norm is known to promote row sparsity in W (see section 3.4). With these ingredients the problem of learning a few hidden features shared across tasks can be formulated as: nk m X 1 X min Loss(wkT θ x, yik ) + γl1,2 (W ) W,θ n k i=1 k=1

(3.5)

The constant γ controls the amount of regularization in the joint model. The authors showed that (3.5) is equivalent to a convex problem for which they developed an alternating minimization algorithm. The algorithm involves two steps, in the first step the parameters W for each classifier are trained independently of each other. In a second step, for a fixed W they find the hidden structure θ by solving an eigenvalue problem. The paper presented experiments on a product rating problem where the goal is to predict ratings given by different subjects. In the context of multitask learning predicting the ratings for a single subject can be regarded as a task. The transfer learning assumption is that predictions made by different subjects are related. The results showed that their transfer algorithm gave better performance than a baseline model where each task was trained independently with an l1 penalty. Learning Shared Structures by Finding Low Rank Parameters Matrices Amit et al. (2007) proposed a regularization scheme for transfer learning based on a trace norm regularization penalty. Consider the following m by d parameter matrix W = [w1 , w2 , . . . , wm ], where each row corresponds to the parameters of one task. Using the above regularization penalty Amit et al. transfer algorithm minimizes the following jointly regularized objective:

30

where Ω(W ) =

P i

nk m X 1 X min Loss(wkT xki , yik ) + γΩ(W ) W n k i=1 k=1

(3.6)

|γi |, and γi is the i-th eigenvalue of W . This norm is used because

it is known to induce solution matrices W of low rank (Srebro and Jaakkola., 2003). Recall that the rank of a d by m matrix W is the minimum z such that W can be factored as W = W 0t θ, for a z by m matrix W 0 and a z by d matrix θ. Notice that θ is no longer in (3.6), this is because in this formulation of structure learning we do not search explicitly for a transformation θ. Instead, we utilize the regularization penalty Ω(W ) to encourage solutions where the task-specific parameters W can be expressed as the combination of a few basis shared across tasks. The optimization problem in (3.6) can be expressed as a semi-definite program and solved with an interior-point method. However, the authors argue that interior point methods scale poorly with the size of the training set and proposed a gradient based method to solve (3.6). The gradient method minimizes a smoothed approximation of (3.6). In addition to the primal formulation, the paper presents a kernelized version of (3.6). It is shown that although the weigh vectors wk0 can not be directly retrieved from the dual solution, they can be found by solving a linear program on m variables. The authors conducted experiments on a multiclass image classification task where the goal is to distinguish between 72 classes of mammals. The performance of their transfer learning algorithm is compared to that of a baseline svm-multiclass classifier. Their results show that the trace-norm penalty can improve multiclass accuracy when only a few samples are available for training.

3.3.3

Transfer by Learning A Distance Metric

In Fink (2004) the authors proposed an asymmetric transfer algorithm. Similar to Thrun (1996), this algorithm learns a distance metric using auxiliary data. This distance metric is then utilized by a nearest neighbor classifier for a target task for which it is assumed that there is a single positive and negative training example. While Thrun’s transfer algorithm followed a neural network approach, Fink’s transfer 31

algorithm follows a max-margin approach. Consider learning a function d : X × X → R which has the following properties: (i) d(x, x0 ) ≥ 0, (ii) d(x, x0 ) = d(x0 , x) and (iii) d(x, x0 ) + d(x0 , x00 ) ≥ d(x, x00 ). A function satisfying these three properties is called a pseudo-metric. Fink’s transfer algorithm learns a pseudo-metric using D. Ideally, we would like to learn a function d that assigns a smaller distance to pairs having the same label than to pairs with different labels. More precisely, we could require the difference in distance to be a least γ for every Tk ∈ D. That is for every auxiliary task we must have that: ∀(xi ,1),(xj ,1)∈Tk ∀(xq ,−1)∈Tk d(xi , xj ) ≤ d(xi , xq ) − γ

(3.7)

In particular, if we restrict ourselves to pseudo-metrics of the form: d(xi , xj )2 = ||θ xi − θ xj ||22 we can reduce the problem of learning a pseudo-metric on D to learning a linear projection θ that achieves γ separation. The underlying transfer learning assumption is that a projection θ that achieves γ separation on the auxiliary tasks will most likely achieve γ separation on the target task. Therefore, if we project the target samples using θ and run a nearest neighbor classifier in the new space we are likely to get a good performance. For learning θ, the authors chose the online metric learning algorithm of Shalev-Shwartz et al. (2004). One of the advantages of this algorithm is that it has a dual form that allows the use of kernels. This dual version of the algorithm is the one used by the authors. The paper presented experiments on a character recognition dataset where they showed that the learned pseudo-metric could significantly improve performance on the target task.

3.4

Feature Sharing

3.4.1 Enforcing Parameter Similarity by Minimizing The Euclidean Distance between Parameter Vectors We start this section with two parameter tying transfer algorithms. Evgeniou and Pontil (2004) proposed a simple regularization scheme for transfer learning that encourages the 32

task-specific parameters of classifiers for related tasks to be similar to each other. More precisely, consider training m linear classifiers of the form: hk (x) = (v + wk )T x

(3.8)

The parameter vector v ∈ Rd is shared by the m tasks while the parameters W = [w1 , w2 , . . . , wm ] for wk ∈ Rd are specific to each task. As in structure learning, the goal of the transfer algorithm is to estimate the task-specific parameters W and the shared parameter v simultaneously using supervised training data from D. Let us define the parameter matrix W 0 = [(w1 + v), (w2 + v), . . . , (wm + v)] where (wk +v) ∈ Rd are the parameters for the k-th task. Notice that any matrix can be written in the above form for some offset v, thus without imposing a regularization scheme on v and wk training classifiers of the form (3.8) reduces to training m independent linear classifiers. Evgeniou and Pontil proposed to minimize the following regularized joint objective: nk m m X 1 X γ X 0T k k min ||wk ||22 + λ||v||22 Loss(wk xi , yi ) + W0 n m k i=1 k=1 k=1

(3.9)

Intuitively, the l2 regularization penalty on shared parameters v controls the norm of the average classifier for the m tasks while the l2 penalty on wk controls how much the task-specific parameters: (v + wk ) differ from this average classifier. The ratio between the regularization constants γ and λ determines the amount of parameter sharing enforced in the joint model. When

γ m

> λ the model penalizes more strongly

deviations from the average model. Thus a large γ favors solutions where the parameters of each classifier wk0 are similar to each other. On the other hand when λ >

γ m

the regu-

larization penalty will favor solutions where v is close to zero, making the task-parameters more dissimilar to each other. In other words, when

γ m

tends to infinity the transfer algorithm reduces to pooling all

supervised data in D to train a single classifier with parameters v. On the other hand, when λ tends to infinity the transfer algorithm reduces to solving m independent tasks. When the hinge loss is used in optimization (3.9) its Lagrangian reveals that at the optimal solution W 0∗ :

33

m

X λ v = wk (λ + γ) m k=1 ∗

(3.10)

This suggests that the minimization in (3.9) can be expressed solely in terms of W , in particular the authors show that (3.9) is equivalent to:

nk m m m m X X X 1 X 1 X 0T k k 2 min Loss(wk xi , yi ) + γ ||wk ||2 + λ ||wk − wk0 ||22 W n m k i=1 k=1 k=1 k=1 k0 =1

(3.11)

The first regularization term is the usual l2 penalty on the parameters of each classifier while the second regularization term favors solutions were the parameters of each classifier are similar to each other. The problem (3.11) can be expressed as a quadratic program and solved with standard interior point methods. In addition to the primal formulation the authors present a dual version of (3.11) that enables the use of kernels. The transfer algorithm was tested on the school data from the UCI machine learning repository. The goal with this data is to predict exam scores of students from different schools. When modeled as a multitask problem predicting scores for the students of a given school is regarded as a task. The authors compared their transfer algorithm with the transfer algorithm of Bakker and Heskes (2003). Their results showed that sharing information across tasks using their approach lead to better performance. Notice that this algorithm makes a very strong relatedness assumption (i.e. the weights of the classifiers must be similar to each other). For this reason it is only appropriate for transfer learning applications where the tasks are closely related to each other.

3.4.2

Sharing Parameters in a Class Taxonomy

For multiclass problems where classes are organized in a hierarchical taxonomy, Cai and Hofmann proposed a max-margin approach that uses discriminant functions structured to mirror the class hierarchy. More specifically, assume a multiclass problem with m classes organized in a tree class taxonomy: H =< G, E >, where G is a set of nodes and E is a set of edges. The leaves 34

of H correspond to the m classes and for convenience we will label the leave nodes using indexes from 1 to m. In addition, assume a weight matrix W = [w1 , w2 , . . . , w|G| ], where wn is a weight vector associated with node n ∈ G. Consider a discriminant function that measures the compatibility between an input x and a class label y ∈ {1, 2, . . . , m} of the form:

f (x, y, W ) =

|Py | X j=1

wPT j x y

(3.12)

where Pyj is the j-th node on the path from the root node to node y. The discriminant function in (3.12) causes the parameters for internal nodes of the class taxonomy to be shared across classes . It is easy to see that (3.12) can be written as a standard discriminant function f (φ(x, y), W ) for some mapping φ(x, y) and hence standard SVM optimization methods can be used. The paper presented results on document categorization showing the advantages of leveraging the taxonomy.

3.4.3 Clustering Tasks Evgeniou and Pontil’s transfer algorithm assumes that the parameters of all tasks are similar to each other, i.e. it is assumed that there is a single cluster of parameter vectors with mean v. For some multitask settings such assumption might be too restrictive because not all tasks might be related to each other. However, a given a set of tasks might contain subsets or clusters of related tasks. The work of Jacob et al. (2008) addresses this problem and proposes a regularization scheme that can take into account the fact that tasks might belong to different clusters. While Evgeniou and Pontil’s transfer algorithm searches for a single mean vector v, Jacob et al.’s transfer algorithm searches for r cluster means. More precisely, their algorithm searches for r cluster means vr and cluster assignments for the task parameters such that the sum of the differences between mean vp and each of the parameter vectors assigned to cluster p is minimized. Searching for the optimal cluster assignments is a hard combinatorial problem, instead 35

the authors propose an approximation algorithm based on a convex approximation of the k-means objective.

3.4.4 Sharing A Feature Filter Jebara (2004) developed a joint sparsity transfer algorithm under the Maximum entropy discrimination (MED) formalism. In a risk minimization framework one finds the optimal linear classifier parameters w∗ by minimizing a regularized loss function on some training set T . In contrast, in a MED framework we regard the classifier parameters w as a random variable and find a distribution p(w) such that the expected loss on T is small, where the expectation is taken with respect to p(w). In addition in a MED framework one assumes some prior distribution p0 (w) on the parameter vectors. The MED objective tries to find a distribution p(w) which has the following properties: 1) it has small expected loss on T and 2) is close to the prior distribution, i.e. p(w) has small relative Shannon entropy with respect to p0 (w). Putting it all together, the single task MED objective for the hinge loss is given by: Z min p(w),² Z s.t.∀i=1:n

p(w) ) ∆w p0 (w)

(3.13)

p(w) [yi wT xi + εi ] ≥ 1

(3.14)

p(w) ln( w

w

where ² = [²1 , ²2 , . . . , ²n ] are the standard slack variables resulting from the hinge loss. When the prior distribution p0 (w) is assumed to be a zero-mean gaussian we recover the standard SVM objective. To perform feature selection consider incorporating a binary feature filter into the linear P classifier: h(x) = dj=1 sj wj xj . The feature filter is given by s = [s1 , s2 , . . . , sd ], where each entry sj ∈ {0, 1} indicates whether a feature should be selected or not. We can now define a joint prior distribution over parameters w and feature filter s: p0 (w, s) = Q p0 (w) dj=1 p0 (sj ). A natural choice for p0 (w) is to assume a zero-mean normal distribution, for p0 (sj ) we will assume a Bernoulli distribution given by: p0 (sj ) = κsj (1 − κ)1−sj . Since s is a feature filter, the constant κ controls the amount of sparsity in the model. 36

To generalize the single task feature selection approach to the multitask case we simply assume that the feature filter s is shared across the m tasks. The joint prior for parameter matrix W and feature filter s is given by:

p0 (W , s) =

m Y

p0 (wk )

d Y

p0 (sj )

(3.15)

j=1

k=1

Putting it all together the joint feature selection transfer algorithm finds the task-specific parameter distribution p(W ) and shared feature filter distribution p(s) by solving: Z min

p(W ,s),[²1 ,...²m ]

Z s.t.∀k=1:m ∀i=1:nk

wk

p(W , s) ) ∆(ws) p0 (W , s)

(3.16)

wjk sj xkj + εi ] ≥ 1

(3.17)

p(W , s) ln( w

p(wk ) [yik

d X j=1

The authors propose to solve (3.17) using a gradient based method. The paper presented experiments on the UCI multiclass dermatology dataset, where a one-vs-all classifier was trained for each class. Their results showed that multitask feature selection can improve the average performance.

3.4.5

Feature Sharing using l1,2 Regularization

Obozinski et al. (2006) proposed a joint sparsity transfer algorithm based on l1,2 regularization. Recall from section 3.3.2 that for a parameter matrix W = [w1 , w2 , . . . , wm ] the P l1,2 regularization penalty is defined as l1,2 (W ) = dj=1 ||wj ||2 . This norm has been shown to promote row sparsity. In particular, Obozinski et al. (2008) studied the properties of l1,2 regularization in the context of joint training of m regression functions. His study reveals that under certain conditions the l1,2 regularized model can discover the subset of features (i.e. rows of W ) that are non-zero in at least one of the m regression tasks. In other words we can regard the l1,2 norm as a convex relaxation to an lr0 penalty given by: lr0 (W ) = |{j = 1 : d|maxk (wj,k ) 6= 0}| 37

(3.18)

While in Argyriou et al. (2006) the regularization penalty worked on parameters corresponding to hidden features, in Obozinski et al. (2006) the regularization penalty is imposed directly on parameters corresponding to features xj . This results in the following jointly regularized objective: nk m d X X 1 X T k k min Loss(wk xi , yi ) + γ ||wj ||2 W n k i=1 j=1 k=1

(3.19)

where wj are the coefficients of one feature across the m tasks. To optimize (3.19) the authors extended the path following coordinate descend algorithm of Zhao et al. (2007). Broadly speaking, a coordinate descend method is an algorithm that greedily optimizes one parameter at a time. In particular, Zhao et al.’s algorithm iterates between two steps: 1) a forward step which finds the feature that most reduces the objective loss and 2) a backward step which finds the feature that most reduces the regularized objective loss. To use this algorithm in the context of multitask learning the authors modified the backward step to ensure that the parameters of one feature across the m tasks are simultaneously updated. Thus in the backward step they select the feature (i.e. wj row of W ) with largest directional derivative with respect to the l1,2 regularized objective in (3.19). The authors conducted experiments on a handwritten character recognition dataset, containing samples generated by different writers. Consider the binary task of distinguishing between a pair of characters, one possibility for solving this task is to ignore the different writers and learn a single l1 regularized classifier pooling examples from all writers (i.e the pooling model). Another possibility is to train an l1 classifier for each writer (i.e. the independent l1 model). Yet another possibility is to train classifiers for all writers jointly with an l1,2 regularization (i.e. the l1,2 model). The paper compares these three approaches showing that joint l1,2 regularization results in improved performance. In addition, the paper presents results on a gene expression cancer dataset. Here the task is to find genetic markers (i.e subset of genes) that are predictive of four types of cancers. The data consists of gene signatures for both healthy and ill individuals. As in the handwritten recognition case, we could consider training independent l1 classifiers to detect each type of cancer or training classifiers jointly with an l1,2 regularization penalty. The

38

experiments showed that in terms of performance both approaches are indistinguishable. However, when we look at feature selection the l1,2 selects significantly fewer genes.

3.4.6

Sharing Features Via Joint Boosting

Torralba et al. (2006) proposed a feature sharing transfer algorithm for multiclass object recognition based on boosting. The main idea is to reduce the computational cost of multiclass object recognition by making the m boosted classifiers share weak learners. Let us first consider training a single classifier with a boosting algorithm. For this, we define F = {f1 (x), f2 (x), . . . , fq (x)} to be a set of candidate weak learners. For example, F could be the family of weighted stumps, i.e. functions of the form: f (x) = a if (xj > β)

(3.20)

= b if (xj ≤ β)

(3.21)

for some real weights a, b ∈ R, some threshold β ∈ R and some feature index j ∈ {1, 2, . . . , d}. Assume that we wish to train an additive classifier of the form: h(x) =

P f ∈Φ

f (x) that

combines the outputs of some subset of weak learners Φ ⊆ F . Boosting provides a simple way to sequentially add one weak learner at the time so as to minimize the exponential loss on the training set T : min

n X

Φ

exp−yi

P f ∈Φ

f (x)

(3.22)

i=1

Several boosting algorithms have been proposed in the literature, this paper uses gentle boosting (Friedman et al., 1998). Gentle boosting performs an iterative greedy optimization, where at each step t we add a weak learner ft (x) to obtain a new classifier: P ht (x) = t−1 j=1 fj (x) + ft (x). The weak learner that is added at step t is given by: argminft ∈F

n X

exp−yi

Pt−1

j=1 fj (xi )

(yi − ft (xi ))2

(3.23)

i=1

In the multiclass case, the standard boosting approach is to learn an additive model for P each class: hk (x) = f ∈Φk f k (x) by minimizing the joint exponential loss: 39

min

n X m X

{Φ1 ,Φ2 ,...,Φm }

k

exp−yi

P f ∈Φk

f (x)

(3.24)

i=1 k=1

Notice that each class has its own set of weak learners Φk , i.e. there is no sharing of weak learners across classes. Torralba et al. proposes to enforce sharing of weak learners across classes by modifying the structure of the class specific additive models hk (x). In particular, let us define R to be a subset of tasks: R ⊆ {1, 2, . . . , m}. For each such P subset consider a corresponding additive classifier hR (x) = f ∈ΦR f (x) that performs the binary task of deciding whether an example belongs to any class in R or not. Using these basic classifiers we can define an additive classifier for the k class of the P form: hk (x) = R:k∈R hR (x). At iteration t the joint boosting algorithm will add a new weak learner to one of the 2m additive models hR so as to minimize the joint loss on D: n m X X

k

exp−yi

P R:k∈R

hR (x)

(3.25)

k=1 i=1

A naive implementation of (3.25) would require exploring all possible subsets of classes and would therefore have a computational cost of O(d 2m ) 3 . The authors show that when the weak learners are decision stumps a O(d m2 ) time search heuristic can be used to approximate (3.25). The paper presented experiments on an object recognition dataset containing 21 object categories. The basic features (i.e. features utilized to create the decision stumps) used where normalized filter responses computed at different locations of the image. They compare their joint boosting algorithm to a baseline that trains a boosted classifier for each class independently of the others. For the independent classifiers they limit the iterations of boosting so that the total number weak learners used by the m independently trained classifiers is the same as the total number of weak learners used by the classifiers trained with joint boosting. Their results showed that for a fix total number of weak learners (i.e. for a given multiclass run-time performance) the accuracy of joint boosting is superior to that of independent boosting. One interesting result is that joint boosting tends to learn weak classifiers that are 3

we have an O(d) cost because every feature in X needs to be evaluated to find the best decision stump.

40

general enough to be useful for multiple classes. For example joint boosting will learn weak classifiers that can detect particular edge patterns, similar to the response properties of V 1 cells.

3.5

Hierarchical Bayesian Learning

3.5.1 Transfer Learning with Hidden Features and Shared Priors Bakker and Heskes (2003) presented a hierarchical bayesian learning model for transfer learning of m regression tasks (i.e. y is a real valued output). In the proposed model some of the parameters are assumed to be shared directly across tasks (like in the structure learning framework) while others are more loosely connected by means of sharing a common prior distribution. The idea of sharing a prior distribution is very similar to the approach of Argyriou et al. (2006) where we assume the existence of a shared mean parameter vector. The main difference is that in addition to the shared prior, Bakker and Heskes (2003) learns a hidden shared transformation. The prior distribution and the shared parameters are inferred jointly using data from all related tasks in a maximum likelihood framework. In particular, consider linear regression models of the form: h(x) =

Pz j=1

wj gjθ (x)

where z is the number of hidden features in the model and gjθ (x) is a function that returns the j-th hidden feature. As in most transfer algorithms we will have task-specific parameters W = [w1 , w2 , . . . , wm ] and shared parameters θ. In addition, we are going to assume that the task-specific parameters wk are sampled from a shared gaussian prior distribution N(m, Σ) with z-dimensional mean m and z by z covariance matrix Σ. The joint distribution of task-specific model parameters W and train data D conditioned on shared parameters Λ = (θ, m, Σ) is given by:

p(D, W |Λ) =

m Y

p(Tk |wk , θ) p(wk |m, Σ)

(3.26)

k=1

where we are assuming that given shared parameters Λ the m tasks are independent. 41

To obtain optimal parameters Λ∗ we integrate the task-specific parameters W and find the maximum likelihood solution. Once the maximum likelihood parameters Λ∗ are known we can compute the task-specific parameters wk∗ by: argmaxwk p(wk |Tk , Λ∗ )

(3.27)

We have assumed so far that all task-specific parameters are equally related to each other, i.e. they are all samples from a a single prior distribution N(m, Σ). However, in real applications all tasks might not be related to each other but there might be some underlying clustering of tasks. The authors address this problem by replacing the single gaussian prior distribution with a mixture of q Gaussian distributions. Thus in the task clustering version of the model each task-specific parameter wk is assumed to be sampled from: P wk ∼ qr=1 αq N(mr , Σr ). The paper presented experiments on the school dataset of the UCI repository and showed that their transfer algorithm improved performance as compared to training each task independently. In addition, the experiments showed that their transfer algorithm was able to provide a meaningful clustering of the different tasks.

3.5.2 Sharing a Prior Covariance Raina et al. (2006) proposed a Bayesian logistic regression algorithm for asymmetric transfer. Their algorithm uses data from auxiliary tasks to learn a prior distribution on model parameters. In particular, the learnt prior encodes useful underlying dependencies between pairs of parameters and it can be used to improve performance on a target task. Consider a binary logistic regression model making predictions according to: p(y = 1|x, w) =

1 T 1+exp−w

x

. For this model, the MAP parameters w∗ are given by: argmaxw

n X

log(p(y = y i |x, w)) + λp0 (w)

(3.28)

i=1

The left hand term is the likelihood of the training data under our model. The right hand term is a prior distribution on parameter vectors w. The most common choice for p0 (w) is a zero-mean multivariate gaussian prior N(0, Σ). In general, Σ is defined to be an 42

identity covariance matrix, i.e we assume that the parameters are independent of each other and have equal prior variance. The main idea of Raina et al. (2006) is to use data from auxiliary tasks to learn a more informative covariance matrix Σ. To give an intuition of why this might be helpful, think about a text categorization task where the goal is to predict whether an article belongs to a given topic or not. Consider a bag of words document representation, where every feature encodes the presence or absence of a word from some reference vocabulary. When training classifiers for the auxiliary tasks we might discover that the parameters for moon and rocket are positively correlated, i.e. when they occur in a document they will typically predict the same label. The transfer assumption is that the parameter correlations learnt on the auxiliary tasks can be predictive of parameter correlations on the target task. In particular, the authors propose the following monte-carlo sampling algorithm to learn each entry: E[wj∗ wq∗ ] of Σ from auxiliary training data: • Input: D, j, q • Output: E[wj∗ wq∗ ] • for p = 1 to p < z – Choose a random auxiliary task: k – Chose a random subset of features: Λ that includes feature pair (j, q) – Train a logistic classifier on dataset Tk using only the features in Λ to obtain estimates: wjp wqp • end • return:E[wj∗ wq∗ ] =

1 z

Pz p=1

wjp wqp

The paper presented results on a text categorization dataset containing documents from 20 different news-groups. The news-groups classes were randomly paired to construct 10 binary classification tasks. The asymmetric transfer learning experiment consisted of two steps. In the first step a prior covariance matrix Σ is learned using data from 9 auxiliary 43

tasks. In the second step this prior covariance is used in (3.28) to train a classifier for the 10-th held-out target task. As a baseline model they train a logistic regression classifier for each task using an identity covariance matrix as a prior. Their results showed that for small training sets the covariance learnt from auxiliary data lead to lower test errors than the baseline model.

3.5.3

Learning Shape And Appearance Priors For Object Recognition

Fei-Fei et al. (2006) presented a hierarchical bayesian learning transfer algorithm for object recognition. Their transfer algorithm uses auxiliary training data to learn a prior distribution over object model parameters. This prior distribution is utilized to aid training of a target object classifier for which there is a single positive and negative training example. Consider learning a probabilistic generative model for a given object class. We are going to assume that for every image we first compute a set of u interesting regions which we will use to construct an appearance and shape representation. In particular, we will represent each image using an appearance matrix A = [a1 , a2 , . . . , au ] where aj is a feature representation for the j-th region, and a u by 2 shape matrix S containing the center location of each region. A constellation object model assumes the existence of q latent object parts. More precisely, we define an assignment vector r = [r1 , r2 , . . . , rq ] where rj ∈ R = {1, 2, . . . u} assigns one of the u interesting regions to the j-th latent object part. Using the latent assignment r and assuming appearance parameters wA and shape parameters wS we can write the generative object model as: p(S , A, wA , wS ) =

q XY

p(ap |wA ) p(S |wS ) p(r) p(wA )p(wS )

(3.29)

r∈R p=1

Consider a normal distribution for both appearance and shape and let S S A A A wA = {(µA 1 , Σ1 ), . . . , (µq , Σq )} and {µ , Σ } be the appearance and shape parameters

respectively, we can re-write the generative object model as:

44

A

S

p(S , A, w , w ) =

q XY

A S S A S N(µA p , Σp ) N(µ , Σ ) p(r) p(w )p(w )

(3.30)

r∈R p=1

The first term in (3.30) is the likelihood of the observations {A, S } under the object model. The second term is a prior distribution over appearance parameters wA and shape parameters wS , these priors can be learned from auxiliary tasks, i.e. related object classes. The prior distribution over a part mean appearance is assumed to be normally disA A tributed: µA j ∼ N(β , Λ ) where β and Λ are the appearance hyper-parameters. Simi-

larly, for the shape prior we have that: µS ∼ N(γ S , ΥS ) where γ S and ΥS are the shape hyper-parameters. The proposed asymmetric transfer algorithm learns the shape and appearance hyperparameters from related object classes and uses them when learning a target object model. In particular, consider learning an object model for the m auxiliary tasks using uniform ∗

shape and appearance priors and let µSk be the maximum likelihood parameters for object P S∗ k. Their transfer algorithm computes the shape hyper-parameters: γ S ∼ m1 m k=1 µk . The appearance hyper-parameters are computed in an analogous manner using the auxiliary training data. The paper presents results on a dataset of 101 object categories. The shared prior hyperparameters are learnt using data from four object categories. This prior is then used when training objet models for the remaining target object classes. For these object classes they assume there is only one positive example available for training. The results showed that when only one sample is available for training the transfer algorithm can improve classification performance.

3.6 Related Work Summary In this chapter we have reviewed three main approaches to transfer learning: learning hidden representations , feature sharing and hierarchical bayesian learning . As we would expect, each approach has its advantages and limitations. The learning hidden representations approach is appealing because it can exploit latent 45

structures in the data. However, such a power comes at a relatively high computational cost: both Ando and Zhang (2005) and Argyriou et al. (2006) joint training algorithms make use of alternating optimization schemes where each iteration involves training m classifiers. The feature sharing approach might not be able to uncover latent structures but as we will show in chapter 6 it can be implemented very efficiently, i.e. we can derive joint learning algorithms whose computational cost is in essence the same as independent training. This approach is appealing for applications were we can easily generate a large number of candidate features to choose from. This is the case in many image classification problems, for example in chapter 5 we generate a large set of candidate features using a kernel function and unlabeled examples. The Bayesian approach has the advantage that it falls under a well studied family of hierarchical Bayesian methods. However, assuming a prior parametric distribution over model parameters might pose important limitations. For example, assuming a shared gaussian distribution over model parameters (Bakker and Heskes, 2003; Fei-Fei et al., 2006) implies that different tasks will weight features similarly. We will see in the experiments in chapter 6 that this assumption rarely holds. That is, it is common to find features that are useful for a pair of tasks but whose corresponding weights have opposite signs. The other main limitation of Bayesian approaches is their computational cost; both Raina et al.’s and Fei-Fei et al.’s algorithms involve costly monte-carlo approximations. For example, each monte-carlo iteration of Raina et al.’s algorithm involves training a set of m classifiers. The transfer algorithm that we present in chapters 5 and 6 falls under the feature sharing approach. We have reviewed four such algorithms: Jebara (2004) proposes an interesting formulation for joint feature selection but it has the limitation that it assumes that the probability of a feature being selected must follow a bernoulli distribution. Argyriou et al. proposed a simple and efficient parameter tying transfer algorithm. However, their algorithm suffers from the same limitation as Bakker and Heskes’s, i.e. the transfer learning assumption is two strong since two related tasks might share the same relevant features but might weight them differently. The two transfer algorithms most closely related to our work are the joint sparsity 46

algorithms of Obozinski et al. and Torralba et al.. Both these algorithms and the algorithm presented in chapters 5 and 6 can be thought as approximations to an lr0 regularized joint objective. The optimization algorithms proposed by Obozinski et al. and Torralba et al. are greedy (i.e. coordinate descend algorithms). In contrast, our transfer algorithm finds a global optimum by directly optimizing a convex relaxation of the lr0 penalty. More precisely, we develop a simple and general optimization algorithm which for any convex loss has guaranteed convergence rates of O( ²12 ). One of the most attractive properties of our transfer algorithm is that its computational cost is comparable to the cost of training m independent sparse classifiers (with the most efficient algorithms for the task (Duchi et al., 2008)).

47

Chapter 4 Learning Image Representations Using Images with Captions The work presented in this chapter was published in Quattoni et al. (2007). In this chapter we will investigate a semisupervised image classification application of asymmetric transfer where the auxiliary tasks are derived automatically using unlabeled data. Thus any of the transfer learning algorithms described in 6.4 can be used in this semisupervised context. In particular, we will use the Reuters Data described in chapter 2 and consider a setting where we have thousands of images with associated captions, but few images annotated with story news labels. We take the prediction of content words from the captions to be our auxiliary tasks and the prediction of a story label to be our target tasks. Our goal is to leverage the auxiliary unlabeled data to derive a lower dimensional representation that still captures the relevant information necessary to discriminate between different stories. We will show how the structure learning framework of Ando and Zhang (2005) can be used to learn such a representation. This chapter is organized as follows: Section 4.1 motivates the need for semi-supervised algorithms for image classification, Section 4.2 presents our model for learning image representations from unlabeled data annotated with meta-data based on structure learning, Section 4.3 illustrates the approach by giving some toy examples of the types of representations that it can learn and Section 4.5 describes experiments on image classification 48

where the goal is to predict the news-topic of an image and the meta-data are image captions. Finally section 4.7 summarizes the results of this chapter and highlights some open questions for future work.

4.1

Introduction

When few labeled examples are available, most current supervised learning methods classification may work poorly —for example when a user defines a new category and provides only a few labeled examples. To reach human performance, it is clear that knowledge beyond the supervised training data needs to be leveraged. When labeled data is scarce it may be beneficial to use unlabeled data to learn an image representation that is low-dimensional, but nevertheless captures the information required to discriminate between image categories. There is a large literature on semi-supervised learning approaches, where unlabeled data is used in addition to labeled data. We do not aim to give a full overview of this work, for a comprehensive survey article see (Seeger, 2001). Most semi-supervised learning techniques can be broadly grouped into three categories depending on how they make use of the unlabeled data: density estimation, dimensionality reduction via manifold learning and function regularization. Generative models trained via EM can naturally incorporate unlabeled data for classification tasks (Nigam et al., 2000; Baluja, 1998). In the context of discriminative category learning, Fisher kernels (Jaakkola and Haussler, 1998; Holub et al., 2005) have been used to exploit a learned generative model of the data space in an SVM classifier. In some cases unlabeled data may contain useful meta-data that can be used to learn a low-dimensional representation that reflects the semantic content of an image. As one example, large quantities of images with associated natural language captions can be found on the web. This chapter describes an algorithm that uses images with captions or other meta-data to derive an image representation that allows significantly improved learning in cases where only a few labeled examples are available. In our approach the meta-data is used to induce a representation that reflects an under49

lying part structure in an existing, high-dimensional visual representation. The new representation groups together synonymous visual features—features that consistently play a similar role across different image classification tasks. Ando and Zhang introduced the structure learning framework, which makes use of auxiliary problems in leveraging unlabeled data. In this chapter we introduce auxiliary problems that are created from images with associated captions. Each auxiliary problem involves taking an image as input, and predicting whether or not a particular content word (e.g, man, official, or celebrates) is in the caption associated with that image. In structure learning , a separate linear classifier is trained for each of the auxiliary problems and manifold learning (e.g., SVD) is applied to the resulting set of parameter vectors, finding a low-dimensional space which is a good approximation to the space of possible parameter vectors. If features in the high-dimensional space correspond to the same semantic part, their associated classifier parameters (weights) across different auxiliary problems may be correlated in such a way that the basis functions learned by the SVD step collapse these two features to a single feature in a new, low-dimensional feature-vector representation. In a first set of experiments, we use synthetic data examples to illustrate how the method can uncover latent part structures. We then describe experiments on classification of news images into different topics. We compare a baseline model that uses a bag-of-words SIFT representation of image data to our method, which replaces the SIFT representation with a new representation that is learned from 8,000 images with associated captions. In addition, we compare our method to (1) a baseline model that ignores the meta-data and learns a new visual representation by performing PCA on the unlabeled images and (2) a model that uses as a visual representation the output of word classifiers trained using captions and unlabeled data. Note that our goal is to build classifiers that work on images alone (i.e., images which do not have captions), and our experimental set-up reflects this, in that training and test examples for the topic classification tasks include image data only. The experiments show that our method significantly outperforms baseline models.

50

4.2 Learning Visual Representations A good choice of representation of images will be crucial to the success of any model for image classification. The central focus of this chapter is a method for automatically learning a representation from images which are unlabeled, but which have associated meta-data, for example natural language captions. We are particularly interested in learning a representation that allows effective learning of image classifiers in situations where the number of training examples is small. The key to the approach is to use meta-data associated with the unlabeled images to form a set of auxiliary problems which drive the induction of an image representation. We assume the following scenario: • We have labeled (supervised) data for an image classification task. We will call this the target task. For example, we might be interested in recovering images relevant to a particular topic in the news, in which case the labeled data would consist of images labeled with a binary distinction corresponding to whether or not they were relevant to the topic. We denote the labeled examples as the target set T0 = {(x1 , y1 ), . . . , (xi , yi )} where (xi , yi ) is the i’th image/label pair. Note that test data points for the target task contain image data alone (these images do not have associated caption data, for example). • We have m auxiliarytraining sets, Tk = {(xk1 , y1k ), . . . , (xknk , yni k )} for k = 1 . . . m. Here xki is the k-th image in the k-th auxiliary training set, yik is the label for that image, and nk is the number of examples in the k-th training set. The auxiliary training sets consist of binary classification problems, distinct from the target task, where each yi0 is in {−1, +1}. Shortly we will describe a method for constructing auxiliary training sets using images with captions. • The aim is to learn a representation of images, i.e., a function that maps images x to feature vectors v(x). The auxiliary training sets will be used as a source of information in learning this representation. The new representation will be applied when learning a classification model for the target task. In the next section we will describe a method for inducing a representation from a set of auxiliary training sets. The intuition behind this method is to find a representation which is relatively simple (i.e., of low dimension), yet allows strong performance on the auxiliary

51

training sets. If the auxiliary tasks are sufficiently related to the target task, the learned representation will allow effective learning on the target task, even in cases where the number of training examples is small.

4.2.1 Learning Visual Representations from Auxiliary Tasks This section describes the structure learning learning algorithm [1] for learning a representation from a set of auxiliary training sets. We assume that x ∈ Rd is a baseline feature vector representation of images. In the experiments in this chapter x is a SIFT histogram representation (Sivic et al., 2005). In general, x will be a “raw” representation of images that would be sufficient for learning an effective classifier with a large number of training examples, but which performs relatively poorly when the number of training examples is small. For example, with the SIFT representation the feature vectors x are of relatively high dimension (we use d = 1, 000), making learning with small amounts of training data a challenging problem without additional information. Note that one method for learning a representation from the unlabeled data would be to use PCA—or some other density estimation method—over the feature vectors U = {x1 , . . . , xq } for the set of unlabeled images (we will call this method the data-SVD method). The method we describe differs significantly from PCA and similar methods in its use of meta-data associated with the images, for example captions. Later we will describe synthetic experiments where PCA fails to find a useful representation, but our method is successful. In addition we describe experiments on real image data where PCA again fails, but our method is successful in recovering representations which significantly speed learning. Given the baseline representation, the new representation is defined as v(x) = θx where θ is a projection matrix of dimension z × d.1 The value of z is typically chosen such that z ¿ d. The projection matrix is learned from the set of auxiliary training sets, using the structure learning learning approach described in (Ando and Zhang, 2005). Figure 4-1 1

Note that the restriction to linear projections is not necessarily limiting. It is possible to learn non-linear

projections using the kernel trick; i.e., by expanding feature vectors x to a higher-dimensional space, then taking projections of this space.

52

Input 1: auxiliary training sets {(xk1 , y1k ), . . . , (xknk , ynkk )} for k = 1 . . . m. Here xki is the i-th image in the k-th training set, yik is the label for that image. nk is the number of examples in the k-th training set. We consider binary classification problems, where each yik is in {−1, +1}. Each image x ∈ Rd some a feature representation for the image. Input 2: target training set T0 {(x1 , y1 ), . . . , (xn , yn )} Structural learning using auxiliary training sets: Step 1: Train m linear classifiers. For k = 1 . . . m, choose the optimal parameters on the k-th training set to be wi∗ = arg minw Lossk (w) where Lossk (w) =

nk X i=1

γ Loss(w · xki , yik ) + ||w||22 2

(See section 4.2.1 for more discussion.) Step 2: Perform SVD on the Parameter Vectors. Form a matrix W of dimension d × m, by taking the parameter vectors wi∗ for k = 1 . . . m. Compute a projection matrix θ of dimension z × d by taking the first z eigenvectors of WW0 . Output: The projection matrix θ ∈ Rz×d . Train using the target training set: Define v(x) = θx. Choose the optimal parameters on the target training set to be v∗ = arg minv Loss(v) where Loss(v) =

n X

Loss(v · v(xi ), yi ) +

i=1

γ2 ||v||22 2

Figure 4-1: The structure learning learning algorithm (Ando and Zhang, 2005).

53

shows the algorithm. In a first step, linear classifiers wk∗ are trained for each of the m auxiliary problems. In several parameter estimation methods, including logistic regression and support vector machines, the optimal parameters w∗ are taken to be w∗ = arg minw Loss(w) where Loss(w) takes the following form:

Loss(w) =

n X i=1

γ Loss(w · (xi ), yi ) + ||w||22 2

(4.1)

Here {(x1 , y1 ), . . . , (xn , yn )} is a set of training examples, where each xi is an image and each yi is a label. The constant γ2 > 0 dictates the amount of regularization in the model. The function Loss(w · x, y) is some measure of the loss for the parameters w on the example (x, y). For example, in support vector machines (Cortes and Vapnik, 2005) Loss is the hinge-loss, defined as Loss(m, y) = (1 − ym)+ where (z)+ is z if z >= 0, and is 0 otherwise. In logistic regression the loss function is Loss(m, y) = − log

exp{ym} . 1 + exp{ym}

(4.2)

Throughout this chapter we use the loss function in (Eq. 4.2), and classify examples with sign(w · x) where sign(z) is 1 if z ≥ 0, −1 otherwise. In the second step, SVD is used to identify a matrix θ of dimension z × d. The matrix defines a linear subspace of dimension z which is a good approximation to the space of ∗ induced weight vectors w1∗ , . . . , wm . Thus the approach amounts to manifold learning in

classifier weight space. Note that there is a crucial difference between this approach and the data-SVD approach: in data-SVD SVD is run over the data space, whereas in this approach SVD is run over the space of parameter values. This leads to very different behaviors of the two methods. The matrix θ is used to constrain learning of new problems. As in (Ando and Zhang, 2005) the parameter values are chosen to be w∗ = θ0 v∗ where v∗ = arg minv Loss(v) and Loss(v) =

n X i=1

γ Loss((θ0 v) · xi , yi ) + ||v||2 2

(4.3)

This essentially corresponds to constraining the parameter vector w∗ for the new problem to lie in the sub-space defined by θ. Hence we have effectively used the auxiliary training 54

problems to learn a sub-space constraint on the set of possible parameter vectors. If we define v(x) = θx, it is simple to verify that Loss(v) =

n X i=1

γ Loss(v · v(xi ), yi ) + ||v||22 2

(4.4)

and also that sign(w∗ · x) = sign(v∗ · v(x)). Hence an alternative view of the algorithm in figure 4-1 is that it induces a new representation v(x). In summary, the algorithm in figure 4-1 derives a matrix θ that can be interpreted either as a sub-space constraint on the space of possible parameter vectors, or as defining a new representation v(x) = θx.

4.2.2

Metadata-derived auxiliary problems

A central question is how auxiliary training sets can be created for image data. A key contribution of this chapter is to show that unlabeled images which have associated text captions can be used to create auxiliary training sets, and that the representations learned with these unlabeled examples can significantly reduce the amount of training data required for a broad class of topic-classification problems. Note that in many cases, images with captions are readily available, and thus the set of captioned images available may be considerably larger than our set of labeled images. Formally, denote a set of images with associated captions as (x01 , c1 ), . . . , (x0q , cq ) where (x0i , ci ) is the i’th image/caption pair. We base our m auxiliary training sets on m content words, (w1 , . . . , wm ). A natural choice for these words would be to choose the m most frequent content words seen within the captions.2 m auxiliary training sets can then be created as follows. Define Ii [c] to be 1 if word wi is seen in caption c, and −1 otherwise. Create a training set Tk = {(x01 , Ik [c1 ]), . . . , (x0q , Ik [cq ])} for each k = 1 . . . m. Thus the k-th training set corresponds to the binary classification task of predicting whether or not the word wk is seen in the caption for an image x0 .

55

Figure 4-2: Concept figure illustrating how when appropriate auxiliary tasks have already been learned manifold learning in classifier weight space can group features corresponding to functionally defined visual parts. Parts (eyes, nose, mouth) of an object (face) may have distinct visual appearances (the top row of cartoon part appearances). A specific face (e.g., a or b) is represented with the boolean indicator vector as shown. Matrix D shows all possible faces given this simple model; PCA on D is shown row-wise in PD (first principal component is shown also above in green as P D1 .) No basis in PD groups together eye or mouth appearances; different part appearances never co-occur in D. However, idealized classifiers trained to recognize, e.g., faces with a particular mouth and any eye (H,S,N), or a particular eye and mouth (LL,LC,LR,EC), will learn to group features into parts. Matrix T and blue vectors above show these idealized boolean classifier weights; the first principal component of T is shown in red as P T1 , clearly grouping together the four cartoon eye and the three cartoon mouth appearances. Positive and negative components of P T1 would be very useful features for future learning tasks related to faces in this simple domain because they group together different appearances of eyes and mouths.

56

a

A

A

A

A

b

b

B

b

b

(a) c

C

c

c

c

J

J

(b)

... j

J

J

Abc

abD

Abc

AbD

abc

abd

AdE

bcf

ADE

Bcf

ADe

bCf

Figure 4-3: Synthetic data involving objects constructed from letters. (a) There are 10 possible parts, corresponding to the first 10 letters of the alphabet. Each part has 5 possible observations (corresponding to different fonts). (b) Each object consists of 3 distinct parts; the observation for each part is drawn uniformly at random from the set of possible observations for that part. A few random draws for 4 different objects are shown.

4.3

Examples Illustrating the Approach

Figure 4-2 shows a concept figure illustrating how PCA in a classifier weight space can discover functional part structures given idealized auxiliary tasks. When the tasks are defined such that to solve them they need to learn to group different visual appearances, the distinct part appearances will then become correlated in the weight space, and techniques such as PCA will be able to discover them. In practice the ability to obtain such ideal classifiers is critical to our method’s success. Next we will describe a synthetic example where the method is successful; in the following section we present real-world examples where auxiliary tasks are readily available and yield features that speed learning of future tasks. We now describe experiments on synthetic data that illustrate the approach. To generate the data, we assume that there is a set of 10 possible parts. Each object in our data consists ¡ ¢ of 3 distinct parts; hence there are 10 = 120 possible objects. Finally, each of the 10 parts 3 has 5 possible observations, giving 50 possible observations in total (the observations for each part are distinct). 2

In our experiments we define a content word to be any word which does not appear on a “stop list” of

common function words in English.

57

As a simple example (see figure 4-3), the 10 parts might correspond to 10 letters of the alphabet. Each “object” then consists of 3 distinct letters from this set. The 5 possible observations for each part (letter) correspond to visually distinct realizations of that letter; for example, these could correspond to the same letter in different fonts, or the same letter with different degrees of rotation. The assumption is that each observation will end up as a distinct visual word, and therefore that there are 50 possible visual words. The goal in learning a representation for object recognition in this task would be to learn that different observations from the same part are essentially equivalent—for example, that observations of the letter “a” in different fonts should be collapsed to the same point. This can be achieved by learning a projection matrix θ of dimension 10 × 50 which correctly maps the 50-dimensional observation space to the 10-dimensional part space. We show that the use of auxiliary training sets, as described in section 4.2.1, is successful in learning this structure, whereas PCA fails to find any useful structure in this domain. To generate the synthetic data, we sample 100 instances of each of the 120 objects as follows. For a given object y, define Py to be the set of parts that make up that object. For each part p ∈ Py , generate a single observation uniformly at random from the set of possible observations for p. Each data point generated in this way consists of an object label y, together with a set of three observations, x. We can represent x by a 50-dimensional binary feature vector x, where only 3 dimensions (corresponding to the three observations in x) are non-zero. To apply the auxiliary data approach, we create 120 auxiliary training sets. The k-th training set corresponds to the problem of discriminating between the k-th object and all other 119 objects. A projection matrix θ is learned from the auxiliary training sets. In addition, we can also construct a projection matrix using PCA on the data points x alone. Figures 4-4 and 4-5 show the projections learned by PCA and the auxiliary tasks method. PCA fails to learn useful structure; in contrast the auxiliary task method correctly collapses observations for the same part to nearby points.

58

PCA Dimensions: 1and 2

PCA Dimensions: 2and 3

0.5

0.4

0.4

0.3

0.2

0.3

0.1 0.2 0 0.1 −0.1 0 −0.2 −0.1 −0.3 −0.2

−0.3 −0.4

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

−0.5 −0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Figure 4-4: The representations learned by PCA on the synthetic data problem. The first figure shows projections 1 vs. 2; the second figure shows projections 2 vs. 3. Each plot shows 50 points corresponding to the 50 observations in the model; observations corresponding to the same part have the same color. There is no discernable structure in the figures. The remaining dimensions were found to similarly show no structure.

Structure Learning: Dimensions: 2and 3

Structure Learning: Dimensions: 1and 2

0.4

0.25

0.2

0.3

0.15

0.2 0.1

0.1

0.05

0

0

−0.05

−0.1 −0.1

−0.2 −0.15

−0.2 −0.145

−0.144

−0.143

−0.142

−0.141

−0.14

−0.139

−0.138

−0.137

−0.136

−0.3 −0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

Figure 4-5: The representations learned by structure learning learning on the synthetic data problem. The first figure shows projections 1 vs. 2; the second figure shows projections 2 vs. 3. Each plot shows 50 points corresponding to the 50 observations in the model; observations corresponding to the same part have the same color. There is clear structure in features 2 and 3, in that observations corresponding to the same part are collapsed to nearby points in the projected space. The remaining dimensions were found to show similar structure to those in dimensions 2 and 3.

59

4.4 Reuters Image Dataset For the experiments in chapters: 4, 5 and 6 we collected an image dataset from the Reuters news website (http://today.reuters.com/news/). Images on the Reuters website are associated to stories or topics, which correspond to different topics in the news. An image can belong to more than one topic, although most images are associated with a single topic. In addition to the topic label, every image contains a caption consisting of a few sentences describing the image. This data has been collected over two different time periods, resulting on two versions of the dataset: Reuters.V1 and Reuters.V2. The Reuters.V1 version contains 10.382 images collected in February 2006 over a period of one week, covering a total of 108 news topics. The Reuters.V2 version contains all the images in Reuters.V1 plus images collected during a week in February 2007, resulting in a total of 20.382 images covering 258 topics. The topics span a wide range of categories including: world, sports, entertainment, and politics news stories. Figure 4-6 shows some examples of news topics and corresponding images. Figures 4-7 to 4-12 show an image for each of the 258 topics in the dataset. The number next to the topic label indicates the number of images for that topic. Figure 4-13 shows some images with their corresponding captions and Figure 4-14 shows some examples of the most frequent words appearing on the captions and corresponding images.

4.5

Experiments on Images with Captions

4.5.1 Data For these section we used the dataset Reuters.V1 described in Section 4.4. The experiments involved predicting the topic variable y for test images. We reserved 8, 000 images as a source of training data, and an additional 1, 000 images as a potential source of development data. The remaining 1,382 images were used as a test set. Multiple training sets of different sizes, and for different topics, were created as follows. We created training sets Tn,y for 60

Figure 4-6: Example images from some news topics: German Chancellor, SuperBall, Grammys, Iraq, Golden Globes, Figure Skating and Australian Open

61

Figure 4-7: This figure shows topics ranked 1-45 in frequency. The number next to the topic label indicates the number of images for that topic.

62

Figure 4-8: This figure shows topics ranked 46-95 in frequency. The number next to the topic label indicates the number of images for that topic. 63

Figure 4-9: This figure shows topics ranked 96-145 in frequency. The number next to the topic label indicates the number of images for that topic. 64

Figure 4-10: This figure shows topics ranked 146-195 in frequency. The number next to the topic label indicates the number of images for that topic. 65

Figure 4-11: This figure shows topics ranked 196-240 in frequency. The number next to the topic label indicates the number of images for that topic. 66

Figure 4-12: This figure shows topics ranked 241-248 in frequency. The number next to the topic label indicates the number of images for that topic.

Figure 4-13: Example images with their corresponding captions.

67

Figure 4-14: This figure shows examples of images that were assigned a given annotation word (i.e. the given word appeared in its corresponding caption), for words: team, president and actress.

68

n = {1, 2, 4, 8, 16, 32, 64} and y = {1, 2, 3, . . . , 15}, where Tn,y denotes a training set for topic y which has n positive examples from topic y, and 4n negative examples. The 15 topics corresponded to the 15 most frequent topics in the training data. The positive and negative examples were drawn randomly from the training set of size 8, 000. We will compare various models by training them on each of the training sets Tn,y , and evaluating the models on the 1,382 test images. In addition, each of the 8, 000 training images had associated captions, which can be used to derive an image representation (see section 4.2.1). Note that we make no use of captions on the test or development data sets. Instead, we will use the 8, 000 training images to derive representations that are input to a classifier that uses images alone. In summary, our experimental set-up corresponds to a scenario where we have a small amount of labeled data for a target task (predicting the topic for an image), and a large amount of unlabeled data with associated captions.

4.6 Image Representation In this chapter we will use x to denote a feature-vector representation that is based on SIFT features for an image x. A SIFT detector (Mutch and Lowe, 2006) is run over the image x, identifying a set of image patches. This detection step is run over the entire set of 10,382 images. k-means clustering is then performed on the image patches, using a SIFT representation of patches as input to the clustering step. In our experiments we chose the number of clusters k to be 1, 000. Each image is then represented by a feature vector x ∈ R1000 , where the j’th component of the feature vector xj is the number of times a SIFT patch in the j’th cluster is seen in x. Thus x essentially represents a histogram of counts of each cluster. This representation has been used in several other image classification problems; we will refer to it as a SIFT histogram (SH) representation.

4.6.1

The Baseline Model

A baseline model was trained on all training sets Tn,y . In each case the resulting model was tested on the 1, 352 test examples. The baseline model consists of a logistic regression 69

model over the SIFT features: to train the model we used conjugate gradient descent to find the parameters w∗ which maximize the regularized log-likelihood, see equations 4.1 and 4.2. When calculating equal-error-rate statistics on test data, the value for P (y = +1|x; w∗ ) can be calculated for each test image x; this score is then used to rank the test examples. The parameter γ in Eq. 4.1 dictates the amount of regularization used in the model. For the baseline model, we used the development set of 1, 000 examples to optimize the value of γ for each training set Tn,y . Note that this will in practice give an upper bound on the performance of the baseline model, as assuming 1, 000 development examples is almost certainly unrealistic (particulary considering that we are considering training sets whose size is at most 320). The values of γ that were tested were 10k , for k = −5, −4, . . . , 4.

4.6.2

The Data-SVD Model

As a second baseline, we trained a logistic-regression classifier, but with the original feature vectors x in training and test data replaced by h-dimensional feature vectors v(x) = θx where θ was derived using PCA. A matrix F of dimension 1, 000 × 8, 000 was formed by taking the feature vectors x for the 8, 000 data points; the projection matrix θ was constructed from the first h eigenvectors of FF0 . The PCA model has free parameters z and γ. These were optimized using the method described in section 4.6.5. We call this model the data-SVD model.

4.6.3

A Model with Predictive Structure

We ran experiments using the structure prediction approach described in section 4.2. We train a logistic-regression classifier on feature vectors v(x) = θx where θ is derived using the method in section 4.2.1. Using the 8, 000 training images, we created 100 auxiliarytraining sets corresponding to the 100 most frequent content words in the captions.3 Each training set involves prediction of a particular content word. The input to the classifier is the SIFT representation of an image. Next, we trained linear classifiers on each of the 3

Content words are defined as any words which do not appear on a “stop” list of common function words.

70

100 auxiliary training sets to induce parameter vectors w1 . . . w100 . Each parameter vector is of dimension 1, 000; we will use W to refer to the matrix of size 1, 000 × 100 which contains all parameter values. The projection matrix θ consists of the z eigenvectors in Rd which correspond to the z largest eigenvalues of WW0 .

4.6.4

The Word-Classifiers Model

As a third baseline, we trained a logistic-regression classifier with the original feature vectors x in training and test data replaced by 100-dimensional feature vectors v(x) = θx where θ = Wt . This amounts to training the topic classifiers with the outputs of the word classifiers. Notice that this model is similar to a model with predictive structure where z = 100. The word-classifiers model has a free parameter γ. This was optimized using the method described in section 4.6.5. We call this model the word-classifiers model.

4.6.5

Cross-Validation of Parameters

The data-SVD and the predictive structure models have 2 free parameters: the dimensionality of the projection h, and γ.4 A single topic—the 7th most frequent topic in the training data—was used to tune these parameters for the 3 model types. For each model type the model was trained on all training sets Tn,7 for n = 1, 2, 4, 8, ..., 64, with values for z taken from the set {2, 5, 10, 20, 30, 40, 100} and values for γ chosen from n {0.00001, 0.0001, . . . , 1000}. Define Ez,γ to be the equal-error-rate on the development set

for topic 7, when trained on the training set Tn,7 using parameters z and γ. We choose the value z ∗ for all experiments on the remaining 14 topics as z ∗ = arg min z

X i=1,2,...,64

i min Ez,γ γ

This corresponds to making a choice of z ∗ that performs well on average across all training set sizes. In addition, when training a model on a training set with i positive examples, we chose γi∗ = arg minγ Ezi ∗ ,γ as the regularization constant. The motivation for using a single 4

We use γ to refer to the parameter γ2 in figure 4-1. γ1 in figure 4-1 was set to be 0.1 in all of our

experiments.

71

Reuters V.1 Dataset: 14 Topics Baseline Model Word Classifier Model PCA Model Structural Learning

Average Equal Error Rate

0.5

0.45

0.4

0.35

0.3

0.25

1

2

4

8

16

32

64

# positive training examples

Figure 4-15: Equal error rate averaged across topics, with standard deviations calculated from ten runs for each topic. The equal error rates are averaged across 14 topics; the 7th most frequent topic is excluded as this was used for cross-validation (see section 4.6.5).

topic as a validation set is that it is realistic to assume that a fairly substantial validation set (1,000 examples in our case) can be created for one topic; this validation set can then be used to choose values of z ∗ and γi∗ for all remaining topics. The word-classifiers model has one free parameter: the constant γ used in Eq. 4.1. The value for this parameter was also optimized using the development set.

4.6.6 Results Figure 4-15 shows the mean equal error rate and standard error over ten runs for the experiments on the Reuters dataset. The equal error rate is the recall occurring when the decision threshold of the classifier is set so that the proportion of false rejections will be equal to the proportion of false acceptances. For example an equal error rate of 70% means that when the proportion of false rejections is equal to the proportion of false acceptances 70% of the positive examples are labeled correctly and 30% of the negative examples are misclassified as positive. 72

For all training set sizes the structure learning learning model improves performance over the three other models. The average performance with one positive training example is around 62% with the structure learning learning method; to achieve similar performance with the word-classifiers model requires around four positive examples and with the baseline model between four and eight. Similarly, the performance with four positive examples for the structure learning learning method is around 67%; both the word-classifiers model and the baseline model require between 32 and 64 positive examples to achieve this performance. PCA’s performance is lower than the baseline model for all training sizes and the gap between the two increases with the size of the training set. The performance of the word-classifiers model is better than the baseline for small training sets but the gap between the two decreases with the size of the training set. Figures 4-16 and 4-17 show equal error rates for two different topics. The first topic, “Australian Open”, is one of the topics that exhibits the most improvement from structure learning learning. The second topic, “Winter Olympics”, is one of the three topics for which structure learning learning does not improve performance. As can be observed from the Australian Open curves the use of structure learning features speeds the generalization ability of the classifier. The structure learning model trained with only two positive examples performs comparably to the baseline model trained with sixty four examples. For the Winter Olympics topic the three models perform similarly. At least for a small number of training examples, this topic exhibits a slow learning curve; i.e. there is no significant improvement in performance as we increase the size of the labeled training set; this suggests that this is an inherently more difficult class.

4.7 Chapter Summary Current methods for learning visual categories work well when a large amount of labeled data is available, but can run into severe difficulties when the number of labeled examples is small. When labeled data is scarce it may be beneficial to use unlabeled data to learn an image representation that is low-dimensional, but nevertheless captures the information required to discriminate between image categories. 73

Australian Open : Trained with 2 examples 1 0.9

0.8

0.8

0.7

0.7 True Positive Rate

True Positive Rate

Australial Open : Trained with 1 examples 1 0.9

0.6 0.5 0.4 0.3

0.5 0.4 0.3

0.2

0.2 Baseline PCA Structural

0.1 0

0.6

0

0.2

0.4 0.6 False Positive Rate

0.8

Baseline PCA Structural

0.1 0

1

0

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6 0.5 0.4 0.3

0.4 0.6 False Positive Rate

0.8

1

0.6 0.5 0.4 0.3

0.2

0.2 Baseline PCA Structural

0.1 0

0.2

Australial Open : Trained with 64 examples

True Positive Rate

True Positive Rate

Australial Open : Trained with 4 examples

0

0.2

0.4 0.6 False Positive Rate

0.8

Baseline PCA Structural

0.1 0

1

0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 4-16: Roc Curves for the “Australian Open” topic.

Winter Olympics: Trained with 4 examples 1 0.9

0.8

0.8

0.7

0.7 True Positive Rate

True Positive Rate

Winter Olympics: Trained with 1 example 1 0.9

0.6 0.5 0.4 0.3

0.5 0.4 0.3

0.2

0.2 Baseline PCA Structural

0.1 0

0.6

0

0.2

0.4 0.6 False Positive Rate

0.8

Baseline PCA Structural

0.1 0

1

0

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6 0.5 0.4 0.3

0.4 0.6 True Positive Rate

0.8

1

0.6 0.5 0.4 0.3

0.2

0.2 Baseline PCA Structural

0.1 0

0.2

Winter Olympics: Trained with 64 examples

True Positive Rate

True Positive Rate

Winter Olympics: Trained with 8 examples

0

0.2

0.4 0.6 False Positive Rate

0.8

Baseline PCA Structural

0.1 0

1

0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 4-17: Roc Curves for the “Winter Olympics” topic.

74

This chapter described a method for learning representations from large quantities of unlabeled images which have associated captions; the goal is to improve learning in future image classification problems. The method makes use of auxiliary training sets corresponding to different words in the captions, and structure learning , which learns a manifold in parameter space. Experiments show that our method significantly outperforms (1) a fully-supervised baseline model, (2) a model that ignores the captions and learns a visual representation by performing PCA on the unlabeled images alone and (3) a model that uses the output of word classifiers trained using captions and unlabeled data. In brief, our results show that when meta-data labels are suitably related to a target task, the structure learning learning method can discover feature groupings that speed learning of the targettask. Our current work concentrates on captions as the source of meta-data, but more generally other types of meta-data could be used. Future work includes exploration of automatic determination of relevance between targetand auxiliary tasks, and experimental evaluation of the effectiveness of structure learning from more weakly related auxiliary domains.

75

Chapter 5 Transfer Learning for Image Classification with Sparse Prototype Representations The work presented in this Chapter was published in Quattoni et al. (2008). Chapter 4 presented a semi-supervised application of a structure learning algorithm. In this chapter we follow a different approach and develop a joint sparsity transfer algorithm for image classification based on a regularization matrix norm derived from simultaneous sparse signal approximation (Tropp, 2006a). This norm is designed to promote joint sparsity across tasks. While chapter 4 presented a semi-supervised application of a transfer algorithm, we now turn our focus to an asymmetric transfer setting. The remainder sections are organized as follows: In Section 5.1 we give the motivation for the work presented in this chapter, Section 5.2 describes the joint sparsity transfer algorithm, Section 5.3 presents experiments on an asymmetric transfer image classification task and finally Section 5.4 provides a summary.

5.1 Introduction In this chapter we develop a joint sparsity transfer algorithm that can learn an efficient representation from a set of related tasks and which explicitly takes advantage of unlabeled 76

data. Our method uses unlabeled data to define a prototype representation based on computing kernel distances to a potentially large set of unlabeled points. However, to train an image classifier in a given domain we might only need to consider distances to a small set of prototypes; if these prototypes were known, we might be able to learn with fewer examples by removing irrelevant prototypes from the feature space. In general we will not know this a priori, but related tasks may share a significant number of such prototypes. Our transfer algorithm identifies the set of prototypes which are jointly most relevant for a given set of tasks, and uses that reduced set of points as the feature space for future related tasks. Our experiments show that using the transferred representation significantly improves learning with small training sets when compared to the original feature space. One of the advantages of our transfer learning method is that the prototype representation is defined using an arbitrary kernel function (Balcan et al., 2004). Recent progress has shown that visual category recognition can improve with the use of kernels that are optimized to particular tasks (Varma and Ray, 2007). We discover an optimal subset of relevant prototypes with a jointly regularized optimization that minimizes the total number of reference prototypes involved in the approximation. Our joint regularization exploits a norm derived from simultaneous sparse signal approximation methods (Tropp, 2006a), leading to an optimization problem that can be expressed as a linear program. Our approach builds on the work of Balcan et al. (2004), who proposed the use of a representation based on kernel distances to unlabeled datapoints, and the work of Tropp (2006a) on simultaneous sparse approximation. The latter problem involves approximating a set of signals by a linear combination of elementary signals while at the same time minimizing the total number of elementary signals involved in the approximation. Tropp proposed a joint optimization with a shared regularization norm over the coefficients of each signal and gave theoretical guarantees for this approach. We evaluate our method on a news-topic prediction task, where the goal is to predict whether an image belongs to a particular news-topic. Our results show that learning a representation from previous topics does improve future performance when supervised training 77

data are limited.

5.2

Learning a sparse prototype representation from unlabeled data and related tasks

We now describe our sparse prototype learning algorithm for learning a representation from a set of unlabeled data points U = {x1 , x2 , . . . , xq } and a collection of training sets of related problems D = {T1 , . . . , Tm }, where Tk = {(xk1 , y1k ), . . . , (xknk , ynkk )}. In all cases, x ∈ X (for example, X = Rd and y ∈ {+1, −1}. In the following subsections we describe the three main steps of our algorithm. In the first step, we compute a prototype representation using the unlabeled dataset. In the second step, we use data from related problems to select a small subset of prototypes that is most discriminative for the related problems. Finally, in the third step we create a new representation based on kernel distances to the the selected prototypes. We will use the sparse prototype representation to train a classifier for a target problem. Algorithm 1 provides pseudo-code for the algorithm. The next three sub-sections describe the three steps of the algorithm in detail.

5.2.1 Computing the prototype representation Given an unlabeled data set U = {x1 , x2 , . . . , xq } and a kernel function k : X × X → R, the first step of our algorithm computes a prototype representation based on kernel distances to the unlabeled data points in U. We create the prototype representation by first computing the kernel matrix K of all points in U, i.e. Kij = k(xi , xj ) for xi and xj in U. We then create a projection matrix θ formed by taking all the eigenvectors of K corresponding to non-zero eigenvalues (the eigenvectors are obtained by performing SVD on K). The new representation is then given by: z(x) = θ> ϕ(x),

78

(5.1)

Input 1: Unlabeled dataset U = {x1 , x2 , . . . , xq } for xi ∈ X (e.g, X = Rd ) Input 2: Collection of related problems: D = {T1 , . . . , Tm } where Tk = {(xk1 , y1k ), (xk2 , y2k ), . . . , (xknk , ynk k )}

for x ∈ X and y ∈ {+1, −1}

Input 3: Kernel function: k : X × X → R Input 4: Threshold ρ Input 5: Regularization constants λk , for k = 1 : m Step 1: Compute the prototype representation • Compute the kernel matrix for all unlabeled points : Kij = k(xi , xj ) for xi ∈ U, xj ∈ U • Compute eigenvectors of of K by performing SVD : Compute a projection matrix θ of dimension q × q by taking the eigenvectors of K; where each column of θ corresponds to an eigenvector. • Project all points xki in D to the prototype space: z(xki ) = θ> ϕ(xki ) where ϕ(x) = [k(x, x1 ), . . . , k(x, xq )]> , xi ∈ U Step 2: Discover relevant prototypes by joint sparse approximation Let W be a q × m matrix where Wj,k corresponds to the j-th coefficient of the k-th problem. • Choose the optimal matrix W ∗ to be: minW,ε

Pm

k=1 λk

Pnk

k i=1 εi

+

Pq

j=1 maxk

|Wj,k |

s.t. for k = 1 : m and i = i : nk yik wk> z(xki ) ≥ 1 − εki

εki ≥ 0

where wk is the k-th column of W , corresponding to the parameters for problem k. Step 3: Compute the relevant prototype representation ∗ | > ρ} • Define the set of relevant prototypes to be: R = {r : maxk |Wrk

• Create projection matrix B by taking all the columns of θ corresponding to the indexes in R. B is then a q × h matrix, where h = |R|. • Return the representation given by: v(x) = B > ϕ(x) Output: The function v(x) : Rq → Rh

Algorithm 1: The sparse prototype representation learning algorithm. 79

where ϕ(x) = [k(x, x1 ), . . . , k(x, xq )]> , and xi ∈ U. We will refer to the columns of θ as prototypes. This representation was first introduced by Balcan et al., who proved it has important theoretical properties. In particular, given a target binary classification problem and a kernel function, they showed that if the classes can be separated with a large margin in the induced kernel space then they can be separated with a similar margin in the prototype space. In other words, the expressiveness of the prototype representation is similar to that of the induced kernel space. By means of this technique, our joint sparse optimization can take the advantage of a kernel function without having to be explicitly kernelized. Another possible method for learning a representation from the unlabeled data would be to create a q × h projection matrix L by taking the top h eigenvectors of K and defining the new representation g(x) = L> ϕ(x); we call this approach the low rank technique. The method we develop in this chapters differs significantly from the low rank approach in that we use training data from related problems to select discriminative prototypes, as we describe in the next step. In the experimental Section we show the advantage of our approach compared to the low rank method.

5.2.2

Discovering relevant prototypes by joint sparse approximation

In the second step of our algorithm we use a collection of training sets from related problems, D = {T1 , . . . , Tm }, where Tk = {(xk1 , y1k ), . . . , (xknk , ynkk )}, to find a subset of discriminative prototypes. Our method is based on the search for a sparse representation in the prototype space that is jointly optimal for all problems in D. Consider first the case of learning a single sparse classifier on the prototype space of the form: f (x) = w> z(x),

(5.2)

where z(x) = θ> ϕ(x) is the representation described in step 1 of the algorithm. A sparse model will have a small number of prototypes with non-zero coefficients. Given a training set with n examples, a natural choice for parameter estimation in such a setting

80

would be to take the optimal parameters w∗ to be:

min λ w

n X

Loss(f (xi ), yi ) +

i=1

q X

|wj |.

(5.3)

j=1

The left term of Equation (5.3) measures the error that the classifier incurs on training examples, measured in terms of a loss function Loss. In this chapter we will use the hinge loss, given by Loss(f (x), y) = max(0, (1 − yf (x))). The right hand term of Equation (5.3) is an l1 norm on the coefficient vector which promotes sparsity. In the context of regression Donoho (2004) has proven that the solution with smallest l1 norm is also the sparsest solution, i.e. the solution with the least number of non-zero coefficients. The constant λ dictates the trade off between sparsity and approximation error on the data. For transfer learning, our goal is to find the most discriminative prototypes for the problems in D, i.e. find a subset of prototypes R such that each problem in D can be well approximated by a sparse function whose non-zero coefficients correspond to prototypes in R. Analogous to the single sparse approximation problem, we will learn jointly sparse classifiers on the prototype space using the training sets in D. The resulting classifiers will share a significant number of non-zero coefficients, i.e active prototypes. Let us define a q × m coefficient matrix W , where Wjk corresponds to the j-th coefficient of the kth problem. In this matrix, the k-th column of W is the set of coefficients for problem k, which we will refer to as wk , while the j-th row corresponds to the coefficients of prototype j across the k problems. The m classifiers represented in this matrix correspond to: fk (x) = wk> z(x).

(5.4)

It is easy to see that the number of non-zero rows of W corresponds to the total number of prototypes used by any of the m classifiers. This suggests that a natural way of posing the joint sparse optimization problem would be to choose the optimal coefficient matrix W ∗ to be:

min W

m X k=1

λk

nk X

Loss(yik , fk (xki )) + lr0 (W )

i=1

81

(5.5)

where: lr0 (W ) = |{j = 1 : d|maxk (wj,k ) 6= 0}|

(5.6)

is a pseudo-norm that counts the number of non-zero rows in W 1 . As in the single sparse approximation problem, the two terms in Equation (5.5) balance the approximation error against some notion of sparsity. Here, the left hand term minimizes a weighted sum of the losses incurred by each classifier on their corresponding training sets, where λk weights the loss for the k-th problem. The right hand term minimizes the number of prototypes that have a non-zero coefficient for some of the related problems. Due to the presence of the lr0 pseudo-norm in the objective, solving (5.5) might result in a hard combinatorial problem. Instead of solving it directly we use a convex relaxation of the lr0 pseudo-norm suggested in the literature of simultaneous sparse approximation (Tropp, 2006a), the l1,∞ norm, which takes the following form: q X j=1

max |Wdk |

(5.7)

k

Using the l1,∞ norm we can rewrite Equation (5.5) as: min W

m X k=1

λk

nm X

Loss(yik , fk (xki )) +

i=1

q X i=1

max |Wjk | k

(5.8)

The right hand term on Equation (5.8) promotes joint sparsity by combining an l1 norm and an l∞ norm on the coefficient matrix. The l1 norm operates on a vector formed by the maximum absolute values of the coefficients of each prototype across problems, encouraging most of these values to be 0. On the other hand, the l∞ norm on each row promotes non-sparsity among the coefficients of a prototype. As long as the maximum absolute value of a row is not affected, no penalty is incurred for increasing the value of a row’s coefficient. As a consequence only a few prototypes will be involved in the approximation but the prototypes involved will contribute in solving as many problems as possible. When using the hinge loss the optimization problem in Equation (5.8) can be formulated as a linear program: 1

The number of non-zero rows is the number of rows for which at least one of its elements is different

than 0.

82

min

W,ε,t

m X

λk

k=1

nk X

εki

i=1

+

q X

tj

(5.9)

j=1

such that for j = 1 : q and k = 1 : m

−tj ≤ Wjk ≤ tj

(5.10)

yik wk> z(xki ) ≥ 1 − εki

(5.11)

εki ≥ 0

(5.12)

and for k = 1 : m and i = 1 : nk

The constraints in Equation (5.10) bound the coefficients for the j-th prototype across the m problems to lie in the range [−tj , tj ]. The constraints in Equations (5.11) and (5.12) impose the standard slack variable constraints on the training samples of each problem.

5.2.3 Computing the relevant prototype representation In the last step of our algorithm we take the optimal coefficient matrix W ∗ computed in Equation (5.9) of Step 2 and a threshold ρ and create a new representation based on kernel distances to a subset of relevant prototypes. We define the set of relevant prototypes to be: ∗ R = {r : max |Wrk | > ρ}. k

(5.13)

We construct a new projection matrix B by taking all the columns of θ corresponding to indices in R, where θ is the matrix computed in the first step of our algorithm. B is then a q × h matrix, where h = |R|. The new sparse prototype representation is given by: v(x) = B > ϕ(x).

(5.14)

When given a new target problem we project every example in the training and test set using v(x). We could potentially train any type of classifier on the new space; in our experiments we chose to train a linear SVM. 83

Figure 5-1: Example images. From top to bottom, left to right: SuperBowl, Sharon, Danish Cartoons, Australian Open, Trapped Coal Miners, Golden Globes, Grammys, Figure Skating, Academy Awards and Iraq.

5.3

Experiments

For these section we used the dataset Reuters.V1 described in Section 4.4. Around 40 percent of the images belonged to one of the ten most most frequent topics, which we used as the basis for our experiments: SuperBowl, Golden Globes, Danish Cartoons, Grammys, Australian Open, Ariel Sharon, Trapped Coal Miners, Figure Skating, Academy Awards and Iraq. Figure 5-1 shows some example images from each of these categories. The experiments involved the binary prediction of whether an image belonged to a particular news topic. The data was partitioned into unlabeled, training, validation and testing sets: we reserved 2,000 images as a source of unlabeled data, 1,000 images as potential source of validation data and 5,000 images as a source of supervised training data. The remaining 2,382 images where used for testing. For each of the 10 most frequent topics we created multiple training sets of different sizes Ty,n for n = 1, 5, 10, 15, . . . , 50

84

and y = 1, 2, . . . , 10; where Ty,n denotes a training set for topic y which has n positive examples from topic y and 4n negative examples. The positive and negative examples were drawn at random from the pool of supervised training data of size 5,000. The total number of positive images in this pool for the ten top topics were: 341, 321, 178, 209, 196, 167, 170, 146, 137 and 125 respectively. We consider a transfer learning setting where we have access to unlabeled data U and a collection of training sets D from m related tasks. Our goal is to learn a representation from D and U and use it to train a classifier for the target task. In our experimental setting we took the 10 most frequent topics and held out one of them to be the target task; the other nine topics were used to learn the sparse prototype representation. We did this for each of the 10 topics in turn. The training set for related topic j in the collection of related training sets D was created by sampling all nj positive examples of topic j and 2nj negative examples from the pool of supervised training data. We test all models using the 2,382 held out test images but we remove images belonging to topics in D. We did this to ensure that the improvement in performance was not just the direct consequence of better recognition of the topics in D; in practice we observed that this did not make a significant difference to the experimental results. Our notion of relatedness assumes that there is a small subset of relevant prototypes such that all related problems can be well approximated using elements of that subset; the size of the relevant subset defines the strength of the relationship. In this experiment we deal with problems that are intuitively only weakly related and yet we can observe a significant improvement in performance when only few examples are available for training. In the future we plan to investigate ways of selecting sets of more strongly related problems.

5.3.1 Baseline Representation For all the experiments we used an image representation based on a bag of words representation that combined color, texture and raw local image information. For every image in the dataset we sampled image patches on a fixed grid and computed three types of features for each image patch: color features based on HSV histograms, texture features consisting

85

of mean responses of Gabor filters at different scales and orientations and ’raw’ features made by normalized pixel values. For each feature type we created a visual dictionary by performing vector quantization on the sampled image patches; each dictionary contained 2,000 visual words. To compute the representation for a new image xi we sample patches on a fixed grid to obtain a set of patches pi = {xi1 , xi2 , . . . , xih } and then match each patch to its closest entry in the corresponding dictionary. The final baseline representation for an image is given by the 6,000 dimensional vector: [[cw1 , . . . , cw2000 ], [tw1 , . . . , tw2000 ], [rw1 , . . . , rw2000 ]] where cwi is the number of times that an image patch in pi was matched to the i-th color word, twi the number of times that an image patch in pi was matched to the i-th texture word and rwi the number of times that an image patch in pi was matched to the i-th raw word.

5.3.2 Raw Feature Baseline Model The raw feature baseline model (RFB) consists of training a linear SVM classifier over the bag of words representation by choosing the optimal parameters to be: w∗ = min λ w

X i

1 Loss(f (xi ), yi ) + kwk22 2

(5.15)

where f (x) = w> x and Loss is the hinge loss described in Section 5.2.2. We conducted preliminary experiments using the validation data from topic 1 and found 0.01 to be the parameter λ resulting in best equal error rate for all training sizes (where we tried values: {0.01, 0.1, 1, 10, 100}); we also noticed that for the validation set the baseline model was not very sensitive to this parameter. We set the constant λ for this and all other models to 0.01. A baseline model was trained on all training sets Ty,n of the 10 most frequent topics and tested on the 2,382 test images. As explained in the previous Section, we removed from the testing set images that belonged to any of the other nine most frequent topics.

86

5.3.3 Low Rank Baseline Model As a second baseline (LRB), we trained a linear SVM classifier but with the baseline feature vectors x in training and testing replaced by h-dimensional feature vectors: g(x) = L> ϕ(x). L is the matrix described in Section 5.2.1 created by taking the top h eigenvectors of K, where K is the kernel matrix over unlabeled data points. We present results for different values of h = {50, 100, 200}. For all experiments in this Section we used an RBF kernel over the bag of words representation: k(xi , xj ) = 2

exp−β||xi −xj ||2 . In a preliminary stage, we tried a range of values β = {0.003, 0.03, 0.3} on the unlabeled data, 0.03 was the value that resulted in a non-degenerate kernel (i.e. neither a diagonal kernel nor a kernel made of ones). The value of β was then fixed to 0.03 for all experiments.

5.3.4 The Sparse Prototype Transfer Model We ran experiments using the sparse prototype transfer learning (SPT) approach described in Section 5.2.3 . For each of the ten topics we train a linear SVM on feature vectors v(x) obtained by running the sparse prototype transfer learning algorithm on the training sets of the remaining nine topics. For a target held out topic j we use the 2,000 unlabeled points in U and a collection of training sets from related problems: D = {T1 , . . . , Tj−1 , Tj+1 , . . . } to compute the sparse prototype representation v(x) based on kernel distances to relevant prototypes. We report results for different values of the threshold ρ; in practice ρ could be validated using leaveone-out cross-validation.

5.3.5 Results For all experiments we report the mean equal error rate and the standard error of the mean. Both measures were computed over nine runs of the experiments, where each run consisted of randomly selecting a training set from the pool of supervised training data. The equal error rate is 1-recall occurring when the decision threshold of the classifier is set so that the the proportion of false rejections will be equal to the proportion of false 87

Results for all Models 0.5 RFB SPT 1 SPT 2 SPT 3 SPT 4 LRB 200 LRB 100 LRB 50

Equal Error Rate

0.45

0.4

0.35

0.3

0.25

1

5

10

15 20 25 30 35 # positive training examples

40

45

50

Figure 5-2: Mean Equal Error Rate over 10 topics for RFB, LRB (for h = {50, 100, 200}, see section 5.2.1 for details) and SPT (for ρ = {1, 2, 3, 4}, see section 5.2.3 for details). acceptances. For example an equal error rate of 30 percent means that when the proportion of false rejections is equal to the proportion of false acceptances 70 percent of the positive examples are labeled correctly and 30 percent of the negative examples are misclassified as positive. Figure 5-2 shows the mean equal error rate averaged over the ten most frequent topics for RFB, LRB and SPT models; as we can observe from this figure the low rank approach fails to produce a useful representation for all choices of h. In contrast, our sparse transfer approach produces a representation that is useful when training classifiers with small number of training examples (i.e. less than 10 positive examples); the improvement is most significant for ρ ≥ 3. For larger training sets the RFB baseline gives on average better performance than the SPT model. We speculate that when larger training sets are available the sparse representation needs to be combined with the raw feature representation. In the future we plan to investigate methods for fusing both representations. However, we would like to emphasize that there exist important applications for which only very few examples might be available

88

Average equal error rate for models trained with 5 examples 0.5 RFB SPT 3

Equal Error Rate

0.45

0.4

0.35

0.3

0.25

SB

GG

DC

Gr

AO

Sh

TC

FS

AA

Ir

Figure 5-3: Mean Equal Error rate per topic for classifiers trained with five positive examples; for the RFB model and the SPT model for θ = 3 (see Section 5.2.3 for details). SB: SuperBowl; GG: Golden Globes; DC: Danish Cartoons; Gr: Grammys; AO: Australian Open; Sh: Sharon; FS: Figure Skating; AA: Academy Awards; Ir: Iraq. for training. For example when building a classifier for a user-defined category it would be unrealistic to expect the user to provide more than a handful of positive examples. Figure 5-3 shows mean equal error rates for each topic when trained with 5 positive examples for the baseline model and the transfer learning model with ρ = 3. As we can see from these figures the sparse prototype transfer learning method significantly improves performance for 8 out of the 10 topics. Figure 5-4 shows learning curves for the baseline and sparse transfer learning model for three different topics. The first topic, Golden Globes, is one of the topics that has the most improvement from transfer learning, exhibiting significantly better performance across all training sizes. The second topic, Academy Awards, shows a typical learning curve for the sparse prototype transfer learning algorithm; where we observe a significant improvement in performance when a few examples are available for training. Finally the third topic, Super Bowl, is the topic for which the sparse prototype transfer algorithm

89

Golden Globes

Academy Awards

0.5

0.5 RFB SPT 3

0.45

RFB SPT 3 0.45

Equal Error Rate

Equal Error Rate

0.4 0.35 0.3 0.25

0.4

0.35

0.2 0.3 0.15 0.1

1

5

10

15 20 25 30 35 # positive training examples

40

45

0.25

50

1

5

10

15 20 25 30 35 # positive training examples

40

45

50

Super Bowl 0.5 RBF SPT 3

0.48 0.46

Equal Error Rate

0.44 0.42 0.4 0.38 0.36 0.34 0.32 1

5

10

15 20 25 30 35 # positive training examples

40

45

50

Figure 5-4: Learning curves for Golden Globes, Academy Awards and Super Bowl topics respectively for RFB and SPT model with θ = 3, see Section 5.2.3 for details about parameter θ. results in worst performance. We speculate that this topic might not be visually related to any of the other topics used for transfer. We have also noticed that this is one of the most visually heterogeneous topics since it contains images from the football field, a press conference and after-game celebrations.

5.4

Chapter Summary

To learn a new visual category from few examples, prior knowledge from unlabeled data as well as previous related categories may be useful. In this chapter wepresented a transfer learning algorithm which exploits available unlabeled data and an arbitrary kernel function by first forming a representation based on kernel distances to a large set of unlabeled data points. To transfer knowledge from previous related problems we observe that a category might be learnable using only a small subset of reference prototypes. 90

Related problems may share a significant number of relevant prototypes; we find such a concise representation by performing a joint loss minimization over the training sets of related problems with a shared regularization penalty that minimizes the total number of prototypes involved in the approximation. This optimization problem can be formulated as a linear program that can be solved with an off-the-shelf package. We conduct experiments on a news-topic prediction task where the goal is to predict whether an image belongs to a particular news topic. Our results show that when only few examples are available for training a target topic, leveraging knowledge learnt from other topics can significantly improve performance. The LP formulation presented in this chapter has two main limitations. The first limitations is that while it is feasible for small problems it does not scale (both in terms of time and memory) to problems with large number of variables2 . The second limitation is that the linear program formulation can not be easily extended to handle other losses. Chapter 6 addresses these problems by developing a general and efficient algorithm for l1,∞ regularization.

2

To give a concrete example, for a problem with 100,000 training samples and 6,000 dimensions the

number of non-zero elements of the LP matrix would be around 109 .

91

Chapter 6 An Efficient Projection for l1,∞ Regularization The work presented in this chapter was published in Quattoni et al. (2009). The LP formulation presented in chapter 5 had the limitation that it could not scale (both in terms of time and memory) to problems with large number of examples and dimensions. To give a concrete example, for a problem with: 100 tasks, 100,000 training samples and 6,000 dimensions we will have a total of a = 700, 000 variables and b = 1, 300, 000 constraints, and an LP matrix with around 109 non-zero entries. To illustrate the scalability of general LP solvers, Figure 6-1 shows time performance as a function of the number of non-zero entries of the LP matrix for a standard benchmark 1 . The results shown correspond to the best commercial solver for this benchmark. As we can see from this figure it is very difficult to use a general LP solver for large-scale problems. Another important limitation of general LP solvers is that their memory requirements are in the order of O(min(a, b)2 ) making them impractical for problems involving large number of variables and constraints (Shalev-Shwartz et al., 2007). In addition, the LP formulation is specific to the hinge loss and cannot be easily extended to handle arbitrary loss functions. In this chapter we address these limitations and develop a general and efficient algorithm for l1,∞ regularization based on formulating the problem as a constrained convex optimization problem for which we derive a projected 1

http://plato.asu.edu/bench.html

92

LP solver Benchmark

1

2 3 4 5 # non−zero entries LP matrix

6

7 5

x 10

Figure 6-1: Time in cpu seconds as a function of the number of non-zero elements of the LP matrix, for the best commercial LP solver (CPLEX: http://www.cplex.com/) on a a set of standard benchmarks. gradient method. The remaining sections are organized as follows: Section 6.1 gives some motivation and frames our optimization algorithm in the larger context of projected gradient methods for minimizing large scale regularized objectives. Section 6.2 presents a general constrained convex optimization formulation for l1,∞ regularized models, Section 6.2.2 shows multitask learning as an instance of the general formulation, Section 6.2.3 reviews the projected gradient method which is the basis for our algorithm, Section 6.3 presents an algorithm to compute efficient projections to the l1,∞ ball, Section 6.4 reviews related optimization methods for l1,∞ regularization, Section 6.5 shows results of applying our algorithm to a synthetic dataset, Section 6.6 presents experiments on a symmetric transfer learning image annotation task and Section 6.7 presents results on a scene recognition dataset. Finally, Section 6.8 summarizes the results of this chapter.

6.1

Introduction

Learning algorithms based on l1 regularized loss functions have had a relatively long history in machine learning, covering a wide range of applications such as sparse sensing (Donoho, 2004), l1-logistic regression (Ng, 2004), and structure learning of Markov networks (Lee

93

et al., 2006). A well known property of l1 regularized models is their ability to recover sparse solutions. Because of this they are suitable for applications where discovering significant features is of value and where computing features is expensive. In addition, it has been shown that in some cases l1 regularization can lead to sample complexity bounds that are logarithmic in the number of input dimensions, making it suitable for learning in high dimensional spaces (Ng, 2004). Analogous to the use of the l1 norm for single task sparse approximation, chapter 5 proposed the use of an l1,∞ norm for enforcing joint sparsity. Recall from chapter 5 that the l1,∞ norm is a matrix norm that penalizes the sum of maximum absolute values of each row. As we have shown in the previous chapter, for a multitask application we can apply the l1,∞ regularization penalty to a parameter matrix W = [w1 , . . . , wm ], where wi is a parameter vector for the i-th task. In this case the l1,∞ regularizer is used to promote feature sharing across tasks and discover solutions where only a few features are non-zero in any of the m tasks (i.e. jointly sparse solutions). Other applications of l1,∞ regularization are simultaneous sparse signal approximation (Tropp, 2006a; Turlach et al., 2005) and structure learning of markov networks (Schmidt et al., 2008). Finding jointly sparse solutions is important when we wish to control the run-time complexity of evaluating a set of classifiers. Consider a multitask problem with m tasks and a dual representation. At test time the cost of evaluating the m classifiers will be dominated by computing inner products between the test point and the support vectors of each task. Clearly, to control the run-time complexity we need to ensure that there will be a small number of shared support vectors. In this chapter we present an efficient projected subgradient method for optimization of l1,∞ regularized convex objectives, which we formulate as a constrained convex optimization problem. Projected gradient methods iterate between performing unconstrained gradient-based updates followed by projections to the feasible set, which in our case is an l1,∞ ball. These methods have been shown to scale well and have been proposed for solving constrained optimization problems involving large numbers of variables. For example, 94

Shalev-Shwartz et al. (2007) developed a projected gradient method for l2 regularization and Duchi et al. (2008) proposed an analogous algorithm for l1 . However, the problem of finding an efficient projected method for l1,∞ constraints remains open. The main challenge in developing a projected gradient algorithm for l1,∞ constraints resides on being able to efficiently compute Euclidean projections onto the l1,∞ ball. We show that this can be done in O(n log n) time and O(n) memory, where n is the number of parameters of our model. We apply our algorithm to a multitask image annotation problem where the goal is to predict keywords for a given image. We show that l1,∞ regularization performs significantly better than both independent l2 and independent l1 regularization. Furthermore, we show that l1,∞ is able to find jointly sparse solutions (i.e. parameters matrices with few non-zero rows).

6.2 A projected gradient method for l1,∞ regularization In this section we start by describing a convex constrained optimization formulation of l1,∞ regularization, followed by a concrete application to multitask learning. We finish by introducing the projected gradient approach that we will use to solve the problem.

6.2.1 Constrained Convex Optimization Formulation Assume we have a dataset D = (z1 , z2 , . . . , zn ) with points z belonging to some set Z, and a d × m parameter matrix W = [w1 , . . . , wm ] where wk ∈ Rd for k = {1, . . . , m} is the k-th column of W . For example in a multitask setting wk would be the parameters for the k-th task. We also have a convex loss function L(z, W ) that measures the loss incurred by W on a training sample z. Let us now define the l1,∞ norm:

||W ||1,∞ =

d X j=1

max |Wj,k | k

(6.1)

When used as a regularization norm l1,∞ induces solutions where only a few rows will contain non-zero values. In fact Tropp (2006a) showed that under certain conditions the l1,∞ regularization norm is a convex relaxation of a pseudo-norm that counts the number of 95

non-zero rows of W . One possibility for defining the l1,∞ regularization problem is to set it as a soft constraint:

min

n X

W

Loss(zi , W ) + λ||W ||1,∞

(6.2)

i=1

Here λ is a parameter that captures the trade-off between error and sparsity. Another natural way of formulating the regularization problem is to set it as a constrained convex optimization:

min W

n X

Loss(zi , W ) s.t. ||W ||1,∞ ≤ C

(6.3)

i=1

In this case C is a bound on ||W ||1,∞ and serves a role analogous to that of λ in the previous formulation. In this chapter we concentrate on this latter formulation.

6.2.2 An Application: Multitask Learning To give a concrete application let us describe the multitask joint regularization setting. The goal here is to train m jointly-sparse linear classifiers, one for each task. By jointly sparse we mean that we wish only a few features to be non-zero in any of the m problems. We can formulate this problem as an l1,∞ regularized objective. Following the notation from the previous section, Z is the set of tuples: (xi , yi , li ) for i = 1 . . . n where each xi ∈ Rd is a feature vector, li ∈ {1, . . . , m} is a label specifying to which of the m tasks the example corresponds to, and yi ∈ {+1, −1} is the label for the example. Equivalently, using the notation from chapter 5 we could assume a collection of datasets D = {T1 , T2 , . . . , Tk } where Tk = {(xk1 , y1k ), (xk2 , y2k ), . . . , (xknk )} Assume that we wish to learn m linear classifiers of the form: fk (x) = wk · x

(6.4)

and let W = [w1 , w2 , . . . , wm ] be a d × m matrix where Wj,k corresponds to the j-th parameter of the k-th problem. So in the ideal case we would like a solution matrix W with a few non-zero rows (i.e. a few active features). In order to assess the classification 96

performance multiple classification losses could be used, in this chapter we used the hinge loss:

LossH (z, W ) = max(0, 1 − fk (x)y)

(6.5)

In this formulation, the hinge loss encourages correct classification of the points and the l1,∞ norm is similar to l1 regularization, but it encourages joint sparsity across the different tasks.

6.2.3 A Projected Gradient Method Our algorithm for optimizing equation (6.3) is based on the projected subgradient method for minimizing a convex function F(W ) subject to convex constraints of the form W ∈ Ω, where Ω is a convex set (Bertsekas, 1999). In our case F(W ) is some convex loss function, W is a parameter matrix and Ω is the set of all matrices with ||W ||1,∞ ≤ C. A projected subgradient algorithm works by generating a sequence of solutions W t via W t+1 = PΩ (W t − η∇t ). Here ∇t is a subgradient of F at W t and PΩ (W ) is is the Euclidean projection of W onto Ω, given by: min ||W 0 − W ||2 = min 0 0

W ∈Ω

W ∈Ω

X

2

0 (Wj,k − Wj,k )

(6.6)

j,k

Finally, η is the learning rate that controls the amount by which the solution changes at each iteration. Standard results in optimization literature (Bertsekas, 1999) show that when η =

η0 √ t

and

F(W ) is a convex Lipschitz function the projected algorithm will converge to an ²-accurate solution in O(1/²2 ) iterations. For the hinge loss case described in the previous section, computing the subgradient of the objective of equation (6.3) is straightforward. The subgradient for the parameters of each task can be computed independently of the other tasks, and for the k-th task it is given by summing examples of the task whose margin is less than one: X yi xi ∇tk =

(6.7)

i : li =k, fk (xi )yi 0 and j (Ai,j − Bi,j ) = θ; or (b) µi = 0 and j Ai,j ≤ θ. Proof: Differentiating L with respect to Bi,j gives the optimality condition

∂L ∂Bi,j

= Bi,j −

Ai,j + αi,j − βi,j = 0. Differentiating L with respect to µi gives the optimality condition P θ − j αi,j − γi = 0.

∂L ∂µi

=

We now assume µi > 0 to prove (a). The complementary slackness conditions imply that P whenever µi > 0 then γi = 0 and therefore θ = j αi,j . If we assume that Bi,j > 0, by complementary slackness then βi,j = 0, and therefore αi,j = Ai,j − Bi,j . If Bi,j = 0 then Bi,j − µi 6= 0 and so αi,j = 0 due to complementary slackness; we then observe that Ai,j = −βi,j , but since Ai,j ≥ 0 and βi,j ≥ 0 it must be that Ai,j = 0; so we can express also αi,j = Ai,j − Bi,j . Thus, P θ = j (Ai,j − Bi,j ) which proves (a). ∂L When µi = 0, Bi,j = 0 because of (6.9) and (6.11). Then, using the optimality conditions ∂B i,j P ∂L we get that αi,j = Ai,j + βi,j . Plugging this into ∂µi we get θ = j (Ai,j + βi,j ) + γi . By definition

βi,j ≥ 0 and γi ≥ 0, which proves (b).

Lemma 1 means that when projecting A, for every row whose sum is greater than θ, the sum of the new values in the row will be reduced by a constant θ. The rows whose sum is less than θ will become zero. The next lemma reveals how to obtain the coefficients of B given the optimal maximums µ. Lemma 2 Let µ be the optimal maximums of problem P1,∞ . The optimal matrix B of P1,∞

99

satisfies that: Ai,j ≥ µi =⇒ Bi,j = µi

(6.13)

Ai,j ≤ µi =⇒ Bi,j = Ai,j

(6.14)

µi = 0 =⇒ Bi,j = 0

(6.15)

Proof: If µi = 0 the lemma follows directly from (6.9) and (6.11). The rest of the proof assumes that µi > 0. To prove (6.13), assume that Ai,j ≥ µi but Bi,j 6= µi . We consider two cases. When Bi,j > 0, by (6.9), if Bi,j 6= µi then Bi,j < µi , which means αi,j = 0 due to complementary slackness. This together with βi,j = 0 imply that Bi,j = Ai,j , and therefore Ai,j < µi , which contradicts the assumption. When Bi,j = 0 then αi,j = 0 and Ai,j = 0 (see proof of Lemma 1), which contradicts the assumption. To prove (6.14), assume that Ai,j ≤ µi but Bi,j 6= Ai,j . We again consider two cases. If Bi,j > 0, βi,j = 0; given that Bi,j 6= Ai,j , then αi,j > 0, and so Bi,j = µi due to complementary slackness. But since αi,j > 0, Ai,j > Bi,j = µi , which contradicts the assumption. If Bi,j = 0 then Ai,j = 0 (see proof of Lemma 1), which contradicts the assumption

With these results, the problem of projecting into the l1,∞ ball can be reduced to the following problem, which finds the optimal maximums µ:

M1,∞ : find µ , θ X s.t. µi = C X

(6.16) (6.17)

i

(Ai,j − µi ) = θ , ∀i s.t. µi > 0

(6.18)

j:Ai,j ≥µi

X

Ai,j ≤ θ , ∀i s.t. µi = 0

(6.19)

j

∀i µi ≥ 0 ; θ ≥ 0

(6.20)

With µ we can create a matrix B using Lemma 2. Intuitively, the new formulation reduces finding the projection to the l1,∞ ball to finding a new vector of maximum absolute values that will be used to truncate the original matrix. The constraints express that the cumulative mass removed from a row is kept constant across all rows, except for those 100

rows whose coefficients become zero. A final lemma establishes that there is a unique solution to M1,∞ . Therefore, the original projection P1,∞ reduces to finding the solution of M1,∞ . Lemma 3 For a matrix A and a constant C < ||A||1,∞ , there is a unique solution µ∗ , θ∗ to the problem M1,∞ . Proof: For any θ ≥ 0 there is a unique µ that satisfies (6.18), (6.19) and (6.20). To see this, P P consider θ ≥ j Ai,j . In this case we must have µi = 0, by equation (6.19). For θ < j Ai,j we have µi = fi (θ) where fi is the inverse of the function gi (µ) =

X

(Ai,j − µ)

j:Ai,j ≥µ

gi (µ) is a strictly decreasing function in the interval [0, maxj Ai,j ] with gi (0) =

P

j Ai,j and P gi (maxj Ai,j ) = 0. Therefore it is clear that fi (θ) is also well defined on the interval [0, j Ai,j ].

Next, define N(θ) =

X

hi (θ)

i

P where hi (θ) = 0 if θ > j Ai,j and hi (θ) = fi (θ) otherwise. N(θ) is strictly increasing in the P interval [0, maxi j Ai,j ]. Hence there is a unique θ∗ that satisfies N(θ∗ ) = C; and there is a unique µ∗ such that µ∗i = hi (θ∗ ) for each i.

So far we have assumed that the input matrix A is non-negative. For the general case, it is easy to prove that the optimal projection never changes the sign of a coefficient (Duchi et al., 2008). Thus, given the coefficient matrix W used by our learning algorithm, we can run the l1,∞ projection on A, where A is a matrix made of the absolute values of W , and then recover the original signs after truncating each coefficient.

6.3.2 An efficient projection algorithm In this section we describe an efficient algorithm to solve problem M1,∞ . Given a d × m matrix A and a ball of radius C, the goal is to find a constant θ and a vector µ of maximums P for the new projected matrix, such that C = i µi . As we have shown in the proof of Lemma 3, µ and θ can be recovered using functions N(θ) and hi (θ). Each function hi (θ) is piecewise linear with m + 1 intervals. Furthermore 101

N(θ), the sum of functions hi (θ), is also piecewise linear with dm + 1 intervals. Section 6.3.3 describes the intervals and slopes of hi . Our algorithm builds these functions piece by piece, until it finds a constant θ that satisfies the conditions of problem M1,∞ ; it then recovers µ. The cost of the algorithm is dominated by sorting and merging the x-coordinates of the hi functions, which form the intervals of N. Therefore the complexity is O(dm log dm) time and O(dm) in memory, where dm is the number of parameters in A. As a final note, the algorithm only needs to consider non-zero parameters of A. Thus, in this complexity cost, dm can be interpreted as the number of non-zero parameters. This property is particulary attractive for learning methods that maintain sparse coefficients.

6.3.3

Computation of hi

In this section we describe the intervals and slope of piecewise linear functions hi . Let si be a vector of the coefficients of row i in A sorted in decreasing order, with an added 0 at position m + 1, si,1 ≥ si,2 ≥ . . . ≥ si,m ≥ si,m+1 = 0. Then, let us define points ri,k = P Pk Pk−1 j:Ai,j ≥si,k (Ai,j − si,k ) = j=1 (si,j − si,k ) = j=1 si,j − (k − 1)si,k , for 1 ≤ k ≤ m + 1. Each point ri,k corresponds to the reduction in magnitude for row i that is obtained if we set the new maximum to si,k . Clearly hi (ri,k ) = si,k . Furthermore it is easy to see that hi is piecewise linear with intervals [ri,k , ri,k+1 ] for 1 ≤ k ≤ m and slope:

Pk j=1

si,j

si,k+1 − si,k 1 =− Pk−1 k − ksi,k+1 − j=1 si,j + (k − 1)si,k

After point ri,m+1 the function is constant and its value is zero. Note that this comes P from equation (6.19) that establishes that µi = 0 for θ > ri,m+1 = m j=1 Ai,j .

6.4

Related Work

A number of optimization methods have been proposed for l1,∞ regularization. In particular, Turlach et al. (2005) developed an interior point algorithm for optimizing a twice differentiable objective regularized with an l1,∞ norm. One of the limitations of this ap102

proach is that it requires the exact computation of the Hessian of the objective function. This might be computationally expensive for some applications both in terms of memory and time. Schmidt et al. (2008) propose a projected gradient method for l1,∞ regularized models that differs in the projection strategy. In our case, the objective corresponds directly to the loss function, and we project to the l1,∞ ball. In contrast, their method minimizes the following regularized objective:

n X

Loss(zi , W ) + λ

i=1

d X

µj

j=1

s.t. |Wj,k | ≤ µj This formulation introduces auxiliary variables µj that stand for the maximum absolute value of each group (i.e., the variables Wj,k for any k). The gradient step updates all the variables without taking into account the constraints: the main variables are updated according to the gradient of the loss function, whereas the auxiliary variables are lowered by the constant λ. Then, a projection step makes a second update that enforces the constraints. In this case, this step can be decomposed into d independent l∞ projections, each enforcing that the auxiliary variables µj are in fact the maximum absolute values of each group. Standard results in optimization theory (see 2.4) show that the convergence of a projected gradient method is in the order of: ²≤O

³ D2 + G2 log(T + 1) ´ √ 2 T

where T is the number of iterations; D is an upper bound on the maximum distance between the initial solution and the optimal solution; and G is an upper bound on the norm of the gradient of the objective at any feasible point. In our formulation, G depends only on the gradient of the loss function. For example, for the hinge loss it can be shown that G can be bounded by the size of an l2 ball containing all the training examples (Shalev-Shwartz et al., 2007). In the method by (Schmidt et al.,

103

2008), G has an additional term which corresponds to the gradient of λ

P j

µj . So the

convergence of their method will have an additional dependence on dλ2 . As we have shown in chapter 5, for the special case of a linear objective the regularization problem can be expressed as a linear program. While this is feasible for small problems it does not scale to problems with large number of variables. Our l1,∞ projection algorithm is related to the l1 projection of Duchi et al. in that theirs is a special case of our algorithm for m = 1. The derivation of the general case for l1,∞ regularization is significantly more involved as it requires reducing a set of l∞ regularization problems tied together through a common l1 norm to a problem that can be solved efficiently.

6.5 Synthetic Experiments For the synthetic experiments we considered a multitask setting where we compared the l1,∞ projection with both independent l2 projections for each task and independent l1 projections. In all cases we used a projected subgradient method, thus the only difference is in the projection step. For all the experiments we used the sum of average hinge losses per task as our objective function. For these experiments as well as the experiments in the √ following section the learning rate was set to η0 / t, where η0 was chosen to minimize the objective on the training data (we tried values 0.1, 1, 10 and 100). All models were run for 200 iterations. To create data for these experiments we first generated parameters W = [w1 , w2 , . . . , wm ] for all tasks, where each entry was sampled from a normal distribution with 0 mean and unit variance. To generate jointly sparse vectors we randomly selected 10% of the features to be the global set of relevant features V . Then for each task we randomly selected a subset v ⊆ V of relevant features for the task. The size of v was sampled uniformly at random from {|V |/2, . . . , |V |}. All parameters outside v were zeroed. Each of the dimensions of the training points xki for each task was also generated from a normal distribution with 0 mean and unit variance. All vectors were then normalized to have unit norm. The corresponding labels yik were set to sign(wk · xik ). The test data was 104

Feature Selection Performance

Synthetic Experiments Results: 60 problems 200 features 10% relevant 50

100 90

45

80 40 70 60

30

50

Error

35

40 25

20

15 10

L2 L1 L1−LINF

20

Precision L1INF Recall L1 Precision L1 Recall L1−INF

30 20

40 80 160 # training examples per task

320

10 10

640

20

40 80 160 # training examples per task

320

640

Figure 6-2: Synthetic experiments. Top: test error. Bottom: feature selection performance. generated in the same fashion. The number of dimensions for these experiments was set to 200 and the number of problems to 60. We evaluated three types of projections: l1,∞ , independent l2 and independent l1 . For each projection the ball constraints C were set to be the true norm of the corresponding parameters. That is for the l1,∞ norm we set C = ||W ||1,∞ . For the independent l1 and l2 norms we set Ck = ||wk ||1 and Ck = ||wk ||2 , resp., where Ck is the regularization parameter for task k. We trained models with different number of training examples ranging from 10 to 640 examples per task and evaluated the classification error of the resulting classifier on the test data. Figure 6-2 shows the results of these experiments. As we would expect, given the amount of feature sharing between tasks, the l1,∞ projection results in better generalization than both independent l1 and independent l2 projections. Since we know the relevant feature set V , we can evaluate how well the l1 and l1,∞ projections recovered these features. For each model we take the coefficient matrix W learned and select all the features corresponding to non-zero coefficients for at least one task. The bottom plot shows precision and recall of feature selection for each model, as we increase the number of training examples per task. As we can see both the l1 model and the l1,∞ can easily recognize that a feature is in the relevant set: the recall for both models is high even with very few training examples. The main difference between the two models is in the precision at recovering relevant features: the l1,∞ model returns significantly sparser solutions, and thus has higher 105

precision.

6.6

Image Annotation Experiments

In these experiments we use our algorithm in a multitask learning application. We compare the performance of independent l2 , independent l1 , and joint l1,∞ regularization. In all cases we used the sum of hinge losses as our objective function. To train the l1 regularized models we used the projected method of Duchi et al. (2008) (which is a special case of our projection when m = 1). To train the l2 regularized models we used the standard SVM-Light software 2 . For these section we used the dataset Reuters.V2 described in Section 4.4. Images on the Reuters website have associated captions. We selected the 40 most frequent content words as our target prediction tasks (a content word is defined as not being in a list of stop words). That is, each task involved the binary prediction of whether a content word was an appropriate annotation for an image. Examples of words include: awards, president, actress, actor, match, team, people. We partitioned the data into three sets: 10,382 images for training, 5,000 images as validation data, and 5,000 images for testing. For each of the 40 most frequent content words we created multiple training sets of different sizes, n = {40, 80, 160, 320, 640}: each training set contained n positive examples and 2n negative examples. All examples for each task were randomly sampled from the pool of supervised training data. For all experiments we used as an image representation the vocabulary tree respresentation (Nister and Stewenius, 2006), with 11,000 features in our case. As a preprocessing step we performed SVD to obtain a new basis of the image space where features are uncorrelated. In preliminary experiments we observed that for both the l1 and l1,∞ models this transformation eased the task of finding sparse solutions. 2

http://svmlight.joachims.org/

106

6.6.1 Evaluation and Significance Testing 3

To compare the performance of different classifiers we use the AUC criterion, which is

commonly used in evaluation on retrieval tasks. For a single task, assuming a labeled test set (xi , yi ) for i = 1 . . . n, the AUC measure for a function f can be expressed (e.g., see (Agarwal et al., 2005)) as 1 X X I[f (xi ) > f (xj )] n+ i:y =+1 j:y =−1 n− i

(6.21)

j

where n+ is the number of positive examples in the test set, n− is the number of negative examples, and I[π] is the indicator function which is 1 if π is true, 0 otherwise. The AUC measure can be interpreted (Agarwal et al., 2005) as an estimate of the following expectation, which is the probability that the function f correctly ranks a randomly drawn positive example over a randomly drawn negative item: £ ¤ AU C(f ) = EX + ∼D+ ,X − ∼D− I[f (X + ) > f (X − )] Here D+ is the distribution over positively labeled examples, and D− is the distribution over negative examples. This interpretation allows to develop a simple significance test, based on the sign test, to compare the performance of two classifiers f and g (more specifically, to develop a significance test for the hypothesis that AU C(f ) > AU C(g)). Assuming that n+ < n− (which is the case in all of our experiments), and given a test set, we create pairs of examples − + − + (x+ i , xi ) for i = 1 . . . n . Here each xi is a positive test example, and xi is an arbitrary

negative test example; each positive and negative example is a distinct item from the test set. Given the n+ pairs, we can calculate the following counts: s+ =

X

− + − I[(f (x+ i ) > f (xi )) ∧ (g(xi ) < g(xi ))]

i



s

=

X

− + − I[(f (x+ i ) < f (xi )) ∧ (g(xi ) > g(xi ))]

i

These counts are used to calculate significance under the sign test. 3

We would like to thank Shivani Agarwal for useful discussions on this subject.

107

# samples

l2 p-value l1 p-value

4800

0.0042

0.00001

9600

0.0002

0.009

19200

0.00001

0.00001

38400

0.35

0.36

63408

0.001

0.46

Table 6.1: Significance tests for the Image Annotation task, comparing l1,∞ with l2 and l1 . Performance Comparison L1INF L1 L2

9600

19200 # training samples

38400

63408

Figure 6-3: Model Comparison in the Image Annotation task: l1,∞ (red), l1 (green), l2 (black). In our experiments, the set-up is slightly more complicated, in that we have a multitask setting where we are simultaneously measuring the performance on several tasks rather than a single task. The test set in our case consists of examples (xi , yi , li ) for i = 1 . . . n where li specifies the task for the i’th example. We replace the AUC measure in Eq. 6.21 with the following measure: X I[fl (xi ) > fl (xj )] 1 X X + n l i:l =l,y =+1 j:l =l,y =−1 n− l i

i

i

j

where n+ is the total number of positive examples in the test set, n− l is the number of negative examples for the l’th task, and fl is the classifier for the l’th task. It is straightforward to develop a variant of the sign test for this setting; for brevity the details are omitted.

108

Figure 6-4: l1,∞ model top ranked images for words: actor,football and award 109

Figure 6-5: l1,∞ model top ranked images 110 for words: final,actress and match

Figure 6-6: l1,∞ model top ranked images for words: minister and tournament

111

Figure 6-7: Parameter matrices of the image annotation task trained with 4,800 training samples for l1 (left) and l1,∞ (right). Each row corresponds to the parameters of a feature across tasks. Gray corresponds to 0 (inactive parameter), whereas black levels indicate the absolute value of the parameters.

6.6.2

Results

To validate the constant C for each model we assume that we have 10 words for which we have validation data. We chose the C that maximized the AUC (we tried values of

112

Figure 6-8: AUC measure with respect joint sparsity in the Image Annotation task, for l1 (top) and l1,∞ (bottom). C = {0.1, 1, 10, 100}). Figure 6-3 shows results on the Reuters dataset for l2 , l1 and l1,∞ regularization as a function of the total number training samples. Table 6.1 shows the corresponding significance tests for the difference between l1,∞ and the other two models. As we can see the l1,∞ regularization performs significantly better than both l1 and l2 for training sizes of less than 20, 000 samples, for larger training sizes all models seem to perform similarly. Figures 6-4, 6-5 and 6-6 show some retrieval results for the l1,∞ model. Figure 6-7 shows the absolute values of the parameters matrices for each model trained with 4,800 examples, where each column corresponds to the parameters for one task and each row to a feature. As we can see l1,∞ regularization produces a joint sparsity pattern, where many tasks use the same features (i.e. rows). Figure 6-8 shows the accuracy of the learned model as a function of the total number 113

Scene Dataset: L1 vs L1INF

L1 Regularization L1−INF Regularization 400

800

1200

# non−zero features

1600

2000

Figure 6-9: Average AUC as a function of the total number of features used by all models. To create this figure we ran both the l1 and l1,∞ models with different values of C. This resulted in models with different numbers of non-zero features. of non-zero features (i.e. the number of features that were used by any of the tasks), where we obtained solutions of different sparsity by controlling the C parameter. Notice that the l1,∞ model is able to find a solution of 66% average AUC using only 30% of the features while the l1 model needs to use 95% of the features to achieve a performance of 65%.

6.7

Scene Recognition Experiments

We also conducted some experiments on an indoor Scene recognition dataset containing 67 indoor scene categories. We used 50 images per class for training, 30 images as a source of unlabeled data to build a representation and the remaining 20 images for testing. We used the unlabeled data: U = {x1 , x2 , . . . , xq } to create a very simple representation based on similarities to the unlabeled images. As a basic image representation g(x) we choose the Gist descriptor (Oliva and Torralba, 2001). Using the unlabeled images we

114

compute the following representation: v(x) = [k(x, x1 ), k(x, x2 ), . . . , k(x, xq )] where k(xi , xj ) = exp

−||g(xi )−g(xj )||2 2 2σ 2

(6.22)

. We will refer to the unlabeled images as templates.

In the experiments we trained one-vs-all classifiers for each category using the hinge loss. We compare l1 regularization with l1,∞ joint regularization. As a performance metric we report the average AUC computed over all classes. Figure 6-9 shows average performance for l1 and l1,∞ regularized models as a function of the total number of non-zero features used by any of the 67 classifiers. To create this figure we ran both the l1 and l1,∞ models with different values of C. As we would expect when no sparsity restriction is imposed both models perform equivalently. However, for sparse models the l1,∞ regularization results in significantly better performance. In particular, the l1,∞ regularized model obtains ≈ 80% AUC using ≈ 10% of the features while the l1 models needs ≈ 30% of the features to achieve similar performance. Figures 6-10 and 6-11 show some examples of features (i.e. templates) chosen by a joint model that has a total of ≈ 100 non-zero parameters. We observe that the features chosen by the jointly sparse model seem to have a generic nature. For example, the first and third features in Figure 6-10 are filter type features that seem to discriminate between vertical and horizontal structures. The feature before the last one in Figure 6-11 appears to recognize indoor scenes that have shelves. For this feature, Figure 6-12 shows the parameter values across the 67 categories. Most of the features chosen by the l1,∞ model seem to have a similar pattern in the sense that they partition evenly the space of scenes.

6.8 Chapter Summary In recent years the l1,∞ norm has been proposed for joint regularization. In essence, this type of regularization aims at extending the l1 framework for learning sparse models to a setting where the goal is to learn a set of jointly sparse models. In this chapter we have presented a simple and effective projected gradient method 115

Figure 6-10: This figure shows some examples of features (i.e. templates) chosen by a joint model with a total of 100 nonzero parameters. Each row corresponds to one feature. The first column shows the template (i.e. unlabeled image). The following six columns show the images in the training set most similar to the template. The last three columns show the images less similar to the template.

116

Figure 6-11: This figure shows some examples of features (i.e. templates) chosen by a joint model with a total of 100 nonzero parameters. Each row corresponds to one feature. The first column shows the template (i.e. unlabeled image). The following six columns show the images in the training set most similar to the template. The last three columns show the images less similar to the template.

117

Example Feature Weights 0.01

0.005

0

−0.005

−0.01 0

20

40

60

Figure 6-12: The figure on the top left shows the parameter weights across the 67 categories for the feature before the last one in figure 6-11. The figure on the top right shows the categories with positive parameter weights (red) and negative parameter weights (green). The figure in the bottom shows example images of positive and negative categories.

118

for training joint models with l1,∞ constraints. The main challenge in developing such a method resides on being able to compute efficient projections to the l1,∞ ball. We present an algorithm that works in O(n log n) time and O(n) memory where n is the number of parameters. This matches the computational cost of the most efficient algorithm (Duchi et al., 2008) for computing l1 projections. We have applied our algorithm to a multitask image annotation problem. Our results show that l1,∞ leads to better performance than both l2 and l1 regularization. Furthermore l1,∞ is effective in discovering jointly sparse solutions. One advantage of our approach is that it can be easily extended to work on an online convex optimization setting (Zinkevich, 2003). In this setting the gradients would be computed with respect to one or few examples. Future work should explore this possibility.

119

Chapter 7 Conclusion 7.1 Thesis Contributions Ideally, we would like to build image classifiers that can exploit a large number of complex features. Working with such rich representations comes at an added cost: Training might require large amounts of supervised data but only a few labeled examples might be available for a given task. To address this problem we proposed transfer algorithms that: 1) Leverage unlabeled examples annotated with meta-data and 2) Leverage supervised training data from related tasks. In particular, this thesis makes the following contributions: • Learning Image Representations using Unlabeled Data annotated with MetaData: We presented a method based on structure learning (Ando and Zhang, 2005) that leverages unlabeled data annotated with meta-data. Our method is able to learn an image representation that is low-dimensional, but nevertheless captures the information required to discriminate between image categories. Our results show that using the meta-data is essential to derive efficient image representations, i.e. a model that ignores the meta-data and learns a visual representation by performing PCA on the unlabeled images alone did not improve performance. • Jointly Regularized Model: We developed a transfer algorithm that can exploit supervised training data from re120

lated tasks. Our algorithm is based on the observation that to train a classifier for a given image classification task we might only need to consider a small subset of relevant features and that related tasks might share a significant number of such features. We can find the optimal subset of shared features by performing a joint loss minimization over the training sets of related tasks with a shared regularization penalty that minimizes the total number of features involved in the approximation. In our first set of experiments we used the joint model in an asymmetric transfer setting. Our results demonstrate that when only few examples are available for training a target task, leveraging knowledge learnt from other tasks using our joint learning framework can significantly improve performance. • An Efficient Algorithm for Joint Regularization: While the LP formulation is feasible for small data-sets, it becomes untractable for optimizing problems involving thousands of dimensions and examples. To address this limitation we presented a general, simple and effective projected gradient method for training l1,∞ regularized models with guaranteed convergence rates of O( ²12 ). The proposed algorithm has O(n log n) time and memory complexity, with n being the number of parameters of the joint model. This cost is comparable to the cost of training m independent sparse classifiers. Therefore, our work provides a tool that makes implementing a joint sparsity regularization penalty as easy and almost as efficient as implementing the standard l1 and l2 penalties. We tested our algorithm in a symmetric transfer image annotation problem. Our results show that when training data per task is scarce l1,∞ leads to better performance than both l2 and l1 regularization. Furthermore, l1,∞ is effective in discovering jointly sparse solutions. This is important for settings where: 1) Computing features is expensive and 2) We need to evaluate a large number of classifiers. In such cases finding jointly sparse solutions is essential to control the run-time complexity of evaluating multiple classifiers.

121

7.2 Discussion 7.2.1

Representations and Sparsity

For the experiments on the Reuters dataset presented in chapter 6 we performed SVD on the raw feature representation. Performing SVD results in a new feature representation with the following properties: 1) Features are uncorrelated and 2) Every feature is now a linear combination of the raw features. We have observed that for some feature representations performing SVD to obtain a new representation with properties 1 and 2 eases the task of finding sparse solutions. Whether whitening the data or not is necessary depends greatly on the feature representation itself. Ideally, the raw feature representation would be rich and complex enough so as to make the SVD step unnecessary. For example no such step was needed for the scene recognition experiments presented in 6.7. Note however that performing SVD has consequences on the computational cost of the method at test time. For a given test point all the raw features will need to be computed to project the point to the sparse SVD representation. Finding a sparse solution will lead to computational savings in the SVD representation both in terms of memory and time since less eigenvectors need to be stored an less inner products need to be computed.

7.2.2

Differences between feature sharing and learning hidden representations approach

The learning hidden representations approach of Ando and Zhang (2005) is appealing because it can exploit latent structures in the data. However, such a power comes at a relatively high computational cost: when used for symmetric transfer the corresponding joint training algorithm must perform an alternating optimization where each iteration involves training classifiers for each task. The feature sharing approach might not be able to uncover latent structures but as we have shown in chapter 6 it can be implemented very efficiently, i.e. we can derive joint learning algorithms whose computational cost is in essence the same as independent training. 122

This approach is appealing for applications were we can easily generate a large number of candidate features to choose from. This is the case in many image classification problems, for example in chapter 5 we generate a large set of candidate features using a kernel function and unlabeled examples. Furthermore, finding jointly sparse models is useful when obtaining a subset of relevant features is in itself a goal. For example, consider a representation based on distances to unlabeled images and an active learning setting where we improve our representation by refining the annotation of the most relevant unlabeled images (i.e. annotating objects contained in the image). We would like to note that as shown in chapter 4, in the case of asymmetric transfer a single iteration of the alternating minimization algorithm might suffice to compute a useful representation. The representation learnt with this approach is orthogonal to the representation learnt with joint regularization. Therefore, future work should explore combining both representations to further improve performance.

7.3

Future Work

Some possible avenues of future work are: • Task Clustering: One of the limitations of our transfer algorithm is that it assumes the existence of a subset of relevant features shared by all tasks. However, in practice for a given set of problems there might be clusters of tasks that share relevant features among them. Discovering such clusters might allow us to share information more efficiently across tasks. • Online Optimization: One advantage of the projected gradient method is that it can be easily extended to work on an online convex optimization setting Zinkevich (2003). In this setting the gradients would be computed with respect to one or few examples, which might improve the overall efficiency of the algorithm. 123

• Feature Representations and Feature Sharing: If there exists a subset of features that can be shared across tasks our algorithm will find it. However, if such subset does not exist we will not be able to share information across tasks. Therefore, given a vision application it would be interesting to investigate which types of feature representations are optimal to promote sharing. • Generalization Properties of l1,∞ Regularized Linear Models: The generalization properties of l1 regularized linear models have been well studied (Zhang, 2002) but little is known about the generalization properties of l1,∞ regularized linear models (Tropp, 2006a; Negahban and Wainwrigh, 2008)

124

Bibliography S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the area under the roc curve. JMLR, 6:393–425, 2005. Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In Proceedings of ICML, 2007. R. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Proceedings of NIPS, 2006. B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4:83–99, 2003. M. Balcan, A. Blum, and S. Vempala. Kernels as features: On kernels, margins, and lowdimensional mappings. In Machine. Learning Journal, 65(1):79 94, 2004. S. Baluja. Probabilistic modeling for face orientation discrimination: Learning from labeled and unlabeled data. In In Neural and Information Processing Systems (NIPS), 1998. D. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.

125

S. Boyd and A. Mutapcic. Subgradient methods. In Lecture Notes for EE364b, Standform University, 2007. L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In Proceedings of ICML, 2004. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20(2):273–297, 2005. D. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Technical report, Techical report, Statistics Dpt, Standford University, 2004. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of International Conference on Machine Learning, 2008. T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proceedings of KDD, 2004. L. Fei-Fei, P. Perona, and R. Fergus. One-shot learning of object categories. Pattern Analysis and Machine Intelligence, 28(4), 2006. M. Fink. Object classification from a single example utilizing class relevance metrics. In Proceedings of NIPS, 2004. J.H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 1998. A. Holub, M. Welling, and P. Perona. Combining generative models and fisher kernels for object recognition. In Proc. ICCV 2005, 2005. T. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In In Advances in Neural Information Processing Systems 11, 1998. L. Jacob, F. Bach, and J.P. Vert. Clustered multi-task learning: A convex formulation. In Proceedings of NIPS, 2008. 126

T. Jebara. Multi-task feature and kernel selection for svms. In Proceedings of ICML, 2004. S. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds and regularization. In NIPS, 2008. M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. S. I. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of markov networks using l1-regularization. In NIPS, 2006. J. Mutch and D. Lowe. Multiclass object recognition with sparse, localized features. In Proceedings of CVPR, 2006. S. Negahban and M. Wainwrigh. Phase transitions for high-dimensional joint support recovery. In Proceedings of NIPS, 2008. Andrew Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In ICML, 2004. K. Nigam, A. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2):103–134, 2000. D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR, 2006. G. Obozinski, B. Taskar, and M. Jordan. Multi-task feature selection. In Technical Report, 2006. G. Obozinski, M. Wainwright, and M. Jordan. High-dimensional union support recovery in multivariate regression. In Proceedings of NIPS, 2008. A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal in Computer Vision, 42:145–175, 2001. A. Quattoni, M. Collins, and T. Darrell. Learning visual representations using images with captions. In Proc. CVPR 2007, 2007.

127

A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classification with sparse prototype representations. In Proc. CVPR 2008, 2008. A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l1,infinity regularization. In Proc. ICML 2009, 2009. R. Raina, A. Y. Ng, and D. Koller. Constructing informative priors using transfer learning. In Proceedings of the 23rd International Conference on Machine learning, pages 713– 720, 2006. M. Schmidt, K. Murphy, G. Fung, and R. Rosale. Structure learning in random fields for heart motion abnormality detection. In CVPR, 2008. M. Seeger. Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, Univ. of Edinburgh, 2001. S. Shalev-Shwartz and N. Srebro. Svm optimization: Inverse dependence on training set size. In Internation Conference of Machine Learning, 2008. S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of ICML, 2004. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of International Conference on Machine Learning, 2007. J. Sivic, B. Russell, A. Efros, Zisserman, and W. Freeman. Discovering object categories in image collections. Proc. Int’l Conf. Computer Vision, Beijing, 2005, 2005. N. Srebro and T. Jaakkola. Weighted low-rank approximations. MACHINE LEARNING INTERNATIONAL CONFERENCE, 20:720–727, 2003. S. Thrun. Is learning the n-th thing any easier than learning the first? In In Advances in Neural Information Processing Systems, 1996. A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object detection. Pattern Analysis and Machine Intelligence, In press, 2006. 128

J. Tropp. Algorithms for simultaneous sparse approximation, part ii: convex relaxation. In Signal Process. 86 (3) 589-602, 2006a. J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE TRANSACTIONS ON INFORMATION THEORY, 50(10):2231–2242, 2004. J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE TRANSACTIONS ON INFORMATION THEORY, 52(3):1030–1051, 2006b. B. Turlach, W. Venables, and S. Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005. V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In Proceedings of ICCV, 2007. T. Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527–550, 2002. P. Zhao, G. Rocha., and B. Yu. Grouped and hierarchical model selection through composite absolute penalties. Technical Report 703, Statistics Department, UC Berkeley, 2007. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of ICML, 2003.

129