Datum-Wise Classification: A Sequential Approach to Sparsity - arXiv

3 downloads 0 Views 652KB Size Report
Aug 29, 2011 - inducing empirical risk, which is a relaxation of the standard L0 .... This empirical risk minimization problem does not consider any prior as-.
Datum-Wise Classification: A Sequential Approach to Sparsity Gabriel Dulac-Arnold?1 , Ludovic Denoyer1 , Philippe Preux2 , and Patrick Gallinari1

arXiv:1108.5668v1 [cs.AI] 29 Aug 2011

1

Universit´e Pierre et Marie Curie - UPMC, LIP6 Case 169 - 4 Place Jussieu - 75005 Paris, France [email protected] 2 LIFL (UMR CNRS) & INRIA Lille Nord-Europe Universit´e de Lille - Villeneuve d’Ascq, France [email protected]

Abstract. We propose a novel classification technique whose aim is to select an appropriate representation for each datapoint, in contrast to the usual approach of selecting a representation encompassing the whole dataset. This datum-wise representation is found by using a sparsity inducing empirical risk, which is a relaxation of the standard L0 regularized risk. The classification problem is modeled as a sequential decision process that sequentially chooses, for each datapoint, which features to use before classifying. Datum-Wise Classification extends naturally to multi-class tasks, and we describe a specific case where our inference has equivalent complexity to a traditional linear classifier, while still using a variable number of features. We compare our classifier to classical L1 regularized linear models (L1 -SVM and LARS) on a set of common binary and multi-class datasets and show that for an equal average number of features used we can get improved performance using our method.

1

Introduction

Feature Selection is one of the main contemporary problems in Machine Learning and has been approached from many directions. One modern approach to feature selection in linear models consists in minimizing an L0 regularized empirical risk. This particular risk encourages the model to have a good balance between a low classification error and high sparsity (where only a few features are used for classification). As the L0 regularized problem is combinatorial, many approaches such as the LASSO [1] try to address the combinatorial problem by using more practical norms such as L1 . These approaches have been developed with two main goals in mind: restricting the number of features for improving classification speed, and limiting the used features to the most useful to prevent overfitting. These classical approaches to sparsity aim at finding a sparse representation of the features space that is global to the entire dataset. ?

This work was partially supported by the French National Agency of Research (Lampada ANR-09-EMER-007).

2

Authors Suppressed Due to Excessive Length

We propose a new approach to sparsity where the goal is to limit the number of features per datapoint, thus datum-wise sparse classification (DWSC). This means that our approach allows the choice of features used for classification to vary relative to each datapoint; data points that are easy to classify can be inferred on without looking at very many features, and more difficult datapoints can be classified using more features. The underlying motivation is that, while classical approaches balance between accuracy and sparsity at the dataset level, our approach optimizes this balance at the individual datum level, thus resulting in equivalent accuracy at higher overall sparsity. This kind of sparsity is interesting for several reasons: First, simpler explanations are always to be preferred as per Occam’s Razor. Second, in the knowledge extraction process, such datum-wise sparsity is able to provide unique information about the underlying structure of the data space. Typically, if a dataset is organized onto two different subspaces, the datum-wise sparsity principle will allows the model to automatically choose to classify using only the features of one or another of the subspace. DWSC considers feature selection and classification as a single sequential decision process. The classifier iteratively chooses which features to use for classifying each particular datum. In this sequential decision process, datum-wise sparsity is obtained by introducing a penalizing reward when the agent chooses to incorporate an additional feature into the decision process. The model is learned using an algorithm inspired by Reinforcement Learning [2]. The contributions of the paper are threefold: (i.) We propose a new approach where classification is seen as a sequential process where one has to choose which features to use depending on the input being inferred upon. (ii.) This new approach results in a model that obtains good performance in terms of classification while maximizing datum-wise sparsity, i.e. the mean number of features used for classifying the whole dataset. It also naturally handles multi-class classification problems, solving them by using as few features as possible for all classes combined. (iii.) We perform a series of experiments on 14 different corpora and compare the model with those obtained by the LARS [3], and a L1 -regularized SVM, thus providing a qualitative study of the behaviour of our algorithm. The paper is organized as follow: First, we define the notion of datum-wise sparse classifiers and explain the interest of such models in Section 2. We then describe our sequential approach to classification and detail the learning algorithm and the complexity of such an algorithm in Section 3. We describe how this approach can be extended to multi-class classification in Section 4. We detail experiments on 14 datasets, and also give a qualitative analysis of the behaviour of this model in Section 6. The related work is given in Section 7.

2

Datum-Wise Sparse Classifiers

We consider the problem of supervised multi-class classification3 where one wants to learn a classification function fθ : X → Y to associate one category y ∈ Y to 3

Note that this includes the binary supervised classification problem as a special case.

Datum-Wise Sparse Classification

3

a vector x ∈ X , where X = Rn , n being the dimension of the input vectors. θ is the set of parameters learned from a training set composed of input/output pairs Train = {(xi , yi )}i∈[1..N ] . These parameters are commonly found by minimizing the empirical risk defined by: N 1 X θ = argmin ∆(fθ (xi ), yi ), N i=1 θ ∗

(1)

where ∆ is the loss associated to a prediction error. This empirical risk minimization problem does not consider any prior assumption or constraint concerning the form of the solution and can result in overfitting models. Moreover, when facing a very large number of features, obtained solutions usually need to perform computations on all the features for classifying any input, thus negatively impacting the model’s classification speed. We propose a different risk minimization problem where we add a penalization term that encourages the obtained classifier to classify using on average as few features as possible. In comparison to classical L0 or L1 regularized approaches where the goal is to constraint the number of features used at the dataset level, our approach performs sparsity at the datum level, allowing the classifier to use different features when classifying different inputs. This results in a datum-wise sparse classifier that, when possible, only uses a few features for classifying easy inputs, and more features for classifying difficult or ambiguous ones. We consider a different type of classifier function that, in addition to predicting a label y given an input x, also provides information about which features have been used for classification. Let us denote Z = {0; 1}n . We define a datumwise classification function f of parameters θ as: ( X →Y ×Z fθ : fθ (x) = (y, z)

,

where y is the predicted output and z is a n-dimensional vector z = (z 1 , ..., z n ), where z i = 1 implies that feature i has been taken into consideration for computing label y on datum x. By convention, we denote the predicted label as yθ (x) and the corresponding z-vector as zθ (x). Thus, if zθi (x) = 1, feature i has been used for classifying x into category yθ (x). This definition of data-wise classifiers has two main advantages: First, as we will see in the next section, because fθ can explain its use of features with zθ (x), we can add constraints on the features used for classification. This allows us to encourage datum-wise sparsity which we define below. Second, while this is not the main focus of our article, analysis of zθ (x) gives a qualitative explanation of how the classification decision has been made, which we study in Section 6. Note that the way we define datum-wise classification is an extension to the usual definition of a classifier.

4

Authors Suppressed Due to Excessive Length

2.1

Datum-Wise Sparsity

Datum-wise sparsity is obtained by adding a penalization term to the empirical loss defined in equation (1) that limits the average number of features used for classifying: θ∗ = argmin θ

N N 1 X 1 X ∆(yθ (xi ), yi ) + λ kzθ (xi )k0 . N i=1 N i=1

(2)

The term kzθ (xi )k0 is the L0 norm 4 of zθ (xi ), i.e. the number of features selected for classifying xi , that is, the number of elements in zθ (xi ) equal to 1. In the general case, the minimization of this new risk results in a classifier that on average selects only a few features for classifying, but may use a different set of features w.r.t to the input being classified. We consider this to be the crux of the DWSC model : the classifier takes each datum into consideration differently during the inference process.

f1 f1

f1

y1

y2

f3 f4

f4 y1

y1

y3

y3

1 1

(left)

 0    0 z =   −λ 1    0   −λ

0

−λ

1

1

1

1

0

(right)

Fig. 1. The sequential process for a problem with 4 features (f1 , ..., f4 ) and 3 possible categories (y1 , ..., y3 ). Left: The gray circle is the initial state for one particular input x. Small circles correspond to terminal states where a classification decision has been made. In this example, the classification (bold arrows) has been made by sequentially choosing to acquire feature 3 then feature 2 and then to classify x in category y1 . The bold (red) arrows correspond to the trajectory made by the current policy. Right: The value of zθ (x) for the different states are illustrated. The value on the arrows corresponds to the immediate reward received by the agent assuming that x belongs to category y1 . At the end of the process, the agent has received a total reward of 0 − 2λ.

Note that the optimization of the loss defined in equation (2) is a combinatorial problem that cannot be easily solved. In the next section of this paper, we 4

−λ 0

−λ

−λ

y2

y2 y3

f4

f2

f2

 0    0 z=  0 −λ    0 − λ  

 0   1 z=  1    0   −λ

The L0 ’norm’ is not a proper norm, but we will refer to it as the L0 norm in this paper, as is common in the sparsity community.

0−2 λ

Datum-Wise Sparse Classification

5

propose an original way to deal with this problem, based on a Markov Decision Process.

3

Datum-Wise Sparse Sequential Classification

We consider a Markov Decision Problem (MDP, [4])5 to classify an input x ∈ Rn . At the beginning, we have no information about x, that is, we have no attribute/feature values. Then, at each step, we can choose to acquire a particular feature of x, or to classify x. The act of classifying x in the category y ends an “episode” of the sequential process. The classification process is a deterministic process defined by: – A set of states X × Z, where state (x, z) corresponds to the state where the agent is currently classifying datum x and has selected features specified by z. The number of currently selected features is thus kzk0 . – A set of actions A where A(x, z) denotes the set of possible actions in state (x, z). We consider two types of actions: • Af is the set of feature selection actions Af = {f1 , . . . , fn } such that, for a ∈ Af , a = fj corresponds to choosing feature j. Action fj corresponds to a vector with only the j th element equal to 1, i.e. fj = (0, . . . , 1, . . . , 0). Note that the set of possible feature selection actions on state (x, z), denoted Af (x, z), is equal to the subset of currently unselected features, i.e. Af (x, z) = {fj , s.t. zj = 0}. • Ay is the set of classification actions Ay = Y, that correspond to assigning a label to the current datum. Classification actions stop the sequential decision process. – A transition function defined only for feature selection actions (since classification actions are terminal): ( X × Z × Af → X × Z T : T ((x, z), fj ) = (x, z0 ) where z0 is an updated version of z such that z0 = z + fj . Policy We define a parameterized policy πθ , which, for each state (x, z), returns the best action as defined by a scoring function sθ (x, z, a): πθ : X × Z → A and πθ (x, z) = argmax sθ (x, z, a). a

The policy πθ decides which action to take by applying the scoring function to every action possible from state (x, z) and greedily taking the highest scoring action. The scoring function reflects the overall quality of taking action a in 5

The MDP is deterministic in our case.

6

Authors Suppressed Due to Excessive Length

state (x, z), which corresponds to the total reward obtained by taking action a in (x, z) and thereafter following policy πθ 6 : sθ (x, z, a) = r(x, z, a) +

T X

rθt |(x, z), a.

t=1

Here (rθt | (x, z), a) corresponds to the reward obtained at step t while having started in state (x, z) and followed the policy with parameterization θ for t steps. Taking the sum of these rewards gives us the total reward from state (x, z) until the end of the episode. Since the policy is deterministic, we may refer to a parameterized policy using simply θ. Note that the optimal parameterization θ∗ obtained after learning (see Sec. 3.3) is the parameterization that maximizes the expected reward in all state-action pairs of the process. In practice, the initial state of such a process for an input x corresponds to an empty z vector where no feature has been selected. The policy θ sequentially picks, one by one, a set of features pertinent to the classification task, and then chooses to classify once enough features have been considered.

Reward The reward function reflects the immediate quality of taking action a in state (x, z) relative to the problem at hand. We define a reward function over the training set (xi , yi ) ∈ T : R : X × Z × A → R which reflects how good of a decision taking action fj on state (xi , z) for input xi is relative to our classification task. This reward is defined as follows7 : – If a corresponds to a feature selection action, then the reward is −λ. – If a corresponds to a classification action i.e. a = y, we have: r(xi , z, y) = 0 if y = yi and = −1 if y 6= yi In practice, we set λ

Suggest Documents