An Efficient Method for Variables Selection Using SVM-Based Criteria

0 downloads 0 Views 368KB Size Report
keyword Support vector machines (SVMs), Feature selection, SVM-based .... one can extend SVMs to non-linear problems by means of a kernel function K.
An Efficient Method for Variables Selection Using SVM-Based Criteria B. Ghattas∗, A. Ben Ishak



Abstract The problem of feature selection for Support Vector Machines (SVMs) classification is investigated in the linear two classes case. We suggest a new method of feature selection based on ranking scores derived from SVMs. We analyze the retraining effects on the ranking rules based on these scores. Our features selection algorithm consists in a forward selection strategy according to the decreasing order of the variables importance and it allows to simply determine how many selected features must be provided to the predictor. Finally we illustrate the effectiveness of our approach on linear synthetic data and some challenging benchmark problems based on Microarray data. Results demonstrate a significant improvement of generalization performance using a few variables.

keyword Support vector machines (SVMs), Feature selection, SVM-based criteria, Ranking rules, Bounds and margin sensitivity, Forward selection, Subset search strategy, Bootstrap, Cross validation, Microarray data.

1

Introduction

Many advanced pattern classification algorithms allow to infer knowledge from a large amount of datasets. This knowledge is then used to make prediction and/or explanatory studies. Some difficulties using these algorithms arise when we deal with datasets having too many attributes but very few instances. Typically, Microarray technology allows researchers to simultaneously measure expression levels associated with thousands of genes in a single experiment. However, the number of replicates in these experiments is often seriously limited. Usually fewer than hundred examples are available altogether for training and testing. This gives rise to datasets having a large number of gene expression values and a relatively small number of samples. In the classification framework each training example xi ∈ Rn consists of n explanatory measurements, referred to as variables, attributes or features, characterizing the problem. Each training example is associated with a label specifying its class. During a learning process, machine learning algorithms try to estimate dependencies between the examples and their label. So intuitively, it seems that increasing the number of variables does not damage the discrimination quality, but in practice that turns out to be a major problem. Indeed, 1

2 SVM for Classification

2

the success of a classification task is strongly affected by the quality of its explanatory variables; redundant, noisy or unreliable variables may impair the learning process. Many recent applications explore problems involving many irrelevant andredundant variables and often comparably few training examples. It is clear that in this challenging task, machine learning methods related to feature selection play a fundamental role for increasing efficiency and enhancing the comprehensibility of results. Thus, the motivation for feature selection is three-fold: improving generalization ability, facilitating data understanding and reducing the time complexity and storage requirements. In this paper we suggest a new methodology for the problem of variable selection. We investigate the efficiency of ranking scores derived from Support Vector Machines for binary classification in the linear case. The variable relevance is measured according to its influence on the weight vector norm or on the estimates of the generalization error. After performing a variable ranking, we adopt a forward selection strategy to determine the optimal subset as done in [7]. This paper can be seen as an extension of [12] and an alternative search strategy for the SVM-RFE algorithm proposed by Guyon [10]. By analogy with the SVM-RFE score we introduce new ranking criteria based on two common bounds for the generalization error of an SVM predictor. Extensive experiments are conducted to compare various ranking criteria. Scores for each variable are estimated using the dataset at hand but also by bootstrap resampling. A forward selection strategy is also suggested allowing us to determine the optimal number of important variables to use in the model, achieving a minimal test error. Generalization error is estimated by random splitting, leave one out or cross validation. Our method performs a variable selection using ranking scores derived from the rich properties of SVMs and alternates the search strategy with the process of training and performance assessment. To our knowledge, up to now, the question of how many ranked features must be provided to the predictor has not been addressed. We show through numerical experiments that our algorithm efficiently solves this problem. The paper is organized as follows. The next Section introduces the basics of SVMs and the two widely used bounds for their generalization error. The different SVM-based criteria for variable ranking are described in Section 3 with a detailed analysis of these criteria concerning the retraining effects and some ranking rule equivalences. Section 4 applies our feature selection methodology to simulated and benchmark real datasets. Finally, in Section 5 we discuss our results and we present some possible extensions.

2

SVM for Classification

In this Section we introduce the basic ideas of support vector machines. This class of algorithms, introduced in [3] and [14], has been shown to be with high performance over a large number of real applications.

3

2 SVM for Classification

2.1

Linear SVM

Suppose we have a training set S of l samples in X ⊆ Rn each belonging to one of the two classes of Y = {−1, +1}, the two classes being linearly separable: l

S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xl , yl )} ⊆ (X × Y) . A hyperplane in the n-dimensional space X is completely defined by the pair (w, b) ∈ Rn × R so that hw.xi + b = 0, where h.i denotes the standard inner product and k.k will be the associated norm in Rn . A hyperplane is optimal if it separates perfectly the two classes and is the farthest away from the closest training vectors of each class. In other words, it is the one that maximizes the 1 . This means that the optimal hyperplane has to margin defined by γ = kwk solve the following optimization problem: M inimizew Subject to

kwk2 2 yi (hw.xi i

+ b) > 1 , ∀i ∈ {1, . . . , l}

(1)

This formulation of the SVM optimization problem is called the hard margin formulation since no training errors are allowed. This quadratic problem can be solved in the dual space of Lagrange multipliers αi > 0, i ∈ {1, . . . , l}. By forming the Lagrangian and solving the stationary conditions, this problem can be translated into: Pl Pl M aximizeα g(α) = i=1 αi − 12 i,j=1 yi yj αi αj hxi .xj i , Pl (2) Subject to i=1 yi αi = 0, αi > 0, i = 1, 2, . . . , l The optimal solution α∗ = (α1∗ , . . . , αl∗ ) of problem (2) specifies the coefficients for the optimal hyperplane: w∗ =

l X

αi∗ yi xi =

i=1

X

αi∗ yi xi .

i∈sv

where sv = {i ∈ {1, . . . , l} ; αi∗ 6= 0} . Vectors xi for which αi∗ 6= 0 are called support vectors 1 , and are the closest ones to the separating hyperplane. The threshold b∗ is chosen to maximize the margin and is given by: b∗ = −

maxyi =−1 (hw∗ .xi i) + minyi =+1 (hw∗ .xi i) . 2

The decision function given by an SVM is: f (x) = sign

X

i∈sv

αi∗ yi

hxi .xi + b



!

.

(3)

1 The training samples for which α∗ = 0 may be removed from the training set without i affecting the solution.

4

2 SVM for Classification

For the non-separable case, the training data cannot be separated by a hyperplane without errors due to a few margin violations. In such a situation, the optimization problem (1) has no solution. Therefore, the previous constraints must be relaxed by introducing slack variables ξi which results in the so called soft margin SVM algorithm, where the optimization problem is: M inimizew Subject to

Pl kwk2 2 i=1 ξi 2 +C yi (hw.xi i + b) > 1

− ξi , i = 1, 2, . . . , l ξi > 0, i = 1, 2, . . . , l

(4)

Here, the constant C determines the trade-off between the maximization of margin and minimization of classification error. When C is sufficiently large and the training set S is separable, the solution of this optimization problem coincides with the one obtained for the separable case. 2.2 Non-linear SVM The above algorithms are limited to linearly separable cases and a relaxed version that allows some training errors. As the decision function (3) depends on the inner product between two vectors rather than on input vectors explicitly, one can extend SVMs to non-linear problems by means of a kernel function K that satisfies the Mercer conditions (symmetric semi-definite positive function). This kind of kernel induces an implicit non-linear function ϕ which maps the sample points xi ∈ X into a high dimensional (even infinite) feature space T where one constructs the optimal hyperplane that separates the mapped points ϕ(xi ). This is equivalent to a non-linear separating surface in X . In the above training problems the data appears only in the form of inner products hxi .xj i, so in the feature space we are only dealing with the data in the form hϕ(xi ).ϕ(xj )i = K(xi , xj ). Thus we never need to know ϕ explicitly. The matrix (K(xi , xj ))16i,j6l is called the Gram matrix of the training examples. Some widely used kernels are: d

• Polynomial: K(x, z) = (hx.zi + 1) , where d ∈ N is the degree.   2 • Radial Basis Function (RBF): K(x, z) = exp − 2σ1 2 kx − zk , where σ ∈ R∗+ is the bandwidth. In the linear separable case K is simply the dot product in the original space and ϕ is the identity. 2.3 Bounds on the Generalization Error for SVMs Since the expected error of an SVM is not accessible, one has to build estimates or bounds for it. We present here two widely used bounds of the generalization performance of an SVM. Almost all the bounds introduced for the expected error rate of an SVM are derived from the leave-one-out estimate. This l-fold cross-validation procedure consists in learning a decision function from l−1 examples, testing the remaining one and repeating until all elements have served as test examples. The number of errors in the leave-one-out procedure is denoted by L. It is known that this

5

2 SVM for Classification

procedure gives an almost unbiased estimate of the expected generalization error ([11]). The theoretical wealth and the structure of SVMs allow to estimate their generalization performance from bounds on the leave-one-out error. The two most common error bounds for SVMs are The Radius-margin bound and the span bound: • Radius-margin bound: For SVMs without threshold and with no training errors, Vapnik ([15]) established the following upper bound on the number of test errors obtained by the leave-one-out procedure: L6

R2 2 = R2 kw∗ k γ2

(5)

where γ is the margin and R is the radius of the smallest sphere enclosing all the training data. Computing the radius of the smallest sphere enclosing the training points can be achieved by solving the following quadratic problem ([15]): M aximizeβ

R2 =

l P

βt K(xt , xt ) −

l P

βt βs K(xt , xs )

t,s=1

t=1

Subject to

l P

βt = 1,

(6)

t=1

βt ≥ 0,

t = 1, 2, . . . l.

• Span bound: A tighter bound has been given by Vapnik and Chapelle ([16]). Under some assumptions the authors derived an estimate of the number of errors during the leave-one-out procedure: X L6 αp∗ Sp2 (7) p∈sv

where the span Sp is the distance between the support vector xp and a set Λp of a constrained linear combinations of the other support vectors, Sp = d(xp , Λp ) = min kxp − xk x∈Λp

where

Λp =

 l  X 

i=1,i6=p

λ i Xi :

l X

i=1,i6=p

  λi = 1, and ∀i 6= p, αi∗ + Yi Yp αp∗ λi ≥ 0 

˜ sv of the Gram The squared span Sp2 is related to the extended matrix K matrix Ksv of support vectors   Ksv 1 ˜ , Ksv = 1T 0

6

3 SVM-Based Criteria for Ranking Variables

where Ksv = {K(xi , xj ); i, j ∈ sv} , by the equation Sp2 = 

1 −1 ˜ sv K

 . pp

T ˜ where 1 is a unit column of length #sv and  1 is its transpose, Ksv is a  −1 th ˜ (#sv + 1) × (#sv + 1) matrix and Ksv is the p diagonal element 2

pp

˜ −1 (3 ). of K sv

Note however that according to these two bounds, the larger the margin, the better the generalization capacity is. Moreover, the generalization ability of an SVM is independent of the dimensionality of the feature space.

3

SVM-Based Criteria for Ranking Variables

An interesting point of SVMs is that they are provided with many statistics that allow to derive several criteria for the purpose of variables ranking. Similarly to [12], we investigate three criteria derived from SVMs for assessing the importance of each variable. These criteria are the weight vector bound, the radius-margin bound and the span bound. Each criterion gives rise to three scores called: zero-order score, difference-order score and first-order score.

3.1

The Criteria Derived From the SVM

The three criteria derived from SVM that we will consider are the following : 2 The weight vector bound : W = kw∗ k 2 The radius margin bound : RW = R2 kw∗ k l P αp∗ Sp2 The Span bound : Spb = p=1

Let us denote : C the value of any of the three criteria computed after learning with the entire dataset. C(−i) the value of that criterion when computed without retraining, but omitting the ith component of any n-valued vector involved in the computation of C. This is the importance measure of variable i without retraining. Cr (−i) the value of any criteria computed after retraining an SVM without considering the ith component. This is the importance measure of variable i obtained with retraining. 2 3

# denotes the cardinal of the set of support vectors sv. ˜ sv is singular, a small ridge is added. When the matrix K

3 SVM-Based Criteria for Ranking Variables

3.2

7

The Scores Based on the Selected Criteria.

As for the criterion used in the SVM-RFE algorithm recently proposed by [10], we introduce two analogous scores based on the radius-margin bound and the span bound. For each variable in a dataset we define three scores computed from any of the three criteria of the previous Section. 3.2.1

The Zero-order Scores :

A zero-order score for a variable i based on criterion C is equal to either Cr (−i) (when retraining is used) or C(−i) (without retraining). The zero-order scores for the three criteria are denoted : W 0 , RW 0 and Spb0 . We add an index r when using retraining. The most relevant variable i to the classification problem is the one that maximizes the score when retraining is used. Without retraining the most relevant variable is the one that minimizes the zero order score. 3.2.2

Difference-order Scores

Here the relevance of variable i will be measured by, |C − C(−i)| The difference-order scores for the three criteria are denoted : ∆W , ∆RW and ∆Spb. We add an index r when using retraining. The most relevant variable i to the classification problem is the one that maximizes the difference-order score, with or without retraining. 3.2.3

First-order Scores

In this case one measures the infinitesimal sensitivity of C to each variable. This can be done by introducing a scaling factor vi and computing the derivative of C with respect to that virtual weighting factor. Each factor vi , i = 1, . . . , n, acts as component-wise multiplicative term on the ith variable and is set to 1 when evaluating the score ∂C ∂C(i) = ; i = 1, . . . n. ∂vi (vi =1)

The three scores are denoted respectively : ∂W, ∂RW, ∂Spb. In computing these first-order scores, one needs to compute the derivative 2 of kwk2 , R2 and Sp2 with respect to vi , which are solutions for optimization problems. The computation of the derivatives can be done thanks to a lemma given in [6]. We give in the last section some details necessary for computing these scores. The most relevant variable i to the classification problem is the one that maximizes the first order score. Retraining is clearly unnecessary for this category of scores.

8

3 SVM-Based Criteria for Ranking Variables

3.3

Effects of Retraining on the Scores

In previous works ([10],[12]) retraining has been considered unnecessary for the zero-order and the difference scores. In this Section we check for this assumption and analyze it’s effect on the corresponding scores. In the case where the retraining is performed at each variable removal, the exact value of the weight vector or the bounds is computed. These values give either an estimation of the new dataset4 separability or an estimation of the generalization performance for the new SVM predictor. However, without retraining we partially evaluate the contribution of a variable to each one of the three criteria. Consequently, W (−i), RW (−i) and Spb(−i) have absolutely no relation to the margin and the generalization ability concept. For instance, when retraining is performed, |W − Wr (i)| measures the difference between two weight vectors corresponding to different training data structures, but without retraining it evaluates the partial contribution of the ith variable to the weight vector achieved on the initial training set. This fact completely changes the signification of the zero-order and difference criteria. Thus, with retraining, the relevant variable should affect weight vector or generalization error bounds more than irrelevant one, but without retraining, the relevant variable should contribute to initial weight vector or initial generalization error bounds more than irrelevant one. As a result, the ranking rule corresponding to the zeroorder criteria is reversed when retraining is not performed: the relevant variable i to the classification problem is the one that minimizes zero-order score. Note that the ranking rules for the difference criteria with or without retraining are exactly the same.

3.4

Some Equivalences Between the Scores in the Linear Case

In this section we establish some equivalences between a few of the SVM-based ranking rules. All the equivalences deal with the linear dependence case (i.e. K(x, z) = hx.zi). Proofs of the main results are quite simple and are given in the appendix. • Under the retraining procedure maximizing ∆Wr is equivalent to maximizing W0r and therefore the relevance orders given by W0r and ∆Wr are exactly the same. • Without retraining, maximizing |C − C(−i)| is equivalent to minimizing C(−i), for the three criteria. • Without retraining, W 0 , ∆W and ∂W are equivalent. lemma 1 The following inequality holds: ∀i = 1..n 4

W 0 6 Wr0 (i)

The ith variable removal induces a new structure for the initial dataset.

9

3 SVM-Based Criteria for Ranking Variables

The following results are obtained without retraining : lemma 2 The following inequality holds for the three importance measures, W, RW and Spb : ∀i C 0 > C 0 (−i) Lemma 2 shows that the difference C − C(−i) is always positive for the three criteria. That means that in the case where retraining is not performed each difference criterion is equivalent to its corresponding zero-order one. In other words, maximizing |C − C(−i)| is equivalent to minimizing C(−i). Moreover, the following lemma is true. lemma 3 When retraining is not performed the following equality holds: ! 0 1 ∂ kwk2 0 W − W (i) = 2 ∂vi (vi =1)

Finally, according to the previous lemma we conclude that the relevance ranks based on W 0 , ∆W and ∂W are identical. In view of these equivalences, a few criteria will not be considered in the experimental section when we deal with linearly separable datasets. The extension of the results to non linear dependence case is possible but not obvious because the variable removal is done in the initial space X where non linear Mercer kernels did not define an inner product. This issue is beyond the scope of this paper. The following table gives a summary of the scores and their equivalences in the linear dependence case.

Weight vector Radius-margin bound Span bound

Zero-order W 0 (∗) Wr0 (∗∗) 0 RW (∗ ∗ ∗) RWr0 0 Spb (∗ ∗ ∗∗) Spb0r

Difference-order ∆W (∗) ∆Wr (∗∗) ∆RW (∗ ∗ ∗) ∆RWr ∆Spb(∗ ∗ ∗∗) ∆Spbr

First-order ∂W (∗) ∂RW ∂Spb

Tab. 1: The criteria indexed by r are computed with retraining. The criteria marked by the same number of asterisks are equivalent.

3.5

Using the Scores to Select Features

Each score described in this section may be computed from an SVM model learned from the dataset at hand. Once the scores are computed, all the variables may be ranked in a decreasing order of importance.

4 Experiments and Results

3.5.1

10

Nested Models for Selecting the Optimal Number of Relevant Features

We have chosen a stepwise forward procedure to evaluate the contribution of each feature to the model. A sequence of nested increasing models is constructed. The model M (1) , uses one feature, the most important, and M (k) uses the k most important features, k = 1, . . . , n. We estimate the error rate of each model M (k) by cross validation, random splitting test samples or the leave one out procedure. The model having the lowest mean error is chosen as the one having the optimal number of features. 3.5.2

Bootstrapping the Scores

Our first experiments showed clearly that computing the scores using the dataset available gives good but unstable results. The scores may have a significant change when we use the same dataset leaving one observation out. To get better estimates of the scores we computed as for bagging ([4]), a bootstrap estimate for the scores. A number of B bootstrap samples are drawn giving B estimates for each score, whose average is used as a better estimate. Results seem to be more robust to data variation.

4

Experiments and Results

We have carried out several experiments to assess the performance of our feature selection strategy over real and synthetic datasets. We have used a modified version of the optimization functions of the matlab toolbox SVM-KM ([5]). Experiments are conducted using each of the 9 scores introduced in the previous Section, with and without retraining (retraining is not necessary for the Spb score). In all the experiments we compute the scores for each variable twice. First using all the dataset at hand, and second, using 100 bootstrap samples and averaging the score of each variable computed over these samples. To assess the optimal number of relevant features, we construct nested increasing SVM models, where at each stage we add to the model the variables one at once in a decreasing order of their importance. The performance of each model is assessed by using 50 random splitting stratified test samples (We have checked that results are similar if we use cross validation or the leave one procedure). For the synthetic dataset we check also the efficiency of our method when varying the number of features and the sample size. Our first aim is to show that our scoring strategy can find the real important variables, in presence of noise and with few observations compared to the number of variables. For large datasets we proceed by two steps for the stepwise procedure : First we introduce variables in the model by ten to localize the global minimum and then we rerun the same procedure introducing the variables one by one exceeding

4 Experiments and Results

11

slightly the localized minimum. Doing this we earn computation time and we get the precise number of optimal variables to use. The scheme of our experiments is given in table 2. Fix the number of Bootstrap samples B. Let D be the whole data set. Score(D,B) and output X (1) , ..., X (n) For k = 1..n  M k = f X (1) , ..., X (k) Erk = T estRS M k , D kopt = Argmink {Erk } Test the model M kopt by Leave one out using D. Tab. 2: Scheme of the experiments used in the following sections. Score is the function which computes the score using B bootstrap samples. T estRS is a function testing a model using 50 random splitting samples from the learning dataset.

Further, this scheme will be used within a cross validation procedure.

4.1

Datasets

Here we give a brief description of the synthetic and the real life datasets investigated in our experiments. 4.1.1

Toy Data

We used the linearly separable dataset described in [17] and [18]. In the 2class linear problem, the first six features are relevant. The probability of the two classes y = 1 or −1 is equal. The first three features {x1 , x2 , x3 } are drawn as xi = yN (i, 1) and the second three features {x4 , x5 , x6 } are drawn as xi = yN (0, 1) with a probability 0.7, otherwise the first three were drawn as xi = yN (0, 1) and the second three as xi = yN (i− 3, 1). The remaining features are noise xi = N (0, 20), i = 7 . . . , n. It is clear from this data construction that the first six features have redundancy. Note that we can vary the number (n−6) of noise features and the dataset size l according to our aims. 4.1.2

Real applications and Data sets

We use several microarray datasets, which have been widely experimented, all of them concerned by a two class classification problem. • Colon Cancer: The Colon cancer problem is described in [2]. In this dataset 62 tissue samples probed by DNA Microarrays contain 22 normal

12

4 Experiments and Results

and 40 colon cancer examples. These two classes have to be discriminated by the expression profiles of 2000 genes. • Lymphoma Dataset: The lymphoma problem is described in [1]. In this dataset the goal is to separate cancerous and normal tissues in a large B-Cell lymphoma. The dataset contains 96 expression profiles concerning 4026 genes, 62 samples are in the classes “DLCL”, “FL” and “CLL” (malignant) and the remaining 34 are labelled “otherwise”. • Prostate Cancer: This dataset is described in [13]. It contains 102 samples with expression profiles concerning 12600 genes. The task is to separate tumor from normal samples. • Leukemia Dataset: The leukemia discrimination problem is described in [8]. The ALL-AML dataset contains 72 samples, each with expression profiles about 7129 genes. The task is to distinguish between the two variants of leukemia ALL and AML. The training set consists of 38 examples and the test set 34 examples. As done usually, we have normalized each dataset to get zero mean and unit standard deviation for each variable. Table 3 gives a summary of these datasets. Dataset Colon Cancer Lymphoma Prostate Cancer Leukemia

# of features 2000 4026 12600 7129

Training set size 62 96 102 38

Test set size not available not available not available 34

#observations +1/-1 22/40 62/34 52/50 27/11 - 20/14

Tab. 3: Real data sets description.

4.2

Results

In addition to the 10 selected scores, we have also used the Fisher discrimination criterion as an extra ranking score, + µi − µ− i F DS(i) = + − ; i = 1, 2, . . . n. ηi + ηi

where µ± i is the mean value for the ith variable in the positive and negative classes and ηi± is the corresponding standard deviation. Namely, the relevant variable i to the classification problem is the one that maximizes F DS(i). In the experiments, this criterion will be called F DS. Note that the Fisher discrimination score performs almost identical to Pearson correlation coefficients, (see [9]).

13

4 Experiments and Results

4.2.1

Toy Experiments

As this dataset is linearly separable, we have used standard linear SVM throughout the experiments. First we check the ability of our strategy to retrieve the right important variables when noisy variables are present, and when varying the sample size. Effect of the sample size for retrieving relevant features For this Section we fix the number of features to n = 200. The 6 first features are the original relevant features for the model, the others are noise. We vary the sample size taking l = 50,100 and 200. We compute the 11 selected scores after learning an SVM on each dataset, and rank the 200 features according to each score. Table 4 gives the highest rank for the 6 relevant variables originally used in the model for each sample size. All of the scores except the zero order Span bound, rank always the 5 relevant variables among the 6 first. For 50 and 100 samples the 6th relevant variable appears late. For 200 observations, the 6 relevant variables are ranked as the most important except for the first order and difference order span born scores. sample size/Score

50 100 200

F DS 7 7 6

RW 0

Spb0

∆RWr

∆Spbr

Wr0

RWr0

Spb0r

∂W

∂RW

∂Spb

10 8 6

199 200 200

10 9 6

103 7 32

10 9 6

10 9 6

19 7 199

10 8 6

10 8 6

9 9 6

Tab. 4: Highest rank necessary to retreive the 6 relevant variables for the model, for the three sample sizes, 50, 100, 200.

The worst results are obtained with the zero order and difference order span born scores. For the other scores, it is clear that by increasing the sample size the 6 relevant variables are ranked as the most important ones. These results, show clearly that we may retrieve the 5 first relevant variables even with a little sample size, except for the zero order span born score. We suspect the 6jh variable to be redundant with the 5 first ones. The Spb0 score will not be considered in the rest of the experiments. Figure 1 shows for the three sample sizes the performance of the nested increasing models where variables are introduced one by one in a decreasing order of importance. Each panel corresponds to one sample size. The curves for the ten scores are overlaid. Each point is a mean average error for the corresponding model using the k most important variables (x-axis), computed over 50 independent regenerated test samples having each 50 observations. • We can see that the shape of all these curves is similar decreasing to reach one global minimum and then increasing. Only one curve seems to be different, it corresponds to the ∆Spbr score.

14

4 Experiments and Results

50 observations

100 observations

200 observations

0.2

0.2

0.2

0.15

0.15

0.15

0.1

0.1

0.1

0.05

0.05

0.05

0 0 10

2

10

0 0 10

2

10

0 0 10

2

10

Fig. 1: Effect of varying the sample size. For each sample size, the curves corresponding to the 10 scores are overlayed. We use 200 features. The y axis corresponds to the mean error over 50 regenerated test samples. The x axis gives the number of important variables used. • The global minimum is often reached when using the 4 most important variables retrieved whatever the score is used. • The mean errors decrease when we increase the sample size. Figure 2 gives a better insight over these results when using 50 observations. Varying the number of features The 4 first variables relevant for the model are retrieved as the 4 most important ones whatever is the number of features and for all the scores. The mean test error using these 4 variables is about 1%. For all the scores the minimum error is reached using the 7 most important variables when 500 features are tested (MER= 0.8%) and using 6 variables when 1000 features are included in the model (0.76%). The fifth and sixth original relevant variables appear soon when using 500 features, and very late when using 1000 features. Table 5 gives the positions at which the 6 relevant variables are ranked.

Figure 3 gives the mean test error curves for the nested models. The left panel corresponds to 500 features, the right one for 1000 features. The shape of the cumulative curve for the errors of the nested models is the same for all the scores, whatever the number of features is. Bootstrapping the scores

15

4 Experiments and Results

50 observations & 200 variables 0.2

0.2

0.15

0.15

0.1

0.1

0.05

0 0 10

∂RW ∂Spb 1

10

0.05

0 0 10

2

10

0.2

0.2

0.15

0.15

0.1

0.1 ∂W W0r

0.05

Spb0r ∆Spbr 1

2

10

10

RW00 RWr ∆RWr

0.05

FDS 0 0 10

1

10

0 0 10

2

10

1

2

10

10

Fig. 2: A comparison of variables ranking scores on a toy problem. The x -axis is the number of features, and the y-axis is the average test error over 50 regenerated test sets (n=200,l=50) #of features/Score 200 500 1000

F DS 7 18 182

RW 0 10 11 180

∆RWr 10 12 473

∆Spbr 103 10 602

Wr0 10 12 180

RWr0 10 12 178

Spb0r 19 10 176

∂W 10 12 179

Tab. 5: Highest rank necessary to retreive the 6 relevant variables for the model, for the three different number of features 200, 500, 1000. The number of observations is set to 50. It is known that leaving one observation out does not affect the SVM results when the omitted observation is a non support vector. We have checked the variability of a score when one observation is omitted. All the scores considered here are affected when one observation is kept out, whenever it is a support vector or not. A natural idea is to look for the distribution of the scores. Figure 4 shows for the first 9 variables of the toy model, the bootstrap distribution of their ∂Spb score. A dashed vertical line appears at the mean of these distribution, and a continuous line is shown at the observed value of the score over the original data set. For the first 6 variables, which are the relevant ones for the model, there is a higher variability in the distribution of the bootstrap scores, and the mean score is significantly lower than the observed one. For the noisy variables the scores are less variable and the bootstrap mean is closer to the observed value

∂RW 10 11 182

∂Spb 9 10 594

16

4 Experiments and Results

Mean test error : 500 features

Mean test error : 1000 features

0.35

0.35

0.3

0.3

0.25

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

0 0 10

1

2

10

0 0 10

3

10

10

1

2

10

3

10

10

Fig. 3: Mean Test error by random splitting when using 500 features (left panel) and 1000 features (right panel). (1)

(2)

(3)

60

60

60

40

40

40

20

20

20

0

1

2 (4)

3

0

1

2 (5)

3

0

60

60

60

40

40

40

20

20

20

0

1

2 (7)

3

0

1

2 (8)

3

0

60

60

60

40

40

40

20

20

20

0

1

2

3

0

1

2

3

0

1

2 (6)

3

1

2 (9)

3

1

2

3

∂Spb : 50 observations & 200 variables (500 bootstrap samples)

Fig. 4: Bootstrap distribution of the 9 first variables score. Only the 6 first ones are relevant for the model. The two vertical lines correspond to the bootstrap mean score and to the estimated score from the original dataset. for the score.

17

4 Experiments and Results

Table 6 gives for each score the number of the 6 variables ranked at the top, and the position at which all of the 6 relevant variables appears in the ranking. This ranking is done using the bootstrap mean score of each variable It is clear that by using the bootstrap mean instead of the observed value of a score (given in the first colon of table 5), we retrieve the original variables of the model more rapidly in the ranking. Variable/Score

1 2 3 4 5 6

F DS 2 1 4 5 3 122

RW 0 2 1 4 5 3 123

∆RWr 2 1 4 5 3 22

∆Spbr 2 1 4 5 3 123

Wr0 2 1 4 5 3 22

RWr0 2 1 4 5 3 123

Spb0r 2 1 4 5 3 123

∂W 2 1 4 5 3 123

Tab. 6: Ranking variables according to the mean of the bootstraped scores over 500 samples. 200 features used with 50 observations. The 6th variable is always ranked at position 8.

4.2.2

Real World Applications

We performed the experiments over the 4 datasets described above : colon, leukemia, lymphoma and prostate. For leukemia a test sample is provided. Here are the different steps of the experiments we have conducted : • In order to rank the variables we have tried the both possibilities, using only the dataset at hand and using 50 bootstrap samples. • Once ranking is done we compute the nested models and test their performance using 50 random splitting stratified test samples5 . By this way, we may determine the optimal number of variables to use in the model (achieving the lowest mean error). • Once the optimal number of variables is fixed, we estimate the performance of the models using this number of variables by using leave one out and cross validation. Figure 5 shows results for the colon dataset when ranking using bootstrap mean scores. Each score corresponds to a curve, scores are grouped according to their equivalences. The colon dataset is linearly separable without training errors, and the predictor used is a hard margin linear SVM achieving an average test error of 17% using all available variables 5 We have tried also to do this selection using cross validation and leave one out. Random splitting seems to be the best choice, choosing generally more variables than the other methods.

∂RW 2 1 4 5 3 123

∂Spb 2 1 4 5 3 123

18

4 Experiments and Results

∂RW : 55 ∂Spb : 17

0.5 0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0 10

0 0 10

2

10

∂W : 40 W0r : 40

0.5

RW00 : 55 RWr : 45 ∆RWr : 43

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0 10

2

10

0.5

FDS : 3

0.4

Spb0r : 44 ∆Spbr : 36

0.5

0 0 10

2

10

2

10

Fig. 5: colon with bootstrap : Average test error on 50 random splits of the Colon Cancer dataset. We can see that our strategy select a little number of variables performing a 0 mean error by random splitting for 7 scores. The FDS score seems to give the worst results. To check for the similarities between the scores we have computed for the colon dataset the Spearman rank correlation between the 10 scores. Table 7 gives the correlation matrix for the 10 scores when computed over 50 bootstrap samples.

F DS RW 0 ∆RWr ∆Spbr Wr0 RWr0 Spb0r ∂W ∂RW ∂Spb

F DS

RW 0

∆RWr

∆Spbr

Wr0

RWr0

Spb0r

1

0.21

0.12

0.17

1

0.86

0.49

0.31

0.31

0.23

0.74

0.31

0.5

1

0.39

0.49

0.14

0.36

1

0.69

0.63

1

∂W

∂RW

∂Spb

0.31

0.21

0.19

0.74

0.999

0.55

0.49

0.86

0.45

0.45

0.69

0.49

0.76

0.81

0.57

0.999

0.74

0.76

1

0.5

0.81

0.31

0.68

1

0.57

0.5

0.33

1

0.74

0.76

1

0.55

Tab. 7: Spearman rank correlation between the 10 scores for the colon dataset when using bootstrap.

1

19

4 Experiments and Results

We can see mainly that there are two very high correlations for the following scores : (∂RW and RW 0 ), (∂W and Wr0 ). These correlations are nearly equal to one although no equivalences exist between these scores. Table 8 gives the number of variables necessary to achieve the minimum mean error for all the datasets, with and without bootstrap. Note that for Lymphoma and Prostate we have first introduced the variables by ten in the stepwise procedure to localize the minimum, and then we worked with one variable introduction. Colon

F DS RW 0 ∆RWr ∆Spbr Wr0 RWr0 Spb0r ∂W ∂RW ∂Spb

Leukemia

Lymphoma

Prostate

with

without

with

without

with

without

with

without

0.117(3)

0.117(3)

0.088(7)

0.058(133)

0.034(88)

0.028(82)

0.034(195)

0.037(84)

0(5)

0(25)

0.088(22)

0.088(15)

0(44)

0(37)

0.022(27)

0.007(860)

0(43)

0(17)

0.118(15)

0.058(3)

0(111)

0(93)

0.02(40)

0.015(421)

0.005(36)

0.063(384)

0.118(16)

0.058(13)

0(54)

0(43)

0.005(95)

0.004(102)

0.005(40)

0 (28)

0.118(2)

0.118(48)

0(66)

0(103)

0.024(42)

0.012(1034) 0.015(418)

0(45)

0(17)

0.118(15)

0.058(3)

0(108)

0(97)

0.02(40)

0.0133(44)

0.006(64)

0.088(17)

0.058(13)

0(50)

0(77)

0.007(53)

0.006(79)

0.005(40)

0(28)

0.118(2)

0.118(48)

0(65)

0(103)

0.024(42)

0.012(1030

0(55)

0(25)

0.088(22)

0.088(15)

0(44)

0(37)

0.022(27)

0.007(860)

0.005(17)

0(23)

0.118 (11)

0.058(13)

0(83)

0(82)

0.001 (102)

0.002(27

Tab. 8: Results for the DNA Microarray benchmark datasets. For each dataset, with or without bootstrap we give the smallest number of selected variables (between parentheses) for which the mean test error is minimum. The test error is averaged over 50 random splitting samples. Note that using all the variables the mean test errors with 50 random splitting are : Colon : 0.17, Leukemia : 0.20588, Lymphoma : 0.06, Prostate : 0.075. We have estimated these errors using Leave one out and 10-fold cross-validation. Results are very similar to those obtained by random splitting, but slightly more optimistic. Two questions seem interesting, the common variables between the scores, and the common variables when using or not the bootstrap. Table 9 gives the number of variables commonly selected for each pair of scores for the colon dataset when using bootstrap. Values appearing on the diagonal are the number of variables selected by each score.

Here we retrieve the same results as when computing the Spearman rank correlation. The scores ∂RW and RW 0 retrieves exactly the same variables, as well as do the scores ∂W and Wr0 Table 10 shows for the different sets the number of variables commonly selected when using or not the bootstrap for ranking. With table 8 we can see that for almost all the scores and all the datasets, the variables selected without bootstrapping are also selected with bootstrapping, recalling that with

20

5 Unbiased estimation of the mean error for our procedure

F DS

RW 0

∆RWr

∆Spbr

Wr0

RWr0

Spb0r

∂W

∂RW

3

1

2

1

1

2

1

1

1

2

55

38

32

40

39

35

40

55

15

43

25

37

43

27

37

38

16

36

27

26

29

27

32

14

40

38

29

40

40

15

45

28

38

39

16

44

29

35

15

40

40

15

55

15

F DS RW 0 ∆RWr ∆Spbr Wr0 RWr0 Spb0r ∂W ∂RW ∂Spb

∂Spb

17

Tab. 9: Comparing the scores : Common selected variables by the 10 criteria for the colon dataset when using bootstrap. bootstrap the optimal number of variables to reach the minimum error is almost always higher. data/score

colon lymphoma leukemia prostate

F DS 3 21 7 84

RW 0 25 4 15 27

∆RWr 17 10 3 40

∆Spbr 35 5 11 61

Wr0 26 9 2 42

RWr0 17 10 3 40

Spb0r 23 7 11 42

∂W 26 9 2 42

∂RW 25 4 15 27

Tab. 10: Common selected variables when using the original dataset and 100 bootstrap samples for ranking.

5

Unbiased estimation of the mean error for our procedure

In this section we cross validate our procedure for retrieving the good model for a given data set. The aim is to unbias the estimation of the mean test error we got in the previous section mainly for the real data sets. Table 11 describes the whole procedure.

Table 12 gives the results for the three data sets for which no test sample is available. We have used only the three first order scores. Mean test errors are given with the mean number of variables used.

∂Spb 15 3 8 27

21

6 Conclusion

Let (L) be the whole data, and B the number of bootstrap samples. Partition (L) randomly with stratification into ten equal subsets, L1 , ..., L10 . Let L−j = L − Lj . For j = 1..10 Score(L−j ,B) and output X (1) , ..., X (n) For k = 1..n Mk = f

“ ” X (1) , ..., X (k)

Er k = T estRS (M k , L−j ) koptj = Argmink {Er k } erj = mean error of M koptj over Lj . P10 1 Output er ¯ = 10 j=1 erj .

Tab. 11: 10-fold cross validation for experiments done in table 2.

Colon Lymphoma Prostate

∂W 0.233 (35.1) 0.051 (86.5) 0.054 (756.6)

∂RW 0.214 (43.3) 0.042 (71) 0.053 (573.3)

∂Spb 0.197 (31.8) 0.073 (70.5) 0.052 (95.5)

Tab. 12: 10-fold cross validation mean errors for three real data sets, using the three first order scores. The mean number of variables used in each case is given between parentheses. It is important to note that the mean test errors obtained in these experiments are averaged over 10 runs of our procedure, each giving a priori rise to different optimal models, using different numbers of variables. Figure 6 shows in the upper panels, the mean error of each of the 10 models, for each data set and for each score. The lower panels show the variation of the number of variables selected for the models whose erros are averaged. It is clear that the models whose errors are averaged in this kind of experience are very different each one from the other. For the colon data set the error rates are higher than what we get when using all the variables. For the two other data sets error rates are generally lower than those obtained when using all the variables.

6

Conclusion

In this paper we have proposed a new approach for the problem of feature selection using support vector machines for classification. The key components of our methodology are: exploiting the SVM-based scores for variable ranking, using bootstrap to estimate the scores and performing a forward selection strategy search for the optimal variable subset. The ranking scores are derived from the weight vector norm and two widely used upper bounds of the leaveone-out error. Inspired from the SVM-RFE algorithm, we have introduced the difference-order criteria as an extra category to zero-order and first-order crite-

22

6 Conclusion

∂W

∂RW

∂W

∂Spb

∂RW

∂W

∂Spb

∂RW

∂Spb

0.5 0.25

0.2 0.4

0.2

Error

0.15 0.3

0.15 0.1

0.2

0.1

0.1

0.05

0

0

Colon

0

Lymphoma

100 200

Prostate 3000 2500

80 # of features

0.05

150 60

2000 1500

100 1000

40 50 20

500 0

Fig. 6: Cross validation of our procedure. The upper panel gives the boxplots of the mean test errors for each data set and for the three first order scores, computed within the cross validation. The lower panel gives the boxplot of the number of features selected using each cross validation sample. ria. At first, we derived some equivalences between the proposed ranking rules then we analyzed retraining influences on the ranking rules. Our feature selection strategy search proved to be very efficient for retrieving the relevant variables for the model even with few observations at hand and too many features. Results on a diversity of DNA Microarray problems seems better than all the known results obtained till now in the literature. The number of variables is dramatically reduced while significantly improving the SVM generalization ability. The performance of this algorithm is equivalent for the different scores whenever they are based on the SVM, but computations are faster when using derivative scores, or the others without retraining. Our method seems to be suitable for all types of problems. We have demonstrated its effectiveness on very high-dimensional problems with very few observations. Of course a lot of work remains to be done in order to extend this research to nonlinear dependencies and the multiclass case. On the practical side, it seems that our method is robust against instability, but more work should be devoted to the results interpretation in collaboration with researchers from the microarray domain.

References [1] A. A. Alizadeh. Distinct types of diffues large b-cell lymphoma identified by gene expression profiling. Nature, 403 : 503-511, 2000. [2] U. Alon et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96(12) : 6745-6750, 1999.

6 Conclusion

23

[3] A. Boser, I. Guyon, and V. Vapnik. Atraining algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, pages 144-152, Pittsburgh, 1992. ACM. [4] L. Breiman. Bagging predictors. Machine Learning, 24(2) : 123-140, 1996. [5] S. Canu, Y. Grandvalet, and A. Rakotomamonjy. SVM and Kernel Methods Matlab Toolbox. Perception Syst`emes et Information, INSA de Rouen, France, http://asi.insa-rouen.fr/˜arakotom/toolbox/index, 2003. [6] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3) : 131159. [7] B. Ghattas. Agr´egation d’arbres de d´ecision binaires : Application a ` la pr´evision de l’ozone dans les Bouches du Rhˆ one. PhD thesis, Universit´e de la M´editerran´ee-GREQAM, Marseille, France, 2000. [8] T. R. Golub et al. Molecular classification of cancer : Class discovery and class prediction by gene expression monitoring. Science, 286 : 531-537, 1999. [9] I. Guyon and A. Elisseff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3 : 1157-1182, 2003. [10] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3) : 389-422, 2002. [11] A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3, 1969. [12] A. Rakotomamonjy. Variable selection using SVM-based criteria. Journal of Machine Learning Research, 3 : 1357-1370, 2003. [13] A. Singh et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2) : 203-209, 2002. [14] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. [15] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. [16] V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Computation, 12 : 9, 2000. [17] J. Weston, A. Elisseff, B. Schoelkopf, and M. Tipping. Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research, 3 : 1439-1461, 2003.

24

7 Appendices

[18] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for svms. In Neural Information Processing Systems, Cambridge, MA, 2001b. MIT Press.

7

Appendices

7.1

Computational details, the First order scores

• Weight vector score: l   X ∂K ((v.xt ) , (v.xs )) ∗ ∗ yt ys αt αs ∂W (i) = ∂vi t,s=1

(vi =1)

where v = (v1 , ..., vn ) and (.) is the component-wise vector product defined by T (v.z) = (v1 z1 , . . . , vn zn )

• Radius-margin bound score: Denote ′

Kv (xs , xt ) =

∂K ((v.xt ) , (v.xs )) ∂vi

( ) l l X X ′ 2 2 ∗ ∗ ′ ∗ ∗ ∗ yt ys αt αs Kv (xs , xt ) + kwk (βt δts − βt βs ) Kv (xs , xt ) ∂RW (i) = R t,s=1 t,s=1 (v βj∗ ,

where δts is the Kronecker symbol and j = 1, . . . , n. maximize the target function of the optimization problem (6).

• Span bound score:   X   ∂H ∗ ∗ T ∂Spb(i) = Sp2 −H −1 + αp∗ Sp4 (α , b ) p∈sv  ∂vi pp

˜ ˜ −1 ˜ −1 ∂ Ksv K K sv ∂vi sv

 pp (v =1) i

!  

 Y  Y Ksv Y , Ksv = yp yq (Ksv )pq and Y is a column where H = pq YT 0 vector containing the corresponding classes of the support vectors. 6



The derivative of the kernel function with respect to vi is also necessary to compute these scores. We give the derivatives of the widely used kernels: 6 As for the K ˜ sv matrix, a small ridge is added to the matrix H to be sure that it is invertible.

i

=1)

25

7 Appendices

• The polynomial kernel derivative:   ∂K ((v.x) , (v.z)) d−1 = 2d.xi zi × (hx.zi + 1) ; i = 1, . . . n. ∂vi (v=1) • The gaussian kernel derivative:     1 1 ∂K ((v.x) , (v.z)) 2 2 ; i = 1, . . . n. = − 2 (xi − zi ) ×exp − 2 kx − zk ∂vi σ 2σ (v=1) 7.2 Proofs for the main results. 7.1.1

proof of lemma 1

We first recall that W 0 and Wr0 (i) are respectively the solutions for the following optimization problems: M inimizew∈Rn Subject to

kwk2 2 , yj (hw.xj i

M inimizew∈Rn Subject to

kwk2 2 , yj (hw.xj i

+ b) > 1 , ∀j ∈ {1, . . . , l}

and + b) > 1 , ∀j ∈ {1, . . . , l} hw.ei i = 0

where ei is the ith vector in the Rn canonical base. Now let us define the sets Ω and Ωi as follows: Ω = {w ∈ Rn ; yj (hw.xj i + b) > 1 , ∀j ∈ {1, . . . , l}} Ω = {w ∈ Rn ; yj (hw.xj i + b) > 1and hw.ei i = 0, ∀j ∈ {1, . . . , l}} i

The result follows from Ωi ⊂ Ω . 7.1.2

proof of lemma 2

We give the proof for the three criteria : For the weight vector criterion we have: ∀i W 0 > W 0 (i). To prove this result we only have to express the difference between the two inequality sides as follows, 2

2

W 0 − W 0 (i) = kwk − kw(i)k =

l X

t,s=1

  yt ys αt∗ αs∗ K (xt , xs ) − K (i) (xt , xs )

and check that K (xt , xj ) − K (i) (xt , xj ) trix.



1≤t,j≤l

is a semi-definite positive ma-

26

7 Appendices

 The matrix K (i) (xt , xj ) 1≤t,j≤l is the Gram matrix of the training data when the ith variable has been omitted. In the linear case we can write     ˆ (i) (xt , xs ) K (xt , xs ) − K (i) (xt , xs ) = K 1≤t,s≤l



ˆ (i) (xt , xs ) where K



1≤t,s≤l

1≤t,s≤l

is the Gram matrix of the training data when only

the ith variable has been considered, hence 2

2

kwk − kw(i)k > 0 and the result follows. For the Span bound criterion : ∀i We first recall the span definition

Spb0 > Spb0 (i).

Sp = d(xp , Λp ) = min kxp − xk x∈Λp

(i)

(i) (i) Sp (i) = d(xp , Λp ) = min xp − x(i) (i)

x(i) ∈Λp

where z (i) designates the vector z for which the ith component is set to zero and (i) Λp the corresponding set of a constrained linear combinations. In the linear case, it is obvious to check that removing the ith vector component is equivalent to set it equal to zero. Since for each i

(i) kxp − xk > x(i) p −x

we get

min kxp − xk >

x∈Λp



(i) − x min x(i)

. p (i)

x(i) ∈Λp

For the Radius weight criterion it is sufficient to proof that: ∀i R ≥ R(i). We will prove this result using the primal formulation of the optimization problem (6). Consequently we can write:  n o 2 2 R = minn sup kxj − ak ; j = 1, 2, . . . l a∈R

and 2

R (i) = minn a∈R

(i)

j



 

2

(i) (i) sup xj − a ; j = 1, 2, . . . l j

where z designates the vector z for which the ith component is set to zero. It is straightforward to check that  

2 n o

(i) 2 (i) sup kxj − ak ; j = 1, 2, . . . l ≥ sup xj − a ; j = 1, 2, . . . l j

and the result follows.

j

27

7 Appendices

7.1.3

proof of lemma 3

On the one hand we have 2 2 2 2 kwk − kw(i)k = kwk − kw(i)k =

l X

(according to lemma 2 )

ˆ (i) (xt , xs ) yt ys αt∗ αs∗ K

t,s=1

On the other hand we obtain ! l 2 X ∂ kwk ˆ (i) (xt , xs ) yt ys αt∗ αs∗ 2K = ∂vi t,s=1 (vi =1)

from which the result follows.