A New Linear Dimensionality Reduction ... - Semantic Scholar

0 downloads 0 Views 216KB Size Report
the concept of directed distance matrices, and a linear transformation in the original space, to finally generalize Fisher's criterion in the transformed space by ...
A New Linear Dimensionality Reduction Technique based on Chernoff Distance Luis Rueda1 and Myriam Herrera2 1

Department of Computer Science and Center for Biotechnology University of Concepci´on Edmundo Larenas 215, Concepci´on, 4030000, Chile [email protected] 2 Department and Institute of Informatics National University of San Juan Cereceto y Meglioli, San Juan, 5400, Argentina [email protected]

Abstract. A new linear dimensionality reduction (LDR) technique for pattern classification and machine learning is presented, which, though linear, aims at maximizing the Chernoff distance in the transformed space. The corresponding two-class criterion, which is maximized via a gradient-based algorithm, is presented and initialization procedures are also discussed. Empirical results of this and traditional LDR approaches combined with two well-known classifiers, linear and quadratic, on synthetic and real-life data show that the proposed criterion outperforms the traditional schemes.

1 Introduction Pattern classification and machine learning constitute two important areas of artificial intelligence. These areas have been growing significantly in the past few years, and found many applications in medical diagnosis, bioinformatics, DNA microarray data analysis, security and many more. Designing a fast classifier, while maintaining a reasonable level of accuracy is a problem that has been well studied. Related to this is the well-known linear dimensionality reduction (LDR) problem, which has the feature of performing a simple linear algebraic operation, and being simpler to implement and understand. Various schemes that yield LDR have been reported in the literature for reducing to dimension one, including the well known Fisher’s discriminant analysis (FDA) approach [5], direct Fisher’s discriminant analysis [6], the perceptron algorithm (the basis of the back propagation neural network learning algorithms) [12], piecewise recognition models [11], removal classification structures [1], adaptive linear dimensionality reduction [9] (which outperforms Fisher’s classifier for some data sets), linear constrained distance-based classifier analysis [4] (an improvement to Fisher’s approach designed for hyperspectral image classification), recursive Fisher’s discriminant [3], pairwise linear classifiers [15], and the best hyperplane classifier [13]. Consider the two-class classification problem with two classes, ω1 and ω2 , represented by two normally distributed n-dimensional random vectors, x1 ∼ N (m1 , S1 ) and x2 ∼ N (m2 , S2 ), and whose a priori probabilities are p1 and p2 respectively. The

aim is to linearly transform x1 and x2 into new normally distributed random vectors y1 and y2 of dimension d, d < n, using a matrix A of order d × n, in such a way that the classification error in the transformed space is as small as possible. Let SW = p1 S1 + p2 S2 and SE = (m1 − m2 )(m1 − m2 )t be the within-class and between-class scatter matrices respectively. The FDA criterion consists of maximizing the distance between the transformed distributions by finding A that maximizes the following function [5]: © ª JF (A) = tr (ASW At )−1 (ASE At ) .

(1)

The matrix A that maximizes (1) is obtained by finding the eigenvalue decomposition of the matrix: SF = S−1 W SE ,

(2)

and taking the d eigenvectors whose eigenvalues are the largest ones. Since the eigenvalue decomposition of the matrix (2) leads to only one non-zero eigenvalue, (m1 − m2 )t (m1 − m2 ), whose eigenvector is given by (m1 − m2 ), we can only reduce to dimension d = 1. Loog and Duin have recently proposed a new LDR technique for normally distributed classes [8], namely LD, which takes the Chernoff distance in the original space into consideration to minimize the error rate in the transformed space. They consider the concept of directed distance matrices, and a linear transformation in the original space, to finally generalize Fisher’s criterion in the transformed space by substituting the within-class scatter matrix for the corresponding directed distance matrix. The LD criterion consists of minimizing the classification error in the transformed space by obtaining the matrix A that maximizes the function: © JLD2 (A) = tr (ASW At )−1 " #) −1 −1 −1 −1 1 p log(S 2 S S 2 ) + p log(S 2 S S 2 ) 1 1 1 2 2 t t W W W W 2 2 ASE A − ASW SW A (3) p1 p2 The solution to this criterion is given by the matrix A that is composed of the d eigenvectors (whose eigenvalues are maximum) of the following matrix: " SLD2 = S−1 W

−1

−1

−1

−1

p1 log(SW2 S1 SW2 ) + p2 log(SW2 S2 SW2 ) 12 SE − SW SW p1 p2 1 2

# .

(4)

The traditional classification problem has usually been solved by minimizing the error or, equivalently, by maximizing the separability between the underlying distributions using different criteria. The FDA criterion discussed above aims at minimizing the error by maximizing the Mahalanobis distance between distributions, resulting in an optimal criterion when the covariance matrices are coincident. In case the covariances are different, the optimal classifier is quadratic; the linear classification, in this case, results

in maximizing the separability between the distributions by using a criterion that generalizes the Mahalanobis distance [7]. On the other hand, the LD criterion utilizes, as pointed out above, a directed distance matrix, which is incorporated in Fisher’s criterion assuming the within-class scatter matrix is the identity. In this paper, we take advantage of the properties of the Chernoff distance, and propose a new criterion for LDR that aims at maximizing the separability of the distributions in the transformed space. Since we are assuming the original distributions are normal, the distributions in the transformed space are also normal. Thus, the Bayes classifier in the transformed space is quadratic and deriving a closed-form expression for the classification error is not possible. Let p(y|ωi ) be the class-conditional probability that a vector y = Ax in the transformed space belongs to class ωi . The Chernoff distance between two distributions, p(y|ω1 ) and p(y|ω2 ), is given as follows: Z Pr[error] = pβ (y|ω1 )p1−β (y|ω2 )dy = e−k(β) , (5) where

k(β) =

β(1 − β) (Am1 − Am2 )t [βAS1 A + (1 − β)AS2 A]−1 (Am1 − Am2 ) 2 1 |βAS1 A + (1 − β)AS2 A| + log . (6) 2 |AS1 A|β |AS2 A|1−β

As it clearly follows from Equations (5) and (6), the smaller the error is, the larger the value of k(β) is, and hence, in this paper, we propose to maximize (6), which indeed, provides a good approximation for the error.

2 Chernoff-distance Linear Dimensionality Reduction The criterion that we propose (referred to as the RH criterion) aims at maximizing the Chernoff distance between the transformed random vectors. Here, we consider p1 = β and p2 = 1 − β (note that β ∈ [0, 1]) since p1 and p2 “weight” the respective covariance matrices in the Chernoff distance. Since after the transformation, new random vectors of the form y1 ∼ N (Am1 ; AS1 At ) and y2 ∼ N (Am2 ; AS2 At ) are obtained, the aim is to find the matrix A that maximizes: Jc∗12 (A) = p1 p2 (Am1 − Am2 )t [ASW At ]−1 (Am1 − Am2 ) µ ¶ |ASW At | + log , |AS1 A|p1 |AS2 A|p2

(7)

which can also be written as follows (cf. [14]): © Jc∗12 (A) = tr p1 p2 (ASW At )−1 ASE At

ª + log(ASW At ) − p1 log(AS1 At ) − p2 log(AS2 At )

(8)

We now show3 that for any value of Jc∗12 (A), where the rows of A are linearly independent, there exists an orthogonal matrix Q such that the Chernoff distance in the transformed space is the same as that of the original space. Lemma 1. Let A be any real d×n matrix, d ≤ n, whose rows are linearly independent, and Jc∗12 (A) be defined as in (8). Then max{A} Jc∗12 (A) = max{Q:QQt =Id } Jc∗12 (Q)

(9)

In order to maximize Jc∗12 , we propose the following algorithm, which is based on the gradient method. The learning rate, one of the parameters to the algorithm are obtained by maximizing the objective function in the direction of the gradient. The first task is to find the gradient matrix using the operator ∇ as follows: ∇Jc∗12 (A) = £ ¤t p1 p2 SE At (ASW At )−1 − SW At (ASW At )−1 (ASE At )(ASW At )−1 £ ¤t + SW At (ASW At )−1 − p1 S1 At (AS1 At )−1 − p2 S2 At (AS2 At )−1 (10) The procedure that maximizes Jc∗12 is shown in Algorithm Chernoff LDA Two, which receives as a parameter, a threshold, θ, to indicate when the search will stop. Another parameter is the learning rate, ηk , which indicates how fast the algorithm will converge. To initialize A we use the result of FDA or LD that leads to the maximum value of the Chernoff distance in the transformed space. Also, as shown earlier in Lemma 1, once A is obtained, there always exists an orthogonal matrix Q such that A can be decomposed into RQ (cf. [14]). An additional step is then introduced in the algorithm, which decomposes A into RQ, and utilizes Q in the next step. Algorithm Chernoff LDA Two Input: Threshold θ begin A(0) ← maxA {Jc∗12 (AF ), Jc∗12 (ALD )} // Max. of FDA and LD k←0 repeat ηk ← maxη>0 φk12 (η) B ← A(k) + ηk ∇Jc∗12 (A(k) ) Decompose B into R and Q A(k+1) ← Q k ←k+1 until |Jc∗12 (A(k−1) ) − Jc∗12 (A(k) )| < θ return A(k) , Jc∗12 (A(k) ) end 3

The proof of all subsequent lemmas and theorems are given in the unabridged version of this paper [14].

Theorem 1. Let {A(k) }∞ k=1 be the sequence of matrices generated by Algorithm Chernoff LDA Two. If ∇Jc∗12 (A(k) ) 6= 0, then Jc∗12 (A(k) ) < Jc∗12 (A(k+1) ). Otherwise, the algorithm terminates. Algorithm Chernoff LDA Two needs a learning rate, ηk , which depends on the case, and is related to the convergence/divergence problem. When η is small, convergence is slower but more likely. As opposed to this, when η is large, convergence is faster but the algorithm could diverge. There are many ways of computing ηk , one of them being the expression that maximizes the value of Jc∗12 in the next step [2]. Consider the following function of η: φk12 (η) = Jc∗12 (A(k) + η∇Jc∗12 (A(k) )) .

(11)

Since the idea is to find A that maximizes Jc∗12 (A), we choose the value of η that maximizes φk12 (η) as follows: ηk = maxη>0 φk12 (η) = maxη>0 Jc∗12 (A(k) + η∇Jc∗12 (A(k) )) .

(12)

Although ηk is obtained using Equation (12), the former has to be computed using an iterative method, again. One of them is the secant method, as proposed in [2]. Starting from initial values of η (0) and η (1) , we update η at step j + 1 as follows: η (j+1) = η (j) −

η (j) − η (j−1) dφk12 dη

(η (j) ) −

dφk12 dη

dφk12 (j) (η ) , (η (j−1) ) dη

(13)



k12 where dη is obtained by differentiating (11). This procedure is repeated until difference between η (j−1) and η (j) is as small as desired. One of the important aspects when applying the secant algorithm is to find the initial values of η. We set one of them to η0 = 0, while the other the value of η1 is obtained from the angle difference between A at step k and the matrix obtained by adding the latter and the product between the learning rate and the gradient matrix, as per the following theorem.

Theorem 2. Let φk12 : Rd×n → R be the continuously differentiable function defined in (11), where Jc∗12 (·) is defined in (8), and whose first derivative is given in (10). Then, the initial values of the secant method are given by η0 = 0 and η1 =

d2 ² − d tr{A(k) [∇Jc∗12 (A(k) )]t }

,

(14)

where ² = cos θ, and θ is the angle between A(k) and [A(k) + ηk ∇Jc∗12 (A(k) )]. Theorem 2 can be geometrically interpreted in the following way. Since kAkF is a norm that satisfies the properties of a metric, we can ensure that there exists a√ matrix norm kAk induced or compatible in Rn , such that for any A 6= 0, kAk = λ1 holds, where λ1 is the largest eigenvalue of A°[2, pp.33]. ° °Then, since ° that eigenvalue is λ1 = 1, the matrix norm induced results in °A(k) ° = °A(k+1) ° = 1. In this way,

we ensure that the rows of A(k) and A(k+1) reside in the same hypersphere in Rn , whose radius is unity. Then, since those rows are linearly independent, they could be “rotated” independently using a vector4 , η, of dimension d. However, Algorithm Chernoff LDA Two uses a scalar instead η. For this reason, the “rotation” can be seen on a hypersphere of radius d and all the rows of A are rotated using the same scalar. As an example, if we choose θˆ = π/180, and suppose that A(k) is of order 1 × n, i.e. a vector in Rn , we obtain a value of ² ≈ 0.9998. Thus the variation between A(k) and A(k+1) is one degree, where, obviously, the value of η1 depends also on A(k) and ∇Jc∗12 (A(k) ).

3 Simulations on Synthetic Data The tests on synthetic data involve ten datasets of dimensions n = 10, 20, . . . , 100 each with two randomly generated normally distributed classes. The two classes of each dataset, ω1 and ω2 , are then fully specified by their parameters, m1 , m2 , S1 and S2 . We also randomly generated p1 = [0.3, 0.7] and assigned p2 = 1 − p1 . We trained three LDR techniques, FDA, LD and RH (the method proposed in this paper), using these parameters, and for each dataset we generated 100,000 samples for testing purposes. For each dataset, we found the corresponding transformation matrix A for each dimension d = 1, . . . , n−1. After the linear transformation is performed, we tested two classifiers: the linear classifier, which is obtained by averaging the covariances matrices in the transformed space, and the quadratic classifier which is the one that minimizes the error rate assuming that the parameters in the transformed data are given by Ami and ASi At . Table 1. Error Rates for the three classifiers, FDA, LD and RH, where the samples are projected onto the d∗ -dimensional space with d∗ gives the lowest error rate for d = 1, . . . , n − 1. n 10 20 30 40 50 60 70 80 90 100

FDA+Q error d∗ 0.286530 1 0.222550 1 0.151190 1 0.287250 1 0.370450 1 0.320760 1 0.381870 1 0.323140 1 0.324740 1 0.198610 1

LD+Q error 0.053140* 0.019680 0.002690* 0.006600 0.005490* 0.000680* 0.000010* 0.000000* 0.000000* 0.000000*

d∗ 9 18 24 36 49 56 28 37 30 31

RH+Q error 0.053230 0.019580* 0.002690* 0.006570* 0.005490* 0.000680* 0.000010* 0.000000* 0.000000* 0.000000*

d∗ 9 18 24 36 49 56 28 37 30 31

FDA+L error d∗ 0.289790 1 0.227000 1 0.182180* 1 0.297840 1 0.396160* 1 0.322920 1 0.381960 1 0.342980 1 0.326360 1 0.278590* 1

LD+L error 0.288820* 0.220180 0.182480 0.295370 0.397450 0.316030 0.381910* 0.334170 0.324740* 0.278730

d∗ 6 3 27 8 1 21 30 23 1 78

RH+L error 0.288830 0.218780* 0.182480 0.294660* 0.397450 0.315250* 0.381910* 0.334080* 0.324740* 0.278720

d∗ 9 4 27 6 1 23 30 25 1 72

The minimum error rates obtained for each individual classifier for synthetic data are shown in Table 1. The first column represents the dimension of each datset. The 4

Using a vector, η, to update A(k) is beyond the scope of this paper, and is a problem that we are currently investigating.

next columns correspond to the error rate and the best dimension d∗ for the three LDR methods and for each classifier, quadratic and linear. The ‘*’ symbol beside the error rate indicates that the lowest among the three methods, FDA, LD and RH, was obtained. Note that for FDA, d∗ = 1, since, as pointed out earlier, the objective matrix contains only one non-zero eigenvalue. We observe that for the quadratic classifier LD and RH outperformed FDA for all the datasets. Also, LD and RH jointly achieved minimum error rate for seven datasets, while RH obtained the best error rate in nine out of ten datasets. For the linear classifier, again, LD and RH outperformed FDA, and also RH achieved the lowest error rate in six out of ten datasets, outperforming LD. In Table 2, the results for the dimensionality reduction and classification for dimension d = 1 are shown. For the quadratic classifier, we observe that as in the previous case, LD and RH outperformed FDA, and that the latter did not obtain the lowest error rate in any of the datasets. On the other hand, RH yielded the lowest error rate in nine out of ten datasets, outperforming LD. FDA, however, did perform very well for the linear classifier, achieving the lowest error rate in eight out of ten datasets. RH, though not the best, outperformed LD yielding the lowest error rate in two out of ten datasets. Note also that the good performance of FDA and the linear classifier is due to the fact that the optimal Bayes classifier for normal distributions is linear when the covariances are coincident. Table 2. Error rates for the quadratic and linear classifiers in the one-dimensional space, where the transformed data was obtained using the FDA, LD and RH methods. n 10 20 30 40 50 60 70 80 90 100

FDA+Q 0.286530 0.222550 0.151190 0.287250 0.370450 0.320760 0.381870 0.323140 0.324740 0.198610

LD+Q 0.169750 0.218260 0.022950* 0.219680 0.237150 0.122350* 0.061530* 0.060320* 0.087150* 0.093410*

RH+Q 0.154790* 0.204680* 0.022950* 0.219590* 0.237080* 0.122440 0.061530* 0.060320* 0.087150* 0.093410*

FDA+L 0.289790* 0.227000 0.182180* 0.297840* 0.396160* 0.322920* 0.381960* 0.342980* 0.326360 0.278590*

LD+L 0.320460 0.229260 0.277120 0.458030 0.397450 0.440710 0.402320 0.444530 0.324740* 0.332370

RH+L 0.385010 0.222490* 0.277120 0.458030 0.397450 0.440710 0.402320 0.444530 0.324740* 0.332370

4 Experiments on Real-life Data We also performed a few simulations on real life data which involve 44 two-class, datasets from the UCI repository [10]. Six datasets were of two classes, and from multi-class datasets, we extracted pairs of classes. The priors were estimated as pi = ni /(ni + nj ), where ni and nj are the number of samples for classes ωi and ωj respectively. Three LDR techniques, FDA, LD, and RH, were trained and the mean of

the error rate computed for quadratic (Q) and linear (L) classifiers in a ten-fold crossvalidation manner. The errors for the best value of d, d = 1, . . . , n, are shown in Table 3. The first column indicates the name of the dataset and the classes separated by “,” (when classes are not given, it means the problem itself is two-class), where the name of the dataset is as follows: W = Wisconsin breast cancer, B = Bupa liver, P = Pima, D = Wisconsin diagnostic breast cancer, C = Cleveland heart-disease, S = SPECTF heart, I = Iris, T = Thyroid, G = Glass, N = Wine, J = Japanese vowels, L = Letter, and E = Pendigits. For the quadratic classifier, RH outperformed both FDA and LD, since the former obtained the lowest error rate in 34 out of 44 cases, while FDA and LD obtained the lowest error rate both in 17 and 16 cases respectively. In the case of the linear classifier, RH also outperformed FDA and LD – the former was the best in 31 cases, while the latter two in 15 and 26 cases respectively. In this case, although RH is coupled with a linear classifier, while it optimizes the Chernoff distance and is expected to work well with a quadratic classifier, RH obtained the lowest error rate in more cases than LD. Also, for the quadratic classifier, on datasets B, P, S and G,1,3, the error rate yielded by RH is significantly smaller than that of LD. In particular, on the G,1,3 dataset, the difference between RH and LD is approximately 9% and with respect to FDA is more than 10%. For the linear classifier on the same dataset, RH is also more accurate than FDA and LD with a difference of approximately 4% and 6% respectively. We also plotted the error rate and Chernoff distance for the SPECTF dataset for all values of d, for LD and RH. FDA was excluded, since as pointed out earlier, the data can only be transformed to dimension 1. The corresponding plots for the error of the quadratic classifier and the Chernoff distances are depicted in Figs. 1 and 2 respectively. The error rate (in general) decreases as the dimension d of the new space increases. Also, in this case, the RH clearly leads to a lower error rate than LD, while both converge to similar error rates for values of d close to n. This reflects the fact that as the Chernoff distance in the transformed space increases (see Fig. 2), the error rate of the quadratic classifier decreases.

Chernoff Distance for SPECTF dataset

Quadratic Classifier Error Rates for SPECTF dataset

70

0.35

LD RH

LD RH

60

0.3

50

Chernoff distance

Error rate

0.25

0.2

0.15

40

30

0.1

20

0.05

10

0

0

5

10

15

20

25

30

35

40

45

d

Fig. 1. Quadratic classifier error rates, SPECTF.

0

0

5

10

15

20

25

30

35

40

d

Fig. 2. Chernoff distance for SPECTF.

45

Table 3. Error rates for the two-class datasets drawn from the UCI machine learning repository.

Dataset W B P D C S I,1,2 I,1,3 I,2,3 T,1,2 T,1,3 T,2,3 G,1,2 G,1,3 G,1,5 G,1,7 G,2,3 G,2,5 G,2,7 G,3,5 G,3,7 G,5,7 N,1,2 N,1,3 N,2,3 J,1,2 J,1,3 J,4,5 J,6,7 J,8,9 L,C,G L,D,O L,J,T L,K,R L,M,N L,O,Q L,P,R L,U,V L,V,W E,1,2 E,3,4 E,5,6 E,7,8 E,9,10

FDA+Q error d∗ 0.030754 1 0.362017 1 0.226435* 1 0.031522* 1 0.164943 1 0.247773 1 0.000000* 1 0.000000* 1 0.050000 1 0.021637 1 0.022222* 1 0.000000* 1 0.310000* 1 0.223611 1 0.000000* 1 0.020000* 1 0.158611 1 0.109722 1 0.027273* 1 0.000000* 1 0.060000 1 0.050000* 1 0.007143 1 0.000000* 1 0.016667 1 0.001435* 1 0.000370* 1 0.007512 1 0.000000* 1 0.066800 1 0.083547 1 0.033400 1 0.009741 1 0.098878 1 0.031751 1 0.045591* 1 0.020505 1 0.010748 1 0.027057 1 0.003051 1 0.002277 1 0.001370 1 0.000911 1 0.011357 1

LD+Q error 0.027835* 0.388571 0.251265 0.040266 0.168276 0.045588 0.000000* 0.000000* 0.030000* 0.010819 0.027778 0.000000* 0.397619 0.204167 0.000000* 0.040000 0.213333 0.098333* 0.063636 0.000000* 0.020000* 0.070000 0.007692 0.000000* 0.016667 0.005263 0.001108 0.001778* 0.000000* 0.051309* 0.051096 0.015402 0.004520 0.041405* 0.015847 0.057280 0.012176 0.007595 0.027048 0.001312 0.002277 0.000457 0.000455* 0.000472*

d∗ 1 4 2 27 11 41 1 1 1 4 2 2 7 1 5 8 8 7 7 1 2 4 6 3 3 3 7 7 1 11 15 15 10 12 13 11 9 15 15 10 1 6 3 12

RH+Q error 0.030754 0.353613* 0.226435* 0.031522* 0.161379* 0.042810* 0.000000* 0.000000* 0.040000 0.005263* 0.027778 0.000000* 0.397619 0.112500* 0.000000* 0.020000* 0.153611* 0.098333* 0.027273* 0.000000* 0.040000 0.050000* 0.000000* 0.000000* 0.008333* 0.001435* 0.000370* 0.004865 0.000000* 0.052896 0.047083* 0.014777* 0.003875* 0.042081 0.014590* 0.046253 0.010248* 0.006966* 0.022438* 0.000873* 0.002273* 0.000000* 0.000455* 0.000943

d∗ 1 1 1 1 11 36 1 1 2 3 2 1 8 8 1 1 8 6 1 1 4 1 6 1 7 1 1 3 1 6 10 10 8 10 13 1 9 9 10 10 8 8 3 12

FDA+L error d∗ 0.039621 1 0.309916 1 0.229033* 1 0.042079 1 0.161609 1 0.233646 1 0.000000* 1 0.000000* 1 0.030000* 1 0.059357 1 0.038889 1 0.000000* 1 0.281905* 1 0.223611 1 0.000000* 1 0.040000 1 0.158611* 1 0.099722 1 0.046364 1 0.025000 1 0.060000* 1 0.050000 1 0.007692 1 0.000000* 1 0.016667 1 0.001435* 1 0.001108* 1 0.004417 1 0.000000* 1 0.069473 1 0.083547 1 0.032784 1 0.009741 1 0.096207 1 0.034936 1 0.046237 1 0.022432 1 0.012018 1 0.029706 1 0.006556* 1 0.002277* 1 0.001826 1 0.000911 1 0.012300 1

LD+L error 0.038150* 0.330168 0.230383 0.029889* 0.158391 0.176373* 0.000000* 0.000000* 0.040000 0.032749 0.027778* 0.000000* 0.295714 0.204167 0.000000* 0.030000* 0.167222 0.088333* 0.037273 0.000000* 0.060000* 0.050000 0.007143* 0.000000* 0.008333* 0.001435* 0.001108* 0.000881* 0.000000* 0.071601 0.084903 0.030216* 0.009741 0.095522 0.033033* 0.050133 0.021787* 0.011381* 0.031035 0.006556* 0.002277* 0.002283 0.000455* 0.009933

d∗ 6 5 7 20 8 19 1 1 1 4 4 4 8 1 5 1 4 7 8 6 1 8 11 3 12 11 11 9 1 11 12 14 15 13 13 11 7 10 13 10 1 11 1 11

RH+L error 0.039621 0.301261* 0.229033* 0.036785 0.144828* 0.180378 0.000000* 0.000000* 0.030000* 0.027193* 0.027778* 0.000000* 0.289048 0.161111* 0.000000* 0.040000 0.166111 0.088333* 0.018182* 0.000000* 0.060000* 0.025000* 0.007692 0.000000* 0.016667 0.001435* 0.001108* 0.004861 0.000000* 0.068404* 0.081574* 0.032776 0.009087* 0.094207* 0.034936 0.045583* 0.022428 0.011381* 0.028381* 0.006556* 0.002277* 0.001822* 0.000911 0.008518*

d∗ 1 5 1 28 5 15 1 1 1 4 4 1 7 8 1 1 8 6 8 7 1 2 1 1 1 1 1 1 1 8 6 12 10 1 1 9 6 9 5 1 1 13 1 6

5 Conclusion We have presented a new criterion for linear dimensionality reduction (LDR), which aims at maximizing the Chernoff distance in the transformed space. The criterion is maximized by using a gradient-based algorithm, for which the convergence and initialization proofs are discussed. The proposed method has been shown to outperform traditional LDR techniques, such as FDA and LD, when coupled with quadratic and linear classifiers on synthetic and real-life datasets. The method is powerful enough to be intuitively extended for the multi-class scenario, which we are currently formalizing and testing.

References 1. M. Aladjem. Linear Discriminant Analysis for Two Classes Via Removal of Classification Structure. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(2):187–192, 1997. 2. E. Chong and S. Zak. An Introduction to Optimization. John Wiley and Sons, Inc., New York, NY, 2nd edition, 2001. 3. T. Cooke. Two Variations on Fisher’s Linear Discriminant for Pattern Recognition. IEEE Transations on Pattern Analysis and Machine Intelligence, 24(2):268–273, 2002. 4. Q. Du and C. Chang. A Linear Constrained Distance-based Discriminant Analysis for Hyperspectral Image Classification. Pattern Recognition, 34(2):361–373, 2001. 5. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, NY, 2nd edition, 2000. 6. H. Gao and J. Davis. Why Direct LDA is not Equivalent to LDA. Pattern Recognition, 39:1002–1006, 2006. 7. M. Herrera and R. Leiva. Generalizaci´on de la Distancia de Mahalanobis para el An´alisis Discriminante Lineal en Poblaciones con Matrices de Covarianza Desiguales. Revista de la Sociedad Argentina de Estad´ıstica, 3(1-2):64–86, 1999. 8. M. Loog and P.W. Duin. Linear Dimensionality Reduction via a Heteroscedastic Extension of LDA: The Chernoff Criterion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):732–739, 2004. 9. R. Lotlikar and R. Kothari. Adaptive Linear Dimensionality Reduction for Classification. Pattern Recognition, 33(2):185–194, 2000. 10. D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of machine learning databases, 1998. University of California, Irvine, Dept. of Computer Science. 11. A. Rao, D. Miller, K. Rose, , and A. Gersho. A Deterministic Annealing Approach for Parsimonious Design of Piecewise Regression Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(2):159–173, 1999. 12. S. Raudys. Evolution and Generalization of a Single Neurone: I. Single-layer Perception as Seven Statistical Classifiers. Neural Networks, 11(2):283–296, 1998. 13. L. Rueda. Selecting the Best Hyperplane in the Framework of Optimal Pairwise Linear Classifiers. Pattern Recognition Letters, 25(2):49–62, 2004. 14. L. Rueda and M. Herrera. Linear Discriminant Analysis by Maximizing the Chernoff Distance in the Transformed Space. Submitted for Publication, 2006. Electronically available at http://www.inf.udec.cl/∼lrueda/papers/ChernoffLDAJnl.pdf. 15. L. Rueda and B. J. Oommen. On Optimal Pairwise Linear Classifiers for Normal Distributions: The Two-Dimensional Case. IEEE Transations on Pattern Analysis and Machine Intelligence, 24(2):274–280, February 2002.