A non-parametric dimensionality reduction technique using gradient descent of misclassification rate S. Redmond, C. Heneghan. Department of Electronic Engineering, University College Dublin, Ireland.
[email protected],
[email protected]
Abstract We present a technique for dimension reduction. The technique uses a gradient descent approach to attempt to sequentially find orthogonal vectors such that when the data is projected onto each vector the classification error is minimised. We make no assumptions about the structure of the data and the technique is independent of the classifier model used. Our approach has advantages over other dimensionality reduction techniques, such as Linear Discriminant Analysis (LDA), which assumes unimodal gaussian distributions, and Principal Component Analysis (PCA) which is ignorant of class labels. In this paper we present the results of a comparison of our technique with PCA and LDA when applied to various 2-dimensional distributions and the two class cancer diagnosis task from the Wisconsin Diagnostic Breast Cancer Database, which contains 30 features.
1
Originality and contribution
At the time of submission we were not aware that this concept has been previously presented in a more general form in [1, 2]. However, the work we present here was arrived at independently and does contain subtle differences. Nonetheless, we hope the reader finds the examples in this paper illustrative of the basic concept described in [1] and [2]. In a typical pattern recognition problem, we are usually faced with large sets of features which may have utility in providing reliable classification. In practice however many of these features may be strongly correlated with one another, or may not contribute to classification in any way. Accordingly, we may wish to perform dimensionality reduction on our data for various reasons: the dimension of the data is too large to handle from a memory or computational point of view, or the in the case of classification, the removal of features containing no information can improve classification results. Projection Pursuit is a popular technique for reducing the dimension of large data set in which we seek a projection of the higher dimensional data onto lower dimensions. However many projection pursuit techniques attempt to maximise, or minimise, some objective function which inherently
makes assumptions about the structure of the data. For Linear Discriminant Analysis (LDA) the assumption is that each class distribution is unimodal gaussian, and that the class separation information is contained in the difference of the means of each class distribution as much as it is contained in the variance. Principal Component Analysis (PCA) assumes that the class separation information is contained in the direction of maximum variance of the data. We shall present a technique which attempts to converge on a projection which will minimise the classification error of the chosen classifier instead of imposing a structure on the data by maximising an objective function.
2
Introduction
Dimensionality reduction can also be thought of as a feature selection process. In feature selection, we generally retain the minimum (and best set) of features for a classification. Feature selection is therefore a special case of dimensionality reduction, but where the basis is an exact subset of the feature basis. A feature selection algorithm is a vital building block of any pattern recognition system. Feature selection, the rejection of null features, those that contain no information, can greatly improve recognition results. In theory when using features that contain little or no relevant information in the classification process the performance of the ideal classifier will not degrade. One could simply include all features in the classification process and features containing no information will be ignored by the classifier. In practice this is rarely true - null features add noise to the system, and the removal of these redundant features can greatly improve results. For example Witten [3] notes experiments with a decision tree classifier which show that adding a binary random variable to the feature set can deteriorate performance by 5-10%. Here we will present a dimensionality reduction technique which sequentially chooses the vectors of an orthogonal basis with a lower dimension than the data so that when the data is projected onto this basis, the classifier error will be minimised. In Section 3 we will give a brief overview of some existing techniques used for dimensionality reduction, and their shortcomings to motivate our new technique. In Section 4 we will give a detailed description of our technique. Section 5 outlines the data sets, the classifier model, and the experimental procedure used in comparing our method with LDA and PCA as a dimensionality reduction technique. Finally, in Sections 6 and 7 we summarise the comparative results and draw our conclusions.
3
Review of existing techniques
Firstly, we will review some methods of dimensionality reduction. In general, these techniques try to choose a projection which preserves the class separation information of the data but suppresses some of the noisy or null features. An intuitive example of dimensionality reduction would be projecting 3-dimensional data onto a 2-dimensional plane. Some techniques for doing this are Principal Component
Analysis (PCA) [4], Linear Discriminant Analysis (LDA) [5], and Orthogonal LDA (OLDA) [6]. We will now briefly describe these algorithms in more detail. PCA finds the directions of maximum variance of all the data by finding the eigenvectors of the covariance matrix of all data, irrespective of class. A lower dimension representation of the data may then be found by projecting onto the first m eigenvectors corresponding to largest m eigenvalues. However, PCA is ignorant of the class labels attached to the data so a good class separation in the lower dimension data is not guaranteed. LDA attempts to find projections which maximise the separation between the means of the classes while simultaneously trying to minimise the variance of each class about its mean. LDA implicitly assumes each class belongs to a single gaussian distribution. We will not present the mathematics of the LDA technique here, however for a C class problem LDA will return C − 1 eigenvectors corresponding to the C − 1 non-zero eigenvalues. The magnitude of the eigenvectors correspond to how well the objective function was maximised. A shortcoming of LDA is that the objective function accounts for Euclidean distance or Mahalonobis distance, but not classification error. OLDA simply expands the concept of LDA to choose various orthogonal projections. For example if we use LDA and choose the eigenvector, corresponding to the maximum eigenvalue, to project onto in 3-dimensional space and we then create a 2-d basis (plane) orthogonal to that vector and project the data onto the 2-d basis, we then perform the LDA analysis on the 2-d data to obtain the second projection vector, orthogonal to the first. Again, in general this technique does not directly optimise with respect to classification accuracy. As mentioned in Section 2, a more general version of this work has been explored independently in [1, 2]. We point the interested reader here.
4
Proposed Method
Let X be an n x d matrix of features, where n represents the number of instances, and d denotes the number of dimensions the features space spans. In general X is drawn from 2 or more classes which we wish to distinguish. We wish to choose a projection of the d dimensional data, X, onto m orthogonal vectors (m ≤ d) such that the classification error is a minimum for whatever classifier is chosen. The projection should also be robust when we move to independent test data. We use a gradient descent method. We choose an initial d x 1 vector at random, w, and project the data, X onto it: p = Xw
(1)
We pass the vector p to the classifier as both the training and test data. Hence, here we are training and testing using the same data. In practice, of course, this would not be the case. But, we do it here to illustrate the mechanisms of the algorithm rather than worrying about training bias. In practice a transformation matrix is found using the training data. Once the dimensionality reduction transformation matrix is found it can be applied to any new data.
For the data vector p, we identify which instances have been erroneously classified in p, using our classifier of choice. For example, Figure 1 (a)(i) shows two gaussian distributions in a 2-dimensional feature space. Instances from class 1 are marked with an ’x’ and those from class 2 with a ’+’. The initial vector w is shown starting at the origin and pointing southeast. Figure 1 (a)(ii) shows a histogram of p separated by class. The dashed vertical line shows were the classifier placed the decision boundary of the classifier which minimises misclassification. The solid histogram values to the left of the decision boundary and the dashed histogram values to the right of the boundary represent histograms of the wrongly classified instances. To improve our classification, we wish to adjust w so that the errors on the right of the decision boundary move toward the left and those on the left move toward the right. In Figure 1 (a)(i) we have marked the misclassified instances we wish to move to the right with a ’5’ and those we wish to move to the left with an ’o’. We now compute the mean of the instances marked ’5’ and denote that vector a, and we also compute the mean of the instances marked ’o’ and denote that vector b. The direction we wish w to move is hence v = b − a. The vector v is shown in Figure 1 (a)(i). We only want to move w a small increment in that direction, so our update equation for w is: wnew = wold + .v 3
2 1
v
1 0
Feature 2
Feature 2
2
w eps*v
−1 −2 −1.5
(2)
−1
−0.5
0
0 −1 −2
0.5
1
1.5
2
2.5
3
−3 −1
3.5
−0.5
0
0.5
1
Feature 1
Histogram of data X
Histogram of data X
1.5
2
2.5
3
3.5
Feature 1 50
35 30 25 20 15 10
40 30 20 10
5 0 −2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Projection p=Xw
(a) (i) 2 Gaussian distributions in a 2dimensional feature space, (ii) a histogram of the data projected onto the vector w.
0 −5
−4
−3
−2
−1
0
1
2
3
4
5
Projection p=Xw
(b) The 3 class case in a 2-dimensional feature space (i) scatter plots, (ii) histogram of projection onto w
Figure 1: Plots of 2 different distributions. Here is some small step size. We then compute the new value of p with wnew , classify again and re-adjust w until convergence is reached. This idea generalises easily to multiple classes. In the multi-class case, the direction in which we choose to move the misclassified instances (left or right along w) is determined by the direction along w in which the nearest correct classification of that class is located. What we are trying to do is choose the next projection pnew = Xwnew so that on average the misclassifications of pnew are moved closer to the correct classifications of the same class and hence encourage those misclassifications to be turned into correct classifications. An example of a 3 class problem is shown in Figure 1
(b), in which the class instances are marked with ’+’, ’x’, and ’4’. Again, right moving instances are marked ’5’ and left moving ’o’. w is shown starting at the origin and pointing northwest. Since we are using a gradient descent method it is possible to converge upon a locally minimal solution, which is not necessarily globally optimal. To help overcome this and find a global minimum we search for w from various different initial vectors. These could be chosen using LDA or PCA, but in our case we chose them to be a set of random but orthogonal vectors. We then choose the w which has converged upon the best projection (the lowest classification error). Once w has been found we can then create an orthonormal basis, which is orthogonal to w but which will be of dimension d−1, and we then project data onto this basis to create a new data set, with the information contained in the direction of w removed. This orthonormal basis is easily created using a QR decomposition of a d x d matrix which has w as its first column and d − 1 other random columns, the basis is formed using the last d − 1 columns of the Q matrix. Once we have transformed the data, X, to a new (d−1)-dimensional dataset we start the procedure of finding a w for this reduced dimension data set. The sequence of finding w and reducing the dimension of the data set can continue until we have found m different w vectors. As we have already stated, we are passing both the training and test data to the classifier so as to keep the concept simple and illustrate the algorithm. In practice we may not have much training data and hence cross-fold validation will be used to obtain a reliable estimate of classification error. This algorithm may be applied in this case to minimise misclassification rate. If we were using 5-fold cross validation, for example, we would have 5 test-training data pairs. For one test-training data pair we train the classifier (given a projection, ptrain , of the training data onto w) and test with a projection, ptest ,of the test data, onto w. We then decide whether the points in the test data are left moving or right moving (as above). This is done for all 5 sets of test data, using there corresponding training sets to train. Then the vectors a, b, v are calculated and w is updated.
5 5.1 5.1.1
Data and experimental design Data Sets Synthetic Data - 2 dimensions
To illustrate the utility of our technique, we provide examples of the simplest possible dimensionality reduction task, from two dimensions to one. We constructed 6 different data sets all 2-dimensional which we considered interesting, i.e. LDA and PCA may find it difficult to choose a single projection to discriminate the class information. Plots of the data sets are shown in Figure 2. We will briefly describe each different data set. Data set (a) consists of 2 classes. Instances of class 1 are marked with ’+’ and class 2 with ’x’. Class 1 comprises of 3 gaussian distributions centred at (0, 1), (1.5, 0) and (0, -1). Class 2 comprises 3 gaussian distributions centred
3
3.5
2.5
3
2 2.5
1.5 2
1 1.5
0.5 1
0 0.5
−0.5 0
−1
−0.5
−1.5
(a)
−2 −3
−2
−1
0
1
2
3
(b)
−1 −0.5
0
0.5
1
1.5
2
2.5
3
3.5
0
5
10
15
20
3
3
2.5
2
2
1 1.5
0 1
0.5
−1
0
−2 −0.5
−3 −1
(c)
−1.5 −1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
(d)
−4 −20
4
4
3
3
−15
−10
−5
2 2 1 1 0 0 −1 −1 −2 −2 −3 −3
−4
(e)
−5 −4
−3
−2
−1
0
1
2
3
4
5
(f)
Figure 2: Data sets
−4 −5
−4
−3
−2
−1
0
1
2
3
4
5
at (0, 0), 1). All gaussian distributions had covariance matrices (0, 2) and (-1.5, 0.05 0 Σn = . Data set (b) consists of 6 classes. Each class distribution 0 0.05 is gaussian, with covariance, Σn . Data set (c) consists of 2 classes. Each class distribution is a mixture of two gaussian distributions with different means. Class 1 is marked with a ’+’ and the gaussian means are (0, 2) and (2, 0). Class 2 is marked with an ’x’ and the gaussian means are (0, 0) and (2, 2). All gaussians has covariance, Σn . Data set (d) comprises 2 classes. Class 1, marked ’x’, is constructed of 500 instances drawn from a gaussian distribution centred at (0 ,0) with covariance, Σn , and 1 instance at (20, 0). Class 2, marked ’+’, is constructed of 500 instances drawn from a gaussian distribution with covariance, Σn , centred at (1 ,0) and 1 instance at (-19, 0). Data set (e) contains 2 classes. Class 1 marked ’+’ is drawn 0.45 0 from a gaussian distribution with mean (0, 0) and covariance Σ = , 0 0.05 and so is scaled along the x-ordinate. Class 2 marked ’x’ is drawn from a gaussian 0.05 0 distribution which also mean (0, 0) but has covariance Σ = , and 0 0.45 thus is scaled along the y-ordinate. Finally, data set (f) contains 2 classes. Class 1 marked ’+’ is drawnfrom a gaussian distribution with mean (0, 0) and covariance 0.45 0 Σ= , and hence scaled along the x-ordinate. Class 2 marked ’x’ 0 0.05 is drawn from a gaussian distribution of mean (0, 0.1) which was not scaled in any direction and has covariance matrix Σn . 5.1.2
Wisconsin Diagnostic Breast Cancer Database - 30 dimensions
The Wisconsin Diagnostic Breast Cancer Database is a database of 30 features from 569 patients tested for breast cancer. There are two classes, Benign and Malignant. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The database is available from the UCI Machine Learning Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html.
5.2
Classifier Design
The tool that we will use for classification is a quadratic discriminant classifier (QDC), based on Bayes rule. A quadratic discriminant classifier is derived as follows. Let ωi signify the ith class. Let x denote the feature vector corresponding to a certain instance of the data, X, i.e. x is a row of X. Using Bayes rule we wish to find the class i which will maximise the posterior probability: p(ωi |x) =
p(x|ωi )P (ωi ) . p(x)
Maximising the p(ωi |x) is equivalent to maximising its logarithm. Therefore, assuming a normal distribution for the feature vector, p(x|ωi ) becomes:
n o d −1 p(x|ωi ) = (2π)− 2 |Σi | 2 exp − 12 (x − µi )T Σ−1 i (x − µi ) , where Σi is the covariance matrix of the ith class, and µi is the mean vector of the ith class. Substituting p(x|ωi ) into the natural logarithm of p(ωi |x), our problem is transformed into finding the class i which maximises the discriminant value gi (x) for a given test feature vector x: gi (x) = xT Hi x + hi x + ki , −1 1 T −1 1 where Hi = − 12 Σ−1 i , hi = Σi µi , and ki = − 2 µi Σi µi − 2 ln |Σi | + ln(P (ωi )). The class with the highest discriminant value is chosen as the assigned class for that feature vector. To construct the quadratic discriminant classifier, therefore, we must estimate the covariance matrix and mean for the features corresponding to each class, and also the prior probability of the class occurring.
5.3
Experimental Design
We present each of our 6 synthetic data sets and the Breast Cancer data to the three techniques under comparison; (1) Our Gradient Descent method, (2) PCA, and (3) LDA. The PCA projection is chosen as the eigenvector corresponding to the maximum eigenvalue after performing an eigenvalue decomposition on the covariance matrix of X. The LDA projection we choose corresponds to the maximum eigenvector returned when maximising the objective function: J(w) = wT SB w / wT SW w , where SB is matrix representing the scatter of the class means about the overall mean, and SW is a matrix describing the scatter of instances about each class mean (see [6] for a detailed description of the scatter matrices), where | | denotes the determinant. The eigenvectors are those of the matrix S−1 W SB . Each method will return a 1-dimensional projection of the initial multi-dimensional data. This projection will be used to train and test the QDC described in Section 5.2. We will use classification error as a measure of performance.
6 6.1
Results Results - Synthetic data
We see that the gradient descent technique outperforms or equals performance of PCA and LDA on all 6 datasets. The data sets where PCA performed worse were simply those where the class information was not in the direction of maximum variance. We will briefly interpret the results for each data set and explain why LDA failed to perform as well. The multi-modal gaussian structure of the class distribution violated the assumptions LDA is based on. In data set (b) when LDA attempted to maximise the variance of the class means about the overall mean it chose a projection which maximised the LDA objective function but obscured the
Gradient Descent
Projection
Classification
Chosen
26%
Classification Error
Error Data set (a)
PCA
(-0.91, 0.42)
33%
Projection Chosen
(-0.1, 0.98)
LDA Classification
Projection Chosen
Error
(-0.82,
62%
-0.58)
Data set (b)
3%
(-0.44, 0.9)
17%
(0.01, 0.99)
23%
(-0.6, -0.8)
Data set (c)
8%
(0.7, -0.72)
10%
(0.72, 0.68)
49%
(0, -1)
Data set (d)
1.6%
(-1, 0)
1.6%
(1, 0)
48%
(0, -1)
Data set (e)
24%
(0, 1)
39%
(0.89, 0.45)
39%
(0.5, 0.87)
Data set (f)
19%
(1, 0)
19%
(1, 0)
45%
(0, -1)
Table 1: Results: synthetic data outlying class at (2.7, 0). In data set (c) the assumptions of LDA are again violated with two bimodal class distributions. Also, since the class means are approximately equal, no separating information is contained in the means, which is what LDA depends upon. In data set (d) the two outlying points at (-19, 0) and (20 , 0) cause the distributions to appear to be skewed along the x-axis hence the LDA technique attempts to minimise the variance about the class means and chooses to project onto the y-axis, which is a projection containing no class separability information. In data set (e) both classes have the same mean and hence the projection LDA chooses is essentially random; dependent on the variances of the distributions. In data set (f) the means of the class distributions are offset along the y-axis, but the information is contained along the x-axis. LDA ignores the variance and maximises the distance between the means.
6.2
Results - Wisconsin Diagnostic Breast Cancer Data
Breast cancer data
Gradient Descent Classification Error
PCA Classification Error
LDA classification error
1.4%
9.1%
37.6%
Table 2: Results: Breast Cancer data We see that LDA performs the worst on this dataset. This most likely due to the data being skewed and non-gaussian. PCA performs reasonably well because it is only two class problem and the direction of maximum variance is very likely to be along the vector between the means of each class. In fact the data set is linearly separable, but our choice of classifier, which assumes that the class distributions are gaussian when projected onto w, caused some errors because the distributions are not gaussian. Plots of the histograms of the final projections, p, are shown in Figure 3.
40 30
60
35 25
50
30
20
15
40
Histogram of data X
Histogram of data X
25
30
20
15
20
10
10
5
10
5
0 −5
0
5
10
15
Projection p=Xw
(a) Gradient Descent
20
0 −4000
−3000
−2000
−1000
Projection p=Xw
(b) PCA
0
1000
2000
0 −9
−8
−7
−6
−5
−4
−3
−2
−1
Projection p=Xw
0
1 −3
x 10
(c) LDA
Figure 3: Histograms of final projections, p, for Breast Cancer data.
7
Conclusions
We have described a gradient descent method for dimensionality reduction which makes no assumptions about the structure of the data and can use an arbitrary classifier. While the system we describe here assumes that the same data is used for testing and training, this is only for illustration purposes. In reality the gradient descent process would operate on the classification results from several cross-fold validation runs. We have compared our systems performance with PCA and LDA, both standard techniques for dimension reduction, and found it to be superior, or at least equal, for all data sets.
References [1] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification [pattern recognition],” Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 40, no. 12, pp. 3043–3054, 1992. [2] X. Wang and K. K. Paliwal, “Using minimum classification error training in dimensionality reduction,” vol. 1, pp. 338–345, 2000. [3] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning tools and techniques with Java Implementations. Morgan Kaufman, 1999. [4] I. T. Jolliffe, Principal Component Analysis. New York: Springer, 2002. [5] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annu. Eugenics, vol. 7, pp. 179–188, 1936. [6] T. Okada and S. Tomita, “An optimal orthonormal system for discriminant analysis,” Pattern Recognition, vol. 18, no. 2, pp. 139–144, 1985.