with the goal of comparing how an increasing level of ... sors acquire a large amount of data, where each obser- vation is .... MP finds a suboptimal solution to the.
Comparison of Data Reduction Techniques Based on the Performance of SVM-type Classifiers Ramona Georgescu† , Christian R. Berger† , Peter Willett† , Mohammad Azam∗ , and Sudipto Ghoshal∗ †
Dept. of Electr. and Comp. Engineering, University of Connecticut, Storrs, CT 06269, USA ∗ Qualtech Systems Inc., 100 Great Meadow Road, Wethersfield, CT 06109, USA
Abstract – In this work, we applied several techniques for data reduction to publicly available datasets with the goal of comparing how an increasing level of compression affects the performance of SVM-type classifiers. We consistently attained correct rates in the neighborhood of 90%, with the Principal Component Analysis (PCA) having a slight edge over the other data reduction methods (PLS, SRM, and OMP). One dataset proved to be hard to classify, even in the case of no dimensionality reduction. Also in this most challenging dataset, performing PCA was considered to offer some advantages over the other compression techniques. Based on our assessment, data reduction appears a useful tool that can provide a significant reduction in signal processing load with acceptable loss in performance. Keywords: Data Reduction, PCA, PLS, SRM, OMP, Classification, SVM, PSVM.
1
Introduction
The recent advances in data collection and storage capabilities have led to information overload in many applications, e.g., on-line monitoring of spacecraft operations with time series data. It is therefore desirable to perform data reduction before storage or transmission of data, which can be lossy, as not all features of data might be relevant. Another motivation for our study is that in real systems, the dimensionality of the acquired data is often lower than the space that it is measured in. Reconstruction from low dimensional samples is needed and possible with varying degrees of accuracy. Also, in classification, the dimensionality of the data needed to distinguish between different classes is a measure of the information loss that can be tolerated in lossy compression schemes. In this paper, we consider the setup in Figure 1; sensors acquire a large amount of data, where each observation is often of high dimension by itself, e.g., ten to thirty dimensional. The data is compressed for stor-
age or transmission and reconstructed before performing classification. We impose the constraint that each observation has to be stored or transmitted seperately, which strongly limits the possibilities of data compression. Since the data is of unknown structure, we differentiate two cases: i) the codebook used for encoding is build based on a training subset of the data; or ii) the codebook is fixed a priori without knowledge of any data. For the first case we employ Principal Component Analysis (PCA) and Partial Least Squares (PLS), while for the blind case we consider Structurally Random Matrices (SRM) and Orthogonal Matching Pursuit (OMP). We compare their performance on the classification of publicly available datasets, where for classification we consider Support Vector Machines (SVM) and Proximal Support Vector Machines (PSVM). The experiments are performed on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, the Ionosphere dataset (both available at the Machine Learning Repository of University of California-Irvine [1]), and on the FD001 Turbofan Engine Degradation Simulation dataset provided by NASA [2]. References to relevant studies performed on these datasets can be found at [1] and respectively [3].
2 2.1
Data Reduction Techniques Principal Component Analysis
PCA is an orthogonal linear transformation that converts the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset by presumably retaining those characteristics of the dataset that contribute most to its variance, i.e., by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the most important aspects of the data.
Dimensionality Reduction (PCA, PLS, SRM, OMP)
Sensors
Bandlimited Channel
Reconstruction of Sensor Data
Classifier (SVM, PSVM)
Shared Codebook
Figure 1: System diagram For a (n × p) data matrix X, in which each row represents an observation and each column represents a feature, the PCA transformation is given by: X = XS ∗ XLT
(1)
where XS are the scores, i.e., a matrix of dimension (n × p) that represents X in the principal component space and XL are the loadings, i.e., a matrix of dimension (p × p) that contains the principal component coefficients. The loadings represent the ”axes“ of the new coordinate system and the scores can be thought of as the coefficients corresponding to the new ”axes“. We used the princomp function in the Statistics ToolR box of MATLAB⃝ (R2008a) to implement PCA and calculate the scores and loadings of the training set. We project the observations in the testing set on the loadings obtained for the training set and thus obtain the scores for the testing set (as the equation above suggests); then, we pass the scores obtained for the training set and the scores obtained for the testing set to the chosen classifier.
2.2
Partial Least Squares
PLS constructs a linear model that describes the relationship between dependent (response) variables Y and independent (predictor) variables X, e.g. one could try predicting a person’s weight (modeled as Y ) from their age and gender (stored in X). This linear model tries to find the multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space. PLS is therefore a supervised learning technique while PCA belongs to unsupervised learning. For a (n × p) data matrix X, in which each row represents an observation and each column represents a feature, and a (n × 1) vector Y that contains the corresponding classes for the observations in X, the PLS transformation is given by: X = XS ∗ XLT + errorX
(2)
Y = YS ∗
(3)
YLT
+ errorY
With nComp being the number of components retained, XS is a matrix of dimension (n × nComp ) representing
the extracted predictor scores and XL is a matrix of dimension (p × nComp ) containing the predictor loadings. YS is of dimension (n × nComp ) and YL is of dimension (1 × nComp ). PLS uses the nonlinear iterative partial least squares (NIPALS) algorithm [4] to find weight vectors w, c such that [cov(xs , ys )]2 = [cov(Xw, Y c)]2 where xs = Xw, ys = Y c and cov(xs , ys ) = xTs ys /n denotes the sample covariance between the score vectors xs and ys . In other words, PLS finds components from X that are also relevant for Y by performing a simultaneous decomposition of X and Y with the constraint that these components explain as much as possible the covariance between X and Y . Several variants of PLS exist; we used the SIMPLS form of PLS introduced by de Jong in which XS is an orthonormal matrix while YS is neither orthogonal nor normalized [6]. The matrix constructed by the weight vectors w is a (p×nComp ) matrix W of PLS weights with the property that XS = X ∗ W . We used the plsregress function in the Statistics ToolR box of MATLAB⃝ (R2008a) to implement PCA and calculate the scores and loadings of X and Y in the training set. We pass the scores obtained on the training set to the chosen classifier along with the scores obtained for the testing set by projecting the observations in the testing set (X) on the loadings of the training set (XL ) according to the formula: XS = XW (XLT W )−1 .
(4)
A regression step, in which the decomposition of X is used to predict Y [6], can follow the PLS decomposition. We have chosen not to take this approach and instead feed the results of the decomposition to a classifier.
2.3
Structurally Random Matrices
A K-sparse signal x of length d is a signal that, after some linear transform, can be represented by at most K nonzero coefficients, K 0 for all the xk of one class, and
the solution of a single system of linear equations. In contrast, SVM solves a quadratic or linear program that requires considerably longer computational time. Computational results on publicly available datasets indicate that the proximal SVM classifier has comparable test set correctness to that of SVM classifiers [14].
4
(a) SVM approach
Results
Standardizing the data is reasonable when the variables are in different units or when the variance of the different columns is substantial. Therefore, the data was centered by subtracting off column means and also by dividing each column by its standard deviation. The data was randomly split such that 32 of the instances went to the training set and 13 to the testing set. Ten Monte Carlo runs were averaged for each scenario. For both the nonlinear SVM and nonlinear PSVM we used a Gaussian Radial Basis Function (RBF) as a kernel, described by: K(xi , xj ) = e−
(b) PSVM approach
Figure 2: SVM approach vs. PSVM approach w′ xj + b < 0 for all the xj of the other class. If the data are in fact separable in this way, there is probably more than one way to do it. Among the possible hyperplanes, SVMs select the one where the distance from the hyperplane to the closest data points (the “margin”) is as large as possible. In practice, it is unlikely that a line will exactly separate the data, and even if a curved decision boundary does, exactly separating the data is probably not desirable: if the data has noise and outliers, a smooth decision boundary that ignores a few data points is better than one that loops around the outliers. Adding slack variables allows a point to be a small distance on the wrong side of the hyperplane without violating the problem constraints. An optimization-based derivation of SVMs is presented in the tutorial in [13].
3.2
Proximal Support Vector Machine
SVM classifies points by assigning them to one of two disjoint halfspaces. In PSVM, the separating planes are not bounding planes, but can be thought of as proximal planes, around which the points of each class are clustered and which are pushed as far apart as possible as seen in Figure 2. This formulation can also be interpreted as regularized least squares. PSVM leads to a fast and simple algorithm for generating a linear or nonlinear classifier that only requires
||xi −xj ||2 2σ 2
(12)
For all the data reduction methods, we chose to do experiments with 2, 5, 10, 15, 20, 25 and 30 components. For OMP, transmitting the nonzero elements and their position in the data obtained after executing the OMP doubles the communication effort and therefore, the number of components retained becomes 4, 10, 20, 30. The values we tried for tuning the standard deviation of the Gaussian RBF, σ were: 0.1, 1, 5, 10, 25, 50. PSVM requires tuning of parameter ν. The values we tried for ν were: 0.1, 1, 10, 100, 1000. The value of ν that gave the best performance with the linear SVM was also used for the nonlinear PSVM. In the following figures, plots for the σ and/or ν that achieve the best performance for each particular classifier-data reduction method combination are shown. For OMP, we set the number of rows of A to 100 to allow the algorithm enough flexibility in calculating its solution. It has to be noted that running OMP introduces a slight time delay compared to other methods that has to be taken into account when selecting a data reduction method.
4.1
Results on the WDBC Dataset
For a performance comparison between the various data reduction methods, see Figures 3 and 4. All methods perform well allowing both classifiers to achieve over 90% correct rate in all scenarios most of the time. Therefore, data reduction is highly recommended, in some cases the approach being able to reduce 30 dimensions to 5 or less with negligible loss in performance. More specifically, SRM and OMP display, as expected, a clear increase in correct rate with increasing number of retained components converging to the “no
1
0.95
0.95
0.9
0.9
correct rate
correct rate
1
0.85 PCA PLS SRM OMP no reduction
0.8 0.75
0.85 PCA PLS SRM OMP no reduction
0.8 0.75
0.7
0.7 0
5
10
15 dimension
20
25
30
0
5
15 dimension
20
25
30
(a) Linear PSVM
1
1
0.95
0.95
0.9
0.9
correct rate
correct rate
(a) Linear SVM
10
0.85 PCA PLS SRM OMP no reduction
0.8 0.75
0.85 PCA PLS SRM OMP no reduction
0.8 0.75
0.7
0.7 0
5
10
15 dimension
20
25
30
0
5
10
15 dimension
20
25
30
(b) Nonlinear SVM
(b) Nonlinear PSVM
Figure 3: WDBC-SVM-Performance Comparison
Figure 4: WDBC-PSVM-Performance Comparison
reduction” case when all dimensiones are retained. On the contrary the PCA and PLS curves tend to be more flat, without an obvious connection between the number of retained dimensions and the correct classification rate. Even more surprising, in the case of linear SVM or PSVM classifiers, applying PCA or PLS sometimes renders better performance than using the full data. We can only speculate that this is connected to the suboptimality of the linear SVM/PSVM classifiers, here being enhanced by the feature recognition property of PCA/PLS. It should be also noted that a 3% decrease in correct rate as displayed by the PCA and PLS between maximum and minimum compression is equivalent in 3 × 170 ≃ 5 the case of the WDBC dataset to only 100 additionally misclassified samples in the testing set. In all scenarios, SVM and PSVM can be considered to achieve similar classification correct rates. For both classifiers, the nonlinear version is slightly superior in performance to the linear version; WDBC is an easy dataset to classify, even for a linear classifier. Dif-
ficulty was increased progressively by analyzing two other datasets.
4.2
Results on the Ionosphere Dataset
The same approach presented above for the WDBC dataset was used for the Ionosphere dataset with the exception of changes in the values of σ and ν, which were adjusted for tuning purposes. PCA revealed that more principal components are needed to explain a certain variance in the Ionosphere dataset than in the case of the WDBC dataset. Therefore, it was expected that the Ionosphere dataset would be a more difficult dataset to classify than the WDBC dataset. This is reflected in slightly smaller classification correct rates (on the average about 85%) in all scenarios. For a performance comparison between the various data reduction methods, please refer to Figures 5 and 6. For both SVM and PSVM, a nonlinear classifier shows a noticeable improvement over a linear classifier by approximately 5%. For all scenarios, SVM and PSVM
1
1
0.95
0.95 0.9
0.85
correct rate
correct rate
0.9
0.8 0.75
PCA PLS SRM OMP no reduction
0.7 0.65 5
10
15 dimension
20
25
0.8 0.75 PCA PLS SRM OMP no reduction
0.7 0.65 0.6
0.6 0
0.85
0.55
30
0
5
(a) Linear SVM
10
15 dimension
20
25
30
(a) Linear PSVM
1
1
0.95
0.95
0.85
correct rate
correct rate
0.9
0.8 0.75
PCA PLS SRM OMP no reduction
0.7 0.65
0.9 0.85 PCA PLS SRM OMP no reduction
0.8 0.75
0.6
0.7 0
5
10
15 dimension
20
25
30
0
5
10
15 dimension
20
25
30
(b) Nonlinear SVM
(b) Nonlinear PSVM
Figure 5: Ionosphere-SVM-Performance Comparison
Figure 6: Ionosphere-PSVM-Performance Comparison
achieve comparable classification correct rates for each data reduction technique. As in the case of WDBC, PCA looks like a consistently good method for data reduction performing well over all levels of compression. In the more difficult to classify Ionosphere dataset, PCA emerges more clearly as the most desired dimensionality reduction method. For both the WDBC and the Ionosphere datasets, applying data reduction techniques results in a small loss in performance while requiring considerably less computational resources when compared to applying no compression and thus using all available data for classification.
to be from a fleet of engines of the same type. Each engine starts with different degrees of initial wear and manufacturing variation which is unknown to the user. Included in the data are three operational settings that have a substantial effect on engine performance. Each engine is operating normally at the start of each time series, and develops a fault at some point during the series. In the training set, the fault grows in magnitude until system failure. Therefore, we considered the Remaining Useful Life (RUL), i.e. the number of remaining operational cycles before failure, to be zero at the end of each engine’s time series and we populated the RUL vector associated with the time series by increasing the RUL by one cycle for each step backwards in time. The PHM dataset was created as a time to failure problem but we chose to recast it as a classification problem. In the test set, the time series ends some time prior to system failure and an RUL vector is provided. We expanded this vector in a similar manner, by increasing the RUL by one cycle for each step back-
4.3
Results on the PHM Dataset
We chose to label the FD001 Turbofan Engine Degradation Simulation dataset as the PHM dataset (please see [3]), and we will refer to it as such. The data consists of multiple multivariate time series, divided into training and test subsets. Each time series is from a different engine but the data is considered
0.75
0 (training) 1 (training) Support Vectors
1.5
0.7
correct rate
1
0.5
0.65 0.6 0.55
0 0.5 PCA
−0.5
0.45
0
5
10 15 dimension
20
25
−1 −4
−2
0
2
Figure 8: PCA - Nonlinear SVM Figure 7: PHM Data Reduced by PCA to 2 Dimensions wards in time. The RUL vector was then recast as the two classes needed for the svm. For both the train set and the test set, values in the RUL vector greater than the mean value of the RUL vector were considered to indicate class 1, i.e. the engine is operating normally, with the rest of the values indicating class 0, i.e. the engine has suffered a fault. The data is contaminated with sensor noise. For denoising, a median filter with a jumping window of length 15 is applied to each column in the data and the RUL vector. All the training dataset was used for training and all the testing dataset was used for testing, there was no randomness involved in the selection. We chose to apply to PHM the data reduction method we deemed worked best for the other datasets, PCA combined with the nonlinear version of SVM. By looking at how the reduced data appears in 2D (Figure 7), it can be observed that the data is very hard to classify even with a nonlinear classifier. The plot gives us the intuition that increasing the number of retained components will decrease performance, evident in Figure 8. These effects could be caused by the fact that the PHM dataset was designed as a regression problem and not as a classification problem.
5
Conclusion
Several techniques for data reduction (PCA, PLS, SRM and OMP) were applied to three datasets (WDBC, Ionosphere and PHM) in an effort to analyze how their performance changes with the level of data compression. The comparison of these techniques was based on how well the data can be classified by an SVM or PSVM (linear and nonlinear versions for each) at decreasing number of components retained. On the WDBC and Ionosphere datasets, all methods consistently attained good correct rates in the neighborhood of 95% and 90%
respectively. Where as expected PCA and PLS outperform SRM and OMP, as they optimize their data reduction based on a training data sest. Still, also SRM and OMP show little performance degradation, especially in conjunction with the nonlinear SVM/PSVM, and could be favorably if no training set is easily available before transmission. In absolute performance, PCA seemed to have a slight edge over the other data reduction methods. SVM and PSVM were similar in performance in all cases, with a nonlinear classifier showing an improvement by a small margin over a linear classifier with respect to correct rates at a certain level of compression (better noticeable on the Ionosphere dataset). A PCA together with a nonlinear SVM were chosen to be run on the last and most complex dataset. PHM proved to be a hard-to-classify dataset even in the absence of dimensionality reduction. Even in such a difficult case, PCA was considered worth applying. Based on these assessments, data reduction appears a useful tool that can provide a significant reduction resource requirements for storage or transmission, with acceptable loss in performance.
Acknowledgements This research was supported by under Contract # NNX08CD30P: NASA STTR 2007, Phase I: Data Reduction Techniques for Real-time Fault Detection and Diagnosis, and Multiple Fault Inference with Imperfect Tests.
References [1] A. Asuncion and D. J. Newman, UCI Machine Learning Repository, University of CaliforniaIrvine, School of Information and Computer Science, 2007.
[2] http://ti.arc.nasa.gov/project/prognostic-datarepository/ [3] A. Saxena, K. Goebel, D. Simon, and N. Eklund, “Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation,” Proc. First Intl. Conf. on Prognostics and Health Management, Oct. 2008. [4] H. Wold, “Path models with latent variables: The NIPALS approach,” Quantitative Sociology: International perspectives on mathematical and statistical model building, pp. 307-357, Academic Press, 1975. [5] H. Abdi, “Partial least square regression (PLS regression),” Encyclopedia of Measurement and Statistics, pp. 740–744, Sage Publications, 2007. [6] S. de Jong, “SIMPLS: An alternative approach to partial least squares regression,” Chemometrics and Intelligent Laboratory Systems, Vol. 18, pp. 251-263, Mar. 1993. [7] R. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, Vol. 24, No. 4, pp. 118-121, Jul. 2007. [8] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, Vol. 52, No. 12, pp. 5406-5425, Feb. 2006. [9] D. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, Vol. 52, No. 4, pp. 1289-1306, Apr. 2006. [10] T. T. Do, L. Gan, Y. Chen, N. Nguyen, and T. D. Tran, “Fast and Efficient Dimensionality Reduction Using Structurally Random Matrices,” Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing, Apr. 2009. [11] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on Information Theory, Vol. 53, No. 12, pp. 4655-4666, Dec. 2007. [12] W. Li and J. C. Preisig, “Estimation of Rapidly Time-Varying Sparse Channels,” IEEE Journal of Oceanic Engineering, Vol. 32, No. 4, pp. 927–939, Oct. 2007. [13] J. P. Lewis, “A Short SVM (Support Vector Machine) Tutorial,” www.idiom.com/∼zilla/code.html. [14] G. Fung and O. L. Mangasarian, “Proximal Support Vector Machine Classifiers,” Proc. Seventh ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 77–86, Aug. 2001.