Efficient Feature Selection for PTR-MS Fingerprinting of Agroindustrial Products Pablo M. Granitto1 , Franco Biasioli2 , Cesare Furlanello3 , and Flavia Gasperi2 1
2
CIFASIS, CONICET/UNR/UPC, Bv 27 de Febrero 210 Bis – 2000 Rosario – Argentina
[email protected] FEM-IASMA Research Center – Agrifood Quality Department, Via E. Mach 1 – 38010 San Michele allAdige (TN) – Italy {franco.biasioli,flavia.gasperi}@iasma.it 3 FBK-irst, Via Sommarive 18 – 38100 Povo (TN) – Italy
[email protected]
Abstract. We recently introduced the Random Forest - Recursive Feature Elimination (RF-RFE) algorithm for feature selection. In this paper we apply it to the identification of relevant features in the spectra (fingerprints) produced by Proton Transfer Reaction - Mass Spectrometry (PTR-MS) analysis of four agro-industrial products (two datasets with cultivars of Berries and other two with typical cheeses, all from North Italy). The method is compared with the more traditional Support Vector Machine - Recursive Feature Elimination (SVM-RFE), extended to allow multiclass problems. Using replicated experiments we estimate unbiased generalization errors for both methods. We analyze the stability of the two methods and find that RF-RFE is more stable than SVMRFE in selecting small subsets of features. Our results also show that RF-RFE outperforms SVM-RFE on the task of finding small subsets of features with high discrimination levels on PTR-MS datasets.
1
Introduction
Proton Transfer Reaction - Mass Spectrometry (PTR-MS) [1] is a spectrometric technique with a growing number of applications ranging from medical diagnosis to environmental monitoring [2]. It allows fast, non-invasive, time-continuous measurements of volatile organic compounds (VOCs). These compounds play a relevant role in food and agro-industrial applications. They are related to the real or perceived quality of food and to its sensory characterisation, and they are emitted during most transformation/preservation processes. Among the applications of PTR-MS based classification in food science and technology, we can cite the detection of the effect of different pasteurisation processes of fruit juices [3], the classification of strawberry cultivars [4] or the characterisation of Italian ‘Grana’ cheeses [5]. Here PTR-MS is used to produce a fingerprint of each sample in the form of a spectrum vector whose components are the intensities of the spectrometric peaks V. Kurkova, R. Neruda, and J. Koutnik (Eds.): ICANN 2008, LNCS 5164, pp. 42–51, 2008. c Springer-Verlag Berlin Heidelberg 2008
Efficient Feature Selection for PTR-MS Fingerprinting
43
at different m/z ratios. Each PTR-MS spectrum can contain up to 500 m/z values. Although this is a relatively low number compared with other spectrometric or spectroscopic approaches, the number of analysed samples per class is usually low in the experimental practice, introducing issues similar to the ones faced in the classification of high-throughput microarray and proteomic data. Moreover, due to the absence of separation, each peak in the spectrum can be related to one or more compounds. The identification of small sets of relevant features for the food product under analysis is of interest for several operative reasons: in particular, we focus on the identification of few relevant ‘quality’ markers that can be measured in a simple, fast and cheap way or to concentrate to a few relevant masses the identification efforts needed to compensate for the lack of separation. There are indications that PTR-MS features can be related to genetic aspects [4] or to sensory characteristics of food [6] and thus classification based on PTR-MS data could provide a tool to better investigate these fields, possibly providing a link between sensory and genetics. In this application domain, we introduce instruments from the recent feature selection literature [7,8,9]. As a general taxonomy, the feature selection mechanism may be implemented into the learning algorithm in embedded methods, while wrapper methods directly consider the classifier outputs as in a black box approach. In both cases, care is required to avoid overfitting during the selection process (the bias selection problem [10]), particularly in real applications on small datasets. The use of resampling methods within a complete validation setup is a typical strategy to avoid these problems [11]. The SVM-RFE algorithm [9] introduced a ranking of the features within Support Vector Machine classifiers by Recursive Feature Elimination (RFE). This strategy found several applications in Bioinformatics[12] and also in Quantitative Structure Activity Relationship (QSAR)[13]. The SVM-RFE is often used in practice with linear SVMs, and it can easily be extended from binary to multiclass classification problems. We developed the alternative RF-RFE method[14], which basically replaces SVM with Breiman’s Random Forest (RF)[15] into the core of the RFE method. RF is a natural multiclass algorithm with an internal unbiased measure of feature importance, and we may use such internal measure for ranking masses for relevance in discrimination. In this paper, we apply the two feature selection and classification methods and compare their performances in indicating highly discriminative masses for PTR-MS multiclass data. A usually neglected problem in feature selection methods is the instability of the selection process [16]. Quite different ranked lists of features may be obtained for classifiers developed on slightly different data replicates, as typically observed in functional profiling from microarray data. In the last part of this paper we compare the stability of the two proposed versions of the RFE algorithm. The article is organized as follows: in Section 2, we describe the full feature selection schemes for RF-RFE and SVM-RFE. In Section 3 we compare both methods on the four real PTR-MS datasets and in Section 4 we discuss the stability of the solutions. Finally, we draw some conclusions in Section 5.
44
2
P.M. Granitto et al.
The Feature Selection Setup
A feature selection method that uses (in any way) information about the targets may lead to overfitting, in particular with the very low samples-to-features ratios typical of spectrometric experiments. Thus, in order to obtain unbiased estimates of the prediction error with small PTR-MS datasets, feature ranking and selection should be included in the modelling, and not treated as a pre-processing step; moreover, we need to appropriately decouple selection from error estimation [10].
Learning Set i
Feature selection: RF-RFE SVM-RFE
Features Subset i Modeling: RF - SVM Test Error i
Test Set i Fig. 1. The computational setup used for the feature selection process
We use a computational setup consisting of two nested processes. The outer loop performs n times a random split of the dataset in a training set (used to develop the models – including the feature selection step), and in a test set, used to estimate the accuracy of the models. The inner process (Figure 1) supports the selection of nested subsets of features and the development of classifiers over these subsets (using only the learning subset provided by the outer loop). The results of the n replicated experiments are then aggregated to obtaine a comprehensive feature ranking and accuracy estimation. The RFE selection method [9] is basically a recursive process that ranks features according to some measure of their importance. At each iteration feature importances are measured and the less relevant one is removed. The (inverse) order in which features are eliminated is used to construct a final ranking. The feature selection process itself consists only in taking the first n features from this ranking. The original SVM-RFE method was developed to select features in a binary classification problem. Between the various strategies for solving multiclass problems with binary classifiers [17,18], we choose the One-vs-One method to extend SVM-RFE to handle multiclass datasets. In this case, a problem with c classes
Efficient Feature Selection for PTR-MS Fingerprinting
45
Table 1. Details of the four dataset. The columns ‘min #’ and ‘max #’ show the min or max number of samples per class in the corresponding dataset. The last column shows the number of production years included in the dataset. Dataset
min m/z max m/z Samples Classes min # max # Years
Strawberry
20
250
233
9
21
30
3
Raspberry
20
250
92
5
17
19
2
Nostrani
20
259
48
6
8
8
1
Grana
20
259
60
4
15
15
1
is decomposed into p = c(c − 1)/2 binary problems. To solve each problem we train a linear SVM [19], obtaining p decision functions i = 1...p.
Di (x) = xwi
(1)
The weight vectors wi corresponding to all binary problems are then averaged p
W=
1 wi p i=1
(2)
and the components of W are used for ranking the features. In all our experiments we use a fixed value of C = 100, following [9]. We performed a series of experiments on PTR-MS datasets using different C values, finding that on our particular data the results are almost independent of the value of C. It must be noted that other datasets could require a full tunning of this parameter [20]. In a previous work [14] we introduced Random Forest - Recursive Feature Elimination (RF-RFE). We showed how RF’s internal measure of features importance can replace SVM weights for features ranking. Also, as RF makes use of Out-of-Bag subsets to estimate the importances, computational efforts are not increased. Moreover, RF was developed as a multiclass algorithm, which suggests it could provide a better measure of importance for this kind of problems.
3
Results
We considered four datasets. The first two refer to cultivar characterization of berry fruits (Strawberries[4] and Raspberries) and the last two to typicality assesment of cheeses (Nostrani[21] and Grana[6]). All products come from Trento Province, North Italy, or other places in the same area. Table 1 shows details of each dataset. In all cases the headspace composition of the samples has been measured by direct injection in a PTRMS apparatus (experimental details can be found in previous papers[3,4]). Each sample was then associated to its PTR-MS spectrum normalised to unit total area.
P.M. Granitto et al. 0.8
46
0.4 0.0
0.2
Error rate
0.6
SVM RF
2
5
10
20
50
100
200
Features
0.8
Fig. 2. Mean classification errors for SVM-RFE and RF-RFE on the Strawberry dataset. Bars show one standard deviation evaluated over 100 replications.
0.4 0.0
0.2
Error rate
0.6
SVM RF
2
5
10
20
50
100
200
Features
Fig. 3. Mean classification errors for SVM-RFE and RF-RFE on the Raspberry dataset. Bars show one standard deviation evaluated over 100 replications.
For all cases we replicated the feature selection process on n = 100 runs. For each run, we split the dataset at random into train/test sets with a 75%/25% proportion, stratifying on class frequencies. The train set is used for RF-RFE and SVM-RFE to select features and to develop models, which are then evaluated on the test set. It is important to note that the 100 runs are not completely independent of each other, because there is a considerable overlap amongh any pair of train sets or any pair of test sets. The results obtained from this kind of replicated
47
0.8
Efficient Feature Selection for PTR-MS Fingerprinting
0.4 0.0
0.2
Error rate
0.6
SVM RF
2
5
10
20
50
100
200
Features
0.4 0.2
Error rate
0.6
0.8
Fig. 4. Mean classification errors for SVM-RFE and RF-RFE on the Nostrani dataset. Bars show one standard deviation evaluated over 100 replications.
0.0
SVM RF
2
5
10
20
50
100
200
Features
Fig. 5. Mean classification errors for SVM-RFE and RF-RFE on the Grana dataset. Bars show one standard deviation evaluated over 100 replications.
experiments on wide datasets usually do not have statistical significance, they are only strong indications of the espected behaviour of the different methods. 3.1
Modeling Error
In Figure 2 we compare both selection methods on the Strawberry dataset. We show mean classification errors (± one standard deviation) for RF or SVM
48
P.M. Granitto et al.
Table 2. Mean inter-quartile distances (MIQ) (×102 ) for different models and subsets size. The first 2 columns show RF and SVM MIQ for the 10 features with the highest median ranking. The following columns show the same results for 50 m/Z values and for the full sets. m/Z values
10 RF
50
All
SVM RF SVM RF SVM
Strawberry
1.4
1.9
3.6
5.7
10.9
9.7
Raspberry
2.5
5.5
12.1
10.0
30.3
17.0
Nostrani
3.9
4.9
10.4
13.2
21.6
19.0
Grana
4.6
6.4
18.3
19.1
30.8
26.3
models adjusted on subsets of different sizes selected with the corresponding RFE methods. In this case RF-RFE clearly outperforms SVM-RFE when using only a few masses. Both methods have a similar behaviour for more than 11 features, reaching their minimum modeling error with around 35 features. This minimum error level is very similar for both methods, with a small edge for SVM. In Figure 3 we show the corresponding results for the Raspberry dataset. RF-RFE shows lower mean errors than SVM-RFE for all subset sizes in this case. The differences between both methods are the biggest of the four datasets under evaluation. The same analysis was repeated for the cheeses datasets, Figures 4 and 5. The results for the Nostrani dataset are similar to the Raspberry ones. RF-RFE again shows lower mean errors than SVM-RFE for all subset sizes. It reaches the minimum mean error with ∼ 50 features. For the Grana dataset (Figure 5), both methods show the same behaviour for small size subsets. But in this case SVMRFE outperforms RF-RFE with bigger subsets, reaching the minimum mean error for 17 features subsets. 3.2
Complexity
Both algorithms showed comparable execution times. In our experiments on PTR-MS datasets, RF-RFE was in all cases slightly heavier than SVM-RFE, with running times ranging from 1.2 to 1.9 of the corresponding times for SVMRFE, but these ratios are clearly problem-dependent. Also, any tunning of the C parameter of SVMs should increase the running time of SVM-RFE considerably. 3.3
Stability
As feature selection methods are unstable[16], each replicate of the selection process gives a different ranking. This means that the error levels showed in
Efficient Feature Selection for PTR-MS Fingerprinting
49
Figures 2 to 5 are only indications of the expected behavior of both methods, which cannot be associated with a particular subset. Of course, a higher stability of the selection method helps in the identification of the most relevant features, because rankings are more similar in that case. In order to measure the stability of both methods we assign relative ranking positions to each feature with a linear scale between 1 (first) and 0 (last). An ideal (totally stable) selection method should returns the same value for each feature in all replicates. On the opposite, a completely unstable method should return a random value in [0 : 1]. Thus, the dispersion of the distribution of this relative ranking (measured over the 100 replications) is correlated with the instability of the selection method. In Table 2 we show the mean inter-quartile distances (MIQ) of these distributions evaluated on different subsets of features. For the 10 and 50 most relevant features for classification RF-RFE values are clearly smaller than SVMRFE ones. Only in the Raspberry dataset with 50 features SVM-RFE shows a smaller MIQ than RF-RFE. For the full sets, SVM selections are always more stable, but this fact has low influence in the selection of the most relevant m/Z values.
4
Conclusion
In this paper we used RF-RFE (coupled with replicated experiments) for feature selection on PTR-MS datasets, and compared it with SVM-RFE. Feature selection methods can be evaluated at least on two aspects, their capacity to find the smallest subset with a given error level, or to find the minimum possible error without caring about the number of selected features. For the first task we showed that RF-RFE has similar or better performance than SVM-RFE in all four datasets. For the second one, RF-RFE showed similar or better performance than SVM-RFE in 3 out of the 4 datasets under analysis. Furthermore, we compared the stability of the selected features, an usually neglected aspect of the feature selection process. We showed that RF-RFE is more stable then SVMRFE in selecting the most relevant features for discrimination. Overall, RF-RFE seems to be more appropriate than SVM-RFE for fingerprinting agroindustrial products with PTR-MS. Work in progress includes the use of other multiclass strategies or non-linear extensions in the SVM-RFE method, the analysis of more agroindustrial products and the identification of the compounds associated with the selected masses.
Acknowledgements We acknowledge partial support for this project from ANPCyT, Argentina (PICT 643) and from PAT projects MIROP, RASO, INTERBERRY, QUALIFRAPE and SAMPPA (Trento, Italia).
50
P.M. Granitto et al.
References 1. Hansel, A., Jordan, A., Holzinger, R., Prazeller, P., Vogel, W., Lindinger, W.: Proton transfer reaction mass spectrometry: on-line trace gas analysis at the ppb level. Int. J. Mass. Spectrom. Ion Procs. 149/150, 609–619 (1995) 2. Lindinger, W., Hansel, A., Jordan, A.: On-line monitoring of volatile organic compounds at ppt level by means of Proton-Transfer-Reaction Mass Spectrometry (PTR-MS): Medical application, food control and environmental research. Int. J. Mass. Spectrom. Ion Procs. 173, 191–241 (1998) 3. Biasioli, F., Gasperi, F., Aprea, E., Colato, L., Boscaini, E., M¨ark, T.D.: Fingerprinting mass spectrometry by PTR-MS: heat treatment vs. pressure treatments of red orange juice - a case study. Int. J. Mass. Spectrom, 223-224, 343-353 (2003) 4. Biasioli, F., Gasperi, F., Aprea, E., Mott, D., Boscaini, E., Mayr, D., M¨ark, T.D.: Coupling Proton Transfer Reaction-Mass Spectrometry with Linear Discriminant Analysis: a Case Study. J. Agr. Food Chem. 51, 7227–7233 (2003) 5. Boscaini, E., Van Ruth, S., Biasioli, F., Gasperi, F., M¨ ark, T.D.: Gas Chromatography-Olfactometry (GC-O) and Proton Transfer Reaction-Mass Spectrometry (PTR-MS). Analysis of the Flavor Profile of Grana Padano, Parmigiano Reggiano, and Grana Trentino Cheeses. J. Agr. Food Chem. 51, 1782–1790 (2003) 6. Biasioli, F., Gasperi, F., Aprea, E., Endrizzi, I., Framondino, V., Marini, F., Mott, D., M¨ ark, T.D.: Correlation of PTR-MS spectral fingerprints with sensory characterisation of flavour and odour profile of Trentingrana cheese. Food Qual. Prefer. 17, 63–75 (2006) 7. Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 8. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1996) 9. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 46, 389–422 (2002) 10. Ambroise, C., McLachlan, G.: Selection bias in gene extraction on the basis of microarray gene-expression data. P. Natl. Acad. Sci. USA 99, 6562–6566 (2002) 11. Furlanello, C., Serafini, M., Merler, S., Jurman, G.: Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data. BMC Bioinformatics 4, 54 (2003) 12. Ramaswamy, S., et al.: Multiclass cancer diagnosis using tumor gene expression signatures. P. Natl. Acad. Sci. USA 98, 15149–15154 (2001) 13. Li, H., Ung, C.Y., Yap, C.W., Xue, Y., Li, Z.R., Cao, Z.W., Chen, Y.Z.: Prediction of Genotoxicity of Chemical Compounds by Statistical Learning Methods. Chem. Res. Toxicol. 18, 1071–1080 (2005) 14. Granitto, P.M., Furlanello, C., Biasioli, F., Gasperi, F.: Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometr. Intell. Lab. 83, 83–90 (2006) 15. Breiman, L.: Random Forests. Mach. Learn. 45, 5–32 (2001) 16. Breiman, L.: Heuristics of instability and stabilization in model selection. Ann. Stat. 24, 2350–2383 (1996) 17. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE T. Neural Networ. 13, 415–425 (2002) 18. Allwein, E., Schapire, R., Singer, Y.: Reducing Multiclass to Binary: A unified Approach for Margin Classifiers. J. Mach. Learn. Res. 1, 113–141 (2000) 19. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Efficient Feature Selection for PTR-MS Fingerprinting
51
20. Huang, T.-M., Kecman, V.: Gene extraction for cancer diagnosis by support vector machines. Artif. Intell. Med. 35, 185–194 (2005) 21. Gasperi, F., Biasioli, F., Framondino, V., Endrizzi, I.: Ruolo dell´analisi sensoriale nella definizione delle caratteristiche dei prodotti tipici: l´esempio dei formaggi trentini / The role of sensory analysis in the characterization of traditional products: the case study of the cheese from Trentino. Sci. Tecn. Latt.-Cas. 55, 345–364 (2004)