Anal Bioanal Chem (2014) 406:2591–2601 DOI 10.1007/s00216-014-7677-z
RESEARCH PAPER
Application of data mining methods for classification and prediction of olive oil blends with other vegetable oils Cristina Ruiz-Samblás & José M. Cadenas & David A. Pelta & Luis Cuadros-Rodríguez
Received: 30 October 2013 / Revised: 14 December 2013 / Accepted: 31 January 2014 / Published online: 28 February 2014 # Springer-Verlag Berlin Heidelberg 2014
Abstract The aim of this article is to study tree-based ensemble methods, new emerging modelling techniques, for authentication of samples of olive oil blends to check their suitability for classifying the samples according to the type of oil used for the blend as well as for predicting the amount of olive oil in the blend. The performance of these methods has been investigated in chromatographic fingerprint data of olive oil blends with other vegetable oils without needing either to identify or to quantify the chromatographic peaks. Different data mining methods—classification and regression trees, random forest and M5 rules—were tested for classification and prediction. In addition, these classification and regression tree approaches were also used for feature selection prior to modelling in order to reduce the number of attributes in the chromatogram. The good outcomes have shown that these methods allow one to obtain interpretable models with much more information than the traditional chemometric methods and provide valuable information for detecting which vegetable oil is mixed with olive oil and the percentage of oil used, with a single chromatogram. Keywords Authentication edible oils . Quantification olive oil . Data mining . Decision tree . Random forest . Rule-based system C. Ruiz-Samblás : L. Cuadros-Rodríguez (*) Department of Analytical Chemistry, University of Granada, c/ Fuentenueva, s.n., 18071 Granada, Spain e-mail:
[email protected] J. M. Cadenas Department of Information Engineering and Communication, Espinardo Campus, University of Murcia, 30100 Murcia, Spain D. A. Pelta Department of Computer Science and Artificial Intelligence, University of Granada, c/ Periodista Daniel Saucedo Aranda, s.n., 18071 Granada, Spain
Introduction Food authentication is the process by which a food is verified as complying with its label description. There is a clear trend in the international market towards labelling products with information about their composition, quality and origin, which brings about the need to develop and standardize analytical methods to either confirm the information given by the label or uncover fraud. A common analytical strategy is to compare information from specimens to be authenticated with that obtained from collections of genuine specimens and to search for a marker substance of the adulterant which is not present, or is at a different concentration, in the genuine product. A whole battery of methods, including every facet of instrumental analytical chemistry, are nowadays available for the determination of marker substances. Very often “hyphenated” methods, mostly an online combination of chromatographic and spectroscopic methods, are deployed to solve authentication problems [1]. In addition, pattern recognition methods and subsequent data evaluation by multivariate statistical methods are also important [2-7]. Multivariate chemometric tools have been playing an important role in analytical chemistry for data treatment in order to obtain interpretable results for pattern recognition [8]. The design of a pattern recognition system requires careful attention to the definition of pattern classes, feature (or attribute) extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance validation [9]. However, most of the modelling methods may not work well or may even be invalid when too many descriptors exist in the model. A key reason is that data overfitting usually occurs when the number of descriptors significantly exceeds the number of training samples or high multicollinearity among the descriptors exists [10, 11]. The authentication of olive oil is a very interesting and important issue to be addressed, since nowadays the
2592
production of high-quality olive oils is strongly controlled by the European Union (EU) to avoid counterfeits and misinformation for consumers [12, 13]. The current EU requirements on this issue are specified in the recent Commission Implementing Regulation (EU) 29/2012 on marketing standards for olive oil in relation to the control of the content of olive oil mixed with other vegetable oils, as well as for foodstuffs which have olive oil as an ingredient [14]. According to this regulation, the presence of more than 50 % olive oil has to be indicated on the label, but if the proportion is lower than 50 %, the name “olive oil” cannot be used on the label. Owing to these requirements, there is a real necessity for analytical methods to accomplish them. These methods have been complemented with data treatment methods based on classic chemometrics [15-17]. Current analytical techniques, chromatography among them, have great power to extract the important valuable information of a specific sample from an analytical signal. There are several ways to acquire a chromatographic signal, the difference among them being the cost in terms of time and the type of information that they can extract. In general, most of the applications use a data set derived from the raw analytical signal provided by the instrument, such as peak areas or concentration [18]. The use of chromatographic fingerprints (a recorded full chromatogram) for the development of the corresponding multivariate models is a very innovative aspect since it helps to highlight chemically relevant information and patterns present in the data [19]. Despite the suitability of the chromatographic fingerprint method for olive oil quantification purposes, it has not been used very extensively in chromatography, although some exceptions exists [20, 21]. Currently, a trend has evolved to introduce new mathematical techniques more extensively in analytical chemistry for complex data analysis. The versatility of data mining methods makes them able to deal with typical problems that arise when fingerprinting techniques are used: too many descriptors in the model, complex data having missing values, mixtures of different data types, multiple classes or unbalanced data sets. For instance, some data mining methods such as artificial neural networks (ANN) [22, 23], decision trees [24] and ensemble methods [25] have been applied in different realms of food authentication. The random forest (RF) methods have become increasingly popular in bioinformatics [26-30]. In this context, the RF method is also used as an attribute subset selection method for the identification of important descriptors, identifying and removing as many irrelevant and redundant descriptors as possible [31]. In addition, RF methods have been successfully applied in several areas having large data volumes, including spectroscop y [ 3 2, 33], chemoinformatics, quantitative structure–activity relationship modelling [34, 35] and omics sciences [36, 37]. However, they have not been used much to solve food authentication problems from analytical chemical data.
C. Ruiz-Samblás et al.
In this study, advanced data mining methods are applied to chromatographic fingerprints of blends of olive oil with other edible vegetable oils with the goal of authentication. These samples were analysed by gas chromatography coupled with use of a mass spectrometer, and the whole raw chromatograms were used for data treatment. The first aim of this contribution is to check the suitability of these computational learning techniques to classify the blends according to the vegetable oil used for blending, and the second aim is to predict the proportion of olive oil used in each blend on the basis of regression.
Data mining techniques in chemometrics The "classic" chemometric approaches are being improved with data mining methods. These methods refer to the process which discovers previously unknown useful patterns in large volumes of data by extracting non-evident information of a data set and then transforming it into an interpretable structure for later use. Methods from statistics, computational intelligence and database systems are applied. The new modelling techniques are not comparable to traditional chemometric ones, which have a solid theoretical foundation in statistics. However, their advantages include good predictive capability and the balance of all attributes in the case of overfitting. A review of the historical development of and the existing stateof-the-art data mining and related tools has been published recently [38]. For more extensive information on the principles, techniques and applications of data mining, the reader is referred to [39-42]. Supervised data mining commonly emphasizes the principles and algorithms for constructing predictive models, e.g. for classification or regression [43], where the quality of a model is assessed in terms of predictive accuracy [44]. In addition, the model would yield rules (i.e. the relationships discerned from data) that can be made interpretable and understandable to human decision makers [45]. Initially, the supervised data mining methods followed two parallel pathways of development, mainly for classification purposes: ANN and decision trees [39]. ANN sought to express a nonlinear function directly (the "cause") by means of assigning weights to the input attributes, accumulate their effects and "react" to produce an output value (the "effect") following some sort of decision function. Decision trees were concerned with expressing the effects directly by developing methods to find "rules" that could be evaluated for separating the input values into one of several "bins" without having to express the functional relationship directly. A significant advantage of both methods is that they are non-parametric, in the sense that they do not rely on assumptions about data distribution.
Application of data mining methods on samples of olive oil blends
Over the past decade, together with ANN, other learning methods such as support vector machines (SVM) have been proposed for chemical studies on complex data analysis and have attracted the attention of the chemometrics community [46], both as a classification technique [36, 47-49] and also because their use has been successfully extended to solve calibration problems [36, 49, 50]. SVM is a discriminative classifier formally defined by a separating hyperplane. It takes a set of input data and predicts, for each given input, which of two possible classes forms the output; the algorithm yields an optimal hyperplane which categorizes new examples. In particular cases, both methods show better performance than linear methods owing to, in practice, the presence of nonlinear effects (such as temperature variation, baseline drift, and multicollinearity) on the analytical instrumental signal which decreases the accuracy of linear methods. In the following sections, the features of the methods used in this contribution will be briefly outlined. Decision trees Recently, decision trees have been receiving increasing attention, and they are becoming more popular in the treatment of chemical data [40, 46, 51]. They can model linear as well as nonlinear relationships and, in addition, they are interpretable, easy to understand and fast to build. Decision trees consist of recursive algorithms implementing any divide-and-conquer strategy. They follow a hierarchical structure where at each level a test is applied to one or more attribute values that may have one or two outcomes. Classification of an instance (a sample) starts at the root of the tree. The instance is then evaluated at one node and takes the branch appropriate to its outcome. The classification is represented by the leaves. If the decision tree is recursively built by using the training set until each leaf is totally classified, it is a great risk to have a model from overfitted data. To avoid overfitting, part of the training data is set aside and used to test the decision tree. Then, the branches that give poor predictions are pruned [42]. The process may continue hierarchically through several internal nodes until it encounters a final leaf, at which it is asserted that the instance or sample belongs to a particular class [52]. Model trees are built repeatedly, and the best rule at each iteration is selected. The attractiveness of decision trees resides in their immediate conversion to rules of the “if … then …” type, which are adequate for interpretation by decision makers. The foremost advantages of decision trees are that they can not only build a readily interpretable model, but that they also do automatic stepwise attribute selection and complexity reduction. However, they have low prediction accuracy especially for regression purposes [53]. In the structure of a tree, each node is either a leaf indicating a class or an internal decision node that specifies some test to be conducted on a single attribute value. For classification trees, each leaf
2593
represents the average value of the instances that are covered by the leaf, whereas for regression trees, each leaf is a regression model. Examples of decision-tree-based methods include classification and regression trees (CART) and M5. CART and SVM were identified in 2006, by the IEEE Conference on Data Mining, as being among the top ten data mining algorithms [54, 55]. CART was proposed by Breiman et al. [56] in 1984. The goal of CART is to explain the response by selecting some useful attributes from a large pool of attributes. CART is a nonparametric algorithm, which is used in linear and nonlinear local fitting with categorical (classification) or continuous (regression purposes) attributes. It consists of three steps: (1) generating the fully grown tree, (2) pruning it back and (3) using cross-validation to find its correct size. It starts by subdividing the regression space into small regions and partitioning the subdivisions again (recursive partitioning). At each partition, it tries all of the attributes for testing the quality of the split and selects the attribute which has the best performance for local modelling; i.e. the one that causes the greatest minimization in a special error function, such as the sum of the squared differences between the observed value and the sample mean in that partition. The splitting continues until the sum of the squared differences of all partitions reaches some threshold (fully grown tree) [57]. The first implementation of model trees, M5, was rather abstractly defined by Quinlan [58] in 1992, and the idea was reconstructed and improved in a system called M5P (from “M5 prime”, M5′) [59]. The M5P algorithm builds regression trees whose leaves are composed of multivariate linear regression models, instead of discrete values, and the nodes of the tree are chosen over the attribute that maximizes the expected error reduction as a function of the standard deviation of the output parameter. For the construction of the tree, M5P needs three steps. The first step generates a regression tree using the training data. It calculates a linear model (using linear regression) for each node of the tree generated. The second step tries to simplify the regression tree generated in the previous search (first postpruning), deleting the nodes of the linear models whose attributes do not increase the error. The aim of the third step is to reduce the size of the tree without reducing the accuracy (second postpruning). To increase the efficiency, M5P does the last two steps at the same time so that the tree is parsed only once. This simplifies the number of the nodes as well as simplifying the nodes themselves [60]. Random forest The RF method belongs to the class of ensemble methods. The basic idea behind the ensemble method is to simultaneously use a set of individual classifier models, combining next their outputs to return a decision (using, e.g., a majority rule). In this way, the ensemble can outperform its individual
2594
constituent models [61-64]. An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The introduction of RF was first made in 2001, in an article by Breiman [65]. RF are an ensemble learning method for classification (and regression) that operate by constructing a multitude of individual unpruned decision trees (models) at the training time (a "decision forest" [53]) by bootstrap sampling of the training data and random selection of attributes. The output of an RF is a prediction obtained by combining appropriately the different predictions of the outputs by individual trees. Therefore, via the ensemble nature of RF, the accuracy of a model significantly improves over that of a single tree. The RF method shows significant advantages with respect to other techniques such as SVM and ANN [66]: (1) it is faster to train, achieving also the same accuracy as SVM and ANN; (2) it is more interpretable (attribute relevance can be estimated during training using little additional computation, plotting of sample proximities, and visualization of output decision trees); (3) it readily handles larger numbers of predictors; (4) it has fewer parameters; and (5) the cross-validation is unnecessary because it generates an internal unbiased estimate of the generalization error (test error) as the forest building progresses. The potential of RF for modelling linear and nonlinear multivariate calibration allows it to be used for attribute selection too, with two different objectives: (1) to find the subset of attributes with the minimum possible generalization error and (2) to select the smallest possible subset with a given discrimination capacity. A good selection of attributes allows the prediction accuracy to be improved, facilitates the interpretation of complex data structures and reduces the calculation time for predictors [67-69]. Rule-based systems Decision trees can be converted into a set of mutually exclusive rules organized in a hierarchical structure. Each rule corresponds to a leaf of the tree and it is established taking into account all the conditions. Rules are used to support decision making in classification and regression [41]. Rulebased systems are fairly simplistic, consisting of little more than a set of "if … then …" statements. However, rule-based systems are really only feasible for problems for which knowledge of the problem can be written in the form of "if … then …" rules and for which this problem area is not large. If there are too many rules, the system can become difficult to maintain and can suffer from performance degradation. CART and M5 are two commonly used rule induction algorithms, or decision tree induction algorithms. M5 rules are an algorithm for inducing simple, accurate decision lists from model trees. Like conventional decision tree learners, M5P builds a tree by splitting the data on the basis of the values of predictive attributes. Instead of selecting attributes by an information-
C. Ruiz-Samblás et al.
theoretic metric, M5P chooses attributes that minimize intrasubset variation in the class values of instances that go down each branch.
Experimental framework Samples The samples were formed blending different kinds of an vegetable edible oil, i.e. olive oil (including four categories: extra virgin, virgin, olive oil and pomace oil), with sunflower oil (SUN), corn oil (COR), sesame oil (SES), soya oil (SOY), and seed oil (SEE). All the samples were purchased in Spain and France and were stored in dark bottles at −4ºC until they were analysed. The blend samples were prepared by mixing one olive oil with one vegetable oil in different proportions, resulting in five different concentrations from 10 to 90 % (w/w). The samples were analysed by gas chromatography and mass spectrometry. More details relating to the sample composition, experimental set-up and analytical conditions can be found in [21]. Chemometrics tools The chromatographic data were exported from the software to MATLAB® version 7.8.0 R2009a (The MathWorks, Natick, MA, USA) and PLS_Toolbox 7.0 (Eigenvector Research, Wenatchee, WA, USA) for data preprocessing. All the data mining treatment was performed using the algorithm FRF_fs [61, 67] and the algorithms provided in the Weka package (http://www.cs.waikato.ac.nz/ml/weka/ index.html). Data sets A data set can be understood as a matrix with as many rows as examples (oil samples, in our case) analysed and as many columns as the number of data points in the entire chromatogram recorded during the acquisition time. Clearly, the signal maxima are the heights of the different chromatographic peaks. Three different data sets were generated, associated with the different tasks to be performed: 1. Data set for training the classification model (BDOILCLAS). The matrix was composed of 62 examples (rows) and 1,007 attributes (columns), all of them being numerical attributes. The 1,008th attribute corresponds to the five different classes of the vegetable oils used (different from the olive oil), namely SUN, COR, SES, SOY and SEE, where the label did not specify what kind of seeds were used.
Application of data mining methods on samples of olive oil blends
2. Data set for training the regression model (BDOILREGR). The matrix was composed of 62 examples (rows) and 1,007 attributes (columns), all of them being numerical attributes. The 1,008th attribute correspond to the percentage of olive oil in each sample with five different concentrations from 10 to 90 %. 3. Data set for validation (BDOIL-REGR-TEST). The matrix was composed of 16 examples with the same structure as BDOIL-REGR. Before any data treatment, the whole chromatograms were preprocessed, which is an indispensable step when the data points from the whole chromatogram are used to acquire useful information. Preprocessing does not only eliminate the unnecessary sources of variance, which would otherwise impede analysis, but can also be used to remove baseline contributions and to align peaks to eliminate retention time drift from run to run [70]. The chromatograms were baselinecorrected with a baseline correction algorithm (penalized asymmetric least squares) [71]. Peak shifting was corrected with interval correlation optimized shifting, iCoshift [70], and finally the chromatograms were normalized and mean-centred in order to remove the variability related to the overall offset. A detailed description of the preprocessing applied can be found in [21]. Measuring classification and prediction quality Next, we describe how a classification task is performed and evaluated. The cross-validation process was applied to evaluate the prediction reliability of the classification methods. The process is as follows. The given dataset E is split into k sets of examples, C1, …, Ck. Then, a data set is built, Di =E - Ci, and the precision of the model obtained from Di on the examples in Ci (Di and Ci are the training and test sets, respectively) is tested. The final precision of the method is estimated by averaging the precision over the k cross-validation trials [44]. In this study, a fivefold cross-validation repeated three times (3×5-fold cross-validation) was used. When a classification is performed, the following features are evaluated: & & & &
True positive (TP) rate: proportion of examples which were classified as class x, among all examples which truly are of class x, i.e. how much of the class was captured. False positive (FP) rate: proportion of examples which were classified as class x, but belong to a different class, among all examples which are not of class x. True negative rate, false negative (FN) rate: the counterparts of the previous definitions. Precision rate (or positive predictive value): TP/(TP+FP). This reflects the probability that a positive example reflects the underlying condition being tested for.
2595
& &
Sensitivity (or recall): TP/(TP+FN). This reflects the capability of a method to correctly classify a positive sample. It corresponds to the TP rate. F-measure: (2 × precision × sensitivity)/(precision + sensitivity).
Receiver operating characteristic (ROC) curves are also used to evaluate the "power" of a classification method for different asymmetric weights [72, 73], as in our case. Since the area under the ROC curve (denoted by AUC) is a portion of the area of the unit square, its value will be always be between 0.0 and 1.0. A realistic classifier should not have an AUC lower than 0.5 [area under the diagonal line between (0,0) and (1,1)]. The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [73]. To evaluate the precision of the regression methods, two data sets are provided: training set and test set. The former is used to construct a regression model, which is then used to infer a value for the examples in the later set. For estimating the quality of the regression, the next features were used: (1) the mean absolute error, which is a risk function corresponding to the expected value of the absolute error loss; (2) the mean square error, which is a risk function corresponding to the expected value of the squared error loss; and (3) the Pearson correlation coefficient, which is a measure of the linear dependence (correlation) between two attributes giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation and −1 is negative correlation.
Results Classification models to identify the vegetable oil Obtaining a subset of attributes From a computational learning point of view, the data sets with few examples have, often, some problems. For instance, with use of just one classifier, it is difficult to make accurate predictions, because small data sets may induce overfitting, thus causing problems in the prediction of specific correlations between inputs and outputs. Nevertheless, an exact relation between the probabilities of classification errors, the number of examples for training, the number of attributes and the real parameters is needed in order to obtain reliable models. In such a situation, one can find a decision function that separates the training samples well but performs poorly on the test data. To overcome these difficulties, attribute selection methods can be applied to partially reduce the redundancy in descriptors. As a general rule, a minimum number of ten·|A|∙|C| training examples is required for an |A|-dimensionality
2596
classification problem of |C| classes [9]. Following this rule, when datasets have few examples (small data sets), we should take into account in the learning algorithm some measure in order to build better models to obtain good reliability in classification. After the preprocessing of the chromatogram data, the FRF_fs method [67] for attribute selection based on the RF ensemble approach was applied on the BDOIL-CLAS data set. After data treatment, nine attributes, out of 1,007, were selected. The attributes selected were 157, 159, 192, 196, 197, 409, 410, 759 and 771. Figure 1 shows the location of these attributes (vertical lines) among all the variables which form the whole chromatogram. The method also ranks them in terms of relevance, with the following order being obtaining: (1) 196, (2) 192, (3) 197, (4) 157, (5) 159, (6) 771, (7) 759, (8) 410 and (9) 409. As shown in Fig. 1, these chosen attributes match, in the chromatogram, the three main groups of chromatographic peaks of triacylglycerols. For instance, the first five attributes belong to the group of small peaks at the beginning of the chromatogram, which are the ones with lower numbers of carbons. This means that they contain most of the modelling information in the chromatogram. Attributes 409 and 410 belong to the first peak in the second group, and attributes 771 and 759 match the last part of the chromatogram. Curiously, the attributes selected do not correspond exactly to the chromatographic peaks, but they are founds on the valleys. In addition, although attributes 409 and 410 are next to each other and, from the chemical point of view, in the chromatogram it seems that they could not provide any differentiation, they have been included in the combination of the attributes selected because for the process of computational learning, they are very different. Fig. 1 The selection of attributes with more information for building the models on the whole chromatogram fingerprint. The vertical lines indicate the places where the attributes selected are located (157, 159, 192, 196, 197, 409, 410, 759 and 771)
C. Ruiz-Samblás et al.
Consequently, the models will be built without and with attribute selection. This selection of attributes is not unique, so according to the method used for attribute selection, other attributes could be selected.
Classification with the RF method Case 1: using the full set of attributes. Firstly, a model based on RF with 100 trees was constructed. As mentioned before, 3×5-fold cross validation was used to validate the outcomes. According to the results obtained, the average precision rate of the classifier was 95.70 % (around 59 examples were correctly classified). The errors occurred in the classes SUN, COR and SEE: one sample of class SUN is classified as SEE, one sample of class COR is classified as SEE, and one sample of class SEE is classified as COR. All the examples of classes SES and SOY were classified correctly. The AUC for every class’s classifier was higher than 0.993, which means an almost perfect level of classification. These results allow one to conclude that the classifier has a very high level of precision. Case 2: using the reduced set of attributes. The same approach for classification was applied on the data set of the nine attributes selected previously by the RF attribute selection method mentioned before. As in the previous case, RF (100 trees) with 3×5-fold cross validation was used to validate the outcomes. The results obtained are shown in Table 1.
Application of data mining methods on samples of olive oil blends
2597
Table 1 Random forest results for classification of the vegetable oils used in the blends
Classification using the CART method: towards an interpretable model
TP rate FP rate Precision F-measure ROC area Class 1 0.917 1 1
0 0 0.020 0
1 1 0.929 1
1 0.957 0.963 1
1 1 1 1
SUN COR SEE SES
1 Average 0.983
0 0.004
1 0.986
1 0.984
1 1
SOY
TP true positive, FP false positive, ROC receiver operating characteristic, COR corn oil, SEE seed oil, SES sesame oil, SOY soya oil, SUN sunflower oil
In this case, the precision rate of the classifier was 98.34 % (around 61 examples were correctly classified). Now, the classifier performs better, and with the advantage of working with only nine attributes, instead of 1,007 as in the previous case. This means that the attributes which have been removed were simply introducing "noise" to the classifier. The error appear on one sample of the class COR, which is classified as SEE; the remaining classes are classified correctly. In this case, the AUC was 1 for all classes. From the results of both cases, it can be concluded that the classification model based on the RF has good characteristics for the two different data sets. However, it performed slightly better and improved the results on the data sets when attribute selection was first applied. For this reason, it makes sense to search for an interpretable classifier, where the decision process can be "human understandable". This is done in the next section.
Starting from the dataset having nine attributes, CART was able to produce a human interpretable classifier with a number of conditions that is easy to understand and follow. This classifier lets us understand easily the decision process when the examples are classified. The output of the method is the decision tree shown in Fig. 2. The binary decision tree works as follows. Given a new sample, we start from the root node. Each branch of the tree represents a condition and we go down the tree following those branches where the condition is true. The process finishes when we reach a leaf that give us the class of the sample. In our case, we observe attribute R759 from the sample and we determine if R759≥33,738.2. If it is, then we decide that the sample belongs to class SES. In the other case, we need to observe attribute R192 and check the new conditions. From a total of 60 examples, just two classification errors occurred: one example of the COR class is classified as SEE and one example of the SOY class is classified as SES. Overall, the precision rate was 96.77 %. The remaining samples were classified correctly. Additional measures concerning the classification behaviour of the model obtained are given in Table 2. It is important to note that in the worst case we just need to check the values of four attributes to determine the class: 157, 192, 409 and 759. This is a great improvement over other standard classification techniques that may use the full set of attributes or other techniques (such as ANN) that provide a black box model.
Prediction of olive oil concentration In this section, we will describe two rule-based regression models that are able to predict the olive oil concentration of a sample. The difference between the models lies in the structure of the rules used.
Table 2 Classification and regression trees (CART) results for classification of the vegetable oils used in the blends TP rate FP rate Precision F-measure ROC area Class
Fig. 2 Decision tree obtained by the classification and regression trees (CART) method for BDOIL-CLAS data set. COR corn oil, SEE seed oil, SES sesame oil, SOY soya oil, SUN sunflower oil
1 0.917 1 1 0.917 Average 0.968
0 0 0.020 0.020 0 0.009
1 1 0.929 0.929 1 0.970
1 0.957 0.963 0.963 0.957 0.968
1 0.958 0.990 0.990 0.958 0.980
SUN COR SEE SES SOY
2598
C. Ruiz-Samblás et al.
Fig. 3 M5 rules for predicting olive oil concentration
Prediction using the M5 rules method The application of the method to the BDOIL-REGR data set led to a set of ten mutually exclusive rules that allows us to predict the olive oil concentration of a sample. The
Fig. 4 M5P tree and rules for predicting olive oil concentration
set is shown in Fig. 3, and the use is straightforward: evaluate the rules from top to bottom until the condition in the IF part is true. At such a point, a prediction is obtained, either as a specific value or as the application of a linear expression.
Application of data mining methods on samples of olive oil blends
Using just 17 attributes (out of 1,007), this set of rules allowed us to obtain the following results. When using the BDOIL-REGR data set, we obtained a mean absolute error of 0.007, a root mean square error of 0.091 and a correlation coefficient of 0.951; with the BDOIL-REGR-TEST data set, we obtained a mean absolute error of 0.148, a root mean square error of 0.178 and a correlation coefficient of 0.774.
Prediction using the M5P method The application of the method to the BDOIL-REGR data set led to the decision tree and rules shown in Fig. 4. It can be observed that the model is more complex than before, but only 11 attributes were now used for regression (out of 1,007 which are in the chromatogram). Again, the model obtained shows great prediction ability. The results were as follows: for the BDOIL-REGR data set we obtained a mean absolute error of 0.073, a root mean square error of 0.091 and a correlation coefficient of 0.948; for the BDOIL-REGR-TEST data set we obtained a mean absolute error of 0.173, a root mean square error of 0.198 and a correlation coefficient of 0.664.
Discussion The results of these studies have proved that the data mining techniques are very suitable for analytical data, so they can play an important role in modelling to extract useful information in the field of vegetable oils. In addition, if the results are compared with those in [21] obtained with the same data, they are very similar in terms of classification and prediction. However, the advantage of these techniques is that they allow one to obtain interpretable models which provide much more information than conventional chemometrics and the possibility of building the model with very few attributes. This could be a great tool since once the models are well developed, with only a chromatogram of an unknown sample it would be possible to classify the vegetable oil used in the blend and, if desired, to know how much of that blend is olive oil by just following easy and interpretable rules.
Acknowledgments The authors acknowledge the support from the Spanish Ministry of Economy and Competitiveness (projects TIN201127696-C02-01 and TIN2011-27696-C02-02) and the Andalusia Regional Government (Consejería de Innovación, Ciencia y Empresa, projects P07-FQN-02667 and P11-TIC-8001, and Consejería de Agricultura, Pesca y Desarrollo Rural). This work was also partially supported by European Regional Development Funds. The authors are grateful to the Andalusia Regional Government (Consejería de Economía, Innovación, Ciencia y Empresa) for the postdoctoral contract awarded to C.R.S.
2599
References 1. Ulberth F (2004) Analytical approaches for food authentication. Mitt Geb Lebensmittelunters Hyg 95:561–572 2. Berrueta LA, Alonso-Salces RM, Héberger K (2007) Supervised pattern recognition in food analysis. J Chromatogr A 1158:196–214 3. Leardi R (2008) Chemometric methods in food authentication. In: Sun DW (ed) Modern techniques for authentication. Academic, Burlington 4. Forina M, Casale M, Oliveri P (2009) Application of chemometrics to food chemistry. In: Brown SD, Tauler R, Walczak B (eds) Comprehensive chemometrics: chemical and biochemical data analysis. Elsevier, Amsterdam 5. van der Veer G, van Ruth SM, Akkermans W (2011) Guidelines for validation of chemometric models for food authentication. Report 2011.022, RIKILT – Institute of Food Safety, Wageningen 6. Vandeginste B (2013) Chemometrics in studies of food origin. In: Brereton P (ed) New analytical approaches for verifying the origin of food. Woodhead, Cambridge 7. Marini F (ed) (2013) Chemometrics in food chemistry. Elsevier, Amsterdam 8. Brereton RG (2009) Chemometrics for pattern recognition. Wiley, Chichester 9. Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22:4–37 10. Noes T, Mevik B-H (2001) Understanding the collinearity problem in regression and discrimination analysis. J Chemom 15:413–426 11. Faber NM, Rajkó R (2007) How to avoid over-fitting in multivariate calibration – the conventional validation approach and an alternative. Anal Chim Acta 595:98–106 12. Lerma García MJ (2012) Characterization and authentication of olive and other vegetable oils. Springer, Berlin 13. Aparicio R, Harwood J (eds) (2013) Handbook of olive oil: analysis and properties, 2nd edn. Springer, Berlin 14. European Commission (2012) Commission Implementation Regulation (EC) No 29/2012 of 13 January 2013 on marketing standards for olive oil. Off J Eur Union L 12:14 15. Marini F, Bucci R, Magrì AL, Magrì AD (2010) An overview of the chemometric methods for the authentication of the geographical and varietal origin of olive oils. In: Preedy VR, Watson RR (eds) Olives and olive oil in health and disease prevention. Academic, London 16. Fauhl C, Reniero F, Guillou C (2000) 1H NMR as a tool for the analysis of mixtures of virgin olive oil with oils of different botanical origin. Magn Reson Chem 38:436–443 17. Maggio RM, Cerretani L, Chiavaro E, Kaufman TS, Bendini A (2010) A novel chemometric strategy for the estimation of extra virgin olive oil adulteration with edible oils. Food Control 21:890– 895 18. Bosque Sendra JM, Cuadros Rodríguez L, Ruiz Samblás C, de la Mata AP (2012) Combining chromatography and chemometrics for the characterization and authentication of fats and oils from triacylglycerol compositional data – a review. Anal Chim Acta 724:1–11 19. Yang Z, Wu W, Gao M, Teng Q, He Y (2012) Analyzing feature selection of chromatographic fingerprints for oil production allocation. Lecture Notes Comput Sci 7530:458–446 20. de la Mata EP, Bosque Sendra JM, Bro R, Cuadros Rodriguez L (2011) Olive oil quantification of edible vegetable oil blends using triacylglycerols chromatographic fingerprints and chemometric tools. Talanta 85:177–182 21. Ruiz Samblás C, Marini F, Cuadros Rodríguez L, González Casado A (2012) Quantification of blending of olive oils and edible vegetable oils by triacylglycerol fingerprint gas chromatography and chemometric tools. J Chromatogr B 910:71–77 22. Marini F (2009) Artificial neural networks in foodstuff analyses: trends and perspectives – a review. Anal Chim Acta 635:121–131
2600 23. Debska B, Guzowska-Swider B (2011) Application of artificial neural networks in food classification. Anal Chim Acta 705:283–291 24. Debska B, Guzowska-Swider B (2011) Decision trees in selection of featured determined food quality. Anal Chim Acta 705:261–271 25. Cao D-S, Xu Q-S, Zhang L-X, Huang J-H, Liang Y-Z (2012) Treebased ensemble methods and their applications in analytical chemistry. Trends Anal Chem 40:158–167 26. Yang P, Yang YH, Zhou BB, Zomaya AY (2010) A review of ensemble methods in bioinformatics. Curr Bioinform 5:296–308 27. Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Data Min Knowl Disc 2: 493–507 28. Qi Y (2012) Random forest for bioinformatics. In: Zhang C, Ma Y (eds) Ensemble machine learning, methods and applications. Springer, New York 29. Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, van Hijum SAFT (2013) Data mining in the life sciences with random forest: a walk in the park or lost in the jungle? Brief Bioinform 14:315–326 30. Cadenas JM, Garrido MC, Martínez R, Pelta D, Bonissone PP (2013) Using a fuzzy decision tree ensemble for tumor classification from gene expression. In: Proceedings of the 5th international conference on fuzzy computation theory and applications. ScitePress Science and Technology Publications, INSTICC, Portugal 31. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517 32. Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA (2009) A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinforma 10:213 33. Ghasemi JB, Tavakoli H (2013) Application of random forest regression to spectral multivariate calibration. Anal Methods 5:1683–1871 34. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2009) A random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Model 43:1947–1958 35. Ghasemi JB, Tavakoli H (2009) Application of random forest approach to QSAR prediction of aquatic toxicity. J Chem Inf Model 49: 2481–2488 36. Brereton RG, Lloyd GR (2010) Support vector machines for classification and regression. Analyst 135:230–267 37. Geurts P, Fillet M, de Seny D, Meuwis M-A, Malaise M, Merville MP, Wehenkel L (2005) Proteomic mass spectra classification using decision tree-based ensemble methods. Bioinformatics 21:3138– 3145 38. Mikut R, Reischl M (2011) Data mining tools. WIREs Data Min Knowl Discov 1:431–443 39. Nisbet R, Elder J, Miner G (2009) Handbook of statistical analysis and data mining applications. Academic, Burlington 40. Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms, 2nd edn. Wiley, Hoboken 41. Han J, Kamber M, PEI J (2012) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, San Francisco 42. Bramer M (2013) Principles of data mining, 2nd edn. Springer, London 43. Berk RA (2008) Statistical learning from a regression perspective. Springer, New York 44. Guillet F, Hamilton HJ (2007) Quality measures in data mining. Springer, Berlin 45. Stahlbock R, Lessmann S, Crone SF (2010) Data mining and information systems: quo vadis? In: Stahlbock R, Crone SF, Lessmann S (eds) Data mining: special issue in Annals of Information Systems. Annals of information systems, vol 8. Springer, New York 46. Mutihac L, Mutihac R (2008) Mining in chemometrics. Anal Chim Acta 612:1–18
C. Ruiz-Samblás et al. 47. Belousov AI, Verzakov SA, von Frese J (2002) A flexible classification approach with optimal generalisation performance: support vector machines. Chemom Intell Lab Syst 64:15–25 48. Xu Y, Zomer S, Brereton RG (2006) Support vector machines: a recent method for classification in chemometrics. Crit Rev Anal Chem 36:177–188 49. Marini F, Bucci R, Magrì AL, Magrì AD (2008) Artificial neural networks in chemometrics: history, examples and perspectives. Microchem J 88:178–185 50. Andrade Garda JM, Carlosena Zubieta A, Gómez Carracedo MP, Gestal Pose M (2009) Multivariate regression using artificial neural networks. In: Andrade Garda JM (ed) Basic chemometric techniques in atomic spectroscopy. Royal Society of Chemistry, Cambridge 51. Brown SD, Myles AJ (2009) Decision tree modeling in classification. In: Brown SD, Tauler R, Walczak B (eds) Comprehensive chemometrics: chemical and biochemical data analysis. Elsevier, Amsterdam 52. Rokach L, Maimon O (2008) Data mining with decision trees: theory and applications. World Scientific, Singapore 53. Sutton CD (2005) Classification and regression trees, bagging, and boosting. In: Rao CR, Wegman EJ, Solka JL (eds) Data mining and data visualization. Handbook of statistics, vol 24. Elsevier, Amsterdam 54. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) The top ten algorithms in data mining. Knowl Inf Syst 14:1–37 55. Wu X, Kumar V (2009) The top ten algorithms in data mining. Chapman & Hall/CRC, Boca Raton 56. Breiman L, Friedman JH, Olsen RA, Stone CJ (1984) Classification and regression trees. Wadsworth/Chapman & Hall, Belmont 57. Questier F, Put R, Coomans D, Walczak B, Vander Heyden Y (2005) The use of CART and multivariate regression trees for supervised and unsupervised feature selection. Chemom Intell Lab Syst 76:45–54 58. Quinlan JR (1992) Learning with continuous classes. In: Adams A, Sterling L (eds) AI '92: proceedings of the 5th Australian joint conference on artificial intelligence. World Scientific, Singapore 59. Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: Proceedings of the poster papers of the 9th European conference on machine learning, Prague 60. Dolado JJ, Rodríguez D, Riquelme J, Ferrer Troyano F, Cuadrado JJ (2007) A two-stage zone regression method for global characterization of a project database. In: Zhang D, Tsai JJP (eds) Advances in machine learning applications in software engineering. Idea Group, Hershey 61. Bonissone PP, Cadenas JM, Garrido MC, Díaz-Valladares RA (2010) A fuzzy random forest. Int J Approx Reason 51(7):729–747 62. Berk RA (2006) An introduction to ensemble methods for data analysis. Sociol Methods Res 34:265–279 63. Rokach L (2010) Ensemble methods in supervised learning. In: Maimaon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, New York 64. Rokach L (2010) Pattern classification using ensemble methods. World Scientific, Singapore 65. Breiman L (2001) Random forests. Mach Learn 45:5–32 66. Montillo AA (2009) Random forest. Guest lecture: statistical foundations of data analysis. Temple University, Philadelphia 67. Cadenas JM, Garrido MC, Martínez R (2013) Feature subset selection filter–wrapper based on low quality data. Expert Syst Appl 40: 6241–6252 68. Genuera R, Poggi J-M, Tuleau-Malotc C (2010) Variable selection using random forest. Pattern Recogn Lett 31:2225–2236 69. Kawakubo H, Yoshida H (2012) Rapid feature selection based on random forest for high dimensional data. Expert Syst Appl 40:6241–6252
Application of data mining methods on samples of olive oil blends 70. Savorani F, Tomasi G, Engelsen SB (2011) iCoshift: an effective tool for the alignment of chromatographic data. J Chromatogr A 1218: 7832–7840 71. Massart DL, Vandeginste BGM, Buydens LMC, de Jong S, Lewi PJ, Smeyers-Verbeke JJ (2007) Handbook of chemometrics and qualimetrics: part A. Elsevier, Amsterdam
2601 72. Brandley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30: 1145–1159 73. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36