The 11th International Conference on Information Sciences, Signal Processing and their Applications: Special Sessions
SVR ACTIVE LEARNING FOR PRODUCT QUALITY CONTROL Fouzi DOUAK(1,2), Farid MELGANI(1), Edoardo PASOLLI(1), Nabil BENOUDJIT(2) (1)
Dept. of Information Engineering and Computer Science, University of Trento, Via Sommarive, 14, I-38123, Trento, Italy E-mail: {fouzi.douak, melgani, pasolli}@disi.unitn.it (2)
Laboratoire d’Electronique Avancée, Université de Batna, Avenue Boukhlouf Med El Hadi, 05000 Batna, Algeria Email:
[email protected]
ABSTRACT In this work, the active learning approach is adopted to address the problem of training sample collection for the estimation of chemical parameters for product quality control from spectroscopic data. In particular, two strategies for support vector regression (SVR) are proposed. The first method select samples distant in the kernel space from the current support vectors, while the second one uses a pool of regressors in order to choose the samples with the greater disagreements between the different regressors. The experimental results on two real data sets show the effectiveness of the proposed solutions. Index Terms – Active learning, chemical parameter estimation, product quality control, spectroscopy, support vector regression (SVR). 1. INTRODUCTION In the last few years, spectroscopy has represented an important technology for product analysis and quality control in different chemical fields. For example, it has been applied successfully in pharmaceutical [1], food [2] and textile industries [3]. Indeed, it allows a fast acquisition of a large number of spectral data, which can be analyzed to yield accurate estimations of the concentration of the chemical component of interest in a given product. From a methodological point of view, the problem of concentration estimation can be approached by supervised regression techniques, in which the purpose is to define the model that relates the acquired observations to the concentration of interest. The performance of supervised approaches depends strongly on the quality and quantity of the labeled samples used to construct the model. By training samples, we mean pairs of spectral data acquired by spectrometer and measurements of the concentration to estimate. In the literature, two main approaches of regression have been proposed. The first one is based on
978-1-4673-0382-8/12/$31.00 ©2012 IEEE
1113
linear models, appreciated for their simplicity, such as multiple linear regression, principal component regression, ridge regression and partial least squares regression. The second approach makes use of nonlinear models, such as radial basis function neural network and support vector regression (SVR). They are characterized by greater computational complexity, but they can give better performances when a strong nonlinearity between the acquired spectral data and the concentrations to estimate is present. While the regression process is typically done by assuming that a sufficient number of training samples is available, from a practical point of view the process of training sample collection is not trivial, because the concentration measurements associated with the spectral data have to be performed manually by human experts and thus are subject to errors and costs in terms of time and money. For this reason, there has been a growing interest in developing strategies for the semi-automatic construction of the training set. A first solution is given by the semisupervised approach [4], in which the unlabeled samples are automatically exploited during the design of the model in order to compensate the deficit in labeled samples. By unlabeled samples, we mean samples whose spectral values are known, but for which the corresponding concentration values are unknown. Such samples exhibit the advantage that they are available at zero cost from the data under analysis. Another solution is represented by the active learning approach. Starting from a small and suboptimal training set, additional samples are selected from a set of unlabeled samples and added to the training set after labeling. The process is iterated until convergence. Despite the promising performance obtained by the active learning approach in the classification field, much less attention has been given for regression problems. In particular, this approach has not been investigated in the chemometrics field. The objective of this work is to introduce the active learning approach for estimating concentrations from spectroscopic data for product quality control purposes.
In particular, two strategies for the state-of-the-art SVR [5],[6] are proposed. To the best of our knowledge, the combination of SVR and active learning approach in the chemometrics field is new. The first method selects samples distant in the kernel space from the current support vectors (SVs), while the second one uses a pool of regressors in order to choose the samples with the greater disagreements between the different regressors. To illustrate the capabilities of the proposed strategies, we conducted an experimental study based on two different real data sets: 1) a diesel data set for estimating the cetane number by near-infrared spectroscopy and 2) a Tecator data set for the estimation of fat content in meat by nearinfrared transmittance spectroscopy. The obtained results show that interesting performances can be achieved. 2. SUPPORT VECTOR REGRESSION (SVR) The ε-insensitive SVR technique is based on the idea to
find an estimate fˆ (x ) of the unknown relationship y = f (x ) between the vector of observations x and the desired concentration value y from the given set of training samples L such that: 1) fˆ (x ) has, at most, ε deviation from the desired targets yi (i=1,...,n) and 2) it is as smooth as possible. This is usually performed by mapping the data from the original d-dimensional feature space to a higher dimensional transformed feature space, i.e., Φ (x ) ∈ ℜ d ' (d ' > d ) , where the function is more likely to be linear and thus can be approximated by the following linear model
fˆ (x ) = ω * ⋅ Φ(x ) + b * .
(1)
The SVR prediction function takes the form
(
)
fˆ (x ) = ∑ α i − α i* k (x i , x ) + b *
(
)
( )
j = n +1
, with
3.1. Distance from the Support Vectors (SVd) In the first method, we start by training a SVR regressor on the training set L and identifying the corresponding SVs. Note that SVs are the only samples of the training set that contribute to define the regression model. At this point, for each sample xj (j=n+1,n+2,…,n+m) of the learning set U the distances dj∈Rv=[dj,1,dj,2,…,dj,v] in the kernel space from the v identified SVs are calculated
( ) k (x , x ) + k (x , x ) − 2k (x
= Φ x j − Φ (x i ) = =
where k(·,·) is a kernel function defined as
n+m
m >> n . In order to expand the training set L with samples chosen from the learning set U and labeled manually by the human expert, an active learning strategy has the purpose of choosing them properly so as to minimize the error of the regression process while minimizing the number of samples to label. A generic active learning process can be subdivided in the following steps: 1) starting from the initial training set L, the unlabeled samples xj of the learning set U are evaluated using an opportune criterion h; 2) the samples are sorted in function of h in order to obtain the sorted learning set Us; 3) the first Ns samples are selected from the set Us, where Ns is the number of samples to be added in the training set L; 4) the selected samples U’s are labeled by the human expert and added to the training set L; 5) the entire process is iterated until the predefined convergence condition is satisfied (e.g., the total number of samples to add to the training set is reached, or the accuracy improvement on an independent calibration/validation set over the last iterations becomes insignificant). In the following, two methods of selection for SVR are proposed.
d j ,i
(2)
i∈S
k x j , x i = Φ x j ⋅ Φ (x i )
{ }
set composed of m unlabeled samples U = x j
j
j
i
i
j
, xi
)
. (4)
After that, the minimum value dj,min on the distances dj is computed
(3)
and S is the subset of indices (l=1,...,n) corresponding to the nonzero Lagrange multipliers α i ’s or α i* ’s. The Lagrange multipliers weigh each training sample according to its importance in determining a solution. The training samples associated to nonzero weights are called support vectors. In S, margin support vectors that lie on the ε-insensitive tube and nonmargin support vectors that correspond to errors coexist. The kernel k(·,·) should be chosen such that it satisfies the condition imposed by the Mercer’s theorem. A common example of nonlinear kernel that fulfils Mercer’s condition is the Gaussian kernel function. 3. PROPOSED ACTIVE LEARNING METHODS Let us consider a training set composed initially of n n labeled samples L = {x i , y i }i =1 and an additional learning
1114
{ }
d j , min = min d j ,i . i =1,..., v
(5)
This is equivalent to consider the distance value related to the SV closest to the sample under analysis. Additionally, for each sample x we consider the value α~ = α − α * j
j
j
j
of the Lagrange multiplier associated with the closest SV. Recall that Lagrange multipliers weigh each training sample according to its importance in determining a solution. The most important training samples are those for which the corresponding Lagrange multipliers are equal to the regularization parameter C. At this point, the samples of the learning set U are sorted first in function of the value α~ j and then in function of the distance value dj,min. The final selection is obtained from this sorted set after including an additional selection constraint. In particular, if the new sample to select shares the same closest SV with a sample already selected at that iteration, it is discarded. In this way, we limit the selection of
similar and redundant samples and choose samples distributed as most as possible over the feature space. 3.2. Pool of Regressors (PoR) The second method is based on a pool of regressors. We suppose to predict the target value of each sample xj (j=n+1,n+2,…,n+m) by adopting ns different regressors rk (t=1,2,…,ns). In particular, we define as µj,t the estimation associated with the sample xj and the regressor rk. Therefore, ns different predictions are obtained for each sample. At this point, the different estimations are combined by calculating the variance value on them
µ j , var =
ns
(µ j,t − µ j )2 ∑ t =1
1 ns
(6)
where
µj =
1 ns
ns
µ j ,t . ∑ t =1
(7)
Finally, the selection criterion that we propose is related to the value µj,var. The samples characterized by the greater values of variance, i.e. the greater disagreements between the different regressors, are selected. Indeed, a high disagreement means that the considered sample has been estimated with high uncertainty, and thus adding it to the training set could be useful to improve the regression process. It remains to define how the different estimations are obtained. In this work, we propose to adopt the bagging approach, which has shown good capabilities both in regression and classification context [7]. Considering the original training set L, ns different training subsets Lk (k=1,2,…,ns) are constructed by randomly choosing a percentage ps of samples from L. At this point, each training subset is considered independently from each other and used to train a different regressor. 4. EXPERIMENTS 4.1. Data Set Description and Experimental Setup In order to validate the proposed methods, we have conducted an experimental analysis on two real data sets. The first one refers to multispectral acquisitions of diesel fuels [8]. It was built by the Southwest Research Institute in order to develop instrumentation to evaluate fuel on battle fields. Along with the spectral acquisitions, different properties are available, such as boiling point at 50% recovery, cetane number, density, freezing temperature, total aromatics, and viscosity. The data set contains only summer fuels, and outliers were removed. In our experiments, we consider one of the most difficult prediction tasks in this data set, i.e., the prediction of the cetane number of the fuel. All spectra range from 750 to 1550 nm, discretized into 401 wavelength values. The data set contains 20 high leverage spectra and 225 low leverage spectra, the latter being separated into two subsets labeled a and b. As suggested by the providers of the data, we have built a learning set with the high leverage spectra and subset a of the low leverage spectra
1115
(thus yielding 133 spectra). The test set is made of the low leverage spectra of subset b (gathering 112 spectra). The second data set deals with the determination of the fat content in meat samples analyzed by near-infrared transmittance spectroscopy [9]. Each sample contains finely chopped pure meat with different moisture, fat and protein contents. Those contents, measured in percent, are determined by analytic chemistry. The spectra, acquired by the Tecator Infratec Food and Feed Analyzer, records light transmittance through the meat samples at 100 wavelengths in the range between 850 and 1050 nm. The corresponding 100 spectral variables are the absorbance defined by the measured transmittance values. The spectra are normalized according to the standard normal variance method, i.e., mean equal to zero and variance equal to one. For this data set, learning and test sets contain 172 and 43 spectra, respectively. In all the following experiments, the initial training samples required by the active learning process were selected randomly from the learning set U. Starting from 33 and 32 samples for the diesel and the Tecator data set, respectively, the active learning algorithms were run until all the learning samples were added to the training set, adding 20 samples at each iteration. The entire process was run ten times, each time with a different initial training set to yield statistically reliable results. At each run, the initial training samples were chosen in a completely random way. A regressor was also trained on the entire learning set in order to have a reference-training scenario, called "full" training. On the one hand, the regression results obtained in this way represent a lower bound for the errors. On the other hand, we expect that the upper error bound will be given by the completely random selection strategy (Ran). We recall that the purpose of any active learning strategy is to converge to the performance of the ‘‘full’’ training scenario faster than the random selection method. Regression performances were evaluated on the test sets in terms of the normalized mean squared error (NMSE)
1 w ∑ ( yˆ i − y i )2 w i =1 NMSE = var (y )
(8)
where w is the number of test samples. The parameter var(y) represents the variance on the output values and plays the role of normalizing constant. It is calculated on all available samples, i.e., both learning and test samples. Concerning the parameter setting, a Gaussian kernel was adopted thanks to the generally good prediction accuracy associated with this kernel. The regularization and kernel width parameters were tuned empirically by crossvalidation in the ranges [2-9,29] and [2-11, 23], respectively. Regarding the number of regressors ns for the PoR strategy, it was fixed to four in all experiments. 4.2. Experimental Results Concerning the diesel data set, we report in Fig. 1(a),(b) the results by evolving the active learning process in terms of NMSE and relative standard deviation (σNMSE),
(a)
(b)
(c)
(d)
Fig. 1. Performances achieved in terms of (a), (c) NMSE and (b), (d) NMSE standard deviation on (a), (b) the diesel and (c), (d) the Tecator data sets.
respectively. First, note that poor performances are obtained at the starting point, since a small number of training samples is used to train the regressor. Another expected result is given by the performance improvement when additional samples are inserted in the training set. This results in graphs with an approximately monotonous decreasing behavior of NMSE and σNMSE, which tend to converge to the results yielded by the “full” regressor, for which the entire learning set is exploited to train the model. Although such decreasing is verified for both active and random selection, the active methods allow a faster convergence to the “full” result with respect to the random strategy. This means that similar values of prediction errors can be obtained using a minor quantity of training samples, which implies a reduction of the human expert work and a decreasing of the computational time necessary to train the regressor. Additionally, the improvements in terms of σNMSE indicate greater levels of stability in defining the regression model. Between the two proposed strategies, the method PoR based on the pool of regressors yields better results with respect to that SVd based on the distances in the kernel space between labeled and unlabeled samples. The obtained results are
1116
shown in greater detail in Table I. In particular, we consider the performances obtained when 40 additional samples are inserted in the training set. The best results are highlighted in bold font. It is confirmed that the proposed strategies are characterized by better performances both in terms of NMSE and relative σNMSE. For the Tecator data set, the results are reported as done for the previous data set in Fig. 1(c),(d) and Table I. Note that the active learning method PoR confirms to give the best performance both in terms of NMSE and σNMSE with respect to the other learning strategies. Instead, bad results are verified in this case for the SVd strategy, which exhibits values of NMSE and σNMSE comparable with the random selection. 5. CONCLUSION In this work, we have introduced the active learning approach in the field of spectroscopic data analysis for the problem of product quality control. In particular, two active learning strategies based on the state-of-the-art SVR have been proposed to address the problem of training sample collection for the estimation of chemical
TABLE I NORMALIZED MEAN SQUARED ERROR (NMSE) AND RELATIVE STANDARD DEVIATION (σNMSE) ON THE DIESEL AND THE TECATOR DATA SETS Diesel data set Method Full Initial Ran SVd PoR
# training samples 133 33 73
Tecator data set
NMSE
σNMSE
0.4672 0.6044 0.5120 0.4907 0.4758
0.0428 0.0261 0.0215 0.0103
parameters in order to minimize at the same time human expert efforts and prediction errors. The first method relies on the idea to add samples that are distant from the current SVs, while the second one uses a pool of regressors in order to select the samples with the greater disagreements between the different regressors. The experimental results obtained on two real data sets show good capabilities of the proposed strategies for selecting significant samples. In general, the proposed methods are characterized by higher performances in terms of both accuracy and stability with respect to a completely random selection strategy. Comparing them, the best active strategy appears the one based on the pool of regressors. It is however the most computational demanding since it needs the training of different regressors to build the pool. Though we have focused on SVR in this work, the active sample selection could be used in combination with other supervised regression approaches. Moreover, while in this work the initial training set was chosen in a random way, more sophisticated initialization strategies could be envisioned in order to further improve the performances of the active learning process. REFERENCES [1] S. Sekulic, H.W. Ward, D. Brannegan, E. Stanley, C. Evans, S. Sciavolino, P. Hailey, P. Aldridge, “On-line monitoring of powder blend homogeneity by near-infrared spectroscopy,” Anal. Chem., vol. 68, pp. 509-513, 1996. [2] Y. Ozaki, R. Cho, K. Ikegaya, S. Muraishi, K. Kawauchi, “Potential of near-infrared Fourier transform Raman spectroscopy in food analysis,” Appl. Spectrosc., vol. 46, pp. 1503-1507, 1992. [3] M. Blanco, J. Coello, J. M. Garcia Fraga, H. Iturriaga, S. Maspoch, J. Pagès, “Determination of finishing oils in acrylic fibers by near-infrared reflectance spectroscopy,” Analyst, vol. 122, pp. 777-781, 1999. [4] X. Zhu, “Semi-supervised learning literature survey,” Comput. Sci. Tech. Rep. 1530, Univ. Wisconsin-Madison, 2005. [5] A. Smola, B. Schölkopf, “A tutorial on support vector regression,” Stat. Comput., vol. 14, no. 3, pp. 199-222, 2004. [6] N. Benoudjit, F. Melgani, H. Bouzgou, “Multiple regression systems for spectrophotometric data analysis,” Chemom. Intell. Lab. Syst., vol. 95, pp. 144-149, 2009.
1117
# training samples 172 32 92
NMSE
σNMSE
0.0024 0.0298 0.0049 0.0050 0.0031
0.0138 0.0018 0.0019 0.0011
[7] L. Breiman, “Bagging predictors,” Tech. Rep. 421, Univ. California, Berkeley, 1994. [8] Diesel data set http://www.eigenvector.com/data/SWRI.
available
at
[9] Tecator data set http://lib.stat.cmu.edu/datasets/tecator.
available
at