This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
1
Semisupervised Incremental Support Vector Machine Learning Based on Neighborhood Kernel Estimation Jing Wang, Daiwei Yang, Wei Jiang, and Jinglin Zhou, Member, IEEE
Abstract—Semisupervised scheme has emerged as a popular strategy in the machine learning community due to the expensiveness of getting enough labeled data. In this paper, a semisupervised incremental support vector machine (SE-INC-SVM) algorithm based on neighborhood kernel estimation is proposed. First, kernel regression is constructed to estimate the unlabeled data from the labeled neighbors and its estimation accuracy is discussed from the analogy with tradition RBF neural network. The incremental scheme is derived to improve the learning efficiency and reduce the computing time. Simulations for manual data set and industrial benchmark-penicillin fermentation process demonstrate the effectiveness of the proposed SE-INC-SVM method. Index Terms—Incremental training, neighborhood kernel estimation (KE), semisupervised scheme, support vector machine (SVM).
I. I NTRODUCTION IG data has drawn huge attention from information scientists, industrial technologists, and decision makers [1]. No doubt that machine learning is the core science of artificial intelligence, which is the fundamental way that systematically extracts much useful information from data. The traditional machine learning methods include support vector machine (SVM) [2], fuzzy logic [3], extreme learning machine (ELM) [4], and neural networks (NNs) [5]. With the help of machine learning methods, the capabilities of system modeling [10], multichannel data processing, process control [11], optimal decision making, and fault diagnosis [12] have been significantly enhanced. For example, the NNs and the fuzzy logic have been carried out to introduce more intelligence into the designed control scheme due to their excellent learning characteristic [13], [18], [22]. Most of them propose
B
Manuscript received August 12, 2016; revised December 26, 2016; accepted February 6, 2017. This work was supported in part by the National Natural Science Foundation of China under Grant 61573050 and Grant 61473025 and in part by the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences under Grant 20160107. This paper was recommended by Associate Editor Z. Liu. The authors are with the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMC.2017.2667703
the control scheme without the prior knowledge of the control direction. The traditional machine learning technique mainly employs labeled data or unlabeled data to learn. However, it is difficult and expensive to get enough labeled data in some practical applications, such as social media analytics [6], industrial process [7], medical diagnose [8], and diabetes therapeutics [9]. To the opposite, it is easy to obtain a lot of unlabeled data without any cost. For example, the unlabeled data can be unmeasured product quality or undiagnosed samples. In order to solve this problem, the semisupervised machine learning method is proposed and has attracted a lot of attention [14], [15]. Here, a little of labeled data are used for supervised learning and a lot of unlabeled data are used for unsupervised learning, i.e., semisupervised learning. The oldest semisupervised learning method may be a generative model [16] for classification which combines expectation-maximization (EM) to model the label estimation. EM algorithm is applied to fault diagnosis with missing data [17], which showed the resulting classifiers perform better than those supervised methods. Self-training is a commonly semisupervised learning technique. A self-training semisupervised SVM (S3VM) algorithm and its corresponding model selection technique is presented to train classifier with small training data [19]. A novel framework is proposed in which an initial classifier is learned by incorporated prior information [20]. Co-training model demonstrates how radically the need for labeled data can be reduced if a huge amount of unlabeled data is available [21]. Then it is applied in an adaptive manner to select feedback documents for boosting QE’s effectiveness [23]. Balcan and Blum [24]gave an augmented version of unified framework to reason about many of the different semisupervised learning approaches. Zhou and Li [25] gave a co-training algorithm based on canonical correlation analysis which also needs only one labeled point. They use the collaborative technology which is based on the co-training semisupervised regression algorithm with two different order Minkowski distance K-nearest neighbor regression models. However, due to the defects of K-nearest neighbor algorithm itself, there are still shortcomings in the choice of K-nearest neighbor regression model, such as excessive storage and large amount of calculation. In recent years, SVM has emerged as a popular method in the machine learning community. It has been successfully
c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2216 See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
applied to actual process, such as industrial modeling under noisy environment [7], smart homecare surveillance system [26] and corporate financial risk prediction [27]. If the standard SVM problem is solved with the unknown labels treated as the additional optimization variables, it will improve the learning ability of SVM. This idea was first under the name of transductive SVM [3]. Since it learns an inductive rule defined over the entire input space, this approach is also defined as S3VM. A wide spectrum of techniques has been applied to solve the nonconvex optimization problem associated with S3VMs [28]. For example, there are local combinatorial search, gradient descent [29], continuation techniques [30], dc programming [31], semi-definite programming [32], nondifferentiable methods, deterministic annealing [33], and branch-and-bound algorithms [34]. Beside the traditional S3VM learning, there is a recently proposed semisupervised learning algorithm based on ELM theory named as semisupervised ELM [35], [36]. ELM is a unified learning scheme for “generalized single-hidden layer feed-forward NNs” [4]. All these semisupervised methods, which utilize both labeled and unlabeled data, can enhance the efficiency of model development in terms of time and cost. It is noteworthy that previous research mainly focuses on the classification while regression remains almost untouched. The label of the unlabeled data is discrete and finite in the classification problems. For example, the number of label of the unlabeled is two in two-class classification problems. But the label of the unlabeled data is continuous and infinite in the regression problem. Therefore, how to label the unlabeled data in the regression and in return to train the regression machine is a most hard and difficult problems. This paper mainly focuses on how to label the unknown data and how to train accurately with semisupervised support vector regression. The semisupervised incremental SVM (SE-INCSVM) based on neighborhood kernel estimate mainly contains two strategies: 1) kernel estimate and 2) incremental learning. Kernel regression is used to estimate the unlabeled data from its labeled neighbors. This estimation is updated during the training process with feedback. The incremental algorithm was employed to solve the S3VMs. The remainder of this paper is organized as follows. The SE-INC-SVM regression algorithm is presented in Section II, in which the model structure, math description, and kernel estimation (KE) for the unlabeled data and incremental learning strategy are described in detail. Then the performance of proposed algorithm is verified by applied to ten popular manual sets and an actual penicillin fermentation process in Section III. The estimations for penicillin and cell concentrations are difficult problem due to less measured quality data, while the proposed method shows good performance than other SVM algorithm. The conclusions are drawn in Section IV. II. S EMISUPERVISED SVM BASED ON THE K ERNEL E STIMATION A. SE-INC-SVM Structure Based on the idea of help-training algorithm, we developed a novel structural of semisupervised learning method which is
Fig. 1.
SE-INC-SVM structure.
the combination of KE and support vector regression, shown in Fig. 1. They have the different functions. The KE method mainly used to label the unlabeled data, then it is transferred back to support vector regression together with the labeled data. The output of SVM with the labeled data and the estimation of unlabeled data as inputs, are feedback to adjust the KE. This process is useful to improve the accuracy of predict. Supposed that the whole training set consists of n labeled examples {(xi , yi )}ni=1 , and m unlabeled examples {(xju , yuj )}m j=1 , with l = m + n. Here, the unlabeled data yuj are unknown and cannot be applied directly to the support vector regression. So the KE technique is used to predict the unlabeled data yuj . B. Neighborhood Kernel Estimation for Unlabeled Data Generally, regression methods can be mainly divided into two categories: 1) parametric methods and 2) nonparametric methods. KE is one most popular technique among the nonparametric methods on account of the virtue of kernel. It provides an unbinned and non-parametric estimate from which a set of data is drawn. The KE for unlabeled data yuj is [37] u, x y K x i i=1 i j , j = 1, . . . , m yuj = n u i=1 K xj , xi n
(1)
where yuj is the predictive label for the unlabeled data xju , and K(xju , xi ) is a kernel function satisfied Mercer theory which is computed from the labeled data xi . For simplicity, Gaussian kernel is given to estimate the unlabeled data in this paper. Moreover, if all the entire labeled training data are used to predict the unlabeled data in (1), the computing efficiency and estimation accuracy will be decreased greatly. So the neighborhood estimation method is exploited in which the kernel function represents the adjacent and nonadjacent relationships between xju and xi , respectively
K xju , xi
⎧ ⎨ exp − u − x 2 /2σ 2 if xu ∈ (x ) or x ∈ xu xj i i i j j = ⎩ 0 otherwise.
(2)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WANG et al.: SE-INC-SVM LEARNING BASED ON NEIGHBORHOOD KE
Fig. 2.
Equivalent normalized RBF network for neighborhood KE.
In order to estimate the unknown label yuj , its neighbors can be searched by calculating Euclidean distance between xju and xi . is the boundary of neighborhood distance. Note that σ is the kernel width which controls how the data points’ feature space images are spread out and thereby determines the size of the minimal sphere. The effectiveness of SE-INC-SVM largely depends on the accuracy of KE method. If the estimation of each unlabeled data is accurate, then training with these estimated unlabeled data and labeled data will lead to the higher accuracy of incremental SVR. To illustrate the accuracy of KE, comparative analysis between KE and RBF network is discussed here. First, define a normalized radial basis function K xju , xi , i = 1, . . . , n. (3) ϕ xju , xi = n u i=1 K xj , xi
3
The closer nodes are, the greater of impacts are. So the KE method of SE-INC-SVM is equivalent to interpolate on the unlabeled data by using the labeled neighbors. And the following incremental learning method is equivalent to select the basis function centers in RBF NN dynamically, which makes the noise generated by the interpolation node smooth, thereby good interpolation accuracy will be gained. Due to the equivalence between SE-INC-SVM and RBF NN, many features of RBF NN can also be reflected in the SE-INC-SVM, such as consistent approximation of nonlinear continuous function and fast learning. The proposed SE-INCSVM algorithm includes incremental learning characteristics, so it will be more suitable than the RBF NN in processing time series data, such as the acquisition data in industrial processes. C. Incremental Learning Scheme for SE-INC-SVM It is known from the transductive SVM that the primal quadratic optimization problem of S3VM regression is given as n
1 V(yi , oi ) minu I w, b, yu = w2 + Cv (w,b,y ) 2 i=1
+ Cu
n ϕ xju , xi = 1,
∀x ∈ X
i=1
where X represents the range of the input. Then, the KE of unlabeled data (1) can be simplified as n yuj = ϕ xju , xi yi , j = 1, . . . , m. (4) i=1
From another point of view, KE can be viewed as sum of normalized radial basis functions, in which yi is considered as linear weight, that is ωi yi , i = 1, . . . , n. The KE (1) can be obtained in the form of weighted sum n yuj = (5) ωi ϕ xju , xi , j = 1, . . . , m. i=1
Formula (5) can be represented as a normalized RBF NN [38], as shown in Fig. 2. Here, d is the input dimension. RBF NN can be equivalent to an interpolation method in mathematically, where ωi is the interpolation node and ϕ(xju , xi ) is the interpolation basis function which takes the form of Gaussian kernel. In this interpolation method, the interpolation result in xju will be affected by all the interpolation nodes xi .
(6)
j=1
where the first two parts represent the model complexity and the experience risk for the labeled data. The last one is a penalty function for the unlabeled data. In general, the optimization problem (6) can be simplified as min I =
Obviously
m V yuj , oj
1 w2 + Cv ξju + ξju∗ (7) ξi + ξi∗ + Cu 2 n
l
i=1
j=1
s.t. wxi + b − yi ≤ ε + ξi∗ yi − wxi − b ≤ ε + ξi u wxju + b − yuj ≤ ε + ξj u yuj − wxju − b ≤ ε + ξju∗ .
(8)
In the objective function (7), the first item is used to measure the model complexity, and the second is the prediction error between the labeled output data and model output. The third one is the prediction error between the KE of unlabeled output and model output. The coefficients Cv and Cu are parameters pregiven by the user which allow the trading off between the margin size and mispredicting training examples (or excluding test examples). Since the unlabeled data were labeled by kernel technique (1) and (2), so the last two items are similar in substantially. Here, the last two item could be merged into ones with the different trading off parameters. Optimization problems (7) and (8) are corrected as
u u 1 w2 + Civ ξi + ξi∗ i + Cj ξj + ξju∗ 2 n
min I = =
i=1 n+m
1 w2 + 2
i=1
m
j=1
Ci λi + λi ∗
(9)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 4
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
s.t. Yi − wXi − b ≤ ε + λi wXi + b − Yi ≤ ε + λ∗i with
(10)
xi ξi yi Y = u , X= u , λ= yj xj ξju v ∗ Ci ξi ∗ , C= . λ = ξju∗ Cju
Then the optimization for S3VM can be transformed to a standard SVM optimization problem when the unlabeled data is estimated by kernel regression technique. The dual problem of (7) is m+n m+n
1 αi − αi∗ αj − αj∗ Xi , Xj 2
max L =
i=1 j=1 m+n
m+n
αi + αi∗ + Yi αi − αi∗
−ε
i=1
s.t.
m+n
αi − αi∗ = 0, 0 < αi , αi∗ < C.
(11)
1 2
i=1
βi γi βj γj Xi , Xj − ε
2(m+n)
j=1
2(m+n)
γi
i=1
βi Yi γi
i=1
βi γi = 0, 0 < γi < C.
(12)
i=1
So the optimization problem (13) with constraint is relaxed to an unconstrained convex quadratic function by introducing a Lagrangian factor μ 2(m+n) 2(m+n) 2(m+n) 1 βi γi βj γj Xi , Xj − ε γi 2 i=1
j=1
+
i=1
2(m+n)
βi Yi γi + μ
i=1
2(m+n)
βi γi . (13)
i=1
The saddle point of performance function L could be given by the KKT conditions, that is 2(m+n) ∂L = βi βj γj Xi , Xj + μβi + βi yi − ε ∂γi j=1
2(m+n) j=1
0 = βc αc +
2(m+n)
(βi αi ).
(15)
Or it can be transformed into matrix form according to the different subsets defined above ⎤ ⎡ ⎤ ⎤ ⎡ ⎡ hcc βc hcs
gc ⎢ hcs ⎥ ⎢ gs ⎥ ⎢ βs hss ⎥ μ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ (16) ⎣ gr ⎦ = ⎣ βr hrs ⎦ αs + αc ⎣ hcr ⎦. 0 0 βs βc
⎤
2(m+n) 2(m+n)
+
gi =
hij αj + μβi
i=1
Then the optimization problem (11) is rewritten as
max L =
2(m+n) j=1
β = ⎣1, . . . , 1, −1, . . . , −1⎦ m+n m+n ∗ γ = α1 , . . . , αm+n , α1∗ , . . . , αm+n .
s.t.
where H is a positive semi-definite matrix Hij = βi βj Xi , Xj . According to the value of gi , the whole sample for training can be classified into two independent subsets, S and R. S = {i, gi = 0, 0 < αi < C} is the subset of support vectors which includes the margin support vectors strictly on the ∈-tube. R = {i ∈ E ∪ O}, E = {i, gi ≤ 0, αi = C}, O = {i, gi ≥ 0, αi = 0} is the subset of remaining vectors O covered by ∈-tube and error support vectors E exceeding the ∈-tube. The increment strategy is based on the basic incremental learning γ -support vector regression [42]. An increment αc could be obtained when a new data xc is added for training whose initial weight value is αc . KKT condition (14) can also be changed into an incremental form
i=1
Define two new vectors ⎡
2(m+n)
(14)
i=1
gi = hic αc +
i=1
max L =
2(m+n) ∂L = βi γi 0= ∂μ
βj Hij γj + μβi + βi yi − ε
Here, subscript c is for new data xc , and subscript s and r are for subset S and R, respectively. We have gs = 0 for support vector data subset S. Then extracting lines 4 and 2 from (16), the system can be rewritten as 0 0 βs β =
s + αc c (17) βs hss hcs 0
μ with s = .
αs Then, we could easily find s is linear to αc , that is
s = ζ αc (18) T −1 β 0 βss with ζ = Q s , Q = . hss βs hss Further substitute (18) into the first and third line of (16)
gc = ξ αc (19)
gr hcc βc hcs ζ+ . where ξ = βr hrs hcr Then all the parameters in the original model are varied into an incremental form when a new point is added for learning. But a new problem arises that we should consider. D. Discussion About the Maximal Increment of Δαc The system actually cannot be applied directly to obtain the new SVM state, since the composition of sets S and R will change with the new data being added. In order to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WANG et al.: SE-INC-SVM LEARNING BASED ON NEIGHBORHOOD KE
5
handle this problem, the main strategy of algorithm is to identify the largest increment αimax when some points migrate between the sets S and R. Four cases will occur attributed to the structural changes. Case 1: Some points in S reach to be bound support vectors. First, the whole set S is divided into two subsets S = {i ∈ S : ζi > ε} I+ S = {i ∈ S : ζi < −ε}. I− S have positive sensitivity with respect The samples in set I+ to the weight of the current sample; that is, their weight would increase by taking αc and reach to αi = C. So these training S should be data should be tested. Likewise, the samples in I− tested whether they will reach zero due to negative sensitivity. These data that satisfied −ε < ζi < ε are ignored as they are not sensitive to αc . Thus, the possible weight updates are S C − αi , i ∈ I+ max
αi = S. −αi , i ∈ I−
The largest possible αcS is when data is moving from S to R
αimax .
αcS = min βi Case 2: gi of some data in R reaches zero, which means these points in R might reach up or low bound. First, divide set R into two subsets R I+ = {i ∈ E : ξi > ε} R = {i ∈ O : ξi < −ε}. I−
Then, the largest increment of αc is −gi .
αcR = min ξi Case 3: gc of new data reaches zero. This case is similar to case 2, and the largest increment is computed as g
αc =
−gc . ξc
Case 4: For αc = C, the largest increment is
Fig. 3.
Flowchart of SE-INC-SVM.
a support vector which is entering the set S, the inverse matrix of Q could be expanded as ⎤−1 ⎡ −1 −1 βk 0 βsT ηk Q ˜ ⎦ ⎣ = . (21) Q = βs hss hks ηkT hkk βk hks hkk Here, the subscript k is corresponding to the new data, and we define
αc = C − αc .
βk = −Qηk .
Consider all above four cases, we could get the maximal increment of αc from the four cases
αcmax = min αcS , αcR , αcg , αc . (20)
According to Sherman–Morrison–Woodbury formula, the ˜ is update of Q −1 −1 Q 0 Q + κ −1 ζk ζkT κ −1 ζk ηk Q = = 0 0 ηkT hkk κ −1 ζ T κ −1 k 1 ζk T ζk 1 (22) + 1 κ
E. Recursive Update of Inverse Matrix After determining the maximal value αcmax , s and g could be calculated. Since the calculation of inverse matrix Q needs long large memory and long time, which reduces the efficiency of proposed algorithm, a recursive update of inverse matrix strategy is employed. First, consider the case that adding a new sample into learning algorithm. Assuming the new training data is becoming
with κ = hkk − ηkT Qηk . This recursive update algorithm is much more efficient than simply inverse the matrix Q, especially when the size of Q is large and data is not sparse.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
TABLE I E XPERIMENTAL DATA S ET
Now, let us consider the case of deleting a data from the support vector subset S. This is an inverse process to obtain ˜ Suppose that matrix Q from known matrix Q. −1 −1 q11 q12 ηk Q ˜ . (23) Q= = q21 q22 ηk hkk Comparing (22) and (23), we have q11 = Q + κ −1 ζk ζkT q12 = κ −1 ζk q21 = κ −1 ζkT q22 = κ −1 .
TABLE II T RAINING MSE OF D IFFERENT R ATIO OF L ABELED DATA
Then the matrix Q after one support vector is deleted can be computed as Q = q11 −
q12 q21 . q22
(24)
In order to show the whole learning process more obviously, a flowchart of the proposed algorithm is given in Fig. 3. III. E XPERIMENTS AND S IMULATIONS A. Manual Set Testing Experiments In order to verify the validity of incremental S3VMs, this paper selected ten manual set for validation given in the Appendix. All these data set are popular for testing learning algorithm. Data sets are generated in a uniform range of variables. In order to get close to realistic data, all the data add 3% random noise. The number of sample is the total number of samples, including training data and test data. Table I shows the properties of artificially sample data. Here, 75% of the total samples are selected as training data, and the remaining as the test samples. In the training samples, we choose 10%, 30%, 50% of the samples as labeled ones, respectively, and the corresponding 90%, 70%, and 50% as unlabeled samples whose labels are deleted. Before training and learning, all data are normalized, i.e., all data are converted into [0, 1]. Tenfold cross-validation is taken in order to ensure the stability and effectiveness of the test experiment. Mean square error (MSE) is employed to measure the performance of the regression algorithm ! N ! # ε2 N. MSE = " l
l=1
Another semisupervised algorithm named HELP-TRAININGSVM [39] (HELP for short) is used to compare with the proposed algorithm. Table II and Fig. 4 show the training MSE of two different algorithms under different manual set with different ratio of labeled data. MSE in SE-INC-SVM is less than HELP algorithm for the most of manual data, since all the parameters in SE-INC-SVM are adjusted dynamically to reduce the training error. HELP adjusts its parameters according to the highest confidence value of the training sample, which affects the model accuracy to a finite extent.
Fig. 4.
Training MSE of different label ratio.
Now, the different ratios of labeled data are analyzed. The more labeled data are, the less MSE is for most of the training results. In other word, more training sample can guarantee the adequate training and stronger generalization ability. This is also in accord with the general law of supervision learning algorithm. Table III and Fig. 5 show the training time of SEINC-SVM and HELP-SVM under the different training set. Obviously, the training time of SE-INC-SVM is less than that of HELP algorithm. The high learning efficiency of proposed algorithm is determined by its unique learning strategy. HELP SVM will evaluate every unlabeled data to look for the highest confidence sample and then add it into the labeled set for retraining the model on the each iteration. This learning scheme makes training slowly. But SE-INC-SVM labels the unlabeled data directly once for all, which simplifies the training process and improve the learning speed. This effect is even
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WANG et al.: SE-INC-SVM LEARNING BASED ON NEIGHBORHOOD KE
7
TABLE III T RAINING T IME OF D IFFERENT R ATIO OF L ABELED DATA
Fig. 5.
Fig. 6.
Test MSE of different label ratio.
Fig. 7.
Train error and predict error with different (a) C and (b) σ 2 .
Training time of different label ratio. TABLE IV T EST MSE OF D IFFERENT R ATIO OF L ABELED DATA
greater when the ratio of labeled samples is high. SE-INCSVM maintains itself within a fixed number of parameters during the training process, and it is ready to add and delete data from the model. There will be less support vectors in model, which naturally leads a faster train speeds. Table IV and Fig. 6 show the test MSE of two algorithms which are based on different proportions of labeled sample. The test MSE in Fig. 6 is logical in order to show the results more clearly. No matter what percentage of labeled data, the test MSE of SE-INC-SVM is all less than that of HELP-TRAINING-SVM. For the proposed algorithm, the model output is closer to the real relation due to the KE and incremental training strategy. So SE-INC-SVM shows good generalization ability. Let us compare the different model generalization error under different proportion of labeled samples. It is shown that MSE decreased obviously when the labeled
ratio increases from 10% to 30%. When the ratio increases from 30% to 50%, the reduction of MSE is less than that from 10% to 30%. More training samples are added for learning if the percentage of labeled sample is increased, and the generalization ability of the model would improve to some extent. Once the model has been fully trained, adding more labeled samples will not significantly improve generalization ability. Now, we continue to deeply explore the effect of parameters on the learning model. As known to us, there are three factors in the proposed SVM algorithm: 1) punishment factor C; 2) kernel function parameter σ 2 ; and 3) sensitive parameter ε. Fig. 7(a) shows the training error and test error for different value of punishment factor C. The less penalty factor is, the more both training error and test error are, which results from the nonsufficient training. As penalty factor increases, the model is training sufficiently which results in the decrease of both errors. Fig. 7(b) explores the relation between kernel parameters σ 2 and model errors. The tendency of training error and test error are similar when the value of kernel factors increase. Too large or too small kernel factor will lead to higher error. So it is necessary to find a proper value to make the balance between the training error and test error. Fig. 8 studies the sensitive parameter which is also named as -tube width. It is shown that the prediction error and training error have opposite trend. Because the wider the pipe is, the more SVM exists in the model, it will inevitably lead to
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
Fig. 10. Biomass concentration estimation result. (a) No noise. (b) 3% noise.
Fig. 8.
Train error and predict error with different ε.
Fig. 11. Absolute error for biomass concentration estimation. (a) No noise. (b) 3% noise.
Fig. 9.
Penicillin fermentation process.
the decrease of training error while the increase of prediction error.
B. Modeling Application in Penicillin Fermentation Penicillin fermentation is a classical benchmark of batch process, which is hard to realize in mathematical model. A well-known platform [40], Pensim, has been widely used by the modeling, control and monitoring community as a source of data for comparing various approaches. The flow sheet of the fermentation process is shown in Fig. 9. Three main steps should be considered before this batch process is modeled: 1) decision of model variable; 2) set of model parameter; and 3) data preprocessing. 1) Decision of the Model Variables: Since lots of factors will affect the penicillin fermentation process, it is significant to select proper inputs and outputs variable to develop a reasonable black-box model. We should not only consider the impact of process variables to the produce, but also the convenience of data obtained from actual process. According to the process mechanism and correlation analysis [41], culture volume, carbon dioxide concentration, and dissolved oxygen concentration are three most important variables. So these three variables are selected as model input. The penicillin concentration and cell concentration is the most important yields, so these two variables are selected as outputs.
2) Model Parameter Set: Previous experiment (manual date testing) and analysis has discussed the effect of parameters in the SE-INC-SVM. Similarly, tenfold crossvalidation is taken to get the most proper value of each parameters, i.e., C = 3, σ 2 = 2, and ε = 0.01. 3) Data Preprocessing: Since different dimensions and values among these process data would result in an inaccurate model, it is very necessary to eliminate these data differences. Here, standardized data processing methods, min–max normalization is used for data preprocessing. First, six batch data are generated from Pensim in an normal condition. The sampling time of each batch is 0.5 h except the labeled data whose sampling time is 8 h. The normal production of penicillin fermentation is about 400 h, so we will obtain 800 samples in one batch, in which 50 samples are labeled and the remaining 750 are unlabeled data. These data are similar with real industrial production because industrial process easily generated unlabeled data. We randomly select two batches data as training data, one batch is in normal operation condition and the other adds 3% Gaussian noise. The first 80 data are used to get an initial model according to the incremental learning strategy of SE-INC-SVM. Then the model is cautiously trained with one data appearing. MSE and absolute error are used to evaluate SE-INC-SVM and HELP algorithm. Figs. 10 and 11 show the estimate result for biomass concentration in two conditions (no noise and 3% noise). SE-INC-SVM gets a better estimation than HELP-SVM, especially in the former period of fermentation process. The proposed learning algorithm could quickly update the training model and decrease the model training error due to its
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WANG et al.: SE-INC-SVM LEARNING BASED ON NEIGHBORHOOD KE
TABLE V MSE OF D IFFERENT A LGORITHMS ’ E STIMATION
9
in order to enhance the training speed and reduce the training error. The new algorithm is subjected to ten classic manual set and almost achieves better results than HELP-training SVM, which is another semisupervised learning method. What is more, the application in penicillin fermentation is also successful. The proposed algorithm is shown to keep a higher accuracy and training speed. The future work is to apply the SE-INC-SVM algorithm to solve more complex industries problems. A PPENDIX Data 1: 2-D Mexican Hat function y = sin c|x| =
Fig. 12. noise.
Penicillin concentration estimation result. (a) No noise. (b) 3%
sin c|x| , x ∈ U[−2π, 2π]. |x|
Data 2: 3-D Mexican Hat function $ $ sin x12 + x22 y = sin c x12 + x22 = $ , xi ∈ U[−4π, 4π]. x12 + x22 Data 3: Friedman #1 function y = 10 sin(π x1 x2 ) + 20(x3 − 0.5)2 + 10x4 + 5x5 , xi ∈ U[0, 1]. Data 4: Friedman #2 function %
2 1 x12 + x2 x3 − x2 x4 x1 ∈ U[0, 100], x2 ∈ U[40π, 560π], x3 ∈ U[0, 1], x4 ∈ U[1, 11]. y=
Fig. 13. Absolute error for penicillin concentration estimation. (a) No noise. (b) 3% noise.
Data 5: Friedman #3 function incremental training strategy. It is very helpful for improving the prediction accuracy. HELP algorithm performed even worse under the noise operation condition, and the prediction accuracy is greatly reduced. Moreover, the SE-INC-SVM generalization is still good. This is because that SE-INCSVM takes feedback which reduces the sensitivity to noise, but HELP does not consider how to deal with unsmooth data Therefore, the estimate error of SE-INC-SVM is relatively small to noise data. Table V shows the MSE of different algorithms’ estimation for penicillin and biomass concentration. The proposed algorithm also has better result than HELP algorithm. Figs. 12 and 13 show the estimate of penicillin concentration. The results is similar to the cell concentration estimation but in different biomass scale. This further proves that SE-INC model algorithm has better performance whether with noise or not. IV. C ONCLUSION A new semisupervised learning algorithm SE-INC-SVM is proposed, in which kernel estimate strategy is employed to estimate the unlabeled data. The accuracy of prediction is improved due to take full advantage from the hidden information of the unlabeled data. The proposed algorithm also uses an incremental training scheme to realize recursive training
−1
x2 x3 −
1 x2 x4
, x1 ∈ U[0, 100], x2 ∈ U[40π, 560π] x1 x3 ∈ U[0, 1], x4 ∈ U[1, 11].
y = tan
Data 6: Gabor function & ' 1 y = π exp −2 x12 + x12 , xi ∈ U[0, 1]. 2 Data 7: Multifunction y = 0.79 + 1.27x1 x2 + 1.56x1 x4 + 3.42x2 x5 + 2.06x3 x4 x5 , xi ∈ U[0, 1]. Data 8: Plane function y = 0.6x1 + 0.3x2 , xi ∈ U[0, 1]. Data 9: Polynomial function y = 1 + 2x + 3x2 + 4x3 + 5x4 , xi ∈ U[0, 1]. Data 10: SinC function y=
sin(x) , x ∈ U[0, 2π]. x
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS
R EFERENCES [1] C. L. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: A survey on big data,” Inf. Sci., vol. 275, pp. 314–347, Aug. 2014. [2] A. Singla, S. Patra, and L. Bruzzone, “A novel classification technique based on progressive transductive SVM learning,” Pattern Recognit. Lett., vol. 42, pp. 101–106, Jun. 2014. [3] C. Chen et al., “Adaptive fuzzy asymptotic control of MIMO systems with unknown input coefficients via a robust Nussbaum gain based approach,” IEEE Trans. Fuzzy Syst., to be published, doi: 10.1109/TFUZZ.2016.2604848. [4] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine for regression and multiclass classification,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 42, no. 2, pp. 513–529, Apr. 2012. [5] Y.-J. Liu and S. C. Tong, “Optimal control-based adaptive NN design for a class of nonlinear discrete-time block-triangular systems,” IEEE Trans. Cybern., vol. 46, no. 11, pp. 2670–2680, Nov. 2016. [6] C. L. P. Chen, D. C. Tao, and X. G. You, “Big learning in social media analytics,” Neurocomputing, vol. 204, pp. 1–2, Sep. 2016. [7] B. Fan, X. J. Lu, and H.-X. Li, “Probabilistic inference-based least squares support vector machine for modeling under noisy environment,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 46, no. 12, pp. 1703–1710, Dec. 2016. [8] P. J.-H. Hu et al., “Managing clinical use of high-alert drugs: A supervised learning approach to pharmacokinetic data analysis,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 37, no. 4, pp. 481–492, Jul. 2007. [9] Y. Wang et al., “Learning can improve the blood glucose control performance for type 1 diabetes mellitus,” Diabetes Technol. Ther., vol. 19, no. 1, pp. 41–48, Jan. 2017. [10] D. Zhao, Z. Lin, and Y. Q. Wang, “Integrated state/disturbance observers for two-dimensional linear systems,” IET Control Theory Appl., vol. 9, no. 9, pp. 1373–1383, Jun. 2015. [11] Y.-J. Liu, J. Li, S. C. Tong, and C. L. P. Chen, “Neural network controlbased adaptive learning design for nonlinear systems with full-state constraints,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1562–1571, Jul. 2016 [12] D. Zhao, D. Shen, and Y. Q. Wang, “Fault diagnosis and compensation for two-dimensional discrete time systems with sensor faults and time varying delays,” Int. J. Robust Nonlin. Control, pp. 1–25, Jan. 2017, to be published, doi: 10.1002/rnc.3742. [13] G. X. Wen, C. L. P. Chen, Y.-J. Liu, and Z. Liu, “Neural network-based adaptive leader-following consensus control for a class of nonlinear multiagent state-delay systems,” IEEE Trans. Cybern., to be published, doi: 10.1109/TCYB.2016.2608499. [14] T. I. Dhamecha, R. Singh, and M. Vatsa, “On incremental semi-supervised discriminant analysis,” Pattern Recognit., vol. 52, pp. 135–147, Apr. 2016. [15] P. K. Mallapragada, R. Jin, A. K. Jain, and Y. Liu, “SemiBoost: Boosting for semi-supervised learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 2000–2014, Nov. 2009. [16] M. R. Sabuncu, B. T. T. Yeo, K. V. Leemput, B. Fischl, and P. Golland, “A generative model for image segmentation based on label fusion,” IEEE Trans. Med. Imag., vol. 29, no. 10, pp. 1714–1729, Oct. 2010. [17] K. Zhang, R. Gonzalez, B. Huang, and G. Ji, “Expectation–maximization approach to fault diagnosis with missing data,” IEEE Trans. Ind. Electron., vol. 62, no. 2, pp. 1231–1240, Feb. 2015. [18] Y.-J. Liu and S. Tong, “Barrier Lyapunov functions for Nussbaum gain adaptive control of full state constrained nonlinear systems,” Automatica, vol. 76, pp. 143–152, Feb. 2017. [19] Y. Li, C. Guan, H. Li, and Z. Chin, “A self-training semi-supervised SVM algorithm and its application in an EEG-based brain computer interface speller system,” Pattern Recognit. Lett., vol. 29, no. 9, pp. 1285–1294, Jul. 2008. [20] Y. He and D. Zhou, “Self-training from labeled features for sentiment analysis,” Inf. Process. Manag., vol. 47, no. 4, pp. 606–616, Jul. 2011. [21] M. Darnstädt, H. U. Simon, and B. Szörényi, “Supervised learning and co-training,” Theor. Comput. Sci., vol. 519, pp. 68–87, Jan. 2014. [22] Y.-J. Liu, S. C. Tong, C. L. P. Chen, and D.-J. Li, “Neural controller design-based adaptive control for nonlinear MIMO systems with unknown hysteresis inputs,” IEEE Trans. Cybern., vol. 46, no. 1, pp. 9–19, Jan. 2016, doi: 10.1109/TCYB.2015.2388582. [23] J. X. Huang, J. Miao, and B. He, “High performance query expansion using adaptive co-training,” Inf. Process. Manag., vol. 49, no. 2, pp. 441–453, Mar. 2013.
[24] M.-F. Balcan and A. Blum, “A discriminative model for semi-supervised learning,” J. ACM, vol. 57, pp. 517–527, Mar. 2008. [25] Z.-H. Zhou and M. Li, “Semisupervised regression with cotrainingstyle algorithms,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 11, pp. 1479–1493, Nov. 2007. [26] B.-W. Chen, C.-Y. Chen, and J.-F. Wang, “Smart homecare surveillance system: Behavior identification based on state-transition support vector machines and sound directivity pattern analysis,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 43, no. 6, pp. 1279–1289, Nov. 2013. [27] J. Sun, H. Li, and H. Adeli, “Concept drift-oriented adaptive and dynamic support vector machine ensemble with time window in corporate financial risk prediction,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 43, no. 4, pp. 801–813, Jul. 2013. [28] Y.-F. Li and Z.-H. Zhou, “Towards making unlabeled data never hurt,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 175–188, Jan. 2014. [29] D. Needell, N. Srebro, and R. Ward, “Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm,” Math. Program., vol. 2, pp. 1–25, Feb. 2015. [30] M. Markov, M. Saghafi, I. A. Hiskens, and H. Dankowicz, “Continuation techniques for reachability analysis of uncertain power systems,” in Proc. IEEE Int. Symp. Circuits Syst., Melbourne, VIC, Australia, 2014, pp. 1816–1819. [31] H. M. Le, H. A. L. Thi, and M. C. Nguyen, “Sparse semi-supervised support vector machines by DC programming and DCA,” Neurocomputing, vol. 153, pp. 62–76, Apr. 2015. [32] F. K. W. Chan, H. C. So, W.-K. Ma, and K. W. K. Lui, “A flexible semi-definite programming approach for source localization problems,” Digit. Signal Process., vol. 23, no. 2, pp. 601–609, Mar. 2013. [33] M. S. Mehmetoglu, E. Akyol, and K. Rose, “A deterministic annealing approach to optimization of zero-delay source-channel codes,” in Proc. IEEE Inf. Theory Workshop, Seville, Spain, 2013, pp. 1–5. [34] S. Nakariyakul, “A comparative study of suboptimal branch and bound algorithms,” Inf. Sci., vol. 278, pp. 545–554, Sep. 2014. [35] T. C. Liu, Y. Yang, G.-B. Huang, Y. K. Yeo, and Z. P. Lin, “Driver distraction detection using semi-supervised machine learning,” IEEE Trans. Intell. Transp. Syst., vol. 17, no. 4, pp. 1108–1120, Apr. 2016. [36] G. B. Huang, S. Song, J. N. D. Gupta, and C. Wu, “Semi-supervised and unsupervised extreme learning machines,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2405–2417, Dec. 2014. [37] H. J. Bierens, “The Nadaraya–Watson kernel regression function estimator,” in Topics in Advanced Econometrics. New York, NY, USA: Cambridge Univ. Press, 1994, pp. 212–247. [38] A. Golbabai, S. Seifollahi, and M. Javidi, “Normalized RBF networks: Application to a system of integral equations,” Phys. Scripta, vol. 78, no. 1, pp. 1302–1314, Jul. 2008. [39] Y.-H. Cheng, J. I. Jie, and X.-S. Wang, “Semi-supervised support vector regression based on help-training,” Control Decis., vol. 27, no. 2, pp. 205–210, Feb. 2012. [40] G. Birol, C. Ündey, and A. Çinar, “A modular simulation package for fed-batch fermentation: Penicillin production,” Comput. Chem. Eng., vol. 26, no. 11, pp. 1553–1565, Nov. 2002. [41] J. Wang, H. T. Wei, L. L. Cao, and Q. B. Jin, “Soft-transition subPCA fault monitoring of batch processes,” Ind. Eng. Chem. Res., vol. 52, no. 29, pp. 9879–9888, Jun. 2013. [42] B. Gu et al., “Incremental learning for γ -support vector regression,” Neural Netw., vol. 67, pp. 140–150, Mar. 2015.
Jing Wang received the B.S. degree in industry automation, and the Ph.D. degree in control theory and control engineering from Northeastern University, Boston, MA, USA, in 1994 and 1998, respectively. She is a Professor with the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China. She was a Visiting Professor with the University of Delaware, Newark, DE, USA, in 2014. Her current research interests include data-driven modeling, optimization and control for complex industrial process, nonlinear model-based control of polymer microscopic quality in chemical reactor, process monitoring, and fault diagnosis for complex industrial process.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. WANG et al.: SE-INC-SVM LEARNING BASED ON NEIGHBORHOOD KE
Daiwei Yang received the B.S. and M.S. degrees in automation from the Beijing University of Chemical Technology, Beijing, China, in 2013 and 2016, respectively, where he is currently pursuing the Ph.D. degree with the College of Information Science and Technology. His current research interests include machine learning and artificial intelligence in big data environment.
Wei Jiang received the B.S. and M.S. degrees in automation from the Beijing University of Chemical Technology, Beijing, China, in 2012 and 2015, respectively. He is a Software Engineer with the China Huanqiu Contracting and Engineering Corporation, Beijing. His current research interests include machine learning and artificial intelligence in big data environment, data-driven modeling, and control for complex chemical process.
11
Jinglin Zhou (M’14) received the B.Eng. degree from Daqing Petroleum Institute, Daqing, China, in 1999, the M.Sc. degree from Hunan University, Changsha, China, in 2002, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2005. He was an Academic Visitor with the Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield, U.K. He is currently a Professor with the College of Information Science and Technology, Beijing University of Chemical Technology, Beijing. His current research interests include stochastic distribution control, fault detection and diagnosis, and variable structure control and their applications.