Application of Support Vector Machines to Nonlinear System Identification Haina Rong, Gexiang Zhang, Cuifang Zhang School of Computer and Communication Engineering, Southwest Jiaotong University, Chengdu 610031 Sichuan, P.R.China Email:
[email protected] theoretical foundation, good generalization capability and no local minima. SVM can solve high-dimension problems and therefore avoid “curse of dimensionality”. The main reason that SVM attracts more and more attentions is that SVM adopts the structure risk minimization (SRM) principle, which has been shown to be superior to the traditional empirical risk minimization (EMP) principle employed by conventional neural networks [3]. Therefore, SVM becomes a focus of machine learning after neural network, and is used to solve the system identification problem [7] after its successful application in pattern recognition [4], regression [5] and control theory [7]. Different from other methods of machine learning, SVM transforms the input space to a feature space through a kernel function so that a nonlinear problem in the input space can be transformed to a linear problem in the feature space [2]. Since different kernel functions form different SVMs with different performances, kernel selection is the key problem when SVM is used to solve the classification or regression problem [3,5]. In addition, kernel selection affects strongly the solution of regression problem. So, we suppose to achieve a strategy of kernel selection and parameter selection through the study of several kernels used in the SVM regression. After a briefly introduction of the SVM method of nonlinear system, this paper analyses several typical kernel functions, then uses SVMs with different kernels to identify a nonlinear system respectively. Finally, a SVM adopting RBF kernel is used to identify a Nonlinear Auto-regressive Moving Average (NARMAX) system. The rest of this paper is organized as follows. Section 2 gives support vector machines for system identification. Section 3 discusses the choice of kernel functions. In Section 4, the model of nonlinear system identification is presented. In section 5, parameter selection is discussed and experimental results are analyzed. Finally, concluding remarks are made in Section 6.
Abstract It is a key research issue for support vector machines (SVMs) to choose kernel function for approximating a function. Different kernel function forms different SVM model that has distinct performances. In this paper, after the nonlinear system identification method using SVM is discussed, the criterion of choosing kernel function for system identification is given, and the effect of parameters are discussed. In the experiment, several kernel functions are used to form different SVM models that are used to identify a typically nonlinear system, respectively. To analyze the effect of a parameter on SVM, plenty of parameters are employed to make the system identification experiment. A large number of experimental results show that radial basis kernel function is a good choice for identifying a nonlinear system using SVM.
1. Introduction In control system, the relationship between the input and the output is always decided using sample learning. The identification problem of linear system has been solved theoretically, but there is no ideal method for nonlinear system identification up to now [1]. Artificial neural network (ANN) is used generally in the system identification because ANN can approach an arbitrary nonlinear function. However, several drawbacks, including local minimum, over learning and the difficulties of choosing network structure, confine ANN’s applications and development [1-2] greatly. Support vector machine (SVM), developed by Vapnik [2], is based on statistical learning theory and gains popularity because of many attractive characteristics and good performances. Comparing with traditional neural networks, SVM has strictly
0-7803-8963-8/05/$20.00 ©2005 IEEE
501
To
2. Support Vector Machines for System Identification
(*)
=
w − ∑ i =1 α i (ξ i − yi + ( w ⋅ xi ) − b + ε ) l
2
2
*
(1)
(α i* − α i )Φ ( xi )T Φ ( x)
(6)
2
can be obtained by
maximizing the dual form of function (8)
W (α (*) ) =
1
l
∑ (α 2
i
− α i* ) k ( xi , x j )(α i , α *j )
i , j =1
l
(8)
+ ε ∑ (α i + α ) − ∑ (α i − α ) yi * i
i =1
* i
i =1
Subject to l
∑ (α
2
second item, 1/ 2 w , is used as a measurement of
i
− α i* ) = 0
(9a)
i =1
function flatness. C is a regularized constant determining the tradeoff between the training error and model flatness. Introduction of slack variables
0 ≤ α i(*) ≤ C
(9b)
Based on the nature of quadratic programming, only
( ξ is denoted with ξ 、 ξ in the following description) leads equation (2) to the following constrained function: *
some coefficients among
α i(*) will be nonzero, and the
data point associated with them could be referred to support vectors.
l
w + C ∑ (ξ i + ξ i* )
3. Kernel Function Choice
i =1
(3)
Due to the introduction of kernel function, the feature space (Reproducing Kernel Hilbert Spaces) is derived using the kernel function, instead of being defined definitely. So the selection of kernel is the key
Subject to: (( w ⋅ xi ) + b) − yi ≤ ε + ξ i ,
yi − (( w ⋅ xi ) + b) ≤ ε + ξ i* , ξi , ξi* > 0
α i(*)
The Lagrange multipliers
(2)
2
(7)
i =1
so-called ε -insensitivity loss function, which indicates that it does not penalize errors below ε ( ε >0). The
2
i =1
l
l
1
l
f ( x ) = ∑ (α i* − α i ) k ( xi , x ) + b
In equation (2), Lε = max {0, y − f ( x ) − ε } is the
Min
∑
Then the approximation function has the following form:
( xi , yi ) with precision ε , then the risk function is:
ξ
*
*
k ( xi , x j ) = Φ ( xi ), Φ ( x j )
coefficients. They are estimated by minimizing the risk function (2). Suppose the function f ( x) approximates
(*)
*
function with the form of equation (6).
where Φ i ( x ) are the features of inputs, wi and b are
(*)
l
*
+ b . For computational convenience, the form Φ ( xiT )Φ ( x ) is often replaced with a so-called kernel
i =1
i =1
(*)
(1) has the form: f ( x ) =
l
w
(*)
l
is to find a function f , we can achieve the corresponding y after training while inputting an arbitrary x . SVM approximates the function in the following form:
2
. So the corresponding
we can get w = ∑ i =1 (α i − α i )xi . Hence the equation
set {(Φ ( x1 ), y1 ), … , (Φ ( xl ), yl )} . System identification
R (C ) = C ∑ Lε +
Lagrange
(5) To minimize the equation (3), we find the saddle point of L. Through the minimum with respect to w,
Φ : R n → H , which maps the sample to a new data
1
1 l
size of training data) is the output of the corresponding system. We introduce a nonlinear mapping
l
introduce
− ∑ i =1 α i (ξ i − ( w ⋅ xi ) + b + yi + ε ) − ∑ i =1 (ηiξ i + ηi ξ i )
dimension input vector, yi ∈ R ( i = 1, … , i, … l is the
i
we
(*) i
L ( w, b, α , ξ , ε , η )
n
i
and η
(*) i
Lagrange function is:
D = {( x1 , y1 ), … , ( xl , yl )} , where xi ∈ R is the n
∑ w Φ ( x) + b
constraints,
multipliers α
In SVM regression, the basic idea is to map the input data into a higher dimensional feature space H via a nonlinear mapping Φ , and then to do linear regression in this space. Therefore, regression approximation addresses the problem of estimating a function based on a given data set
f ( x) =
the
(4)
502
and it behaves like RBF for certain parameters. However, Sigmoid function is not better than the RBF kernel because the Sigmoid function has no special superiority and many difficulties in choosing the parameter. The Sigmoid function is not recommended in general cases. Fourier transform is very import in the signal processing. Fourier-series is a expansion of Fourier transform in 2N+1 dimensions. However, the expansion has no good approximation ability, so a certain regularization parameter is always introduced to enhance the approximation ability, such as:
to determine the feature space. Here are several kernel functions used generally: (1) Polynomial:
k ( x, xT ) = ( x, xT + 1) d
(10a)
(2) Gaussian Radial Basis Function (RBF): 2
k ( x, xT ) = exp( − x − xT
/(2σ 2 ))
(10b)
(3) Multi-Layer Perceptron (Sigmoid function):
k ( x, xT ) = tanh( ρ x, xT + b)
(10c)
(4) Fourier Series:
k ( x, x T ) =
sin( N + 1/ 2)( x − xT ) sin((1/ 2)( x − xT ))
k ( x, xT ) = (1 − q 2 ) / 2(1 − 2q cos( x − xT ) + q 2 )
, where q is a regularization parameter. Therefore, this kernel is probably not a good choice because its regularization capability is poor, which is evident by consideration of its Fourier transform [11]. Spline function has been used generally because of its approximation ability to arbitrary function. To both the finite and the infinite spline, the complexity of the solution lies in the number of Support Vectors instead of the space dimension or the number of knots. To simplify computation, the infinite spline is always used and it is defined in the internal [0,1]. When the order is 1, infinite spline function is shown in equation (10e). From the reference [6], we know that the product of Mercer kernel is also a Mercer kernel. So we can construct a n-dimension kernel using the above onedimension kernel. The n-dimension kernel is represented as
(10d)
(5) Spline:
k ( x , x T ) = 1 + x, x T +
1 T
2( x, x )
−
1 6 min( x, xT )3
(10e)
Usually, the control system is nonlinear because of many interfering factors. So the system to identify is almost nonlinear system, thus, a linear kernel is not a good selection. The polynomial mapping is a popular choice for non-linear model. It degenerates to a linear function when parameter d=1. With the increasing of the value d, the space complexity increases by the exponential order. Limited by the hardware and the EMS memory, the value d must be confined in engineering applications. In general, radial basis kernel function is the first choice for nonlinear system identification because it has the form of Gaussian function. Traditional technique utilizing radial basis function uses some methods to determine the center of a subset. In most cases, the clustering method is employed to decide the center of a subset. An attractive characteristic of SVM is that the selection is implicit, with each support vector contributing one local Gaussian function [12]. By further considerations, it is possible to select the global basis function width using the SRM principle. RBF is a nonlinear mapping, and a linear kernel is a special case of RBF because that the linear kernel with a penalty parameter C has the same performance as the RBF kernel with some parameters (C , γ ) [9]. The optimization problem to solve in SVM is a convex and quadratic programming problem, so the kernel used in SVM should satisfy the Mercer condition [2]. Both Polynomial and RBF function satisfy the Mercer condition. Although the Sigmoid function is a perfect approximation function in neural network, it only satisfies the Mercer condition for some parameters. In the case of parameter ρ > 0 and b3. In the case of w=10, the error satisfies the precision only if d>17. In the case of w=30, there is no reasonable value of d among (0,30) that make the error under ε . To the Polynomial kernel function, the space complexity is big when the value of d is 30. If the value of d increases, it is difficult to optimize SVMs because of the limitation of the EMS memory. Therefore, it is easy to approximate a low frequency nonlinear function, but it is difficult to approximate a high frequency nonlinear function when
Error curve of the RBF kernel
0
10
-1
10
-2
(Error)
10
-3
10
-4
w =1 w =10 w =30 (Sigma)
10
-5
10 0.01
0.1
1
10
100
Figure 5. Error curve 2 support
(Support Vector number)
10
vector curve of the RBF kernel
1
10
0
10 0.01
0.1
1
w =1 w =10 w =30 10(Sigma)100
10
-2
w =1 w =10 w =30
10
Time curve of the RBF kernel w =1 w =10 w =30
7 (Time/s)
-1
(Error)
8
Error curve of the poly kernel
0
10
Figure 6. The changing curve of the number of support vectors
-3
10
-4
10
-5
10
6
0
5
(d)30
20
Figure 8. Error curve support vector curve of the poly kernel 100
0.1
1.0
number of support vector
4 3 0.01
10
(Sigma)
10
100
Figure 7. Time curve Comparing Figure 5 with Figure 6, there is a certain corresponding relationship between the error curve and the number of Support Vectors, that is, the identification error is small when the number of Support Vectors is small. Figure 7 shows that CPU time consumed in quadratic programming decreases when the value of σ increases, which indicates that the time complexity is always less than a certain value.
w =1 w =10 w =30
50
0
0
10
20
(d) 30
Figure 9. The changing curve of the number of support vectors
505
6 Time curve of the poly kernel 5.5
(Time/s)
0.7
w=1 w=10 w=30
5
4 10
20
(d) 30
Figure 10. Time curve adopting Polynomial kernel. Comparing Figure 8 with Figure 9, the error curve has a certain relationship with the number of Support Vectors. Figure 10 shows that there is an increasing trend of the time curve, which indicates that the time complexity increases when the value of d increases. Because spline kernel function has no parameter, we can only achieve one result under each frequency when the nonlinear system is identified using SVM. The results are list in Table 1, in which the identification performances are bad when an infinite Spine with 1 order is adopted. So spline kernel function is probably a bad selection for SVM regression. From the above experimental analysis, we can draw conclusions: a SVM using RBF kernel has good identification performances for a nonlinear system with different frequencies and it is easy to chose a parameter for RBF kernel because its valid parameter varies in a certain range; a SVM adopting Polynomial kernel has also good identification performances for a nonlinear system with a low frequency, but it is difficult to implement for a nonlinear system with high frequency; the identification performance of a SVM adopting an infinite Spine of order 1 is poor.
-1
10
Time (s) 4.6 4.4 4.2
-2
(Error)
10
-3
10
-4
10
-5
10
-2
10
-1
10
0
10
1
10
(sigma)
3
10
Figure 11. Error curve Curve of support vector 100 ( number of support vector)
Support vector (%) 64 84 100
Error curve
0
10
Table 1. Identification results of nonlinear systems using SVMs with an infinite Spine kernel w 1 10 30
(1 + y(k − 1))2 + y(k − 2)2
(13)
It is difficult to estimate the trend of the mapping curve between the inputs and the outputs for such a complex system. To this system, the input vector is [ y ( k ), y ( k − 1), y ( k − 2), u ( k ), u ( k − 1)] , we uses 200 points random sequence limited in (-1,1) as the control signal. One half of these points are used to train SVM and the rest points are used to test the identification system. The regularization parameter C is chosen as 200, and the precision is chosen as 0.0001. The parameter σ varies from 0 to 100. To a RBF kernel with a specified value of σ , SVM is used to identify the nonlinear system, so the identification error, the number of Support Vectors needed in the regression and CPU time required in the quadratic programming can be achieved through the experiment and shown in Figure 11, Figure 12 and Figure 13, respectively.
4.5
3.5 0
y(k ) y(k − 1) y(k − 2)u(k − 1)( y(k − 2) + u(k ))
Error 0.1119 0.2559 0.1017
To test the RBF kernel function further, we identify a SISO control system that is based on NARMAX model:
98 96 94 -2 10
-1
10
0
10
1
10
(sigma)
3
10
Figure 12. The changing curve of the number of support vectors
y ( k + 1) =
506
Curve of the time
5
References [1] S.S. Ge, “Book review”, Automatica, 2003, vol. 39, pp. 177-179. [2] V. Vapnik, The Nature of Statistical Learning Theory, Translated by X. G. Zhang, Beijing: Tsinghua University Press, 2000. [3] V.N. Vapnik, S. Golowich, and A. Smola, “Support vector method for function approximation, regression estimation, and signal processing”, Advances in Neural Information Processing Systems, 1997, pp. 281-287. [4] J.C. Christopher, A. Burges, “A tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery, 1998, vol.2, no.2, pp. 121-167. [5] A. Smola, B. Scholkopf. “A Tutorial on Support Vector Regression”, http://www.neurololt.com, 1998. [6] J.C. Platt, “Fast Training of Support Vector Machines using Sequential Minimal Optimization”, Advances in Kernel Methods-Support Vector Learning, MIT Press, 1999, pp. 185-208 [7] A. Gretton, A. Doucet, and R. Herbrich, “Support vector regression for black-box system identification”, Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing, Piscataway: IEEE Press, 2001, pp. 341344. [8] Y. Tan, J. Wang, “A Support Vector Machine with a Hybrid Kernel and Minimal Vapnik-Chervonenk is Dimension”, IEEE Transactions on Knowledge and Data Engineering, 2004, vol.16, no.4, pp. 385-395. [9] C.W. Hsu, C.C. Chang, and C.J. Lin. “A Practical guide to support vector classification”, http://www. csie. ntu. edu. tw/ ~cjlin, 2003. [10] H.T. Lin, C.J. Lin, “A study on sigmoid kernels for SVM and the training of non-PSD Kernels by SMO-type methods”, http://www.csie.ntu.edu.tw/~cjlin, 2003. [11] A.J. Smola, B. Schölkopf, “On a kernel–based method for pattern recognition, regression, approximation and operator inversion”, Algorithmica, 1998, vol. 22, pp. 211-231. [12] S.R. Gunn, “Support vector classification and regression”, Technical Report, Image Speech And Intelligent Systems Research Group, University of Southampron, 1997.
(Time/s)
4.5
4
3.5 -2 10
-1
10
0
10
1
10
(sigma)
3
10
Figure 13. Time curve Figure 11 shows that the identification error satisfies the precision when the value of σ lies in the range (0.9,2) and the error gets to the minimal value 4.5 × 105 when σ is 1.2. Comparing Figure 11 with Figure 12, the error becomes small as the number of Support Vectors decreases. Figure 13 shows that CPU time spent in quadratic programming decreases when the value of σ increases. So we can see that the SVM adopting RBF kernel function has good performances when it is used to identify a more complex nonlinear system such as a NAMAX system.
6. Conclusions Both the choice of kernel functions and the decision of the parameters of kernel functions are very important research issues when support vector machines are used in classification and regression. This paper mainly discusses the choice of several kernel functions, including radial basis kernel function, spline kernel function, polynomial kernel function and Fourier Series kernel function, and the effects of different parameters on the performances of SVMs with the same kernel function. A large number of experimental results show that RBF kernel function is a good choice when SVM is used to identify a nonlinear system because the SVM with RBF kernel function can approximate well the nonlinear system with high frequency and the choosing range of its parameter is small. Furthermore, experimental results also show that the space complexity does not become very big when the parameter varies in a certain range. So it is easy to implement in PC computers for system identification using SVM.
507