AN ACTIVE-SET ALGORITHM FOR SUPPORT VECTOR MACHINES IN NONLINEAR SYSTEM IDENTIFICATION Michael Vogt ∗ Vojislav Kecman ∗∗ ∗
Darmstadt University of Technology, Institute of Automatic Control, Landgraf-Georg-Strasse 4, 64283 Darmstadt, Germany, E-mail:
[email protected] ∗∗ University of Auckland, School of Engineering, Private Bag 92019 Auckland, New Zealand
Abstract: This contribution describes an active-set algorithm for the optimization of regression support vector machines (SVMs). Its intended use is mainly system identification. Currently, SVMs are computed solving a QP problem by working-set algorithms like the SMO method. Although showing good results in general, they may perform weakly in some situations, particularly when solving regression problems. In these cases, activeset techniques (which are robust general-purpose QP solvers) have been shown to be a reasonable alternative. The paper considers how to adapt them to SVM regession with fixed or variable bias term and applies them to the identification of a condensing boiler. Keywords: system identification, regression, nonlinear, support vector machines, quadratic programming, active-set algorithm
1. INTRODUCTION In the recent years, support vector machines (SVMs) have become popular for classification and regression tasks (Vapnik, 1995; Schölkopf and Smola, 2002) since they can treat large input dimensions and show a good generalization behavior. The method has its foundation in classification and has later been extended to regression. However, applications in the field of system identification are still rare. SVMs are computed by solving a Quadratic Programming (QP) problem (see Eq. (7) and (9)), the size of which depends on the number N of training data. Currently, the QP problem is solved using workingset algorithms like the SMO method (Platt, 1999), ISDA (Huang and Kecman, 2004) or similar strategies (Chang and Lin, 2003). These can be implemented efficiently and have a memory consumption only proportional to the number of training data. Note that the memory consumption would be quadratic if the
whole QP problem were stored in memory. For that reason, working-set algorithms are suitable even for large scale problems, i.e., even for huge data sets. On the other hand, they may show weak results if the problem is ill-posed, if the SVM parameters (C and ε) are not chosen carefully, or if high precision is needed. The computation time may then increase by several magnitudes for the same data set. This applies in particular to regression, see Sec. 5. Active-set algorithms are the classical solvers for QP problems. They are known to be robust, but they are often slower and require more memory than workingset algorithms. Only few attempts have been made to utilize this technique for SVMs. E.g., in (Mangasarian and Musicant, 2001) it is applied to a modified SVM classification problem. Also the Chunking algorithm (Vapnik, 1995) is closely related. However, there is no active-set algorithm available that considers the special needs of the standard SVM regression problem. This will be the main contribution of this paper.
The general idea is to find the active set A, i.e., those inequality constraints that are fulfilled with equality. If A is known, the Karush-Kuhn-Tucker (KKT) conditions reduce to a simple system of linear equations which yields the solution of the QP problem (Nocedal and Wright, 1999). Because A is unknown in the beginning, it is constructed iteratively by adding and removing constraints and testing if the solution remains feasible. The construction of A starts with an initial active set A0 containing the indices of the bounded variables (lying on the boundary of the feasible region) whereas those in F 0 = {1, . . . , N}\A0 are free (lying in the interior of the feasible region). Then the following steps are performed repeatedly for k = 1, 2, . . .: A1. Solve the KKT system for all variables in F k . A2. If the solution is feasible, find the variable in Ak that violates the KKT conditions most, move it to F k , then go to A1. A3. Otherwise find an intermediate value between old and new solution lying on the border of the feasible region, move one bounded variable from F k to Ak , then go to A1. The intermediate solution in step A3 is computed as ak = η¯ak + (1 − η)ak−1 with maximal η ∈ [0, 1] (affine scaling), where a¯ k is the solution of the linear system in step A1, i.e., the new iterate ak lies on the connecting line of ak−1 and a¯ k . The optimum is found if during step A2 no violating variable is left in Ak . In Sec. 2, a brief review of the SVM regression method is given. Sec. 3 shows then how to adapt the above algorithm to SVM regression for both fixed and variable bias term. In Sec. 4, the efficient solution of the KKT system is explained. An application example for nonlinear system identification of a condensing boiler is studied in Sec. 5.
2. SUPPORT VECTOR MACHINE REGRESSION
(1)
so that all data {(xi , yi )}Ni=1 lie within an insensitivity zone of size ε around the function, see Fig. 1. Outliers are treated by two sets of slack variables ξi and ξi∗ measuring the distance above and below the insensitivity zone, respectively. This results in the following primal optimization problem (Schölkopf and Smola, 2002): N 1 min∗ Jp (w, ξ, ξ∗ ) = wT w + C (ξi + ξi∗ ) 2 w,ξ,ξ i=1
Insensitivity Zone
0.6 y
ξ*i
ξi
0.4
ε
0.2 Regression Function
0 −0.2
0
0.2
0.4
0.6 x
0.8
1
Fig. 1. Linear support vector machine regression To solve this problem, its (primal) Lagrangian Lp (w, b, ξ, ξ∗ , α, α∗ , β , β ∗ ) = N N 1 T w w + C (ξi + ξi∗ ) − (βi ξi + βi∗ ξi∗ ) 2 i=1 i=1 N (3) − αi (ε + ξi − yi + wT xi + b) −
i=1 N
αi∗ (ε + ξi∗ + yi − wT xi − b)
i=1
is needed. The dual variables α, α∗ , β and β ∗ are the Lagrange multipliers of the primal constraints. This function has a minimum with respect to the primal variables and a maximum with respect to the dual ones (saddle point condition). According to the KarushKuhn-Tucker (KKT) conditions, the derivatives with respect to the primal variables must vanish in the optimum: ∂Lp ∂w ∂Lp ∂b ∂Lp ∂ξi ∂Lp ∂ξi∗
w=
N
(αi − αi∗ )xi
=0
⇒
=0
⇒
=0
⇒
αi + βi = C,
=0
⇒
αi∗ + βi∗ = C,
N
(4a)
i=1
(αi − αi∗ ) = 0
(4b)
i=1
i = 1, . . . , N
(4c)
i = 1, . . . , N (4d)
This yields a dual objective function that is solely dependent from the dual variables α and α∗ :
Linear regression SVMs try to find a flat function f (x) = wT x + b ,
0.8
(2a)
Jd (α, α∗ ) =
N N 1 (αi − αi∗ )(αj − αj∗ )xTi xj 2 i=1j=1
−
N
(αi − αi∗ )yi + ε
i=1
N
(5)
(αi + αi∗ )
i=1
To solve nonlinear regression problems, the input space is mapped into a feature space by a nonlinear mapping Φ. The linear SVM is then applied to the features Φ(x) instead of the original variables x. Since the input variables occur in Eq. (5) only as scalar product xTi xj , we define a kernel function
s.t. yi − wT xi − b ≤ ε + ξi
(2b)
ε + ξi∗
K(x, x ) = ΦT (x)Φ(x ) .
(2c) (2d)
Many different kernel functions are possible, e.g., linear kernels, polynomial kernels or Gauss kernels
w xi + b − yi ≤ ξi , ξi∗ ≥ 0, i = 1, . . . , N. T
(6)
(Schölkopf and Smola, 2002). Consequently, the only difference between linear and nonlinear SVMs is the use of kernel functions instead of scalar products resulting in the following dual optimization problem: min∗
α, α
N N 1 (αi − αi∗ )(αj − αj∗ )Kij 2 i=1j=1 N
N − (αi − αi∗ )yi + ε (αi + αi∗ ) i=1 i=1 (∗) αi
s.t. 0 ≤ ≤ C , i = 1, . . . , N N (αi − αi∗ ) = 0 i=1
the KKT conditions of this problem must be found. Its (dual) Lagrangian is Ld (α, α∗ , λ, λ∗ , µ, µ∗ ) = N N 1 (αi − αi∗ )(αj − αj∗ )Kij 2 i=1j=1
(7a)
−
−
(7c)
=0
with the bias term b, which provides an constant offset to f and is computed from the KKT conditions, see (∗) (Vapnik, 1995) and Fig. 1. Vectors xi with αi = 0 are called support vectors; Eq. (8) is the support vector expansion. Usually only a small fraction of the data set are support vectors, typically about 5%.
−
We first start with a simplification and keep the bias term fixed, including the most important case b = 0. This is possible if the kernel function provides an implicit bias, as e.g. in the case of positive definite kernel functions (Poggio et al., 2002; Huang and Kecman, 2004; Vogt et al., 2003). The only effect is that slightly more support vectors are computed. Consequently, condition (4b) is not present and the dual optimization problem changes to min
N N 1 (αi − αi∗ )(αj − αj∗ )Kij 2 i=1j=1
−
N
(αi − αi∗ )yi + ε
i=1 N
+b
N
(αi + αi∗ )
i=1
(αi − αi∗ )
s.t. 0 ≤
≤ C,
λi αi −
i = 1, . . . , N
(9b)
The last term in Eq. (9a) vanishes either if b = 0 or if b is treated as a variable. In the second case, the additional equality constraint (7c) has to be imposed (as done in Eq. (7a)). If b is kept fixed, the SVM is computed by solving the box-constrained convex QP problem (9), which is one of the most convenient types of QP problems. To solve it with the active-set method described in Sec. 1,
N
µi (C − αi )
i=1 N
i=1 N
i=1
i=1
λ∗i αi∗ −
µ∗i (C − αi∗ )
(∗)
∂Ld = ε + Ei − λi + µi = 0 ∂αi ∂Ld = ε − Ei − λ∗i + µ∗i = 0 ∂αi∗ (∗)
0 ≤ αi
≤C
(∗) λi ≥ 0, (∗) (∗) αi λi =
0,
⇒ ⇒
⇒
⇒ ⇒ ⇒ ⇒
(11b)
(11d)
(∗) (∗) (C − αi )µi = 0 αi∗ , five cases have to
(11e) be consid-
(i ∈ F)
λi = µi = = 0, λ∗i = 2ε > 0 aj Kij = yi − ε − aj Kij
j∈F (∗)
(12a)
(∗)
j∈AC
(i ∈ F ∗ )
λ∗i = µi = µ∗i = 0, λi = 2ε > 0 aj Kij = yi + ε − aj Kij
j∈F (∗)
(12b)
(∗)
j∈AC
(i ∈ A0 ∩ A∗0 )
λi = ε + Ei > 0, λ∗i = ε − Ei > 0 µi = 0, µ∗i = 0
(12c)
(i ∈ AC )
λi = 0, λ∗i = ε − Ei > 0 µi = −ε − Ei > 0, µ∗i = 0
αi = 0, αi∗ = C ⇒ ⇒
≥0
µ∗i
0 < αi∗ < C, αi = 0 ⇒
(11a)
(11c)
(∗) µi
0 < αi < C, αi∗ = 0
αi = C, αi∗ = 0
i=1
(∗) αi
N
(∗)
αi = αi∗ = 0 (9a)
(10)
(αi − αi∗ )
According to αi and ered:
3.1 Regression with fixed bias term
(αi + αi∗ )
i=1
where λi and µi are the Lagrange multipliers of the (∗) (∗) constraints αi ≥ 0 and αi ≤ C, respectively. Introducing the prediction errors Ei = f (xi ) − yi , the KKT conditions can be derived for i = 1, . . . , N (Nocedal and Wright, 1999):
3. SOLVING THE REGRESSION PROBLEM
α, α∗
i=1 N
N
i=1
(7b)
(∗)
(∗)
(αi − αi∗ )yi + ε
+b
where Kij = K(xi , xj ). The notation αi is used as an abbreviation if an (in-) equality is valid for both αi and αi∗ . The SVM output is computed as f (x) = (αi − αi∗ )K(xi , x) + b (8) αi
N
(12d)
(i ∈ A∗C )
λi = ε + Ei > 0, λ∗i = 0 µi = 0, µ∗i = −ε + Ei > 0
(12e)
The superscript ∗ indicates that αi∗ is concerned rather than αi . Subscripts 0 and C correspond to the lower and the upper bound of αi and αi∗ . Obviously, there are more than five cases but only these five can occur:
It can be shown that αi and αi∗ cannot be different from zero simultaneously. If one of the variables is free (Eq. (12a) and (12b)) or equal to C (Eq. (12d) and (12e)), the other one must be zero. The above conditions are exploited in each iteration step k. Cases (12a) and (12b) establish the linear system in step A1 of the algorithm for the currently free variables i ∈ F k ∪ F ∗k . Cases (12c) – (12e) are the conditions that must be fulfilled in the optimum for all variables in Ak0 ∪ A0∗k ∪ AkC ∪ AC∗k , i.e., step A2 searches for the worst violator among these variables. Note that Ak0 ∩ AkC = ∅ and A0∗k ∩ AC∗k = ∅ because a variable cannot lie on both bounds simultaneously. The variables in AkC ∪ AC∗k are the bounded variables and also occur in the linear system of cases (12a) and (12b). The condition αi αi∗ = 0 (see above) allows to use the SVM coefficients ai = αi − αi∗ instead of the Lagrange multipliers αi and αi∗ . With this abbreviation, the number of variables reduces from 2N to N, the algorithm is slightly faster, and many similarities to classification algorithms can be observed. With this modification, in step A1 the linear system Hk a¯ k = ck with a¯ ki = α¯ ik − α¯ i∗k
for i ∈ F k ∪ F ∗k
hkij = Kij cki
= yi −
(13)
j∈AkC ∪A∗k C
akj Kij +
−ε +ε
for i ∈ F k for i ∈ F ∗k
(14)
has to be solved by the methods described in Sec. 4.1. If F k ∪ F ∗k contains p variables, then Hk is a p × p matrix. Step A2 of the algorithm computes λki = ε + Eik λi∗k = ε − Eik
for i ∈ Ak0 ∪ A0∗k
(15a)
for i ∈ AkC ∪ AC∗k
(15b)
function (9a). With these changes and again with the coefficients ai = αi − αi∗ the linear system becomes k k k c } p rows a¯ H e = (17) T e 0 bk dk } 1 row with dk = −
j∈AkC ∪A∗k C
akj
and
ei = 1
(18)
Since we retain the assumption that K(xi , xj ) is positive definite, the Cholesky decomposition H = RT R is available (see Sec. 4.1), and the block system (17) can be solved by the following procedure: • Solve RT Ru = e for u. aj + • b=− j∈AkC ∪A∗k C
j∈AkC ∪A∗k C
uj cj
j∈AkC ∪A∗k C
uj
• Solve RT Ra = c − eb for a¯ . which is basically a Gauss step applied to the blocks of (∗) (∗) the matrix. The computation of λi and µi remains the same as in Eq. (15) for fixed bias term. An additional topic has to be considered here: For a variable bias term, the Linear Independence Constraint Qualification (LICQ) (Nocedal and Wright, 1999) is violated when for each ai one inequality constraint is active, e.g., when the algorithm is initialized with ai = 0 for i = 1, . . . , N. The algorithm uses a simplified version of Bland’s rule to avoid cycling in these cases.
4. IMPLEMENTATION DETAILS The active-set algorithm has been implemented as C MEX-file under MATLAB. It is able to handle both fixed and variable bias terms. Since usually most of the coefficients ai are zero in the optimum, it starts with ai = 0 for i = 1, . . . , N as initial feasible solution.
and µki = −ε − Eik µi∗k = −ε + Eik
and checks if they are positive, i.e. if the KKT conditions are valid for i ∈ Ak0 ∪A0∗k ∪AkC ∪AC∗k . Among the negative multipliers, the most negative one is selected and moved to F k or F ∗k , respectively. 3.2 Regression with variable bias term If the bias term b is computed explicitly, the additional equality constraint (7c) has to be considered. In that case, the Lagrangian (10) and its derivatives (11) remain the same – with the important difference that b is not fixed any more. Note that the term b
N
(αi − αi∗ )
i=1
(16)
in the Lagrangian now results from the constraint (7c) (with b as Lagrange multiplier), not from the objective
4.1 Cholesky decomposition with pivoting The proposed algorithm assumes positive definite kernel functions. This is not a hard limitation because most of the commonly used kernels (Gaussians, inhomogeneous polynomials, . . . ) show this property. Consequently, the Kernel matrix H has a Cholesky decomposition H = RT R, where R is an upper triangular matrix. For a fixed bias term, the solution of the linear system in step A1 is found by simple backsubstitution. For variable bias term, the block-algorithm described in Sec. 3.2 is used. Unfortunately, H may be “nearly indefinite”, i.e., it may become indefinite by round-off errors during the computation. This occurs e.g. for Gaussians having large widths. There are two ways to cope with this problem: First, to use Cholesky decomposition with pivoting, and second, to slightly enlarge the diagonal elements to make H “more definite”.
Although the Cholesky decomposition is numerically very stable, the active-set algorithm uses diagonal pivoting by default. This allows in the “nearly indefinite” case to extract the largest positive definite part of H = (hij ). All variables corresponding to the rest of the matrix are set to zero then. Usually the Cholesky decomposition is computed using axpy operations (Golub and van Loan, 1996). However, the pivoting strategy needs the updated diagonal elements in each step, as they would be available if outer product updates were applied. Since these require many accesses to matrix elements, a mixed procedure is implemented that updates only the diagonal elements and uses axpy operations otherwise: Compute for i = 1, . . . , p: Find k = arg max{|h¯ ii |, . . . , |h¯ pp |}. Swap lines and columns i and k symmetrically. Compute rii = h¯ ii . Compute for j =
i + 1, . . . , p: rij = (hij − ki−=11 rki rkj ) rii h¯ jj ← h¯ jj − rij2 where h¯ jj are the updated diagonal elements and “←” indicates the update process. The result can be written as (19) PHPT = RT R with the permutation matrix P. Of course the implementation uses an pivot vector instead of the complete matrix. Besides that, only the upper triangular part of R is stored, i.e., only memory for p(p + 1)/2 elements is needed. This algorithm is almost as fast as the standard Cholesky decomposition.
T41 F31 P11
- Boiler -
- T31
Fig. 2. Block diagram of the boiler burner output P11 , see Fig. 2. Second order dynamics are assumed for the output and all inputs, so the model has 11 regressors. The training data set consists of 3344 samples, the validation data set of 2629 samples. All experiments were done with MATLAB 6.5 under Microsoft Windows 2000 on a 800 MHz Pentium-III PC having 256 MB RAM.
5.1 SVMs and RBF networks We first examine the principle properties of the SVM regression method by comparing a SVM with Gauss kernels to a RBF network. The RBF network has exactly the same structure as the SVM, i.e., its neurons are located on the data points and all Gaussians have the same width σ. Therefore both models only differ in their training method. The SVM is trained by the proposed active-set algorithm, the RBF network by the OLS method (Chen et al., 1991). Tab. 1 compares both models for σ = 5. For SVMs, the insensitivity zone is properly set to ε = 0.01, the precision to check the KKT conditions is τ = 10−4 . The model complexity (number of RBF neurons and SVM’s upper bound C) has been determined by crossvalidation, i.e., the models with the best validation results have been selected. Table 1. SVM and OLS training
4.2 Memory consumption Approximately the following memory is required: • N elements for the coefficient vector, • N elements for the index vector, • Nf (Nf + 3)/2 elements for the triangular matrix and the right hand side of the linear system, where Nf is the number of free variables in the opti(∗) mum, i.e., those with 0 < αi < C. As this number is unknown in the beginning, the algorithm starts with an initial amount of memory and increases it whenever variables are added. The index vector is needed to (∗) (∗) keep track of the stets F (∗) , AC and A0 . It is also used as pivot vector for the Cholesky decomposition.
5. RESULTS The following example shows how to use the activeset algorithm in system identification. We use a data set described in detail in (Vogt et al., 2003). The task is to estimate the outlet temperature T31 of a high efficiency (condensing) boiler. The inputs are the system temperature T41 , the water flow F31 and the
Prediction RMSE (Training) RMSE (Validation) SVs/Neurons
SVM 0.0046 0.0054 137 (C = 50)
OLS 0.0097 0.0155 16
Simulation RMSE (Training) RMSE (Validation) SVs/Neurons
SVM 0.0173 0.0206 228 (C = 10)
OLS 0.0333 0.0416 9
This example shows that SVM training leads to a significantly smaller root-mean-square error (RMSE) than the least-squares based OLS method. This applies to both one-step-ahead prediction and simulation. However, SVMs need a higher complexity (number of support vectors) than the RBF networks.
5.2 Active-set and working-set methods Tab. 2 compares the computation times of the activeset algorithm and the working-set method LIBSVM (Chang and Lin, 2003) when the upper bound C is varied and the precision is set to τ = 10−4 . The SVM uses Gauss kernels having a width of σ = 3. Since the focus is on the algorithm itself, we disregard the
Table 2. Variation of C C LIBSVM Active-Set All SVs Free SVs
100
101
102
103
104
2.3 s 163.4 s 432 34
3.6 s 27.7 s 147 45
9.6 s 11.1 s 93 54
54.5 s 11.6 s 92 82
247.5 s 17.2 s 102 100
discussion on approximation errors in the following. Tab. 2 shows that LIBSVM can efficiently handle a large number of support vectors whereas the active-set method shows its strength if the number is small. Tab. 3 compares both algorithms for different precisions τ when the Gaussians’ width is small (σ = 1) and C is set to 1000. In this setting all support vectors are free, which is an extreme case but not unusual (Platt, 1999). Obviously, the computation time of the Table 3. Variation of τ for σ = 1 τ LIBSVM Active-Set All SVs Free SVs
10−2 2.9 s 6.7 s 156 156
10−3 8.2 s 20.0 s 130 130
10−4 17.9 s 25.5 s 131 131
10−5 65.0 s 26.4 s 132 132
10−6 132.5 s 26.4 s 131 131
active-set method is nearly insensitive with respect to the precision (except for τ = 10−2 ). Tab. 4 shows the same experiment for a large width (σ = 5). Here only half of the support vectors are free. Clearly this fraction increases for low precision. Another property Table 4. Variation of τ for σ = 5 τ LIBSVM Active-Set All SVs Free SVs
10−2 7.1 s 1.5 s 86 85
10−3 17.3 s 10.5 s 99 66
10−4 31.0 s 12.0 s 103 61
10−5 129.4 s 12.4 s 102 57
10−6 546.4 s 12.6 s 102 57
of active-set algorithms can be observed for high precision: Once the correct active set is found, the solution can be computed with full precision and does not change if τ is further decreased. The results show that the computation times are mainly dependent from the number of support vectors and precision. LIBSVM always had 40 MB cache available because it gets rather slow when the cache size is too small.
6. CONCLUSIONS An active-set algorithm has been proposed for SVM regression. The general strategy has been adapted to this problem for both fixed and variable bias terms. The result is a robust algorithm that requires approximately 2N + 12 Nf2 elements of memory, where N is the number of data and Nf the number of free support vectors. Simulation results show that the active-set algorithm is advantageous for SVM regression • when high precision is needed. • when the number of support vectors is small.
Currently, the algorithm changes the active set by only one variable per step, and most of the computation time is spent to calculate the prediction errors Ei . Both problems can be significantly improved by introducing gradient projection steps. If this technique is combined with iterative solvers, also a large number of free variables is possible. Although the algorithm shows important advantages in several situations, it should be regraded as a first step in this direction. REFERENCES Chang, Chih-Chung and Chih-Jen Lin (2003). LIBSVM: A library for support vector machines. Technical report. National Taiwan University. Taipei, Taiwan. Chen, Sheng, C. F. N. Cowan and P. M. Grant (1991). Orthogonal least squares learning algorithm for radial basis function networks. IEEE Transactions on Neural Networks 2(2), 302–309. Golub, Gene H. and Charles F. van Loan (1996). Matrix Computations. 3. ed.. The Johns Hopkins University Press. Baltimore, MD. Huang, Te Ming and Vojislav Kecman (2004). Bias term b in SVMs again. In: Proceedings of the 12th European Symposium on Artificial Neural Networks (ESANN 2004). Bruges, Belgium. pp. 441–448. Mangasarian, Olvi L. and David R. Musicant (2001). Active set support vector machine classification. In: Advances in Neural Information Processing Systems (NIPS 2000) (Todd K. Leen, Volker Tresp and Thomas G. Dietterich, Eds.). Vol. 13. pp. 577–583. MIT Press. Cambridge, MA. Nocedal, Jorge and Stephen J. Wright (1999). Numerical Optimization. Springer Series in Operations Research. Springer-Verlag. New York. Platt, John C. (1999). Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods – Support Vector Learning (Bernhard Schölkopf, Christopher J. C. Burges and Alexander Smola, Eds.). Chap. 12. MIT Press. Cambridge, MA. Poggio, Tomaso et al. (2002). b. In: Uncertainty in Geometric Computations (Joab Winkler and Mahesan Niranjan, Eds.). Chap. 11, pp. 131–141. Kluwer Academic Publishers. Boston. Schölkopf, Bernhard and Alexander J. Smola (2002). Lerning with Kernels. Adaptive Computation and Machine Lerning. The MIT Press. Cambridge, MA. Vapnik, Vladimir N. (1995). The Nature of Statistical Learning Theory. Information Science and Statistics. Springer-Verlag. New York. Vogt, Michael, Karsten Spreitzer and Vojislav Kecman (2003). Identification of a high efficiency boiler by Support Vector Machines without bias term. In: Proceedings of the 13th IFAC Symposium on System Identification (SYSID 2003). Rotterdam, The Netherlands. pp. 485–490.