TSK-Fuzzy Modeling Based on "-Insensitive Learning - IEEE Xplore

2 downloads 0 Views 563KB Size Report
inconsistency of a fuzzy modeling, where zero-tolerance learning is used to obtain fuzzy model tolerant to imprecision. These new methods can be called ...
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

181

TSK-Fuzzy Modeling Based on "-Insensitive Learning Jacek M. Leski, Senior Member, IEEE

Abstract—In this paper, new learning methods tolerant to imprecision are introduced and applied to fuzzy modeling based on the Takagi–Sugeno–Kang fuzzy system. The fuzzy modeling has an intrinsic inconsistency. It may perform thinking tolerant to imprecision, but learning methods are zero-tolerant to imprecision. The proposed methods make it possible to exclude this intrinsic inconsistency of a fuzzy modeling, where zero-tolerance learning is used to obtain fuzzy model tolerant to imprecision. These new methods can be called -insensitive learning or learning, where, in order to fit the fuzzy model to real data, the -insensitive loss function is used. This leads to a weighted or “fuzzified” version of Vapnik’s support vector regression machine. This paper introduces two approaches to solving the -insensitive learning problem. The first approach leads to the quadratic programming problem with bound constraints and one linear equality constraint. The second approach leads to a problem of solving a system of linear inequalities. Two computationally efficient numerical methods for the -insensitive learning are proposed. The -insensitive learning leads to a model with the minimal Vapnik–Chervonenkis dimension, which results in an improved generalization ability of this model and its outliers robustness. Finally, numerical examples are given to demonstrate the validity of the introduced methods. Index Terms—Fuzzy systems, generalization control, robust methods, statistical learning theory, tolerant learning.

I. INTRODUCTION

F

UZZY modeling enables finding nonlinear models of reality where knowledge is obtained as a set of IF–THEN rules with linguistically interpreted propositions. The fuzzy modeling is based on the premise that human thinking is tolerant to imprecision and the real world is too complicated to be described precisely [46], [47]. Fuzzy modeling has recently played an important role in many engineering fields, such as: pattern recognition, control, identification, data mining, and so on [9], [10], [17], [18], [30], [36], [43]. Methods of fuzzy IF–THEN rules extraction can be divided into: 1) obtained from a human expert and 2) obtained automatically from observed data, usually by means of artificial neural networks incorporated into fuzzy systems. Methods from the first group have significant disadvantages: few experts can and/or want to share their knowledge. In the methods from the second group, knowledge is acquired automatically by learning algorithms of neural networks or/and evolutionary Manuscript is received February 6, 2002; revised July 3, 2003 and July 2, 2004. The author is with the Division of Biomedical Electronics, Institute of Electronics, Silesian University of Technology, 44-101 Gliwice, Poland (e-mail: [email protected]). Digital Object Identifier 10.1109/TFUZZ.2004.840094

methods. Such a connection of neural networks and fuzzy models is usually called neuro-fuzzy systems. The neuro-fuzzy and fuzzy modeling has an intrinsic inconsistency. It may perform thinking tolerant to imprecision, but neural networks learning methods are zero-tolerant to imprecision, i.e., they usually use the quadratic loss function, to match the reality and a fuzzy model. In this case, only perfect matching between the reality and the model leads to zero loss. The approach to fuzzy modeling presented in this paper is based on the premise that human learning, as well as thinking, is tolerant to imprecision. Hence, zero loss is assumed for difference between a model and the reality less than some preset value, noted . If this difference is greater than , then the loss increases linearly. The learning method based on this loss function may be called -insensitive learning or shortly learning. In applications, data from a training set are corrupted by noise and outliers. It follows that the fuzzy systems design methods need to be robust. According to Huber [16], a robust method should have the following properties: i) it should have a reasonably good accuracy at the assumed model, ii) small deviations from the model assumptions should impair the performance only by a small amount, and iii) larger deviations from the model assumptions should not cause a catastrophe. In literature, there are many robust loss functions [16]. In this work, -insensitive loss function is used, that is a generalization of an ). absolute error loss function ( It is well-known in approximation theory (Tikhonov regularization) and machine learning (statistical learning theory) that too precise learning on a training set leads to overfitting (overtraining), which results in poor generalization ability. The generalization ability is interpreted as a production of a reasonable decision for data previously unseen in the process of training [12], [41]. Vapnik–Chervonenkis (VC) theory (or statistical learning theory) has recently emerged as a general theory for the estimation of dependencies from a finite set of data [40], [42]. The most important in the VC-theory is the structural risk minimization (SRM) induction principle. The SRM principle suggests a tradeoff between the quality of an approximation and the complexity of the approximating function [40], [41]. A measure of the approximation function complexity (or capacity) is called the VC-dimension. One of the simplest methods to control the VC-dimension is to change the insensitivity parameter in the loss function and to enforce flatness of the approximating function by a regularization constant [40]–[42]. For the last few years, there has been an increasing interest in fuzzy systems which incorporate tools well-known from the statistical learning theory. Fuzzy clustering with a weighted

1063-6706/$20.00 © 2005 IEEE

182

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

(or fuzzy) -insensitive loss function is introduced in [25] and [26]. This method leads to improved robustness of outliers with respect to the traditional fuzzy clustering methods. The support vector fuzzy regression machines have been introduced in [15]. The support vector interval regression network has been established in [19]. A differentiable approximation of the misclassification rate with the use of the empirical risk minimization principle to improve learning of a neuro-fuzzy classifier is proposed in [5]. Paper [7] proposes a support vector learning approach to construct fuzzy classifiers. Work [8] describes the support vector fuzzy clustering method. An -insensitive approach to learning of neuro-fuzzy systems has been introduced in [27]. A similar approach to learning a classifier, called a fuzzy support vector machine has been independently introduced in [24]. Although the idea of learning tolerant to imprecision can be incorporated into all fuzzy system design methods, in this work a method based on fuzzy partition of the input space will be presented due to its simplicity. Pedrycz, in [33], used the fuzzy -means clustering (introduced by Bezdek [3]) to find antecedent variable membership functions and then identified a relational fuzzy model. Many authors (cf. [6], [10], [23], [34], [37], and [43]) have recently used the fuzzy -means method to find clusters in the input space preserving similarity of input data. In this approach, each cluster corresponds to a fuzzy IF–THEN rule in the Takagi–Sugeno–Kang form [38], [39] IF

is

THEN (1)

where able,

is the input variable, is the output variis the augmented input vector, is the vector of consequent param-

eters of the th rule; denotes a bias of the th model. The has a membership funcantecedent fuzzy set of the th rule, . In case of the Gaussian membertion ship functions and the algebraic product used as the -norm, the fuzzy antecedent is defined as

(2)

and

(4)

denotes an element of the partition matrix obtained where from the fuzzy -means clustering of the input space. For the input , the overall output of the fuzzy model is completed by a weighted averaging aggregation of outputs of individual rules (1), as [18], [43]

(5)

where

is the normalized firing strength of the th

rule for the input , and denotes consequent parameter vector. It must also be noted, that (2) and (5) describe a radial-basis neural network. This network is a linear network with respect to the parameters in consequence part of IF–THEN rules. So, if we define , then the overall output of the fuzzy system can be written as . , the least Usually, for the consequent parameters squares (LS) estimation method is applied [18], [22], [37], [38]. There are two approaches: 1) to solve one global LS problem, for all IF–THEN rules, and 2) to solve independent weighted LS problems, one for each IF–THEN rule. The first approach leads to better global performance, while the second one leads to more reliable local performance. The combination of both approaches is suggested in [45]. In this work, both approaches, i.e., global learning and local learning, will be used to introduce the idea of learning tolerant to imprecision into fuzzy modeling. Suppose we have the training set , where stands for data cardihas a nality, and each independent input datum corresponding dependent output datum . For fixed antecedents obtained via clustering of the input space, the global LS solution to the consequent parameters estimation, minimizing the following criterion function [45]:

denotes the th component of the th input data, and where , , ; are the parameters centers and dispersions of the membership functions for the th rule and the th input variable, respectively. These parameters are obtained as

(3)

(6) can be written in the matrix form as (7)

LESKI: TSK-FUZZY MODELING BASED ON -INSENSITIVE LEARNING

183

where

, . The local LS solution to the consequent parameters estimation, minimizing the following criterion functions [45]:

(8)

TABLE I NOTATIONS USED IN GLOBAL AND LOCAL "learning

Using the augmented input vector sion model in the form

(12)

can be written in the matrix form as (9)

where

for global learning method, where to the following problem:

is obtained as a solution

and

; , . The goal of this work is twofold; first, to introduce new learning methods tolerant to imprecision, in case of the fuzzy system design methods recalled previously; next, to investigate the generalization ability and outliers robustness of the fuzzy system obtained by means of this learning method for synthetic as well as real-world high-dimensional data. This paper is organized as follows. Section II presents an introduction of -insensitive learning method and shows that this approach leads to a quadratic programming problem. Section III presents a new numerical method, called iterative quadratic programming (IQP), to solve the problem of the -insensitive learning. The -insensitive learning by solving ), without the need to a system of linear inequalities ( solve the quadratic programming problem, is introduced in Section IV. Section V presents simulation results and a discussion for the fuzzy modeling of real-world high-dimensional data. Finally, conclusions are drawn in Section VI.

II. -INSENSITIVITY IN TSK-FUZZY SYSTEM LEARNING The problem of learning tolerant to imprecision can be pre(or ; sented as determining the consequent parameters ), where an -insensitive loss function is used in order to fit the fuzzy model to real data from the training set. For the scalar argument the -insensitive loss function has the form [41] (10) and for a vector argument, fined as

, we seek a linear regres-

, can be de-

(11)

(13) where

is vector

with the components corresponding

. to the biases excluded, The second term in (13) is related to the minimization of the Vapnik–Chervonenkis dimension (complexity) of the regression model [41]. The parameter controls the tradeoff between the complexity of regression model and the amount up to which errors are tolerated. Using the -insensitive loss function the local learning (for the th rule) can be written as the problem

(14)

is vector with the last component correwhere sponding to the bias excluded. The aforementioned equation can be called the weighted -insensitive estimator (or the fuzzy -insensitive estimator) with complexity control. Both global and local learning can be represented as the following problem:

(15)

where the meaning of the notations , , and is presented in Table I. In this table, denotes the vector of di1 with all entries equal to 1. It is worth noting that mension as a special case of (15), where and for , we obtain criterion function used in support

184

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

vector regression (SVR) machine with linear kernel [41]. Criterion function (15) may also be treated as a “fuzzified” version of Vapnik’s SVR criterion. Let us first apply the methodology used by Vapnik [42] to optimizing (15). In a general case, the inequalities: , are not satisfied for all training data. , then for all data If we introduce slack variables , , we can write (16) Using (16), the criterion (15) can be written in the form

(17)

Putting (19) in the Lagrangian (18), we get

(21) Maximization of (21) with respect to , subject to conand , is restraints ferred to as the Wolfe dual formulation of (18). Once again, note that (21) is a generalization (or “fuzzified” version) of the soluand ; tion obtained for SVR machine. For , Vapnik’s solution is obtained. At the saddle point, for each Lagrange multiplier, the Karush–Kühn–Tucker (KKT) conditions must be satisfied

(22)

and be minimized subject to constraints (16) and , . The Lagrangian function of (17) with the previous constraints is

(18)

where , , , are the Lagrange multipliers. The objective is to minimize this Lagrangian with respect to , , . It must also be maximized with respect to the Lagrange multipliers. The following conditions for optimality (the saddle point of the Lagrangian) are obtained by differentiating (18) and setting the results equal to zero with respect to , ,

(19)

, The last two conditions (19) and the requirements imply that , . From the first condition (19), we obtain (20) We cannot determine the bias parameters from the previous formula. For bias components from (20) the following condition . is obtained,

From (22), we see that the following conditions are satisfied. • If , then , and data pair is on the upper bound of the insensitivity region. , then , and data pair • If is on the lower bound of the insensitivity region. , then , and data pair is • If above the upper bound of the insensitivity region. , then , and data pair is • If below the lower bound of the insensitivity region. , then , and data pair is below • If the upper bound of the insensitivity region. , then , and data pair is above • If the lower bound of the insensitivity region. From the first two of these conditions and (22), we can determine the bias bias bias

for for

(23)

for Thus, we may determine the bias from (23) by taking any which there are the Lagrange multipliers in the open interval . In the global learning case, the bias is equal to . Taking into account , we see that bias, . In the local learning (the th rule) case, the bias is equal to . leads to the quadratic Computation of the parameters programming (QP) problem (21) with bound constraints and one linear equality constraint. For a large training set the standard optimization techniques quickly become intractable in their memory and time requirements. Standard implementation matrix. of QP solvers requires the explicit storage of Osuna et al. in [32] and Joachims in [20] show that the large QP problem can be decomposed into a series of smaller QP subproblems over a part of the training data. Platt in [35] proposes the Sequential Minimal Optimization algorithm. This method

LESKI: TSK-FUZZY MODELING BASED ON -INSENSITIVE LEARNING

185

chooses two Lagrange multipliers and finds their optimal values analytically. A crucial disadvantage of these techniques is that they may give an approximate solution and may require many passes through the training set. In [31], an alternative approach that determines the solution of QP in the classification problem is presented. In the next section this idea is used to solve the problem of fuzzy modeling with -insensitive learning.

( ). Thus, Lagrange multipliers can be treated as functions of the slack variables. From the aforementioned rules we can see that these functions have a discontinuity at ( ). In [31], the approximation of this type of function by means of a piece-wise linear function is proposed. Here, the following “fuzzified” version is proposed:

III. ITERATIVE QP SOLUTION Regrouping the terms in (18) yields (28)

where

is a big positive value and

(29) (24)

where

(25) The condition for optimality is obtained by differentiating (24) with respect to and setting the results equal to zero

(26)

and , the solution to (26) with respect to can be written in the form If we define

we obtain the functions As a special case, for proposed in [31]. It should be noted that there is no theoretical basis for the selection of these functions. However, a piece( ) wise linear approximation of discontinuity at depends is the simplest one. From (27) we can infer that , , and from (25), (28), and (29) that , depend on on . In this case, the most popular algorithm for approximating the solution is the Picard iteration through (27) and (25), (28), (29). This algorithm loops through the cycle of estimates and checks the termination criterion , where is a preset parameter, where denotes the iteration index. There is no theoretsuperscript ical basis that the aforementioned iterations lead to an optimal solution of the quadratic programming problem. However, in practice the iteration through (27) and (25), (28), (29) lead to faster convergence to an optimal or near optimal solution. The procedure of seeking optimal can be called iterative quadratic programming (IQP) and summarized in the following steps. 1) Initialize Iteration index 2)

,

for all

.

(27) where . Expression (27) is the optimal solution to (24). However, this solution is unconstrained and in order to satisfy the KKT conditions from the previous section, the special setting of the paramand is needed: if ( ), then eters ( ); if ( ), then ( ); if ( ), then

3)

, . 4) Compute , using (28) and , using (25). ) or , then 5) if ( , go to 2).

.

186

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

IV. -INSENSITIVE LEARNING BY SOLVING A SYSTEM OF LINEAR INEQUALITIES

We obtain conditions for optimality by differentiating (32) with respect to , and setting the results equal to zero

The minimization problem (15) can be rewritten in the matrix form (30) where denotes a weighted Vapnik loss function defined for a vector argument as (31) To make the minimization problem (30) mathematically tractable, we see that the minimization of the first term can be equivalently written as the following requirements: and . Defining the extended forms of and :

and

the previous requirements can be written as: . In practically interesting cases, not all inequalities in the aforementioned system are fulfilled (except the case where is large enough to make all the data fall into the insensitivity region). Note that the solution of inequality system: is equivalent to the solution of equality system: . Thus, in order to be solved, the above inequality system is replaced by the following equality where is an arbitrary positive vector, system: . Indeed, we do not know . However, the method described later can obtain both and . . If the th We define the error vector as: , component of is positive, ( ( th), ), then the th datum falls into the insensitivity region, and by increasing the respective component of , the ( ) can be set to zero. If the th ( th) component of is negative, then the th datum falls outside the insensitivity region and it is impossible ( ). to decrease ( ) and fulfill the condition In other words, we obtain a nonzero error only for the datum outside the insensitivity region. Our minimization problem (30) can be approximated by the following:

(32) where . For mathematical simplicity, the previous criterion is an approximation of (30) where the squared error rather than the absolute error is used. After describing algorithm for squared error, in further part of this section [see after (34)], an algorithm for absolute error will be obtained using an iteratively reweighted least-squares method.

(33) From the first (33), we see that the vector depends on vector . Vector can be called a margin vector because its components determine the distance from a datum to the insensitivity region. For fixed , if datum is placed in the insensitivity region, the corresponding distance can be increased in order to obtain zero error. However, if the datum is placed outside the insensitivity region, then the error is negative and we may decrease the error only by decreasing the corresponding component of . The only way to prevent from converging to zero is to start with and to refuse to decrease any of its components. Ho and Kashyap proposed an iterative algorithm for alternately determining and where components of cannot decrease [13], [14]. The solution is obtained in an iterative way. Vector is determined on the basis of the first equation from (33), , i.e., where superscript denotes the iteration index. Components of vector are modified by components of the error vector , but only in the case when it results in increasing components of . Otherwise, the components of remain unmodified. So, we write this modification as follows [13]: (34) where is a parameter. For mathematical simplicity, the criterion (32) is an approximation of (30) where the squared error rather than the absolute error is used. Note that criterion (32) may be rewritten for the absolute error as

(35) Using equality

(36) criterion (35) may be rewritten as (37) Thus, comparing (32) and (37), the absolute error criterion, equivalent to (30) is easily obtained by selecting the following diagonal weight matrix:

where is the th component of the error vector. However, the error vector depends on , so, we use vector from the previous iteration. This procedure is based on the premise that near differ impercepthe optimum solution sequential vectors tibly. It is worth noting that the traditional method to solve the

LESKI: TSK-FUZZY MODELING BASED ON -INSENSITIVE LEARNING

187

least absolute value (LAV) problems leads to the linear programming problem with equality constraints [1], [2]. For large training set standard optimization techniques quickly become intractable in their memory and time requirements. The above presented algorithm is very similar to the iterative reweighting least squares (IRLS) [11], [21], [29]. However, in the original IRLS algorithm during the iteration process the vector remains unmodified. The presented approach to solve linear matrix inequalities differs from IRLS by incorporating the modification of the margin vector and the term related to minimization of the complexity of the solution. The procedure of seeking optimal and , may be called the -insensitive learning by solving a system of linear ) and summarized in the following steps. inequalities ( 1) Fix Initialize . 2)

,

and

.

The purpose of this section was to make a comparative study of the proposed methods performance to the support vector regression machine [41]. The objective of this experiments was to approximate data generated by means of the following nonlinear function:

(38) .

5) 6) if 2).

A. One-Dimensional Nonlinear Function Approximation

. Set iteration index

. 3) 4)

ments an interior point algorithm for quadratic programming. For a large training set, this technique quickly becomes intractable in their memory and time requirements. There are many approaches to solve quadratic programming problems for large training sets. However, a crucial disadvantage of these techniques is that they may give an approximate solution of quadratic problem. Thus, in all experiments, an interior point algorithm was used. Gaussian kernel was used for the SVR machine.

. , then

, go to

Remarks The constant is a preset parameter. Appendix A shows that and for any diagonal matrix , the above algofor rithm is convergent. If step 4) in this algorithm is omitted, then the squared error minimization procedure is applied. In practice, division by zero in step 4) does not occur. It results from the fact goes to inthat some components of vector go to zero as finity. However, in this case, convergence is slow and condition 6) stops the algorithm. Similar algorithm for classification problems has been introduced in [28]. V. NUMERICAL EXPERIMENTS AND DISCUSSION and were used in the The parameters and were IQP method. In all experiments, method. The iterations were stopped as soon used in the as the Euclidean norm in a successive pair of vectors was less . For both IQP and the fuzzy -means than clustering algorithm was applied with the weighted exponent equal to 2. A random partition matrix was used for initialization and iterations were stopped as soon as the Frobenius norm . in a successive pair of partition matrices was less than All experiments were run on Pentium IV 2.53-GHz processor running Windows XP and MATLAB environment. The support vector regression (SVR) machine from Matlab Support Vector Machine Toolbox by S. Gunn was used.1 The toolbox imple1This toolbox has been obtained via the Internet at http://www.isis.ecs. soton.ac.uk/resources/svminfo.

where , , , and . The training was set consists of 100 data pairs. Each datum pair generated by the following technique: first, a uniform random ), next, the value of was number was generated in ( , where is a uniform random obtained using ). Example of these data are presented in number on ( is Fig. 1—data pairs are denoted by crosses and function a bold-solid line. In the experiments, 50 different training sets, generated by the aforementioned procedure, were used. The quality of approximation was measured using the root and obmean square error (RMSE) between the function ). Patained model using uniform grid of 201 points on ( rameters and were changed in the range from 0 to 3.5 (step 0.1). The number of IF–THEN rules was changed from 2 to 10. Tables II and III show the results for each number of IF–THEN rules for global and local learning methods, respectively. The data in these tables are the mean and standard deviation of RMSE over 50 experiments. Values of and parameters for which the lowest mean RMSE was obtained are also shown. If we take these tables into account, then it should be noted that despite the number of IF–THEN rules, learning tolerant to imprecision leads to a better function approximation comparing to the zero-tolerant learning, for both global and local learning. The best approximation for each number of rules is obtained for parameters and value different from zero. The IQP method for local learning leads to the best approximation of function (for seven IF–THEN rules). Only slightly worse results were obtained global method. On the same datasets the best apfor the ) was obtained using the proximation ( (Gaussian SVR machine for the following parameters: , and (regularization pakernel parameter), rameter). Thus, we see that the SVR machine leads to slightly better approximation comparing with fuzzy modeling methods based on learning. However, the running time of fuzzy modeling methods is approximately 20-times shorter comparing with the SVR method. In other words, fuzzy modeling with learning tolerant to imprecision leads to comparable results with respect to SVR, but its running time is significantly shorter. The number of coefficients obtained for the fuzzy model , where denotes the number of IF-THEN rules is:

188

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

Fig. 1. Simulation results of function approximation using "-insensitive local fuzzy modeling where the bolded line denotes approximated function and the solid line denotes a model. Noise data are denoted by crosses. TABLE II SIMULATIONS RESULTS OBTAINED FOR APPROXIMATION OF A NONLINEAR FUNCTION – GLOBAL LEARNING. THE BEST RESULT FOR EACH METHOD IS BOLDED

TABLE III SIMULATIONS RESULTS OBTAINED FOR APPROXIMATION OF A NONLINEAR FUNCTION – LOCAL LEARNING. THE BEST RESULT FOR EACH METHOD IS BOLDED

and is the dimensionality of input space. Thus, the best approximation of the function is obtained for 28 coefficients. , For the SVR machine the number of coefficients is: denotes the number of the support vectors. For where

datasets used in this experiment the number of support vectors was changed from 17 to 34. So, the number of coefficients varies from 35 to 69. In conclusion, it should be stated that fuzzy modeling with learning leads to a more economic

LESKI: TSK-FUZZY MODELING BASED ON -INSENSITIVE LEARNING

189

Fig. 2. Simulation results of function approximation using the support vector regression with Gaussian kernel where the bolded line denotes approximated function and the solid line denotes a model. Noise data are denoted by crosses.

Fig. 3. ECG signal used in tests. (a) Training part. (b) Testing part.

solution (small number of coefficients), which is independent to used training set. Figs. 1 and 2 show an example of models

obtained using fuzzy modeling with learning and the SVR machine, respectively.

190

B. Real-World High-Dimensional Data The purpose of these experiments was to compare the generalization ability of the proposed fuzzy system with learning tolerant to imprecision, the classical (zero-tolerance) learning and support vector regression machine. The following benchmark databases were used. • Data originating from [4] concerning the identification of a gas oven. Air and methane were delivered into the gas ) to oboven (gas flow in ft/min—an input signal (percentage contain a mixture of gases containing ). The data consisted of 296 pairs tent—output signal of input–output samples with 9 s. sampling period. In order to identify the model, the following vectors were used as input and output . The learning set consists of the first 100 pairs of data and the testing set consists of the remaining 190 pairs of data. • Data originating from [44] concerning the prediction of a sunspots number. The data consisted of 280 measurements with a one-year sampling period, of sunspots activity from 1700 to 1979 A.D. To identify the model, the following vectors were used as input: and output . The learning set consists of the first 100 vectors and the testing set consists of the remaining 168 vectors. • ECG signal from the MIT-BIH database—the record numbered 100. The sampling frequency of that signal . is equal to 360 Hz and quantization step size is 5 The learning process was conducted for the first 450 samples with one outlier—the learning set [see Fig. 3(a)]. The testing set consists of 5000 samples – the testing set [see Fig. 3(b)]. The order of the model was equal to 4 and a nonlinear one-step predictor was build. Thus, the following vectors were used as input: and output . Parameters and were changed in the range from 0 to 0.5 (step 0.01) and from 0 to 0.2 (step 0.01), respectively. The number of IF-THEN rules was changed from 2 to 6 (from 2 to 10 for ECG database). After the training stage (fuzzy system design on the training set), the generalization ability of the designed model was determined as a root mean squared error (RMSE) on the test set. The training stage was repeated 25 times for a different random initialization of the FCM method for each combination of the above parameters values. Tables IV–VI show the lowest RMSE for each number of if-then rules for Box–Jenkins, Sunspots and ECG databases, respectively. Also values of and parameters for which the lowest RMSE was obtained as well as RMSE for zero-tolerant learning are shown. If we take these tables into account, several observations can be made. First of all, it should be noted that despite the number of IF-THEN rules, learning tolerant to imprecision leads to a better generalization comparing to zero-tolerant learning, for all databases. The best generalization for each number of rules is obtained for parameters and value different from zero. It must also be noted that we observe different dependencies of the generalization ability from the number of IF-THEN rules for zero-tolerant learning and learning tolerant to imprecision. For zero-tolerant learning, increasing the number of rules results in

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

TABLE IV RMSE OBTAINED FOR THE TESTING PART OF THE BOX-JENKINS DATA. THE BEST RESULT FOR EACH METHOD IS BOLDED

TABLE V RMSE OBTAINED FOR THE TESTING PART OF THE SUNSPOTS DATA. THE BEST RESULT FOR EACH METHOD IS BOLDED

TABLE VI RMSE OBTAINED FOR THE TESTING PART OF THE ECG SIGNAL. THE BEST RESULT FOR EACH METHOD IS BOLDED

fast decrease in the generalization ability due to the overfitting effect of the training set. For both learning methods tolerant

LESKI: TSK-FUZZY MODELING BASED ON -INSENSITIVE LEARNING

191

Fig. 4. Errors of ECG signal modeling obtained by (a) "-insensitive global learning and (b) SVR method.

to imprecision a slower decrease in the generalization ability may be noted in this case. The best generalization ability for Box-Jenkins data is obtained using the IQP method for local learning with , and . For Sunspots database, the best generalization is obtained for the IQP method for local learning with , and . For ECG database, the best generalization is obtained for the IQP method for global learning with , and . Only slightly worse results were obtained for the method. Fig. 4 shows a course of error signal obtained by: a) fuzzy modeling for eight IF-THEN rules with global IQP learning tolerant to imprecision, and b) the SVR machine. The following results were obtained for the SVR method: the Box–Jenkins database— , for , , ; the Sunspots database— , for , , ; the ECG database— , for , , . Taking into account the numbers of support vectors the following numbers of coefficients are obtained: 1761 for the Box–Jenkins database, 2185 for the Sunspots database and 1721 for the ECG database. For TSK fuzzy modeling, the best generalization is obtained for the following numbers of coefficients: 62, 185, and 104, respectively. Thus, the knowledge-base obtained by the fuzzy modeling with learning is significantly smaller (and linguistically interpreted) with respect to this from the SVR machine. It can be noted that the TSK fuzzy modeling with learning leads to better generalization comparing with the SVR machine for real-world high-dimensional data. The proposed method converges after a few iterations (usually less than 10) and the IQP method converges after less than 50 iterations.

The running times for ECG database were as follows, the SVR machine—519.9 s, the TSK model for eight IF-THEN rules: IPQ global—0.32 s, IPQ local—0.28 s, global—0.26 s, local—0.23 s, LS global—0.09 s, LS local—0.18 s. Thus, the running time of the TSK fuzzy modeling with learning was over 1600-times shorter with respect to the SVR machine. To make the comparison easier Table VII contains RMSE and the number of coefficients archived by the proposed algorithms and SVR machine for a set of various problems. Tests for the robustness of outliers of the proposed learning methods were also performed. First, for Box–Jenkins database, the fifth output sample , that has the original value equal to 52.7, was set to value equal to 100. Hence, only one outlier was added to the database, in this case. In the second case, two outliers were added to the training set: once again and, additionally, (the original value 48.1) were set to 100. In the third case, the Cauchy random noise was added to output variable of the training part of Box–Jenkins data. Finally, in the fourth case, the Cauchy random noise was added to the input and output variable of the training part of Box–Jenkins data. The training stage was repeated 25 times for a different random initialization of the FCM method and parameters , and for which the best generalization without outliers and noise was obtained. Table VIII shows the lowest RMSE for the testing part of this database. From this table we can see that the best generalization in the presence of outliers and noise was obtained for the method, but slightly worse results were obtained for the IQP method. For the noise added only to the output variable the best results were obtained for IQP method. Finally, it can be noted that learning methods tolerant to imprecision lead to better than two-fold generalization improvement with respect to the zero-tolerance

192

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 2, APRIL 2005

TABLE VII COMPARISON OF RMSE AND THE NUMBERS OF COEFFICIENTS FOR EACH METHOD ON DIFFERENT DATASETS. THE BEST RESULT FOR EACH DATASET IS BOLDED

TABLE VIII RMSE OBTAINED FOR THE TESTING PART OF THE BOX-JENKINS DATA IN THE PRESENCE OF OUTLIERS AND NOISE IN THE TRAINING PART

ally, in this case, a knowledge-base has linguistic interpretation and is independent to the training set. In practical applications, it is also very important that TSK-fuzzy modeling with learning tolerant to imprecision leads to similar generalization ability comparing with the support vector machine, but its running time is dramatically shorter. Finally, it may be stated that fuzzy modeling with learning tolerant to imprecision can be concurrent to the state-of-the-art method based on support vector machine. APPENDIX

learning. The SVR machine leads to significantly better robustness to single outliers, but in the presence of noise on input and output the results obtained by TSK fuzzy modeling are better. Interesting questions for the future research are as follows: i) is it possible to improve the generalization ability of a fuzzy model by its fine tuning using gradient methods based on the -insensitive loss function?, ii) is it true that using the -insensitive fuzzy -means [25] instead of FCM clustering improves the generalization ability of a designed fuzzy system?, and iii) how to control a model complexity with respect to the premises part of IF–THEN rules?

The first equation from (33) can be rewritten in the form: . Thus, for all elements of the error vector cannot be zero. If we define and , then using (33) and (34) yields: and . Substitution of the previous results in (32) gives

VI. CONCLUSION In this work, a new approach to fuzzy modeling with learning tolerant to imprecision is presented. The Vapnik’s -insensitive loss function is used in this method of learning. It is showed that in this case a weighted (“fuzzified”) version of the support vector regression is obtained. Numerically effective methods of designing the TSK-fuzzy system called IQP and the -insensiare also introduced. These methods establish the tive connection between fuzzy modeling and the statistical learning theory where an easy control of system complexity is permitted. Numerical examples show the usefulness of the new methods in fuzzy systems design with improved generalization ability and robustness of outliers comparing with the classical zero-tolerance learning. The learning tolerant to imprecision always leads to a better generalization comparing with the classical methods. It is impossible to decide which tolerance learning method is the best. The -insensitive learning by solving a system of linear inequalities has usually better robustness to outliers. The iterative quadratic programming leads to a better generalization with , but its computational burden is greater. respect to the The experiments also show that a “fuzzified” version of support vector machine leads to a more economic knowledge-base, that is, with significantly smaller number of coefficients. Addition-

From the first line of (33) we have Using this and the equality

after some simple algebra, we obtain

Since

the second and third terms simplified to

Thus

.

LESKI: TSK-FUZZY MODELING BASED ON -INSENSITIVE LEARNING

The matrix is symmetric and positive semidefinite. As a result, the second term is negative or zero. For the first term is negative or zero. Thus, the sequence is monotonically decreasing. Convergence requires that tends to zero (no modification in (34)), while is . bounded away from zero, due to ACKNOWLEDGMENT The author would like to thank the Associate Editor and anonymous referees for their constructive comments that have helped to improve this paper. REFERENCES [1] I. Barrondale and F. D. K. Roberts, “An improved algorithm for discrete L linear approximation,” SIAM J. Numer. Anal., vol. 10, no. 5, pp. 839–848, 1973. [2] I. Barrondale and A. Young, “Algorithms for best L and L linear approximations on a discrete set,” Numerische Mathematik, vol. 8, pp. 295–306, 1966. [3] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum, 1982. [4] G. E. P. Box and G. M. Jenkins, Time Series Analysis. Forecasting and Control. San Francisco, CA: Holden-Day, 1976. [5] G. Castellano, A. M. Fanelli, and C. Mencor, “An empirical risk functional to improve learning in a neuro-fuzzy classifier,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 1, pp. 725–730, Jan. 2004. [6] J.-Q. Chen, Y.-G. Xi, and Z.-J. Zhang, “A clustering algorithm for fuzzy model identification,” Fuzzy Sets Syst., vol. 98, pp. 319–329, 1998. [7] Y. Chen and J. Z. Wang, “Support vector learning for fuzzy rule-based classification systems,” IEEE Trans. Fuzzy Syst., vol. 11, no. 6, pp. 716–728, Dec. 2003. [8] J.-H. Chiang and P.-Y. Hao, “A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing,” IEEE Trans. Fuzzy Syst., vol. 11, no. 4, pp. 518–527, Aug. 2003. [9] K. B. Cho and B. H. Wang, “Radial basis function based adaptive fuzzy systems and their applications to system identification and prediction,” Fuzzy Sets Syst., vol. 83, pp. 325–339, 1996. [10] E. Czogala and J. M. Leski, Fuzzy and Neuro-Fuzzy Intelligent Systems. Heidelberg, Germany: Physica-Verlag, 2000. [11] P. J. Green, “Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives,” J. Roy. Statist. Soc., ser. B 46, pp. 149–192, 1984. [12] S. Haykin, Neural Networks. A Comprehensive Foundation, 2nd ed. Upper Saddle River, NJ: Prentice-Hall, 1999. [13] Y.-C. Ho and R. L. Kashyap, “An algorithm for linear inequalities and its applications,” IEEE Trans. Electromagn. Compat., vol. EC-14, no. 7, pp. 683–688, Sep. 1965. [14] , “A class of iterative procedures for linear inequalities,” SIAM J. Control, vol. 4, pp. 112–115, 1966. [15] D. H. Hong and C. Hwang, “Support vector fuzzy regression machines,” Fuzzy Sets Syst., vol. 138, no. 2, pp. 271–281, 2003. [16] P. J. Huber, Robust Statistics. New York: Wiley, 1981. [17] H. Ishibuchi, R. Fujioka, and H. Tanaka, “Neural networks that learn from fuzzy IF–THEN rules,” IEEE Trans. Fuzzy Syst., vol. 1, no. 2, pp. 85–97, Apr. 1993. [18] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing. A Computational Approach to Learning and Machine Intelligence. Upper Saddle River, NJ: Prentice-Hall, 1997. [19] J.-T. Jeng, C.-C. Chuang, and S.-F. Su, “Support vector interval regression networks for interval regression analysis,” Fuzzy Sets Syst., vol. 138, no. 2, pp. 283–300, 2003. [20] T. Joachims, “Making large-scale support vector machine learning practical,” in Advances in Kernel Methods – Support Vector Learning, B. Schölkopf, J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1999, pp. 169–184. [21] M. I. Jordan and R. A. Jacobs, “Hierarchical mixture of experts and the EM algorithm,” Neural Comput., vol. 6, no. 2, pp. 181–214, 1994. [22] E. Kim, M. Park, S. Ji, and M. Park, “A new approach to fuzzy modeling,” IEEE Trans. Fuzzy Syst., vol. 5, no. 3, pp. 328–337, Jun. 1997. [23] E. Kim, M. Park, S. Kim, and M. Park, “A transformed input-domain approach to fuzzy modeling,” IEEE Trans. Fuzzy Syst., vol. 6, no. 4, pp. 596–604, Aug. 1998. [24] C.-F. Lin and S.-D. Wang, “Fuzzy support vector machine,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 464–471, Mar. 2002.

193

[25] J. M. Leski, “An "-insensitive approach to fuzzy clustering,” Int. J. Appl. Math. Comp. Sci., vol. 11, no. 4, pp. 101–115, 2001. [26] , “Toward a robust fuzzy clustering,” Fuzzy Sets Syst., vol. 137, no. 2, pp. 215–233, 2003. , “Neuro-fuzzy system with learning tolerant to imprecision,” Fuzzy [27] Sets Syst., vol. 138, no. 2, pp. 427–439, 2003. [28] , “An "-margin nonlinear classifier based on IF–THEN rules,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 1, pp. 68–76, 2004. [29] P. McCullagh and J. A. Nelder, Generalized Linear Models. London, U.K.: Chapman and Hall, 1983. [30] S. Mitra and Y. Hayashi, “Neuro-fuzzy rule generation: Survey in soft computing framework,” IEEE Trans. Neural Netw., vol. 11, no. 3, pp. 748–768, May. 2000. [31] A. Navia-Vázquez, F. Pérez-Cruz, A. Artés-Rodríguez, and A. R. Figueiras-Vidal, “Weighted least squares training of support vector classifiers leading to compact and adaptive schemes,” IEEE Trans. Neural Netw., vol. 12, no. 5, pp. 1047–1059, Sep. 2001. [32] E. Osuna, R. Freund, and F. Girosi, “An improved training algorithm for support vector machines,” in Proc. IEEE Workshop Neural Networks for Signal Processing, 1997, pp. 276–285. [33] W. Pedrycz, “An identification algorithm in fuzzy relational systems,” Fuzzy Sets Syst., vol. 13, pp. 153–167, 1984. , “Conditional fuzzy clustering in the design of radial basis function [34] neural network,” IEEE Trans. Neural Netw., vol. 9, no. 4, pp. 601–612, Jul. 1998. [35] J. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” in Advances in Kernel Methods – Support Vector Learning, B. Schölkopf, J. C. Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1999, pp. 185–208. [36] D. Rutkowska, Neuro-Fuzzy Architectures and Hybrid Learning. Heidelberg, Germany: Physica-Verlag, 2002. [37] M. Setnes, “Supervised fuzzy clustering for rule extraction,” IEEE Trans. Fuzzy Syst., vol. 8, no. 4, pp. 416–424, Aug. 2000. [38] M. Sugeno and G. T. Kang, “Structure identification of fuzzy model,” Fuzzy Sets Syst., vol. 28, pp. 15–33, 1988. [39] H. Takagi and M. Sugeno, “Fuzzy identification of systems and its application to modeling and control,” IEEE Trans. Syst., Man, Cybern., vol. 15, no. 1, pp. 116–132, Jan. 1985. [40] V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [41] , Statistical Learning Theory. New York: Wiley, 1998. [42] , “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–999, Sep. 1999. [43] L.-X. Wang, A Course in Fuzzy Systems and Control. New York: Prentice-Hall, 1998. [44] A. S. Weigend, B. A. Huberman, and D. E. Rumelhart, “Predicting the future: A connectionist approach,” Int. J. Neural Syst., vol. 1, pp. 193–209, 1990. [45] J. Yen, L. Wang, and C. W. Gillespie, “Improving the interpretability of TSK fuzzy models by combining global learning and local learning,” IEEE Trans. Fuzzy Syst., vol. 6, no. 4, pp. 530–537, Aug. 1998. [46] L. A. Zadeh, “Fuzzy sets,” Inf. Control, vol. 8, pp. 338–353, 1964. [47] , “Outline of a new approach to the analysis of complex systems and decision processes,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 1, pp. 28–44, Jan. 1973.

Jacek M. Leski (M’98–SM’03) was born in Gliwice, Poland. He received the M.Sc., Ph.D., and D.Sc. degrees in electronics from Silesian University of Technology, Gliwice, Poland, in 1987, 1989, and 1995, respectively. He is a Professor of Biomedical Information Processing at Silesian University of Technology and, currently, Head of the Division of Biomedical Electronics. He also is a Professor at the Institute of Medical Technology and Equipment, Zabrze, Poland. His research interests are biomedical digital signal processing, fuzzy and neuro-fuzzy modeling, pattern recognition, and learning theory. In these subjects, he has authored and coauthored four monographs, nine chapters in monographs, and more than 150 international journal and conference papers. Dr. Leski is a Senior Member of the EMBS, the IEEE Signal Processing Society, the IEEE Systems, Man, and Cybernetics Society, the IEEE Neural Networks Society, the IEEE Communication Society, the IEEE Control Systems Society, the Polish Biomedical Engineering Society, and the Polish Society of Theoretical and Applied Electrotechnics.

Suggest Documents