International Journal of Computational Cognition (http://www.YangSky.com/yangijcc.htm) Volume 1, Number 1, Pages 21–77, March 2003 Publisher Item Identifier S 1542-5908(03)10102-9/$20.00 Article electronically published on October 12, 2002 at http://www.YangSky.com/ijcc11.htm. Please cite this paper as: hJacek M. L Ã eski, “ε-Insensitive Learning Techniques for Approximate Reasoning Systems (Invited Paper)”, International Journal of Computational Cognition (http://www.YangSky.com/yangijcc.htm), Volume 1, Number 1, Pages 21–77, March 2003i.
ε-INSENSITIVE LEARNING TECHNIQUES FOR APPROXIMATE REASONING SYSTEMS (INVITED PAPER) JACEK M. L Ã ESKI
Abstract. Initially, an axiomatic approach to the definitions of fuzzy connectives are recalled. Based on these definitions several important fuzzy connectives and their properties are described. Then, the idea of approximate reasoning using generalized modus ponens and fuzzy implication are considered. After reviewing the well-known fuzzy systems, the artificial neural network based on logical interpretation of if-then rules is presented. The elimination of non-informative part of final fuzzy set before defuzzification plays the key role in this system. Then, new learning methods tolerant to imprecision are introduced and used to learning this system. The proposed learning methods make it possible to dispose an intrinsic inconsistency of neurofuzzy modeling, where zero-tolerance learning is used to obtain fuzzy model tolerant to imprecision. These new methods may be called ε-insensitive learning, where, in order to fit the fuzzy model to real data, the weighted ε-insensitive loss function is used. The ε-insensitive learning leads to a fuzzy model with minimal Vapnik-Chervonenkis dimension, which results in an improved generalization ability of this system. Another advantage of the proposed learning methods is their outliers robustness. In this paper, two approaches to solving the εinsensitive learning problem are presented. The first approach leads to a quadratic programming problem with bound constraints and one linear equality constraint. The second approach leads to a problem of solving a system of linear inequalities. Three computationally efficient numerical methods for the ε-insensitive learning are proposed. Finally, an example is given to demonstrate the validity of the introc duced methods. Copyright °2002 Yang’s Scientific Research Institute, LLC. All rights reserved.
Received by the editors October 10, 2002 / final version received October 12, 2002. Key words and phrases. approximate reasoning, fuzzy connectives, generalized modus ponens, epsilon-insensitive learning, robust methods, statistical learning theory, neurofuzzy systems. c °2002 Yang’s Scientific Research Institute, LLC. All rights reserved.
21
22
JACEK M. L Ã ESKI
1. Introduction Fuzzy modeling enables to find a nonlinear models of reality, where knowledge is obtained as a set of if-then rules with linguistically interpreted propositions. The fuzzy modeling is based on the premise that human thinking is tolerant to imprecision, and the real world is too complicated to be described precisely [66], [67]. Recently, fuzzy modeling plays an important role in many engineering fields, such as: pattern recognition, control, identification, data mining, and so on [10], [13], [28], [34], [35], [36], [38], [39], [58]. An investigation of inference processes, when premises and/or conclusions in the if-then rules are fuzzy, is still a subject of many papers [4], [8], [9], [12], [17], [19], [30], [45], [47], [59]. In such processes, a sound and proper choice of logical operators plays an essential role. The theoretical (mathematical) and the practical (computational) behavior of logical operators in inference processes has to be known before such a choice is made. Both types of the above mentioned knowledge related to the well-known families of triangular norms and implications can also be found in the literature [17], [19], [59]. Some selected logical operators and fuzzy implications were also investigated with respect to their behavior in the inference processes. The fuzzy if-then rules have on one hand a conjunctive interpretation and on the other hand the interpretation in terms of classical logical implication. The inference algorithms based on conjunctive interpretation of if-then rules were simpler and faster in comparison with algorithms used for the logical interpretation of such rules. Additionally, applying conjunctive implication interpretation of if-then rules lead to intuitively better inference results. In the paper an inference with specific defuzzification that leads to simpler, faster and intuitively acceptable results is presented. Artificial neural network that automatically generates this kind of fuzzy if-then rules is also described. In literature, several methods of automatic fuzzy rule generation from given numerical data have been described [7], [25], [27], [31], [46], [60]. The simplest method of rule generation is based on a clustering algorithm and estimation of proper fuzzy relations from a set of numerical data [31], [60]. Another type of methods, which use the learning capability of neural networks and the fact that both fuzzy systems and neural nets are universal approximators, has been successfully applied to various tasks. The problem here is the difficulty in understanding the identified fuzzy rules since they are implicitly acquired into the network itself. Mitra et al. [46] have proposed a fuzzy multilayer perceptron generating fuzzy rules from the connection weights. Several methods of extracting rules from the given data are based on a class of radial basis function networks (RBFNs). The fact
ε-INSENSITIVE LEARNING TECHNIQUES
23
that there is a functional equivalence between RBFNs and the fuzzy system has been used by Jang et al. [26] to construct Takagi-Sugeno-Kang type of adaptive network based fuzzy inference system (ANFIS) which is trained by the back-propagation algorithm. Such a connection of neural networks and fuzzy models is usually called a neuro-fuzzy system. More general fuzzy reasoning schemes in ANFIS are employed by Horikawa et al. [25]. Such developed radial basis function based adaptive fuzzy systems have been described by Cho and Wang [7] and applied to system identification and prediction. Another type of fuzzy system with moving fuzzy set in consequents of if-then rules is shown in [33]. Extraction of fuzzy if-then rules can be divided into [13]: (1) obtained from a human expert, (2) obtained automatically from observed data, usually by means of artificial neural networks incorporated into fuzzy systems. Methods from the first group have great disadvantages: few experts can or/and want to share their knowledge. In methods from the second group, knowledge is acquired automatically by learning algorithms of neural networks. The neuro-fuzzy modeling has an intrinsic inconsistency. It may perform thinking tolerant to imprecision, but neural networks learning methods are zero-tolerant to imprecision. Usually, these learning methods use the quadratic loss function, to match reality and a fuzzy model. In this case only perfect matching between reality and the model leads to zero loss. The presented in this paper approach to neuro-fuzzy modeling is based on the premise that human learning, as well as thinking, is tolerant to imprecision. Hence, zero loss for an error less than some pre-set value, noted ε, is assumed. If the error is greater than ε, then the loss increases linearly. Learning method based on this loss function may be called ε-insensitive learning. In real applications, data from a training set are corrupted by noise and outliers. It follows that fuzzy systems design methods need to be robust. In literature there are many robust loss functions [24]. In this work, εinsensitive loss function is used, that is a generalization of absolute error loss function (ε = 0). According to Huber [24], a robust method should have the following properties: (i) it should have a reasonably good accuracy at the assumed model, (ii) small deviations from the model assumptions should impair the performance only by a small amount, (iii) larger deviations from the model assumptions should not cause a catastrophe. It is well-known in approximation theory (Tikhonov regularization) and machine learning (statistical learning theory) that too precise learning on a training set leads to overfitting (overtraining), which results in poor generalization ability. The generalization ability is interpreted as a generation of reasonable decision for data previously unseen in the process of training
24
JACEK M. L Ã ESKI
[21], [53], [61]. Vapnik-Chervonenkis (VC) theory (or statistical learning theory) has recently emerged as a general theory for estimation of dependencies from finite set of data [62]. The most important in the VC-theory is the Structural Risk Minimization (SRM) induction principle. The SRM principle suggest a tradeoff between the quality of an approximation and the complexity of the approximation function [61]. A measure of the approximation function complexity (or capacity) is called VC-dimension. One of the simplest method of control VC-dimension is to change the insensitivity parameter ε in the loss function. Increasing ε results in decreasing of VC-dimension. The goal of this work is twofold. Firstly, the theoretical description and structure presentation of an artificial neural network based on logical interpretation of if-then rules (ANBLIR) is introduced. Secondly, an idea of learning tolerant to imprecision for automatic fuzzy if-then rule extraction is used. The paper is divided into 11 sections. Some introductory remarks are contained in Section 1. Section 2 presents a short review of the study concerning an axiomatic approach to the definition of fuzzy connectives, that is, conjunctions, disjunctions, negations and implications. Section 3 recalls the main ideas of fuzzy inference process using the generalized modus ponens and conjunctive as well as logical interpretation of if-then rules. Section 4 introduces the basics of fuzzy systems. In Section 5 the structure of the neuro-fuzzy system (ANBLIR) and the estimation of its parameters are shown. Section 6 presents an introduction of ε-insensitive learning method and shows that this approach leads to a quadratic programming problem. Section 7 presents a new numerical method, called Iterative Quadratic Programming (IQP), to solve the problem of the ε-insensitive learning. Section 8 presents a description of the ε-insensitive learning using a method based on an incremental solving of a quadratic programming problem. The εinsensitive Learning by Solving a System of Linear Inequalities (εLSSLI), without the need to solve the quadratic programming problem, is presented in Section 9. Section 10 illustrate the theoretical considerations by means of application of the neuro-fuzzy system to a system identification problem. Finally, concluding remarks are gathered in Section 11. 2. An approach to axiomatic definition of fuzzy connectives We start our considerations applying an axiomatic approach to the definition of fuzzy connectives, i.e., conjunction, disjunction, negation and implication [17], [18], [19]. Let us consider the class of intersection-union operators known as the triangular norms, i.e., t-norm T and t-conorm (s-norm) S operators considered as functions, T : [0, 1] × [0, 1] → [0, 1];
ε-INSENSITIVE LEARNING TECHNIQUES
25
S : [0, 1] × [0, 1] → [0, 1]. The T serves as a basis for defining intersections of fuzzy sets, while S serves as a basis for defining unions of fuzzy sets. Taking into account the properties of classical sets the following axioms may be accepted for t-norms [19]: T1◦ T (x, 1) = x, T (x, 0) = 0 — boundary conditions, T2◦ T (x, y) = T (y, x) — commutativity, T3◦ if x ≤ q and y ≤ r, then T (x, y) ≤ T (q, y) and T (x, y) ≤ T (x, r) — monotonicity, T4◦ T (x, T (y, z)) = T (T (x, y), z) — associativity, and for s-norms: S1◦ S(x, 0) = x, S(x, 1) = 1 — boundary conditions, S2◦ S(x, y) = S(y, x) — commutativity, S3◦ if x ≤ q and y ≤ r, then S(x, y) ≤ S(q, y) and S(x, y) ≤ S(x, r) — monotonicity, S4◦ S(x, S(y, z)) = S(S(x, y), z) — associativity, where q, r, x, y, z ∈ [0, 1]; In other words, a function T (·, ·) is a t-norm if and only if it satisfies the above written conditions T1◦ — T4◦ and a function S (·, ·) is a s-norm if and only if it satisfies conditions S1◦ — S4◦ . From the algebraic point of view T is a semigroup operation in [0, 1] with identity 1 and S is a semigroup operation in [0, 1] with identity 0. The most important examples of corresponding t-norms and s-norms are given in Table 1. Now, we will discuss the complement of a fuzzy set. According to the minimal requirements which are necessary to identify an operation called negation we can postulate the existence of a nonincreasing function n : [0, 1] → [0, 1] such that n(0) = 1, n(1) = 0. This class of functions can be clarified by taking into account the following conditions: N1◦ n is strictly decreasing, N2◦ n is continuous, N3◦ n(n(x)) = x for all x ∈ [0, 1]. A negation is strict if it satisfies N1◦ and N2◦ . A strict negation is called strong if N3◦ additionally holds. The specific strong negation N (x) = 1 − x is called a standard negation. Since a strict negation is a strictly decreasing and continuous function its inverse n−1 is also a strict negation. The equality n−1 (x) = n(x) holds for all x ∈ [0, 1] if and only if n is involutive, i.e., n(n(x)) = x holds for all x ∈ [0, 1]. It is easy to show that for every negation n the following inequalities are satisfied [19] (1)
∀
x∈[0,1]
ni (x) ≤ n (x) ≤ ndi (x) ,
JACEK M. L Ã ESKI 26
Z (x, y) =
Z 0 (x, y) =
s-norm M 0 (x, y) = max (x, y) Π0 (x, y) = x + y − xy W 0 (x, y) = min ½ (x + y, 1) max (x, y) , x + y < 1 max1 (x, y) = 1, otherwise ½ max (x, y) , min(x, y) = 0 1, otherwise
Table 1. Selected t-norms and s-norms. Name t-norm Zadeh M (x, y) = min (x, y) Algebraic Π (x, y) = xy L Ã ukasiewicz W (x, y) = max ½ (x + y − 1, 0) min (x, y) , x + y > 1 Fodor min0 (x, y) = 0, otherwise ½ min (x, y) , max(x, y) = 1 0, otherwise Drastic
ε-INSENSITIVE LEARNING TECHNIQUES
27
where ni (x) and ndi (x) denote intuitionistic and dual intuitionistic negation, defined as follows: ½ 1, x = 0, (2) ni (x) , 0, x > 0, ½ 1, x < 1, (3) ndi (x) , 0, x = 1. As in the classical set theory, de Morgan laws establish a link between the union and intersection via the complementation. If a t-norm T , a s-norm S and a strong negation n satisfy de Morgan laws as: ½ n[T (x, y)] = S [n (x) , n (y)] , (4) ∀ n[S (x, y)] = T [n (x) , n (y)] , x,y∈[0,1] then the triple (T, S, n) is called a de Morgan triple and T , S are called n-duals of each other. The t-norms and s-norms introduced in Table 1 are dual when considered with the standard negation N (x) = 1 − x. Obviously, for any t-norm T and any s-norm S the following inequalities hold [13]: ½ Z (x, y) ≤ T (x, y) ≤ M (x, y) , (5) ∀ ∀ 0 M (x, y) ≤ S (x, y) ≤ Z 0 (x, y) . x∈[0,1] y∈[0,1] Moreover, W ≤ Π and W ≤ min0 but Π and min0 are not comparable in this sense. Also Π0 ≤ W and max1 ≤ W but Π0 and max1 are not comparable in the sense mentioned above. Better illustration of the above written inequalities can be obtained by means of an integrated index introducing a distance measure between the arbitrary operations f1 , f2 in the interval [0, 1] [9]. Let f1 , f2 : [0, 1] × [0, 1] → [0, 1] be measurable functions treated as two-argument operations in [0, 1]. The pseudometric distance between the operations f1 and f2 with respect to values of their arguments is calculated as follows Z1 Z1 (6) d (f1 , f2 ) , |f1 (x, y) − f2 (x, y)| dxdy. 0
0
For constant operations, i.e., f1 (x, y) = 0, f2 (x, y) = 1 for all x, y ∈ [0, 1], we get d(f1 , f2 ) = 1. Since constant operations differ from drastic operations Z, Z 0 only by boundary conditions, we also obtain the distance for drastic operations: d(Z, Z 0 ) = 1. Taking into account the min (M -norm) and max (M 0 -norm) operations we can divide the [0, 1] interval into three basic classes as follows: (i) products (Z ≤ T ≤ M ), (ii) averages (M ≤ A ≤ M 0 ), (iii) sums (M 0 ≤ S ≤ Z 0 ). Note that the well-known averages belong to √ the class of averages, i.e., arithmetic mean (x + y)/2; geometric mean xy;
28
JACEK M. L Ã ESKI
harmonic mean (2xy)/ (x + y). For example, the following distances may be determined using (6): d(Z, M ) = d(M, M 0 ) = d(M 0 , Z) = 31 , d (W, Π) = 1 d (Π0 , W 0 ) = 12 . Now we will discuss the fuzzy implication. According to the requirements which are necessary to identify an operation called fuzzy implication, it is a function I : [0, 1] × [0, 1] −→ [0, 1] satisfying the following conditions [19]: I1◦ If x ≤ z then I(x, y) ≥ I (z, y) — monotonicity with respect to the first argument, I2◦ If y ≤ z then I (x, y) ≤ I (x, z) — monotonicity with respect to the second argument, I3◦ I (0, y) = 1 — falsity implies anything, I4◦ I (x, 1) = 1 — anything implies tautology, I5◦ I (1, 0) = 0 — Booleanity, where x, y, z ∈ [0, 1]. Now let us recall further properties, in terms of function I, which could also be important in some applications: I6◦ I7◦ I8◦ I9◦ I10◦ I11◦ I12◦ I13◦
I (1, x) = x — tautology cannot justify anything, I (x, I (y, z)) = I (y, I (x, z)) — exchange principle, x ≤ y if and only if I (x, y) = 1 — implication defines ordering, I (x, 0) = N (x) is a strong negation, I (x, y) ≥ y, I (x, x) = 1 — identity principle, I (x, y) = I (N (y) , N (x)) with a strong negation N , I is a continuous function.
The two most important families of such implications are related either to the formalism of Boolean logic or to the residuation concept from intuitionistic logic. For the concepts mentioned above, a suitable definition is introduced below [19], [47], [57], [59]. An S-implication associated with a s-norm S (?S ) and a strong negation N is defined by (7)
∀
x,y∈[0,1]
IS,N (x, y) = N (x) ?S y.
An R-implication associated with a t-norm T (?T ) is defined by (8)
∀
x,y∈[0,1]
IT (x, y) = sup { r| x ?T r ≤ y} . r
The last expression can be justified by the following classical set-theoretic identities [19] [ (9) A ∪ B = (A \ B) = { Z| A ∩ Z ⊆ B} , where \ denotes set-difference operator.
ε-INSENSITIVE LEARNING TECHNIQUES
29
We can see that both IS,N and IT satisfy conditions I1◦ -I5◦ for any tnorm T , s-norm S and strong negation N , thus they are fuzzy implications. For the sake of completeness we mention a third type of implications used in quantum logic: QL-implication associated with a t-norm T , a s-norm S and a strong negation N is defined by (10)
∀
x,y∈[0,1]
IT,S,N (x, y) = N (x) ?S (x ?T y) .
followed from the classical interpretation of implication p =⇒ q, i.e., ¬p ∨ (p ∧ q). In work [15] another interpretation of QL-implication, arises from a pair of fuzzy rules, where if x ∈ A, then y ∈ B, else, if x ∈ / A, then y ∈ C. ¡ ¢ is interpreted as fuzzy relation R , (A × B) ∪ A × C . Thus, in terms of the membership functions, we have: µR (x, y) = [µA (x) ?T µB (y)] ?S {N [µA (x)] ?T µC (y)}, where µR , µA , µB and µC denote the membership functions of fuzzy sets R, A, B and C, respectively. If C means ‘unknown’, i.e., µC (y) ≡ 1, then the QL-implication is recovered. If C means ‘undefined’, i.e., µC (y) ≡ 0, then the conjunctive interpretation of if-then rule is obtained. Generally, IT,S,N violates property I1◦ . However, conditions under which I1◦ is satisfied by a QL-implication can be found in [17]. Assuming that N : [0, 1] → [0, 1] is a strictly decreasing continuous function (a strong negation; N (0) = 1, N (1) = 0, N (N (x)) = x for all x ∈ [0, 1]), the N - reciprocal of I defined by (11)
(12)
∀
x,y∈[0,1]
IN (x, y) = I (N (y) , N (x))
is also considered to be a fuzzy implication. The most important fuzzy implications representing the classes of fuzzy implications discussed above are juxtaposed in Table 2. Regarding a fuzzy implication as a two-argument function, we can find its location within the interval [0, 1] using the above-mentioned pseudometric distance (6) in the same way as for fuzzy operations. For example, the following distances may 1 be determined [13]: d (ILÃ uka , IFodo ) = 12 ; d (IReich , IZadeh ) = 81 . Below the idea of approximate reasoning by means of generalized modus ponens using fuzzy implications will be recalled. 3. Approximate reasoning using fuzzy implications and generalized modus ponens Fuzzy implications are mostly used as a way of interpretation of the if-then rules with fuzzy antecedent and/or fuzzy consequent. Such rules constitute a convenient form of expressing pieces of knowledge and a set of
JACEK M. L Ã ESKI 30
Name L Ã ukasiewicz
Fodor
Form min (1 − x + y, 1) ½
KleeneDienes
max {1 − x, min (x, y)}
max (1 − x, y)
Properties
I1◦ -I8◦ ,I10◦ ,I11◦
I1◦ -I5◦ ,I8◦ ,I11◦ ,I12◦
I2◦ ,I3◦ ,I5◦ ,I6◦ ,I9◦ ,I13◦
I1◦ -I7◦ ,I9◦ ,I10◦ ,I12◦ ,I13◦
I1◦ -I7◦ ,I9◦ ,I10◦ ,I12◦ ,I13◦
I1◦ -I12◦
I1◦ -I13◦
Type R for T = W, S for S ½ = W 0, T = M, QL for S = W 0, R for T = min0 , S for S ½ = max1 , T = M, QL for S = max1 , S for S = Π0 , S for S ½ = M 0, T = W, QL for 0 ½ S=W , T = M, QL for S = M 0, R for T = Π,
I1◦ -I8◦ ,I10◦ ,I11◦ ——
R for T = M,
Table 2. Selected fuzzy implications.
1, x ≤ y, max (1 − x, y) , x > y,
Zadeh
Reichenbach 1 − x + xy
Goguen G¨odel
min (y/x, 1) ½ 1, x ≤ y, ½ y, x > y. 1, x ≤ y, 0, x > y.
Rescher
ε-INSENSITIVE LEARNING TECHNIQUES
31
I(x,y)
1
0.5
0 0 1
0.5 0.5 1
0 y
x
Figure 1. Graphical illustration of L Ã ukasiewicz fuzzy implication.
I(x,y)
1
0.5
0 0 1
0.5 0.5 1 x
0 y
Figure 2. Graphical illustration of Fodor fuzzy implication.
if-then rules forms a fuzzy rule base. Let us consider the canonical form of fuzzy if-then rule R(i) , which includes other types of fuzzy rules and fuzzy proposition as special cases, in the (MISO) form (13)
(i)
(i)
R(i) : IF X1 IS A1 AND · · · AND Xt IS At , THEN Y IS B (i) ,
32
JACEK M. L Ã ESKI
I(x,y)
1
0.5
0 0 1
0.5 0.5 1
0 y
x
Figure 3. Graphical illustration of Reichenbach fuzzy implication.
I(x,y)
1
0.5
0 0 1
0.5 0.5 1 x
0 y
Figure 4. Graphical illustration of Kleene-Dienes fuzzy implication. where Xk and Y stand for linguistic variables of the antecedent and conse(i) quent and Ak , B (i) are fuzzy sets in universes of discourse Xk ⊆ R, Y ⊆ R, respectively. Fuzzy if-then rules may be interpreted in two ways: as a conjunction of the antecedent and the consequent (Mamdani combination) or as a fuzzy implication [9], [15], [16], [59], [64]. In this paper both interpretations will
ε-INSENSITIVE LEARNING TECHNIQUES
33
I(x,y)
1
0.5
0 0 1
0.5 0.5 1
0 y
x
Figure 5. Graphical illustration of Zadeh fuzzy implication.
I(x,y)
1
0.5
0 0 1
0.5 0.5 1 x
0 y
Figure 6. Graphical illustration of Goguen fuzzy implication.
be used. A linguistic form of fuzzy if-then rule (13) can be expressed as a fuzzy relation ´ ³ ´ ³ (i) (i) (14) R(i) = A1 × · · · × At =⇒ B (i) = A(i) =⇒ B (i) ,
34
JACEK M. L Ã ESKI
I(x,y)
1
0.5
0 0 1
0.5 0.5 1
0 y
x
Figure 7. Graphical illustration of G¨odel fuzzy implication.
I(x,y)
1
0.5
0 0 1
0.5 0.5 1 x
0 y
Figure 8. Graphical illustration of Rescher fuzzy implication.
for logical interpretation, and as (15)
´ ³ ´ ³ (i) (i) R(i) = A1 × · · · × At ∩Tc B (i) = A(i) ∩Tc B (i) ,
ε-INSENSITIVE LEARNING TECHNIQUES (i)
35 (i)
for conjunctive interpretation, where A(i) = A1 × · · · × At relation in X = X1 × · · · × Xt defined by (16)
is a fuzzy
µA(i) (x) = µA(i) (x1 ) ?T · · · ?T µA(i) (xt ) , 1
t
∩Tc stands for intersection of fuzzy sets using t-norm Tc and ?T denotes respective t-norm T . In terms of membership functions Eqs. (14) and (15) may be written as: (17)
µR(i) (x, y) = I (µA(i) (x) , µB (i) (y)) ,
(18)
µR(i) (x, y) = µA(i) (x) ?Tc µB (i) (y) ,
for logical and conjunctive interpretation, respectively. Approximate reasoning is usually executed in a fuzzy inference system, which performs a mapping from an input fuzzy set A0 in X to a fuzzy set B 0 in Y via a fuzzy rule base. The most commonly used inference rule in the classical logic is modus ponens: if propositions A and A =⇒ B are true, then proposition B is also true, that is, (A ∧ (A =⇒ B)) =⇒ B or Premise I — fact Premise II — rule Conclusion
A IF A, then B B
In the case of similarity between propositions in the fact and in the antecedent of rule, the inference procedure is called generalized modus ponens. It may be written as Premise I — fact Premise II — rule Conclusion
A0 IF A, then B B0
where A and A0 ; B and B 0 are closer. The method of determining the conclusion was introduced by Zadeh [67] as the compositional rule of inference or sub-star composition (19)
B 0 = A0 ◦ R = A0 ◦ (A =⇒ B)
or, equivalently (20)
µB 0 (y) = sup [µA0 (x) ?T 0 µR (x, y)] , x∈X
where ◦ stands for sup-t-norm composition operation, defined in (20). Indeed, the above composition can be easily extended into multidimensional case, that is, B0 = A0 ◦ R = A0 ◦ (A =⇒ B). Approximate reasoning is usually executed in a fuzzy inference system which consists of I fuzzy if-then rules. In this case, two methods of approximate reasoning can be used: composition based inference (first aggregate
36
JACEK M. L Ã ESKI
then inference — FATI) and individual rule based inference (first inference then aggregate — FITA). In composition based inference, a finite number of rules R(i) for i = 1, ..., I are aggregated via intersection or average operations, i.e.
(21)
R=
I M
R(i) ,
i=1
L
where denotes the symbol of aggregation operation using a t-norm TA or a s-norm SA or averages ?A (for example, normalized arithmetic sum) for aggregation of respective membership functions (22)
~ ~ ~ w ?TA w w ?TA w w w (1) (2) w w R (x, y) = R (x, y) w ?SA w R (x, y) w w ?SA Ä ?A Ä Ä ?A ~~ w..w w where w Ä.Ä stands for alternatives of aggregation
~ ~ w w ?T A w w w ... w ?SA w w Ä Ä ?A
~ w w (I) w R (x, y). w Ä
operation.
Taking into account an arbitrary input fuzzy set A0 in X and using the generalized modus ponens we obtain the output of fuzzy inference (FATI) " (23)
0 = A0 ◦ R = A0 ◦ BFATI
I M
#
"
R(i) = A0 ◦
i=1
I ³ M
A(i) =⇒ B (i)
´
#
i=1
or in terms of the membership functions 0 µBFATI (y)
(24)
=
sup [µA0 (x) ?T 0 µR (x, y)]
x∈X
=
(
sup µA0 (x) ?T 0
x∈X
"
I M
#) R
(i)
(x, y)
,
i=1
where ?T 0 stands for t-norm T 0 for a sup-t-norm composition operation. In individual rule based inference (FITA) each rule in the fuzzy rule base determines an output fuzzy set and after that an aggregation via intersection or average operation is performed. Thus, the output fuzzy set is expressed by means of the formulas
(25)
0 BFITA =
I n ´o ³ M A0 ◦ A(i) =⇒ B (i) i=1
ε-INSENSITIVE LEARNING TECHNIQUES
37
or in terms of the membership functions
(26)
0 µBFITA (y) =
I M
sup [µA0 (x) ?T 0 µR(i) (x, y)] .
i=1 x∈X
0 0 is more specified than BFITA , i.e. It can be proved [15] that BFATI
(27)
0 0 BFATI ⊆ BFITA
or
∀
0 0 µBFATI (y) ≤ µBFITA (y).
y∈Y
0 is equal to or contained in the interIt means that the consequent BFATI 0 section of fuzzy inference results — BFITA . For simplicity of calculation in 0 0 practice the consequent BFATI is replaced by BFITA , under the assumption that the differences are not so big. If the input fuzzy sets A01 , ..., A0t or (A0 ) are singletons in x01 , ..., x0t or 0 0 0 0 (x0 ), the conclusion BFATI is equal to BFITA , i.e., µBFATI (y) = µBFITA (y) for y ∈ Y. In this case we obtain _ 0 sup µA0 (x) ?T 0 µR (x, y) µBFATI (y) = µA0 (x0 ) ?T 0 µR (x0 , y) | {z } | {z } x∈X 1
(28)
x6=x0
= µR (x0 , y) =
I M
0
µR(i) (x0 , y)
i=1
and
0 µBFITA (y)
I M µ 0 (x ) ? 0 µ (i) (x0 , y) = | A {z 0} T R i=1
_
sup µA0 (x) ?T 0 | {z }
x∈X x6=x0
(29)
=
I M
1
0
µR(i) (x, y)
µR(i) (x0 , y) = µR (x0 , y) .
i=1
Taking into account (28) and (29) we see that for singletons, the FATI and FITA inference results are the same. Finally, if we use interpretations
38
JACEK M. L Ã ESKI
of fuzzy if-then rules discussed above, then we get (30)
0 0 µBFATI (y) = µBFITA (y) =
I M
I [µA(i) (x0 ) , µB (i) (y)]
i=1
for logical interpretation, and (31)
0 0 µBFATI (y) = µBFITA (y) =
I M
[µA(i) (x0 ) ?Tc µB (i) (y)]
i=1
for conjunctive interpretation of if-then rules. 4. Fundamentals of fuzzy systems In approximate reasoning realized in fuzzy systems the fuzzy if-then rules play an essential role. Often they are also used to capture the human ability to make a decision or control in an uncertain and imprecise environment. In this section we will use such fuzzy rules to recall the important fuzzy systems which are basic in our further considerations. Assume that I fuzzy if-then rules with t-input and one-output (MISO) are given. The i-th rule in which the consequent is represented by a linguistic variable Y may be written in the following forms (32)
(i)
(i)
R(i) : IF X1 IS A1 AND · · · AND Xt IS At , THEN Y IS B (i)
or in a pseudo-vector notation (33)
R(i) : IF X IS A(i) , THEN Y IS B (i) ,
where (34)
X = [X1 , X2 , ..., Xt ]
and X1 , X2 , ..., Xt and Y are linguistic variables which may be interpreted (i) (i) as inputs of fuzzy system and the output of that system. A1 , ..., At are (i) linguistic values of the linguistic variables X1 , X2 , ..., Xt and B is linguistic value of the linguistic variable Y . A collection of the above written rules for i = 1, 2, ..., I creates a rule base which may be activated (or fired) by the singleton inputs (35)
X1 IS x01 AND · · · AND Xt IS x0t
or (36)
X IS x0 .
ε-INSENSITIVE LEARNING TECHNIQUES
39
It can easily be concluded from (17) and (18) that such a type of reasoning, where the inferred value of i-th rule output for singletons may be written for logical interpretation of the if-then rules, in the form (37)
µB 0(i) (y) = I(Fi (x0 ), µB (i) (y))
and for conjunctive interpretation (38)
µB 0(i) (y) = Tc (Fi (x0 ), µB (i) (y)) ,
where I (·, ·) stands for fuzzy implication; ?Tc for t-norm Tc and (39)
Fi (x0 ) = µA(i) (x01 ) ?T · · · ?T µA(i) (x0t ) = µA(i) (x0 ) 1
t
denotes the degree of activation (or the firing strength) of the i-th rule. The last equation represents explicit connective (AND) of the predicates Xk IS (i) Ak ; k = 1, 2, ..., t in the antecedent of the i-th fuzzy if-then rule. A crisp value of the output we can get from Modified Indexed Center of Gravity (MICOG) as defuzzification [10], [11], [14] Z y [µB (y) − α] dy (40) MICOG [µB (y)] = Z , [µB (y) − α] dy where α ∈ [0, 1] ≥ min µB (y) is constant. The subtraction of value α y∈Y
eliminates the non-informative part of the membership function µB (y). For α = 0, we get the well-known COG defuzzification. A final crisp value of the system output for normalized sum as aggregation and MICOG defuzzification can be evaluated from formula (41) Z X Z X I I y {Ψ (Fi (x0 ), µB (i) (y)) − αi } dy y [µB 0(i) (y) − αi ]dy y0 = Z Ii=1 = Z Ii=1 , X X {Ψ (Fk (x0 ), µB (k) (y)) − αk } dy [µB 0(k) (y) − αk ]dy k=1
k=1
where Ψ stands for fuzzy implication I or t-norm Tc for logical or conjunctive interpretation of if-then rules, respectively. A method of determination the values αi will be described later. Let us introduce the following notations: µB ∗(i) (y) , µB 0(i) (y) − αi and y (i) is the
40
JACEK M. L Ã ESKI
center of gravity (COG) location of the fuzzy set B ∗(i) , i.e. Z yµB ∗(i) (y)dy . (42) y (i) = COG (µB ∗(i) (y)) = Z µB ∗(i) (y)dy Using (41) and (42) it is easy to obtain the general form for final output value I X
(43)
y0 =
y (i) Area (µB ∗(i) (y))
i=1 I X
, Area (µB ∗(k) (y))
k=1
where B ∗(i) is a resulting conclusion for i-th rule after remove its noninformative part, however, before aggregation. Now, we note that fuzzy systems with Larsen’s product operation as conjunctive interpretation of the if-then rules and symmetric triangle (isosceles triangle) membership functions for consequents B (i) we can write in the form well known from the literature [28] I X w(i)
(44)
y0 =
i=1 I X k=1
2
Fi (x0 ) y (i)
w(k) Fk (x0 ) 2
,
(i)
where w is the width of the triangle base for the i-th rule. It should be (i) noted that the w2 factor may be interpreted as a respective weight of the i-th rule or its certainty factor. Another very important fuzzy system is called Takagi-Sugeno-Kang. Assume that I fuzzy if-then rules or fuzzy conditional statements with t-input and one-output (MISO) are given. The i-th rule may be written in the following forms (45) (i) (i) R(i) : IF X1 IS A1 AND · · · AND Xt IS At , THEN Y = f (i) (X1 , ..., Xt ) or in a pseudo-vector notation (46)
R(i) : IF X IS A(i) , THEN Y = f (i) (X) .
A crisp value of the output for Larsen’s product operation for conjunctive interpretation of if-then rules and normalized sum as aggregation, can be
ε-INSENSITIVE LEARNING TECHNIQUES
41
evaluated from [7] I X
(47)
y0 =
Fi (x0 ) f (i) (x0 )
i=1 I X
. Fk (x0 )
k=1
Taking into account that function f (i) is of the form: (i)
f (i) (x0 ) = p0
(48) (i)
where p0 is crisply defined constant in the consequent of the i-th rule. Such a model is called a zero-order Sugeno fuzzy model. The more general first-order Sugeno fuzzy model is of the form (i)
(i)
(i)
f (i) (x0 ) = p0 + p1 x01 + · · · + pt x0t ,
(49) (i)
(i)
(i)
where p0 , p1 , ..., pt are all constants. In vector notation it takes the form (50)
(i)
e (i)> x0 = p(i)> x00 , f (i) (x0 ) = p0 + p
e denotes the parameter vector with bias element excluded; superwhere p script > stands for transposition and x00 denotes an extended input vector · ¸ 1 0 (51) x0 = . x0 Notice that in both above described systems the consequent is crisp. In equation (44) the value describing the location of COG’s consequent fuzzy set in if-then rules is constant and equals y (i) for the i-th rule. A natural extension of the above-described situation is an assumption that the location of the consequent fuzzy set is a linear combination of all inputs [33], [34], i.e. y (i) (x0 ) = p(i)> x00 .
(52)
Hence, using (43) and (52) yield I X
(53)
y0 =
Area (µB ∗(i) (y)) p(i)> x00
i=1 I X
, Area (µB ∗(k) (y))
k=1
where B ∗(i) is conclusion for the i-th rule after remove its non-informative part, however, before aggregation.
42
JACEK M. L Ã ESKI
5. Neuro-Fuzzy system with logical interpretation of if-then rules (i)
(i)
Let us assume that premises of if-then rules A1 , ..., At membership functions ³
(i) cj
´2
x0j − (i) Aj (x0j ) = exp − ³ ´2 (i) 2 sj
(54)
(i)
have Gaussian
,
(i)
where cj , sj stand for center and dispersion of membership function for the i-th rule and the j-th input variable, respectively. On the basis of (54) and for explicit connective AND taken as the algebraic product, we get (55)
A(i) (x0 ) =
t Y
(i)
Aj (x0j ).
j=1
Using (55), the degree of activation (or the firing strength) of the i-th rule can be written in the form ³ ´2 (i) t x − c X 0j j (56) Fi (x0 ) = exp − ³ ´2 . (i) j=1 2 sj In addition, let us assume that consequents B (i) of the i-th if-then rule have symmetric triangle (isosceles triangle) membership functions with the width of the triangle base equal to w(i) . For determining the fuzzy system output on the basis of (53), we must calculate Area (µB ∗(i) (y)). From definition of B ∗(i) and (37), for logical interpretation of if-then rules, we have (57)
µB ∗(i) (y) = I[Fi (x0 ), µB (i) (y)] − αi .
For an implication satisfying condition I9◦ we assume (58)
αi = 1 − Fi (x0 ).
Now, let us determine Area (µB ∗(i) (y)) for L Ã ukasiewicz and Reichenbach fuzzy implications. Taking into account that B (i) is a symmetric fuzzy set and fuzzy L Ã ukasiewicz implication can be presented as R-implication in the
ε-INSENSITIVE LEARNING TECHNIQUES
43
form ILÃ (x, y) = 1 − x + min (x, y), we get yR(i)
Area (µB ∗(i) (y)) = 2
{I [Fi (x0 ), µB (i) (y)] − αi } dy =
(i)
y (i) − w2
(59)
yR(i)
2
[1 − Fi (x0 ) + Fi (x0 ) ∧ µB (i) (y) − 1 + Fi (x0 )] dy =
(i)
y (i) − w2
"
yR(i)
Fi (x0 ) ∧
(i)
2 y−y (i) + w2
#
dy =
w(i)
(i)
w(i) 2 Fi (x0 ) (2
− Fi (x0 )) .
y (i) − w2
For Reichenbach fuzzy implication, in a similar way, we obtain yR(i)
Area (µB ∗(i) (y)) = 2
{I [Fi (x0 ), µB (i) (y)] − αi } dy =
(i) y (i) − w2
(60)
2
yR(i)
[1 − Fi (x0 ) + Fi (x0 )µB (i) (y) − 1 + Fi (x0 )] dy =
(i)
y (i) − w2
"
yR(i)
2Fi (x0 )
(i)
2 y−y (i) + w2
(i)
w(i)
#
dy =
w(i) 2 Fi (x0 ).
y (i) − w2
Fuzzy sets in conclusion of if-then rule, before as well as after inference are graphically illustrated in Figs. 9 and 10 for L Ã ukasiewicz and Reichenbach fuzzy implication, respectively. The respective formulas for Area (µB ∗(i) (y)), are functions of the degree of activation Fi (x¤0 ) and the width of the triangle £ base w(i) , can be denoted as G Fi (x0 ), w(i) . For other fuzzy implications, £ ¤ the function G Fi (x0 ), w(i) is presented in Table 3. Now, a crisp value of the output for the fuzzy system can be evaluated from the following formula I h i X G Fi (x0 ), w(i) p(i)> x00 (61)
y0 =
i=1 I h i X G Fk (x0 ), w(k)
.
k=1
Let us observe that equation (56) and (61) describes a radial-basis function neural network. For t inputs and I if-then rules the following unknown parameters should be determined: (i)
(i)
• cj , sj for j = 1, 2, ..., t and i = 1, 2, ..., I — the parameters of membership functions of fuzzy sets in antecedents of if-then rules, (i) • pj for j = 0, 1, ..., t and i = 1, 2, ..., I — the parameters determining the location of moving fuzzy sets in consequents of if-then rules,
44
JACEK M. L Ã ESKI
B (i) , B'
(i)
, B*
(i)
1
rule's firing
y (i)
y
w (i)
Figure 9. Triangular fuzzy set in conclusion of if-then rule (solid line). The result of inference for L Ã ukasiewicz fuzzy implication — non-informative part (blue), informative part (red). The result of inference after remove the non-informative part (green). • w(i) for i = 1, 2, ..., I — the width of the triangle base of moving fuzzy sets in consequents of if-then rules. Obviously, the number of if-then rules is pre-set or it may be determined by the method using a validity index in fuzzy clustering or using the maximization of the generalization ability of the system. Usually, the above mentioned unknown parameters are estimated by means of a gradient descent method. Therefore the so-called training set is necessary, i.e., a set of inputs for which the output values are known {x0 (n), t0 (n)}; n = 1, 2, ..., N . The measure of the error of output value may be defined for a single pair from the training set as (62)
En = L (t0 (n) − y0 (n)) ,
where t0 (n), y0 (n) denote the desired (target) and actual value of system output for x0 (n), respectively. Function L (·) stands for a loss function. 2 The most frequently a quadratic loss function is used, that is, L (·) = 12 (·) . In the next section another loss function will be used. For the entire training set, we define the error function as the average of En (63)
E=
N 1 X En . N n=1
ε-INSENSITIVE LEARNING TECHNIQUES
B (i) , B'
(i)
, B*
45
(i)
1
rule's firing
y (i)
y
w (i)
Figure 10. Triangular fuzzy set in conclusion of if-then rule (solid line). The result of inference for Reichenbach fuzzy implication — non-informative part (blue), informative part (red). The result of inference after remove the non-informative part (green).
In the so-called batch mode of learning, parameters updating is performed after the presentation of all examples from the training set, that is called an epoch. Thus, the minimization of error E is made iteratively (for parameter n oj=t,i=I (i) (i) (i) (i) α ∈ cj , sj , p0 , pj , w(i) ): i,j=1
(64)
αnew
¯ ∂E ¯¯ = αold − η , ∂α ¯α=αold
where η is the learning rate parameter. In the sequential mode of learning (stochastic mode), parameters updating is performed after the presentation of each example from the training set. From the real-mode point of view the sequential mode is preferred. In addition, given that examples are presented in a random manner to the system, makes the search stochastic. In this case, it is less possibly for learning algorithm to be trapped in a local minimum.
46
JACEK M. L Ã ESKI
We may express the partial derivatives of error En with respect to the unknown parameters as: £ ¤ ∂En [y (i) (x0 (n)) − y0 (n)]Fi (x0 (n)) ∂G Fi (x0 (n)), w(i) = An (i) I h i ∂Fi (x0 (n)) X ∂cj G Fk (x0 (n)), w(k) k=1 (i)
(65)
×
∂En (i)
∂sj
=
x0j (n) − cj ³ ´2 , (i) sj
£ ¤ [y (i) (x0 (n)) − y0 (n)]Fi (x0 (n)) ∂G Fi (x0 (n)), w(i) An I h i ∂Fi (x0 (n)) X G Fk (x0 (n)), w(k) k=1 (i)
(66)
(67)
×
∂En (i)
∂pj
(x0j (n) − cj )2 , ³ ´3 (i) sj
h i G Fi (x0 (n)), w(i) x (n) , j 6= 0, An I i 0j X h (k) G Fk (x0 (n)), w k=1 h i = (i) G F (x (n)), w i 0 An I , j = 0. i X h (k) G Fk (x0 (n)), w k=1
(68)
£ ¤ ∂En y (i) (x0 (n)) − y0 (n) ∂G Fi (x0 (n)), w(i) = An I , i ∂w(i) ∂w(i) X h (k) G Fk (x0 (n)), w k=1
where (69)
An =
∂En . ∂y0 (n)
Indeed, for the quadratic loss function, we obtain An = − (t0 (n) − y0 (n)). The respective derivatives for various fuzzy implications are shown in Table 3. A very simple operation speeding up convergence is proposed by Jang et al. [28]. If in four successive steps of gradient descent learning the error E increases and decreases commutatively, then the learning rate parameter is
Implication L Ã ukasiewicz ¡ ¢ Fodor ¡ Fi (x0 ) > 12 ¢ Fodor Fi (x0 ) ≤ 12 Reichenbach Kleene-Dienes ¡ ¢ Zadeh ¡Fi (x0 ) > 21 ¢ Zadeh Fi (x0 ) ≤ 12 Goguen G¨odel Rescher
¡ ¢ ∂G (Fi (x0 ),w(i) ) G Fi (x0 ), w(i) ∂Fi (x0 ) ¢ ¡ w(i)¡ Fi (x0 ) − 12 Fi (x0 )2 ¢ w(i) (1 − Fi (x0 )) w(i) 12¡− Fi (x0 ) + Fi (x0¢)2 w(i) (2Fi (x0 ) − 1) (i) 2 w Fi (x0 ) − Fi (x0 ) w(i) (1 − 2Fi (x0 )) (i) w w(i) 2 Fi (x0 ) 2 w(i) 2 (i) F (x ) w F i 0 i (x0 ) 2 ¡ ¢ 1 (i) (i) Fi (x0 ) − 2 w w 0 0 −w(i) 0 w(i) (Fi (x0 ) − 1) w(i) −w(i) 0 1 2
− Fi (x0 ) + Fi (x0 )2 Fi (x0 ) − Fi (x0 )2 1 2 Fi (x0 ) 1 2 2 Fi (x0 ) Fi (x0 ) − 12 0 −1 Fi (x0 ) − 1 −1
∂G (Fi (x0 ),w(i) ) ∂w(i) Fi (x0 ) − 12 Fi (x0 )2
£ ¤ Table 3. Function G Fi (x0 ), w(i) and its derivatives for selected fuzzy implications.
ε-INSENSITIVE LEARNING TECHNIQUES 47
48
JACEK M. L Ã ESKI
decreased, i.e., multiplied by nD < 1. However, if in four successive steps of gradient descent learning the error E decreases, then the learning rate parameter is increased, i.e., multiplied by nI > 1. Another problem is the estimation of the number I of if-then rules and initial values of membership functions for antecedents of if-then rules. Typ(i) ically, the centers cj are sub-sampled from the set of examples from the training set or all examples are used as the centers. Usually, all dispersions parameters of Gaussian membership functions are set equal to a prespecified positive value. The problem of estimation initial values of membership functions for antecedent of if-then rules may also be solved by means of preliminary clustering the input part of the training set using the fuzzy c-means method [2], [6], [50], [51]. Indeed, in our case we have I clusters. So, the name fuzzy I-means method will be a better. In this method each input vector x0 (n); n = 1, 2, ..., N is assigned to clusters represented by prototypes vi ; i = 1, ..., I measured by grade of membership uin ∈ [0, 1]. The (I × N )-dimensional partition matrix U becomes from the set of all possible fuzzy partitions into I clusters is defined by ¯ ( I ¯ X I×N ¯ =f I = U∈R ∀ uin ∈ [0, 1], uin = 1, ¯ ∀ ¯1≤i≤I 1≤n≤N i=1 ) N X (70) 0< uin < N . n=1
The fuzzy I-means criterion function has the form (71)
Jm (U, V) =
I X N X
m
(uin ) d2in ,
i=1 n=1
where U ∈ =f I , V = [v1 , v2 , ..., vI ] ∈ Rt×I and m is a weighting exponent in (1, ∞). The quantity din is the inner product induced norm (72)
2
>
d2in = kx0 (n) − vi k = (x0 (n) − vi ) (x0 (n) − vi ) .
It can be proved that a local minimum of criterion (71) may be obtained by an iterative method of commutative modification of partition matrix and prototypes [1], [2]:
(73)
∀
∀
1≤i≤I 1≤n≤N
uin
−1 2 ¶ m−1 I µ X d in , = d jn j=1
ε-INSENSITIVE LEARNING TECHNIQUES N P
(74)
∀
1≤i≤I
vi =
n=1
m
(uin ) N P n=1
49
x0 (n) . m
(uin )
The optimal partition is a fixed point of (73) and (74), and the solution is obtained from the Picard iteration. This algorithm is called fuzzy ISODATA or Fuzzy I-Means and can be described in the following steps: 1◦ Fix I (1 < I < N ), m ∈ (1, ∞). Initialize U(0) ∈ =f I . Set the iteration index, j = 0. (j) (j) (j) 2◦ Calculate fuzzy centers for the j-th iteration V(j) = [v1 , v2 , · · · , vI ] (j) using (74) and U . 3◦ Update fuzzy partition matrix U(j+1) for the (j + 1)-th iteration using (73). ° ° 4◦ If °U(j+1) − U(j) °F > ξU , then j ← j + 1 and go to 2◦ else stop. ³ ´ P P 2 k·kF denotes the Frobenius norm (kUkF = T r UU> = i n u2in ) and ξU is a pre-set parameter. The iterations are stopped as soon as the Frobenius norm in a successive pair of U matrices is less than the pre-set value ξ. In this algorithm, the parameter m influences a fuzziness of the clusters; a larger m results in fuzzier clusters. For m −→ 1+ , the fuzzy I-means solution becomes a hard one, and for m −→ ∞ the solution is as fuzzy as possible: uin = 1c , for all i, n. There is no theoretical basis for the optimal selection of m, and usually m = 2 is chosen. According to the above written algorithm the calculations are initialized using a random partition matrix U which fulfils conditions from (70). Such a method leads to the local minimum of criterion (71). Therefore the most frequently used solution is multiple repeated calculations in accordance with the above algorithm for various random realizations of initial partition matrix. Usually, validity indices which measure the cluster quality are used. One of the most popular validity indices is the extended Xie-Beni index [63] N X I X
(75)
VXB =
m
(uin ) d2in
n=1 i=1
N min kvk − vi k
.
k6=i
Indeed, we search a c-partition for which index VXB is minimal, that is, minimizing clusters’ compactness whilst maximizing their separation. As a result of preliminary clustering of the training set the following assumption (i) for initialization of the system can be made: cj = vij for j = 1, 2, ..., t;
50
JACEK M. L Ã ESKI
i = 1, 2, ..., I and N X
³ (76)
(i)
sj
´2
=
³ ´2 (i) uin x0j (n) − cj
n=1 N X
, uin
n=1
Another solution accelerating the convergence of the method is the estimation of parameters p(i) ; i = 1, ..., I by means of least squares method. The output value y0 of the system in equation (61) may be considered to be a linear combination of unknown parameters p(i) . Let us introduce the following notation: ¤ £ G Fi (x0 (n)), w(i) (i) (77) S (x0 (n)) = I , i X h (k) G Fk (x0 (n)), w k=1
· d (x0 (n)) =
.. (2) .. .. S (1) (x0 (n)) x0> (x0 (n)) x0> 0 (n) .S 0 (n) . · · · .
(78)
i> , S (I) (x0 (n)) x0> 0 (n)
(79)
· ¸> . . . P = p(1)> ..p(2)> .. · · · ..p(I)>
Now, for the n-th example (61) may be written in the form >
(80)
y0 (n) = d (x0 (n)) P.
Thus, for the consequent parameters p(i) , the Least Squares (LS) estimation is applied [54], [55], [56]. There are two approaches: (1) to solve one global LS problem, (2) to solve I independent weighted LS problems, one for each if-then rule. The first approach leads to better global performance, while the second leads to more reliable local performance. Combining both above approaches is suggested in [65]. In this work both approaches, that is, global learning and local learning, will be used to introduce an idea of learning tolerant to imprecision. For fixed antecedents obtained via clustering of the input space, the global LS solution to the consequent parameters estimation, minimizing the following criterion function [65] (81)
Ig (P) =
N X n=1
2
>
[t0 (n) − y0 (n)] = (t − Xd P) (t − Xd P) ,
ε-INSENSITIVE LEARNING TECHNIQUES
51
can be written in the matrix form, as ¡ ¢−1 > (82) P = X> Xd t, d Xd where >
Xd , [d (x0 (1)) , d (x0 (2)) , · · · , d (x0 (N ))] ∈ RN ×I(t+1) , >
t = [t0 (1) , t0 (2) , · · · , t0 (N )] . The local LS solution to the consequent parameters estimation, minimizing the following criterion functions [65] (i) Il
³
(i)
p
´ =
N X
h i2 S (i) (x0 (n)) t0 (n) − y (i) (x0 (n))
n=1
³ (83)
=
t − Xp(i)
´>
³ ´ D(i) t − Xp(i) ,
can be written in the matrix form, as ³ ´−1 (84) p(i) = X> D(i) X X> D(i) t,
i = 1, 2, ..., I,
where >
X , [x01 , x02 , · · · , x0N ] ∈ RN ×(t+1) , ³ ´ D(i) = diag S(i) ; h iN . S(i) = S (i) (x0 (n)) n=1
For the global learning in the case of sequential mode the solution (82) may be easily replaced by the recurrent least squares method. For k-th example from the training set, we get [32]: (85) n o > P (k) = P (k − 1) + G(k − 1)d (x0 (k)) y0 (k) − d (x0 (k)) P (k − 1) , G (k) = (86)
G (k − 1) − G(k − 1)d (x0 (k)) n o−1 > × d (x0 (k)) G(k − 1)d (x0 (k)) + 1 d(x0 (k))> G(k − 1).
To initialize the computations we take: ½ P (0) = 0, (87) G (0) = ζI, where I is an identity matrix, ζ is a large positive constant. Both local and global learning may be executed using the following schemes: (i) the parameters form the antecedents and the consequents of if-then rules are adjusted separately. First, the antecedents parameters are
52
JACEK M. L Ã ESKI
adjusted using unsupervised learning — clustering the input data using (73) and (74). Second, the consequents parameters are adjusted by means of gradient descent method (67), (68) or the least squares method: (82) for global learning and (84) for local learning; (ii) the parameters are adjusted in two-phase learning. First, as in the previous method, the antecedents parameters are adjusted using unsupervised learning. Second, all parameters (antecedent and consequent) are adjusted by means of the gradient descent method (65) — (68). (iii) First, the antecedents parameters are adjusted using unsupervised learning. Finally, in each iteration parameters p(i) are estimated on the basis of (82) or (84), whereas the other parameters by means of the gradient descent method (65), (66) and (68). 6. ε-insensitive learning of a neuro-fuzzy system The neuro-fuzzy systems presented in the previous section as well as other Neuro-fuzzy systems known from the literature have an intrinsic inconsistency. It may perform thinking tolerant to imprecision, but neural networks learning methods are zero-tolerant to imprecision [40], [42], [43], [44]. Usu2 ally, these learning methods use the quadratic loss function L (·) = 12 (·) , to match reality and a fuzzy model. In this case only perfect matching between reality and the model leads to a zero loss. Presented in this paper approach to neuro-fuzzy modeling is based on the premise that human learning as well as thinking is tolerant to imprecision. Hence, zero loss for an error less than some pre-set value, noted ε, is assumed. If the error is greater than ε, then the loss increases linearly. Learning method based on this loss function can be called ε-insensitive learning or εlearning. Problem of learning tolerant to imprecision can be presented as determining the systems parameters, where in order to fit the fuzzy model to real data from the training set an ε-insensitive loss function is used. Using the ε-insensitive loss function the measure of the error for the n-th example has the form [61] En (88)
= L (t0 (n) − y0 (n)) , et0 (n) − y0 (n)dε ½ 0, |t0 (n) − y0 (n)| ≤ ε, , |t0 (n) − y0 (n)| − ε, |t0 (n) − y0 (n)| > ε.
In this case, the quantity An from the partial derivatives of error En with respect to the unknown parameters may be expressed as ½ ∂En 0, |t0 (n) − y0 (n)| ≤ ε, = (89) An = sgn (y0 (n) − t0 (n)) , |t0 (n) − y0 (n)| > ε, ∂y0 (n) where sgn(·) denotes signum function. The ε-insensitive fuzzy clustering is introduced in [37], [40], [41]. A more complicated problem is to use the ε-insensitive loss function in estimation of
ε-INSENSITIVE LEARNING TECHNIQUES
53
the parameters of the consequents. However, in other side, solution to this problem enables to control the generalization ability of the system. According to the statistical learning theory to obtain the maximum generalization ability, we need to control both training error and model complexity. In short, we need to explain the data with a simple model. This refers to Occam’s razor principle: “Entities should not be multiplied beyond necessity” or in other words: “The simplest explanation is the best” [61], [62]. The neuro-fuzzy system described in the previous section can be viewed as a mixture of experts system [21]. Each expert is represented by a fuzzy if-then rule with a moving fuzzy set in consequent. A region of work of each expert/rule is obtained by the fuzzy clustering method. It is assumed that different experts work best in different regions of the input space. An integrating unit called gating network, described by (61), performs a function of mediator among the experts. In the further parts of this work, the problem of the estimation of consequence parameters with control the model complexity will be solved. 6.1. Global learning case. Using the ε-insensitive loss function the global learning criterion function is proposed in the following form (90)
min
P∈RI(t+1)
Ig (P) ,
N m l X τ e> e > P, t0 (n) − d (x0 (n)) P + P 2 ε n=1
e is a narrowed vector P, with excluded components corresponding to where P e = [e e (2)> , · · · , p e (I)> ]> (see Eq.(50)). The second term the biases: P p(1)> , p in (90) is related to minimization of the Vapnik-Chervonenkis dimension (complexity) of the regression model [61]. Parameter τ ≥ 0 controls the trade-off between the regression model complexity and the amount up to which errors are tolerated. > e (x0 )> P+a, e (x0 ) = e Taking into account that y0 = d (x0 ) P = d where d (I) > > (1) > (2) > [S (x0 ) x0 , S (x0 ) x0 , · · · , S (x0 ) x0 ] and (1)
(2)
(I)
a = S (1) (x0 ) p0 + S (2) (x0 ) p0 + · · · + S (I) (x0 ) p0
denotes overall bias, the criterion (90) can be written in the form (91)
N m l ³ ´ X e (x0 (n))> P e −a + τP e a , e > P. e t0 (n) − d Ig P, e It ,a∈R 2 ε P∈R n=1
min
In a general case, not for all examples (x0 (n) , t0 (n)) inequalities: t0 (n)− e (x0 (n))> P e (x0 (n))> P e − a ≤ ε and d e + a − t0 (n) ≤ ε are satisfied. If we d
54
JACEK M. L Ã ESKI
introduce slack variables ξn+ , ξn− ≥ 0, then for all examples we can write ( e (x0 (n))> P e − a ≤ ε + ξ+, t0 (n) − d n (92) > e e d (x0 (n)) P + a − t0 (n) ≤ ε + ξn− . Using (92), the criterion (91) can be written in the form N ³ ´ X ¡ + ¢ 1 > e a = 1 e P, e Ig P, ξn + ξn− + P τ n=1 2
(93)
and minimized subject to the constraints (92) and ξn+ ≥ 0, ξn− ≥ 0. The Lagrangian function of (93) with the above constraints is Gg
=
N ¢ 1 e> e 1 X ¡ + P P+ ξ + ξn− 2 τ n=1 n
−
N X
³ ´ > e + e λ+ n ε + ξn − t0 (n) + d (x0 (n)) P + a
n=1
−
N X
³ ´ − e (xn (n))> P e −a λ− ε + ξ + t (n) − d 0 n n
n=1
(94)
−
N X ¡
¢ − − + µ+ n ξn + µn ξn ,
n=1 − + − λ+ n , λn , µn , µn
where ≥ 0 are the Lagrange multipliers. The objective is to e a, ξ + , ξ − . It must be also minimize the above Lagrangian with respect to P, n n maximized with respect to the Lagrange multipliers. The following condie a, ξ + , ξ − tions for optimality we get by differentiating (94) with respect to P, n n and setting the results equal to zero: N X ¡ + ¢ ∂Gg = P e (x0 (n)) = 0, e− d λn − λ− n e ∂P n=1 N ∂Gg X ¡ + ¢ = λn − λ− n = 0, ∂a (95) n=1 1 ∂G g + − λ+ n − µn = 0, + = τ ∂ξ n 1 ∂Gg − = − λ− n − µn = 0. τ ∂ξn− − The last two (95) and the requirements µ+ n , µn ≥ 0 imply that ¤ £ conditions 1 + − λn , λn ∈ 0, τ . From the first condition (95), we obtain the so-called
ε-INSENSITIVE LEARNING TECHNIQUES
55
support vector expansion [61] e= P
(96)
N X ¡ + ¢ e λn − λ− n d (x0 (n)) , n=1
e for some training data, e can be described as a linear combination of d i.e., P which are named support vectors. Putting conditions (95) in the Lagrangian (94) we get Gg
(97)
=
−
N N ¢ ¢¡ + 1 XX¡ + >e e λ − λ− λj − λ− n j d (x0 (n)) d (x0 (j)) 2 n=1 j=1 n
−ε
N N X ¡ + ¢ X ¡ + ¢ λn + λ− λn − λ− n + n t0 (n) . n=1
n=1
− Maximization of (97) with respect to λ+ n , λn subject to constraints: N P − (λ+ n − λn ) = 0, (98) n=1 + − £ 1¤ λn , λn ∈ 0, τ
is the so called Wolfe dual formulation of (94). It is well-known from optimization theory that at the saddle point, for each Lagrange multiplier, the Karush-K¨ uhn-Tucker (KKT) conditions must be satisfied: ³ ´ > e + e λ+ n ε + ξn − t0 (n) + d (x0 (n)) P + a = 0, ³ ´ − − e (x0 (n))> P e − a = 0, λ ε + ξ + t (n) − d 0 n µn ¶ 1 (99) + − λn ξn+ = 0, τ ¶ µ 1 − λ− ξn− = 0. n τ ¡ 1¢ + From last two conditions (99) we see that λ+ n ∈ 0, τ =⇒ ξn = 0 and ¡ ¢ 1 − − λn ∈ 0, τ =⇒ ξn = 0. In this case, from the first two conditions (99) we have: ( ¡ ¢ e (x0 (n))> P e − ε, for λ+ ∈ 0, 1 , a = t0 (n) − d n τ¢ ¡ (100) e (x0 (n))> P e + ε, for λ− ∈ 0, 1 . a = t0 (n) − d n τ Thus, we may determine the parameter a from (100) by taking example ¡ ¢ x0 (n) for which we have the Lagrange multipliers in open interval 0, τ1 . From numerical point of view, it is better to take mean value of a obtained for all examples for which the conditions from (100) are satisfied. Taking
56
JACEK M. L Ã ESKI (1)
(2)
(I)
into account that a = S (1) (x0 ) p0 + S (2) (x0 ) p0 + · · · + S (I) (x0 ) p0 PI (i) (i) (x0 ) = 1, we see that p0 = a, i = 1, 2, ..., I. i=1 S
and
6.2. Local learning case. Using the ε-insensitive loss function the local learning criterion function, for the i-th model, is proposed in the form (101) N ³ ´ X m l τ (i) (i) (i) e (i) . e (i) , p0 = e (i)> p e (i)> x0 (n) + p0 Il p S (i) (x0 (n)) t0 (n) − p + p ε 2 n=1 The above equation can be called the weighted ε-insensitive estimator (or fuzzy ε-insensitive) with complexity control. The parameter τ ≥ 0 (regularization parameter) controls the trade-off between model generalization ability and model matching to the training data (a larger τ results in decreased complexity of the model). In a general case, not for all examples (x0 (n) , t0 (n)) inequalities: t0 (n)− (i) (i) (i)> e e (i)> x0 (n) + p0 − t0 (n) ≤ ε are satisfied. If we use p x0 (n) − p0 ≤ ε, p + − slack variables ξin , ξin ≥ 0, then for all examples we can write: ( (i) + e (i)> x0 (n) − p0 ≤ ε + ξin t0 (n) − p , (102) (i) − (i)> e p x0 (n) + p0 − t0 (n) ≤ ε + ξin . Using (102), the criterion (101) can be written in the form (103)
(i)
Il
³
(i)
e (i) , p0 p
´ =
N ¡ + ¢ 1 (i)> (i) 1 X (i) − e e p S (x0 (n)) ξin + ξin + p τ n=1 2
+ − and minimized subject to the constraints (102) and ξin ≥ 0, ξin ≥ 0. The Lagrangian function of (103) with the above constraints is (i)
Gl
=
N ¡ + ¢ 1 (i)> (i) 1 X (i) − e e + p p S (x0 (n)) ξin + ξin 2 τ n=1
−
N X
³ ´ (i) + (i)> e λ+ ε + ξ − t (n) + p x (n) + p 0 0 0 in in
n=1
−
N X
³ ´ (i) − (i)> e λ− ε + ξ + t (n) − p x (n) − p 0 0 0 in in
n=1
(104)
N X ¡ + + ¢ − − µin ξin + µ− in ξin , n=1 − + − λ+ in , λin , µin , µin
where ≥ 0 are the Lagrange multipliers. The objective is (i) + − e (i) , p0 , ξin to minimize this Lagrangian with respect to p , ξin . It must be
ε-INSENSITIVE LEARNING TECHNIQUES
57
also maximized with respect to the Lagrange multipliers. The following conditions for optimality are obtained by differentiating (104) with respect (i) + − e (i) , p0 , ξin to p , ξin and setting the results equal to zero: (i) ∂Gl ∂e p(i) (i) ∂Gl (i) ∂p0 ∂G(i) l + ∂ξ in (i) ∂Gl − ∂ξin
(105)
e (i) − =p
N X ¡ + ¢ λin − λ− in x0 (n) = 0, n=1
N X ¡ + ¢ = λin − λ− in = 0, n=1
1 (i) + S (x0 (n)) − λ+ in − µin = 0, τ 1 − = S (i) (x0 (n)) − λ− in − µin = 0. τ =
+ − The last two (105) ¤ and the requirements µin , µin ≥ 0 imply that £ conditions + − 1 (i) λin , λin ∈ 0, τ S (x0 (n)) . From the first condition (105), we obtain the support vector expansion
e (i) = p
(106)
N X ¡ + ¢ λin − λ− in x0 (n) . n=1
Putting conditions (105) in the Lagrangian (104) we get
(i)
Gl (107)
=
−
N N ¢¡ + ¢ > 1 XX¡ + λ − λ− λij − λ− in ij x0 (n) x0 (j) 2 n=1 j=1 in
−ε
N N X ¡ + ¢ X ¡ + ¢ λin + λ− λin − λ− in + in t0 (n) . n=1
n=1
− Maximization of (107) with respect to λ+ in , λin subject to constraints:
(108)
N X ¡ + ¢ λin − λ− in = 0, n=1 ¸ · 1 (i) + − λin , λin ∈ 0, S (x0 (n)) τ
58
JACEK M. L Ã ESKI
is the Wolfe dual formulation of (104). At the saddle point, for each Lagrange multiplier, the KKT conditions must be satisfied: ³ ´ (i) + + (i)> e λ ε + ξ − t (n) + p x (n) + p = 0, 0 0 0 in in ³ ´ (i) − − e (i)> x0 (n) − p0 = 0, ε + ξin + t0 (n) − p λ ¶ µin 1 (i) (109) + S (x0 (n)) − λ+ in ξin = 0, τ ¶ µ 1 (i) − S (x0 (n)) − λ− in ξin = 0. τ ¡ ¢ From last two conditions (109) we see that λ+ ∈ 0, τ1 S (i) (x0 (n)) imply in ¡ 1 (i) ¢ + − ξin = 0 and λ− (x0 (n)) imply ξin = 0. In this case, from the in ∈ 0, τ S first two conditions (109) we have: ( ¢ ¡ 1 (i) (i) > (i) e − ε, for λ+ (x0 (n)) , p0 = t0 (n) − x0 (n) p in ∈ ¡0, τ S ¢ (110) (i) > (i) 1 (i) e + ε, for λ− p0 = t0 (n) − x0 (n) p (x0 (n)) . in ∈ 0, τ S (i)
any x0 (n)¢ Thus, we may determine the parameter p0 from (110) by taking ¡ for which we have the Lagrange multipliers in open interval 0, τ1 S (i) (x0 (n)) . (i) e a or p e (i) , p0 6.3. A unification. Determination of the parameters P, leads to quadratic programming (QP) problem (97) or (107) with bound constraints and one linear equality constraint (98) or (108). It is easy to see that both problems may be presented as the maximization of the following Lagrangian
G =
(111)
−
N N ¢¡ + ¢ 1 XX¡ + λj − λ− λn − λ− n j Knj 2 n=1 j=1
−ε
N N X ¡ + ¢ X ¡ + ¢ λn + λ− + λn − λ− n n t0 (n) . n=1
with respect to (112)
− λ+ n , λn
n=1
subject to constraints: N ¢ X¡ + λn − λ− n = 0, n=1 − λ+ n , λn ∈ [0, ηn ] .
The solution is given by (113)
p=
N X ¡ + ¢ λn − λ− n rn . n=1
ε-INSENSITIVE LEARNING TECHNIQUES
59
Let us denote bias as z. Now, in both cases the determination of bias is given by: ½ (114)
+ z = t0 (n) − r> n p − ε, for λn ∈ (0, ηn ) , − p + ε, for λ z = t0 (n) − r> n ∈ (0, ηn ) . n
Thus, both global and local learning can be presented as a solution of the problem (111), with notations summarized in Table 4. In this table the notations needed in further consideration are also shown. For a large training set standard optimization techniques quickly become intractable in their memory and time requirements. Standard implementation of QP solvers require explicit storage of N × N matrix. Osuna et. al in [49] and Joachims in [29] shows that the large QP problems can be decomposed into a series of smaller QP subproblems over part of the data. Platt in [52] proposes Sequential Minimal Optimization algorithm. This method chooses two Lagrange multipliers and find their optimal values analytically. A disadvantage of these techniques is that they may give an approximate solution, and may require many passes through the training set. In [48] an alternative approach that determines the solution in the classification problem is presented. In the next section this idea is used to solve the problem of fuzzy modeling with ε-insensitive learning. In [5] an approach that determines the exact solution for p training data in terms of that for p−1 data, is presented to solve classification problems. In Section 8, this idea is used to solve the neuro-fuzzy learning problems. A third approach to ε-insensitive learning shown in the Section 9 leads to a problem of solving a system of linear inequalities. 7. Iterative QP solution Regrouping the terms in (94) or (104) and using the unification presented in the previous section yields G = (115)
N N X ¡ ¢ X ¡ ¢ 1 >e + − p Ip + ξn+ ηn − λ+ ξn− ηn − λ− n − µn + n − µn 2 n=1 n=1
+
N N X ¢2 X ¢2 γn+ ¡ γn− ¡ > t0 (n) − r> p − ε + rn p − t0 (n) − ε , n 2 2 n=1 n=1
where (116)
γn+ ,
2λ+ n , t0 (n) − r> np−ε
γn+ ,
2λ− n . r> n p − t0 (n) − ε
JACEK M. L Ã ESKI 60
x0 (n) µ· h iN h iN ¸¶ un un , diag |e |e n| n+N | n=1 n=1 £ (i) ¤> S (x0 (1)) , S (i) (x0 (2)) , · · · , S (i) (x0 (N )) t+1 ¡£ ¤¢ > diag 0, 1t×1
Local learning for the i-th rule p(i) > x0 (n) x0 (j) 1 (i) (x0 (n)) τS (i) p0
Table 4. Substitutions used in global and local learning.
a e (x0 (n)) d µh i2N ¶ diag |e1n |
Notation Global learning p P e (x0 (n))> d e (x0 (j)) Knj d 1 ηn τ z rn De1
1N ×1 I (t + 1) ¡£ ¤¢ > > > diag 0, 1t×1 , 0, 1t×1 , · · · , 0, 1t×1
n=1
u κ eI
ε-INSENSITIVE LEARNING TECHNIQUES
61
The condition for optimality is obtained by differentiating (115) with respect to p and setting the results equal to zero (117) N N X X ¡ ¢ ¡ ¢ ∂G e = Ip − γn+ t0 (n) − r> p − ε r + γn− r> n n n p − t0 (n) − ε rn = 0. ∂p n=1 n=1 ¡ ¢ ¡ ¢ + − If we define Γ+ = diag γ1+ , γ2+ , ..., γN and Γ− = diag γ1− , γ2− , ..., γN , the solution to (117) with respect to p can be written in the form (118) ³ ´−1 £ ¡ + ¢ ¤ − + > − e X> p = X> Γ + Γ X + I r r r Γ (t − ε1N ×1 ) + Xr Γ (t + ε1N ×1 ) , >
where Xr = [r1 , r2 , · · · , rN ] . Expression (118) is the optimal solution to (115). However, this solution is unconstrained and in order to satisfy the KKT conditions from the Section − + 6, the special setting of the parameters λ+ n and λn is needed: if ξn < 0 − + − + − + (ξn < 0), then λn = 0 (λn = 0); if ξn = 0 (ξn = 0), then λn ∈ (0, ηn ) + − + − (λ− n ∈ (0, ηn )); if ξn > 0 (ξn > 0), then λn = ηn (λn = ηn ). Thus, the Lagrange multipliers can be treated as functions of the slack variables. From the above rules we can see that these functions have a discontinuity at ξn+ = 0 (ξn− = 0). In [48], the following approximation of these functions is proposed ξn+ < 0, 0, + βξn , 0 ≤ ξn+ < (119) λ+ n = ηn , ξn+ ≥ ηβn ,
ηn β ,
ξn− < 0, 0, − βξn , 0 ≤ ξn− < λ− n = ηn , ξn− ≥ ηβn ,
ηn β ,
where β is a big positive value and (120)
ξn+ , t0 (n) − r> n p − ε,
ξn− , r> n p − t0 (n) − ε.
From (118) we can infer that p depends on γn+ , γn− , and from (116), (119), (120) that γn+ , γn− depend on p. In this case, the most popular algorithm for approximating the solution is the Picard iteration through (118) and (116), [k−1] (119), estimates p → ° © [k] ª (120).[k]This algorithm loops through the cycle°of[k] [k−1] ° ° γ → p and checks the termination criterion p − p < κ 1, 2 where κ1 is a pre-set parameter, and superscript [k] denotes the iteration index. The procedure of seeking optimal p can be called Iterative Quadratic Programming (εIQP) and summarized in the following steps: +[1]
(1) Initialize γn
−[1]
= β, γn
= 0 for all n. Iteration index, k = 1.
62
JACEK M. L Ã ESKI
(2) p[k]
³ =
³ ´ ´−1 h +[k] X> Γ+[k] + Γ−[k] Xr + eI X> (t − ε1N ×1 ) r r Γ i −[k] +X> (t + ε1N ×1 ) . r Γ
+[k+1]
−[k+1]
[k] [k] − ε, ξn − t0 (n) − ε. = t0 (n) − r> = r> (3) ξn np n p +[k+1] −[k+1] +[k+1] −[k+1] , λn using (119) and γn , γn using (4) Compute λn (116). ° ° (5) if (k = 1) or °p[k] − p[k−1] °2 > κ1 , then k = k + 1, go to 2 else stop.
8. Incremental learning Putting the first and the last two conditions (95) or (105) in the Lagrangian (94) or (104) and using notations from Table 4, we get H = −G =
(121)
N N N X ¢¡ + ¢ ¡ + ¢ 1 XX¡ + λn − λ− λj − λ− K + ε λn + λ− nj n n j 2 n=1 j=1 n=1
−
N X ¡
N X ¢ ¡ + ¢ − λ+ − λ t (n) − z λn − λ− 0 n n n .
n=1
λ± n
λ+ n
Defining , − written in the form max z∈R
(122)
min
λ− n
n=1
∈ [−ηn , +ηn ], the minimization of (121), can be
H {−ηn ≤λ± n ≤+ηn }
=
N N N X ¯ ±¯ 1 XX ± ± ¯λn ¯ λn λj Knj + ε 2 n=1 j=1 n=1
−
N X n=1
λ± n t0 (n) − z
N X
λ± n.
n=1
λ± n
Differentiating (122) with respect to and z yields: N X ¡ ±¢ ∂H Knj λ± j + z + ε sgn λn − t0 (n) , ± = ∂λn j=1 (123) N X ∂H λ± ∂z = j . j=1 Using (96) and (106) we see that the first of (123) is the KKT condition for N P local and global learning, respectively. Defining hn , Knj λ± j +z −t0 (n), j=1
e + z − t0 (n) and the following conditions should be we see that hn = e r> np satisfied: (1) if hn > 0, then the n-th example is below the regression line,
ε-INSENSITIVE LEARNING TECHNIQUES
63
(2) if hn = 0, then the n-th example is on the regression line, (3) if hn < 0, then the n-th example is above the regression line, (4) if hn + ε < 0, then the n-th example is above the insensitivity region and λ± n = +ηn , (5) if hn + ε = 0, then the n-th example is on the edge of the insensitivity region, and λ± n ∈ (0, +ηn ), (6) if hn + ε > 0, then the n-th example is on the + insensitivity region, and λ± n → 0 , (7) if hn − ε < 0, then the n-th example − is on the insensitivity region, and λ± n → 0 , (8) if hn − ε = 0, then the n-th example is on the edge of the insensitivity region, and λ± n ∈ (−ηn , 0), (9) if hn − ε > 0, then the n-th example is below the insensitivity region, and λ± n = −ηn . ³ ´ N The parameters values {λ± n }n=1 , z explicitly define the partition of the training data indices T r(N ) in the following groups: support vectors indices Sv (N ) , error vectors indices Er(N ) , and remaining vectors indices Re(N ) . Indeed, Sv (N ) ∪ Er(N ) ∪ Re(N ) = T r(N ) . In the incremental learning the solution in the iteration p is obtained from the solution in the iteration p−1. In the iteration p data pair (x0 (c) , t0 (c)) is added to the training set and index c is added to the training data indices, T r(p) = T r(p−1) ∪ {c}. First, we assume that initially λ± c is equal to zero, and it is changed by a small e and z change their values in each value ∂λ± c . The regression parameters, p incremental step to keep all elements in training set in equilibrium, i.e., keep the KKT conditions fulfilled. These conditions can be differentially expressed as:
(124)
X ∂hn = Knc ∂λ± Knj ∂λ± c + j + ∂z = 0, (p−1) j∈Sv X ± ∂λ± ∂λ + c j = 0. j∈Sv (p−1)
For vectors from the support set Sv (p−1) = {s1 , s2 , · · · , s` }, the following conditions are fulfilled ∂hsk = 0; 1 ≤ k ≤ `. Writing (124) for Sv (p−1) in matrix form yields (125)
Ξ·
∂z ∂λ± s1 .. . ∂λ± s`
= −
1 Ks 1 c .. . Ks ` c
± ∂λc ,
64
JACEK M. L Ã ESKI
where Ξ is symmetric not positive-definite Jacobian 0 1 ··· 1 1 Ks1 s1 · · · Ks1 s` (126) Ξ= . . . .. .. .. .. . . 1 Ks ` s 1 · · · Ks ` s ` In equilibrium state
½
(127) with sensitivities given by (128)
ψ ρs 1 .. .
∂z = ψ∂λ± c , ± ∂λ± n = ρn ∂λc ,
= −Υ
ρs `
1 Ks 1 c .. .
,
Ks ` c
where Υ = Ξ−1 , and for all example indices outside Sv (p−1) set, ρj = 0. Substituting (127) in ∂hn yields (129)
∂hn = κn ∂λ± c ,
where κn = 0 for all example indices from Sv (p−1) , and X (130) κn = Knc + Knj ρj + ψ, for n ∈ / Sv (p−1) . j∈Sv (p−1) (p−1) If ∂λ± moves across the c is sufficiently large, then indices from T r (p−1) (p−1) (p−1) sets Sv , Er , Re . On the basis of (127) and (129), it is possible to determine, the largest admissible value of ∂λ± c , to the first membership change, accordingly to: (1) hc +ε ≤ 0, with equality when c joins to Sv (p−1) , (2) hc − ε ≥ 0, with equality when c joins to Sv (p−1) , (3) λ± c ≤ +ηc , with equality when c joins to Er(p−1) , (4) λ± ≥ −η , with equality when c c c (p−1) joins to Er(p−1) , (5) 0 ≤ λ± ≤ +η , ∀n ∈ Sv , with equality to 0, n n (p−1) (p−1) when n transfers from Sv to Re , and equality to +ηn , when n (p−1) , with transfers from Sv (p−1) to Er(p−1) , (6) −ηn ≤ λ± n ≤ 0, ∀n ∈ Sv (p−1) (p−1) equality to −ηn , when n transfers from Sv to Er , and equality to 0, when n transfers from Sv (p−1) to Re(p−1) , (7) hn + ε ≤ 0, ∀n ∈ Er(p−1) , with equality when n transfers from Er(p−1) to Sv (p−1) , (8) hn − ε ≥ 0, ∀n ∈ Re(p−1) , with equality when n transfers from Re(p−1) to Sv (p−1) , (9) hn + ε ≥ 0, ∀n ∈ Re(p−1) , with equality when n transfers from Re(p−1) to Sv (p−1) , (10) hn − ε ≤ 0, ∀n ∈ Re(p−1) , with equality when n transfers from Re(p−1) to Sv (p−1) .
ε-INSENSITIVE LEARNING TECHNIQUES
65
If support vector indices set is extended by inclusion element s`+1 , then also the matrix Υ should be extended. The matrix Ξ is extended by adding one row and one column: 1 Ξ(p−1) Ks1 s`+1 (131) Ξ(p) = . .. . 1
Ks`+1 s1
···
Ks`+1 s`+1
Using for (131) the extension principle from the matrix theory [20], (see Appendix A) and (128), (130) yields 0 · → ¸ .. − ¤ 1 ρ £ − → . Υ(p−1) ρ 1 , (132) Υ(p) = + 0 κs`+1 1 0 ··· 0 0 → where − ρ = [ψ, ρs1 , · · · , ρs` ] . Above process is reversible, and when support vector with index s`+1 is excluded, we may use (132) to obtain Υ(p−1) . In case of excluding vector with index sj , j 6= ` + 1, we can change position of this vector to ` + 1, and use (132): h i. (p) (p) (p) (133) ∀ Υ(p−1) = Υ(p) Υjj . mn mn − Υmj Υjn >
(p−1)
m,n6=j; m,n∈Svi m,n6=0
Incremental learning (εIL) algorithm can be summarized in the following steps: (0) (1) Initialize hn = −t0 (n), λ± = Er(0) = Re(0) = ∅, n = 0, ∀n, Sv p = 1, (2) Select element with index c as the farthest from regression line, (3) If hc + ε > 0 or hc − ε < 0, then Re(p) = Re(p−1) ∪ {c}, (4) If hc + ε ≤ 0, then increase λ± c so that one condition is fulfilled: • hc + ε = 0, Sv (p) = Sv (p−1) ∪ {c}, update Υ(p) , (p) • λ± = Er(p−1) ∪ {c}, c = +ηc , Er • Index of a one element from T r(p−1) transfer between Sv (p−1) , Er(p−1) , Re(p−1) , (5) If hc + ε ≥ 0, then decrease λ± c so that one condition is fulfilled: • hc + ε = 0, Sv (p) = Sv (p−1) ∪ {c}, update Υ(p) , (p) • λ± = Er(p−1) ∪ {c}, c = −ηc , Er • Index of a one element from T r(p−1) transfer between Sv (p−1) , Er(p−1) , Re(p−1) ,
66
JACEK M. L Ã ESKI
± ± ± ± ± (p−1) (6) Update, λ± ; c ←− λc + ∂λc ; λn ←− λn + ρn ∂λc , n ∈ Sv ± ± (p−1) z ←− z + ψ∂λc ; hn ←− hn + κn ∂λc , n ∈ / Sv , (7) If there are not processed element indices in T r(N ) , then p ←− p+1, go to 2 else stop. Remarks: • In substance, it is indifferent what example is selected as (x0 (c) , t0 (c)). But to increase the convergence of the incremental learning, the example pair farthest from the regression line is selected. • The incremental learning has computational burden approximately quadratic with the cardinality of the training set. It is also approximately linear to data dimensionality t. • Some of examples are selected several times, and it is remunerative, from computational point of view, to cache Knj evaluations.
9. ε-insensitive learning by solving a linear inequalities system Using notations from Table 4, the criterion functions (91) and (101) can be rewritten in the matrix form as τ (134) min I (p) , et − Xr pdε,u + p>eIp, p 2 where e·dε,u denotes a weighted Vapnik loss function, defined for a vector >
argument, g = [g1 , g2 , · · · , gN ] , as (135)
egdε,u ,
N X
>
un egn dε , u = [u1 , u2 , · · · , uN ] .
n=1
To make the minimization problem (134) mathematically tractable, we see that the minimization of the first term can be equivalently written as requirements: Xr p + ε1N ×1 > t and Xr p − ε1N ×1 < t, where 1N ×1 is a (N × 1)-dimensional vector with all entries equal to 1. Defining the ex. tended versions of X and t: X , [X> ..−X> ]> and t , [t (1)−ε, t (2)− r
re
r
r
e
0
0
ε, · · · , t0 (N ) − ε, −t0 (1) − ε, −t0 (2) − ε, · · · , −t0 (N ) − ε]> , the above requirements can be written as: Xre p − te > 0. In practically interesting cases, not all inequalities in the above system are fulfilled (except the case, where ε is so large, then all data fall in the insensitivity region). The above inequalities system, in order to solve, is replaced by the following equalities system: Xre p − te = b, where b is an arbitrary positive vector, b > 0. We define the error vector as: e = Xre p − te − b. If the n-th (2n-th), where 1 ≤ n ≤ N , component of e is positive, en ≥ 0 (e2n ≥ 0), then the n-th datum falls on the insensitivity region, and by increase the respective component of b, the en (e2n ) can be set to zero. If n-th (2n-th) component of
ε-INSENSITIVE LEARNING TECHNIQUES
67
e is negative, then the n-th datum falls outside the insensitivity region, and it is impossible to decrease bn (b2n ) and fulfils the condition bn > 0 (b2n > 0). In other words, non-zero error corresponds only to datum outside the insensitivity region. Our minimization problem (134) can be approximated by the following one (136) τ > min I (p, b) , (Xre p − te − b) De (Xre p − te − b) + p>eIp, p∈Rκ ,b>0 2 µ· ¸¶ . where De denotes a diagonal weights matrix, De = diag u> ..u> . Dimensionality of the parameter vector is denoted as κ; see Table 4. For mathematical simplicity, the above criterion is approximation of (134), where squared error rather than absolute error is used. In further part of this work absolute error will be used. Conditions for optimality we get by differentiating (136) with respect to p, b and setting the results equal to zero: ( ³ ´−1 τe > p = X D X + I X> e re re re De (te + b) , 2 (137) e = Xre p − te − b = 0. From the first equation (137), we see that the vector p depends on vector b. The vector b can be called a margin vector, because its components determine the distance from a datum to the insensitivity region. For fixed p, if datum is lying on the insensitivity region, the corresponding distance can be increased to obtain zero error. However, if datum is lying outside the insensitivity region, then error is negative, and we may decrease the error only by decreasing the corresponding component of b. The only way to prevent b from converging to zero is to start with b > 0 and to refuse to decrease any of its components. Ho and Kashyap proposed an iterative algorithm for alternately determining p and b, where components of b cannot decrease [22], [23]. Now, this algorithm can be extended to our weighted squared error criterion with term corresponding to VC dimension. The solution is obtained in an iterative way. Vector p is determined on the basis of the first ³ ´−1 ¡ ¢ equation from (137), i.e., p[k] = X> De Xre + τ eI X> De te + b[k] , re
2
re
where superscript [k] denotes the iteration index. Components of the vector b are modified by components of the error vector e, but only in the case when it is response in increase components of b. Otherwise, the components of b remain unmodified. So, we write this modification as follows ¯ ¯´ ³ ¯ ¯ (138) b[k+1] = b[k] + ρ e[k] + ¯e[k] ¯ , where ρ > 0 is a parameter.
68
JACEK M. L Ã ESKI
Absolute error criterion, equivalent to (134) is easy obtained by selecting the following diagonal weight matrix De1
= diag (u1 /|e1 | , u2 /|e2 | , · · · uN /|eN | , u1 /|eN +1 | , u2 /|eN +2 | , · · · uN /|e2N | ) ,
where ei is the i-th component of the error vector. But, the error vector depends on p. So, we use vector p from previous iteration. This procedure is based on the premise that near the optimal solution sequence of vectors p differs imperceptibly. The procedure of seeking optimal p and b, may be called the ε-insensitive Learning by Solving a System of Linear Inequalities (εLSSLI) and may be summarized in the following steps: · ¸ . [1] >. > (1) Fix τ ≥ 0, 0 < ρ < 1 and De1 = diag( u .u ). Initialize b[1] > 0. Set iteration index k = 1, ´−1 ³ ¢ [k] ¡ [k] τe [k] , X> (2) p[k] = X> re De1 te + b re De1 Xre + 2 I (3) e[k] = Xre p[k] − te − b[k] , (4) ¯ ³ .¯ ¯ .¯ ¯ .¯ ¯ [k] ¯ ¯ [k] ¯ ¯ [k] ¯ [k+1] = diag u1 ¯e1 ¯ , · · · , uN ¯eN ¯ , u1 ¯eN +1 ¯ , · · · , De1 ¯´ .¯ ¯ [k] ¯ uN ¯e2N ¯ , ¡ [k] ¯ [k] ¯¢ [k] ¯ ¯ , = b + ρ (5) b[k+1] ° [k+1] ° e + e [k] ° ° (6) if b −b > κ2 , then k = k + 1, go to 2 else stop. Remarks κ2 is a pre-set parameter. Appendix B shows that for 0 < ρ < 1 and any diagonal matrix De , the above algorithm is convergent. If the step 4 in this algorithm is omitted, then the squared error minimization procedure is obtained. In practice, divide-by-zero-error in the step 4 does not occur. It follows from the fact that some components of vector e goes to zero as [k] goes to infinity. But in this case convergence is slow and the condition 6 stops the algorithm. 10. Numerical experiments and discussion Both local and global learning were examined using the following scheme: the parameters form the antecedents and the consequents of if-then rules were adjusted separately. First, the antecedents parameters were adjusted using clustering the input data using (73) and (74). Second, the consequents parameters are adjusted by means of the least squares method: (82) for global learning and (84) for local learning. For calculations, the Reichenbach
ε-INSENSITIVE LEARNING TECHNIQUES
69
fuzzy implication and the following parameters values: m = 2, ξU = 10−4 were applied. The fuzzy clustering was repeated 25 times for a different random initialization of the initial partition matrix. A c-partition with the minimal value of index VXB was selected. In the IQP method the parameters β = 103 and κ1 = 10−6 were used. In all experiments, b[1] = 10−6 and ρ = 0.98 were used in the εLSSLI method. The iterations were stopped as soon as the Euclidean norm in a successive pair of b vectors was less than κ2 = 10−4 . All experiments were run in the MATLAB environment. The purpose of these experiments was to compare the generalization ability of the proposed neuro-fuzzy system with learning tolerant to imprecision and classical zero-tolerance learning. Data originating from Box and Jenkins [3] work concerning the identification of a gas oven. Air and methane were delivered into the gas oven (gas flow in ft/min - an input signal x (n)) to obtain a mixture of gases containing CO2 (percentage content - output signal y (n)). The data consisted of 296 pairs of input-output samples in 9 sec. periods. To identify the model, the following vectors were used as input: x0 (n) , [y(n − 1)... y(n − 4) x(n) x(n − 1)... x(n − 6)]> and output t0 (n) , y(n). The training set consists of the first 100 pairs of data and the testing set consists of the remaining 190 pairs of data. Parameters τ and ε were changed in range from 0 to 0.5 (step 0.01), and the number of if-then rules was change from 2 to 6. After the training stage (neuro-fuzzy system design on the training set), the generalization ability of the designed model was determined as a root mean squared error (RMSE) on the test set. Tables 5, 6 show the lowest RMSE for each number of if-then rules for global and local learning, respectively. Also values of τ and ε parameters for which the lowest RMSE is obtained as well as RMSE for zero-tolerance learning are shown. Taking into account these tables several observations can be made. First of all, it should be noted, that for both global and local learning, despite of the number of if-then rules, learning tolerant to imprecision leads to a better generalization comparing with zero-tolerance learning. The best generalization for each number of rules is obtained for parameters ε and τ value different from zero. It must be also noted, that we observe different dependencies of the generalization ability from the number of if-then rules for zero-tolerance learning and tolerant to imprecision learning. For zerotolerance learning, increasing the number of rules results in fast decreasing the generalization ability due to the overfitting effect of the training set. For all tolerant to imprecision learning methods we have slower decreasing of the generalization ability. The best generalization ability is obtained using the εIQP local learning algorithm with I = 2, ε = 0.05 and τ = 0.23. The
JACEK M. L Ã ESKI 70
I 2 3 4 5 6
I 2 3 4 5 6
εIQP ε 0.06 0.02 0.03 0.05 0.03 τ 0.21 0.08 0.09 0.01 0.08
RMSE 0.3532 0.3730 0.3723 0.3885 0.5299
εIL ε 0.02 0.09 0.02 0.11 0.03 τ 0.22 0.17 0.09 0.12 0.05
εLSSLI RMSE ε 0.3454 0.02 0.3580 0.06 0.3732 0.04 0.3518 0.03 0.4957 0.01
τ 0.07 0.09 0.25 0.13 0.39
Zero-tolerance learning RMSE 0.3622 0.3785 0.4349 0.4272 0.5579
Table 5. RMSE obtained for the testing part of Box-Jenkins time series — global learning.
RMSE 0.3453 0.3678 0.3695 0.3843 0.5534
εIQP ε 0.05 0.03 0.02 0.07 0.05 τ 0.23 0.29 0.32 0.13 0.15
RMSE 0.3508 0.3664 0.3987 0.4147 0.5217
εIL ε 0.02 0.11 0.04 0.07 0.03
τ 0.09 0.16 0.09 0.23 0.04
εLSSLI RMSE ε 0.3502 0.03 0.3654 0.11 0.3945 0.04 0.4036 0.08 0.5213 0.06
τ 0.08 0.07 0.03 0.03 0.07
Zero-tolerance learning RMSE 0.3594 0.3776 0.4281 0.4159 0.5543
Table 6. RMSE obtained for the testing part of Box-Jenkins time series — local learning.
RMSE 0.3440 0.3478 0.3722 0.3831 0.5339
ε-INSENSITIVE LEARNING TECHNIQUES
71
results obtained by the εLSSLI method is a little bit worse comparing with the εIQP. The worst generalization ability is obtained by the εIL method. Finally, it can be noted that the εLSSLI algorithm converges after a few iterations, but its computational burden is approximately 3-times greater in comparison with the zero-tolerance learning. However, computational burden of the εIQP and εIL methods is approximately 4- and 7-times greater in comparison to the zero-tolerance learning, respectively. Tests for the robustness of outliers of the proposed learning methods were also performed. Firstly, the sixth output sample y6 , that has the original value equal to 52.4, was set to value equal to 100. Hence, in this case, only one outlier was added to the database. In the second case, two outliers were added to the training set: once again y6 and, additionally, y34 (original value 47.6) were set to 100. In the third case, the Cauchy random noise was added to output variable y of the training part of the dataset. Finally, in the fourth case, the Cauchy random noise was added to the input and output variable of the training part. The training stage was performed for the parameters I, ε and τ for which the best generalization without outliers and noise was obtained. Table 7 shows the RMSE for the testing part of the database. From this table we can see that the best generalization, in the presence of outliers and noise, was obtained for the εIL method, but slightly worse results were obtained for the εLSSLI method. Finally, it can be noted that learning methods tolerant to imprecision lead to a better than 2-times generalization ability improvement. Thus, in the presence of outliers and noise the learning methods tolerant to imprecision particularly improve the generalization ability of designed neuro-fuzzy systems. 11. Conclusions The inference algorithms based on conjunctive interpretation of implication in some cases seem to be faster, simpler and more exact than the fuzzy inference system based on logical interpretation. However, the interpretation of the fuzzy if-then rules based on fuzzy implications is sounder from the logical point of view. In this paper an artificial neural network based on logical interpretation of if-then rules (ANBLIR) has been described. Such a system can be used for an automatic if-then rule generation. The innovations of that system in comparison with the well known from literature are the logical interpretation of fuzzy if-then rules and the moving fuzzy set in consequents of if-then rules. A combination of gradient descent and least squares methods for parameter optimization has been discussed. For initialization of calculations preliminary fuzzy clustering has been used. A new approach to neuro-fuzzy modeling with learning tolerant to imprecision is also presented. In this method of learning a weighted ε-insensitive
JACEK M. L Ã ESKI 72
LS 2.4506 2.5093 3.3783 25.0577
Global εIQP 1.1652 1.1683 1.3145 3.3462
learning Local learning εIL εLSSLI LS εIQP εIL εLSSLI 0.9845 1.0137 2.2305 1.0476 0.9632 1.0127 1.0023 1.0149 2.3813 1.0301 1.0116 1.0149 1.1248 1.5428 2.7532 1.3452 1.2836 1.4272 2.4854 2.6632 11.8582 3.3641 2.1148 2.1316
Table 7. RMSE obtained for the testing part of the Box-Jenkins data in the presence of outliers and noise in the training part. Test 1 outlier 2 outliers noise on y noise on x and y
ε-INSENSITIVE LEARNING TECHNIQUES
73
loss function is used. Computationally effective numerical methods for εinsensitive learning of a neuro-fuzzy system, called iterative QP (εIQP), incremental learning (εIL) and learning by solving a system of linear inequalities (εLSSLI) are introduced. These methods establishes a connection between fuzzy modeling and the statistical learning theory, where easy control of VC-dimension (system complexity) is permitted. Numerical example shows the usefulness of the new method in design neuro-fuzzy systems with improved generalization ability and outliers robustness comparing with the classical zero-tolerance learning. Learning tolerant to imprecision always leads to better generalization comparing with the classical methods. It is impossible to decide which tolerance learning method is the best. The incremental learning algorithm has better outliers robustness but its computational burden is approximately 2-times greater with respect to εLSSLI. The εLSSLI is easier to implement and leads to a little bit worse generalization ability with respect to the incremental learning. Appendix A The extension principle is formulated as follows: A−1 bc> A−1 A−1 b · ¸−1 −1 − 0 A b A + d0 d = , > > −1 c d c A 1 − d0 d0 £ ¤> where£ d0 = d − c> A−1 b. In¤ our case, A = Γ(p) , b = 1Ks1 s`+1 · · · Ks` s`+1 , c> = 1Ks`+1 s1 · · · Ks`+1 s` and d = Ks`+1 s`+1 . Appendix B τe The first Eq. from (137) can be rewritten in the form: X> re De e = − 2 Ip. Thus, for τ > 0 all elements of the error vector can not be zero. If we ³ ´−1 ¯ ¯ [k] τe [k] ¯ [k] ¯ define X†re , X> D X + , then using I X> e re re re De and e+ , e + e 2 ¡ ¢ [k] [k+1] [k] † (137) = e + ρ Xre Xre − I e+ and eIe p[k+1] = ³ and (138)´yields: e [k] [k] e [k] + ρX†re e+ . Substitution the above results in (136) X†e b[k] + ρe+ = p gives: ¡ ¢ [k] I [k+1] = I [k] + 2ρe[k]> De Xre X†re − I e+ ¢> ¡ ¢ [k] [k]> ¡ +ρ2 e+ Xre X†re − I De Xre X†re − I e+ [k]
[k]>
[k]
e † +2τ ρe p[k]>eIX†re e+ + τ ρ2 e+ X†> re IXre e+ .
74
JACEK M. L Ã ESKI
[k] p[k] . Using the above From the first Eq.(137) we have X> = − τ2 eIe re De e and the equality ¡ ¢ [k] ¡ ¢ [k] [k]> 2ρe[k]> De Xre X†re − I e+ = ρe+ De Xre X†re − I e+ ,
after some simple algebra: I [k+1] − I [k]
[k]
ρ (ρ − 1) e[k]> De e+ ³ τ e´ † [k] [k]> > +ρ2 e+ X†> re Xre De Xre + I Xre e+ 2 2 [k]> † [k] −2ρ e+ De Xre Xre e+ . ³ ´ τe > † † Since X†> re Xre De Xre + 2 I Xre = De Xre Xre the second and third terms [k]>
=
[k]
simplify to: −ρ2 e+ De Xre X†re e+ . [k] [k]> [k] Thus: I [k+1] − I [k] = ρ (ρ − 1) e[k]> De e+ −ρ2 e+ De Xre X†re e+ . The matrix De Xre X†re is symmetric and positive semidefinite. It follows, that the second term is negative or zero. For 0 < ρ < 1 the first term is negative or zero. Thus, the sequence I [1] , I [2] , ... is monotonically decreasing. Con[k] vergence requires that e+ tends to zero (no modification in (138)), while e e[k] is bounded away from zero, due to X> re De e = −τ Ip. References [1] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum Press, 1982. [2] J.C. Bezdek and S.K. Pal, Fuzzy Models for Pattern Recognition. New York: IEEE Press, 1992. [3] G.E.P. Box and G.M. Jenkins, Time Series Analysis. Forecasting and Control. San Francisco: Holden-Day, 1976. [4] Z. Cao and A. Kandel, “Applicability of Some Fuzzy Implication Operators,” Fuzzy Sets and Systems, Vol. 31, pp. 151-186, 1989. [5] G. Cauwenberghs and T. Poggio, “Incremental and Decremental Support Vector Machine Learning,” Adv. Neural Information Processing Systems, Cambridge MA: MIT Press, Vol. 13, 2001. [6] J.-Q. Chen, Y.-G. Xi and Z.-J. Zhang, “A Clustering Algorithm for Fuzzy Model Identification,” Fuzzy Sets and Systems, Vol. 98, pp. 319-329, 1998. [7] K.B. Cho and B.H. Wang, “Radial Basis Function Based Adaptive Fuzzy Systems and Their Applications to System Identification and Prediction,” Fuzzy Set and Systems, Vol. 83, pp. 325-339, 1996. [8] O. Cordon, F. Herrera and A. Peregrin, “Applicability of the Fuzzy Operators in the Design of Fuzzy Logic Controllers,” Fuzzy Sets and Systems, Vol. 86, pp. 15-41, 1997. [9] E. CzogaÃla and R. Kowalczyk, “Investigation of Selected Fuzzy Operations and Implications in Engineering,” Fifth IEEE Int. Conf. on Fuzzy Systems, pp. 879-885, 1996.
ε-INSENSITIVE LEARNING TECHNIQUES
75
[10] E. CzogaÃla, J. Fodor and J. L Ã eski, “The Fodor Fuzzy Implication in Approximate Reasoning,” Systems Science, Vol. 23, No. 2, pp. 17-28, 1997. [11] E. CzogaÃla and J. L Ã eski, “An Equivalence of Approximate Reasoning Under Defuzzification,” BUSEFAL, Vol. 74, pp. 83-92, 1998. [12] E. CzogaÃla and J. L Ã eski, “Fuzzy Implications in Approximate Reasoning,” In: L.A. Zadeh and J. Kacprzyk (Eds.), Computing with Words in Information/Intelligent Systems, Vol.I: Foundations. Heidelberg, Springer-Verlag, pp. 342-357, 1999. Ã eski, Fuzzy and Neuro-Fuzzy Intelligent Systems. Heidelberg: [13] E. CzogaÃla and J. L Physica-Verlag, Springer-Verlag Comp., 2000. [14] E. CzogaÃla and J. L Ã eski, “On Equivalence of Approximate Reasoning Results Using Different Interpretations of Fuzzy If-Then Rules,” Fuzzy Sets and Systems, Vol. 117, pp. 279-296, 2001. [15] D. Dubois and H. Prade, “Fuzzy Sets in Approximate Reasoning, Part 1: Inference with Possibility Distributions,” Fuzzy Sets and Systems, Vol. 40, pp. 143-202, 1991. [16] D. Dubois and H. Prade “What Are Fuzzy Rules and How to Use Them,” Fuzzy Sets and Systems, Vol. 84, pp. 169-185, 1996. [17] J.C. Fodor, “On Fuzzy Implication Operators,” Fuzzy Sets and Systems, Vol. 42, pp. 293-300, 1991. [18] J.C. Fodor, “Contrapositive Symmetry of Fuzzy Implications,” Fuzzy Sets and Systems, Vol. 69, pp. 141-156, 1995. [19] J.C. Fodor and M. Roubens, Fuzzy Preference Modelling and Multicriteria Decision Support. Dordrecht: Kluwer, 1994. [20] F.R. Gantmacher, The Theory of Matrices. New York: Chelsa Publ., 1959. [21] S. Haykin, Neural Networks. A Comprehensive Foundation. 2nd Ed., Upper Saddle River: Prentice-Hall, 1999. [22] Y.-C. Ho and R.L. Kashyap, “An Algorithm for Linear Inequalities and its Applications,” IEEE Trans. Elec. Comp., Vol. 14, pp. 683-688, 1965. [23] Y.-C. Ho and R.L. Kashyap, “A Class of Iterative Procedures for Linear Inequalities,” J.SIAM Control., Vol. 4, pp. 112-115, 1966. [24] P.J. Huber, Robust Statistics, New York: Wiley, 1981. [25] S. Horikawa, T. Furuhashi and Y. Uchikawa, “On Fuzzy Modeling Using Fuzzy Neural Networks with the Back-propagation Algorithm,” IEEE Trans. Neur. Net., Vol. 4, pp. 801-806, 1992. [26] J.R. Jang and C. Sun, “Functional Equivalence Between Radial Basis Function and Fuzzy Inference Systems,” IEEE Trans. Neur. Net., Vol. 4, pp. 156-159, 1993. [27] J.R. Jang and C. Sun, “Neuro-Fuzzy Modeling and Control,” Proc. IEEE, Vol. 83, pp. 378-406, 1995. [28] J.R. Jang, C. Sun and F. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence. Upper Saddle River: Prentice-Hall, 1997. [29] T. Joachims, “Making Large-Scale Support Vector Machine Learning Practical,” In: B. Sch¨ olkopf, J.C. Burges and A.J. Smola (Eds.), Advances in Kernel Methods — Support Vector Learning. Cambridge MA: MIT Press, pp. 169-184, 1999. [30] E. Kerre, “A Comparative Study of the Behavior of Some Popular Fuzzy Implication Operators on the Generalized Modus Ponens,” In: L.A. Zadeh and J. Kacprzyk, (Eds.), Fuzzy Logic for the Management of Uncertainty. New York: Wiley, 1992. [31] B. Kosko, “Fuzzy Associative Memories,” In: A. Kandel, (Ed.), Fuzzy Expert Systems. Boca Raton: CRC Press, 1987.
76
JACEK M. L Ã ESKI
[32] P. de Larminat and Y. Thomas, Automatique des Systemes Lineaires, 2. Identification. Paris: Flammarion Sciences, 1977. [33] J. L Ã eski and E. CzogaÃla, “A New Artificial Neural Network Based Fuzzy Inference System with Moving Consequents in If-Then Rules,” BUSEFAL, Vol. 71, pp. 72-81, 1997. [34] J. L Ã eski and E. CzogaÃla, “A New Artificial Neural Network Based Fuzzy Inference System with Moving Consequents in If-Then Rules and its Applications,” Fuzzy Sets and Systems, Vol. 108, No.3, pp. 289-297, 1999. Ã eski and E. CzogaÃla, “A New Fuzzy Inference System Based on Artificial Neural [35] J. L Network and its Application,” In: L.A. Zadeh and J. Kacprzyk (Eds.), Computing with Words in Information/Intelligent Systems, Vol.II: Applications. Heidelberg: Springer-Verlag, pp. 75-94, 1999. Ã eski and N. Henzel, “A Neuro-Fuzzy System Using Logical Interpretation of [36] J. L If-Then Rules and its Application to Diabetes Mellitus Forecasting,” Archives of Control Sciences, Vol. 9, No. 1-2, pp. 107-122, 1999. [37] J. L Ã eski, “Robust Possibilistic Clustering,” Archives of Control Sciences, Vol. 10, No. 3-4, pp.141-155, 2000. [38] E. CzogaÃla, J. L Ã eski and Y. Hayashi, “A Classifier Based on Neuro-Fuzzy Inference System”, Journal of Advanced Computational Intelligence, Vol. 3, No. 4, pp. 282288, 2000. [39] J. L Ã eski and N. Henzel, “A Neuro-Fuzzy System Based on Logical Interpretation of If-Then Rules,” In: M. Russo and L.C. Jain (Eds.), Fuzzy Learning and Applications. Boca Raton, London: CRC Press, pp. 359-388, 2001. [40] J. L Ã eski, “An ε-Insensitive Approach to Fuzzy Clustering,” Int. J. Appl. Math.Comp.Sci., Vol. 11, No. 4, pp. 993-1007, 2001. [41] J. L Ã eski, “Towards a Robust Fuzzy Clustering”, Fuzzy Sets and Systems (in print). [42] J. L Ã eski, “ ε-insensitive Fuzzy c-regression Models: Introduction to ε-insensitive Fuzzy Modeling,” IEEE Trans. Systems, Man & Cybernetics (in print). [43] J. L Ã eski, “Improving Generalization Ability of Neuro-Fuzzy System by ε-insensitive Learning,” Int. J. Appl. Math. Comp. Sci. , Vol.12, No. 3, pp.101-110, 2002. [44] J. L Ã eski, “Neuro-fuzzy System with Learning Tolerant to Imprecision,” Fuzzy Sets and Systems (in print). [45] H. Maeda, “An Investigation on the Spread of Fuzziness in Multi-Fold Multi-Stage Approximate Reasoning by Pictorial Representation — Under Sup-Min Composition and Triangular Type Membership Function,” Fuzzy Sets and Systems, Vol. 80, pp. 133-148, 1996. [46] S. Mitra and S.K. Pal, “Fuzzy Multi-Layer Perceptron, Inferencing and Rule Generation,” IEEE Trans. Neur. Net., Vol. 6, pp. 51-63, 1995. [47] M. Mizumoto and H.-J. Zimmermann, “Comparison of Fuzzy Reasoning Methods,” Fuzzy Sets and Systems, Vol. 8, pp. 253-283, 1982. [48] A. Navia-V´ azquez, F. P´ erez-Cruz, A. Art´ es-Rodr´ıguez and A.R. Figueiras-Vidal, “Weighted Least Squares Training of Support Vector Classifiers Leading to Compact and Adaptive Schemes,” IEEE Trans. Neur. Nets., Vol. 12, No. 5, pp. 1047-1059, 2001. [49] E. Osuna, R. Freund and F. Girosi, “An Improved Training Algorithm for Support Vector Machines,” Proc. IEEE Workshop on Neural Networks for Signal Processing, pp. 276-285, 1997. [50] N.R. Pal and J.C. Bezdek, “On Cluster Validity for the Fuzzy c-means Model,” IEEE Trans. Fuzzy Systems, Vol. 3, pp. 370-379, 1995.
ε-INSENSITIVE LEARNING TECHNIQUES
77
[51] W. Pedrycz, “An Identification Algorithm in Fuzzy Relational Systems,” Fuzzy Sets and Systems, Vol. 13, pp. 153-167, 1984. [52] J. Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines,” In: B. Sch¨ olkopf, J.C. Burges and A.J. Smola (Eds.), Advances in Kernel Methods — Support Vector Learning, Cambridge MA: MIT Press, pp. 185-208, 1999. [53] B.R. Ripley, Pattern Recognition and Neural Network. Cambridge: Cambridge University Press, 1996. [54] M. Setnes, “Supervised Fuzzy Clustering for Rule Extraction,” IEEE Trans. Fuzzy Systems, Vol. 8, No. 4, pp. 416-424, 2000. [55] M. Sugeno and G.T. Kang, “Structure Identification of Fuzzy Model,” Fuzzy Sets and Systems, Vol. 28, pp. 15-33, 1988. [56] H. Takagi and M. Sugeno, “Fuzzy Identification of Systems and its Application to Modeling and Control,” IEEE Trans. Systems, Man & Cybernetics, Vol. 15, No. 1, pp. 116-132, 1985. [57] E. Trillas and L. Valverde, “On Implication and Indistinguishability in the Setting of Fuzzy Logic,” In: J. Kacprzyk and R.R. Yager (Eds.), Management Decision Sup¨ Rheinland, port Systems using Fuzzy Sets and Possibility Theory. K¨ oln: Verlag TUV 1985. [58] L.-X. Wang, A Course in Fuzzy Systems and Control. New York: Prentice-Hall, 1998. [59] S. Weber, “A General Concept of Fuzzy Connectives, Negations and Implications Based on t-norms and t-conorms,” Fuzzy Sets and Systems, Vol. 11, pp. 115-134, 1983. [60] L. Wang and J.M. Mendel, “Generating Fuzzy Rules by Learning From Examples,” IEEE Trans. Systems, Man & Cybernetics, Vol. 22, pp. 1414-1427, 1992. [61] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [62] V. Vapnik, “An Overview of Statistical Learning Theory,” IEEE Trans. Neur. Nets., Vol. 10, No. 5, pp.988-999, 1999. [63] X.L. Xie and G. Beni, “A Validity Measure for Fuzzy Clustering,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 13, No. 8, pp. 841-847, 1991. [64] R.R. Yager, “On the Interpretation of Fuzzy If-Then Rules,” Applied Intelligence, Vol. 6, pp. 141-151, 1996. [65] J. Yen, L. Wang and C.W. Gillespie, “Improving the Interpretability of TSK Fuzzy Models by Combining Global Learning and Local Learning,” IEEE Trans. Fuzzy Systems, Vol. 6, No. 4, pp. 530-537, 1998. [66] L.A. Zadeh, “Fuzzy Sets,” Information and Control, Vol. 8, pp. 338-353, 1964. [67] L.A. Zadeh, “Outline of a New Approach to the Analysis of Complex Systems and Decision Processes,” IEEE Trans. Systems, Man & Cybernetics, Vol. 3, No. 1, pp. 28-44, 1973. Institute of Electronics, Technical University of Silesia, Akademicka 16, 44-100 Gliwice, Poland E-mail address:
[email protected]