EUSFLAT - LFA 2005
Improving the Interpretability of Data-Driven Evolving Fuzzy-Systems Edwin Lughofer Johannes Kepler University Linz
[email protected]
Eyke H¨ ullermeier Otto-von-Guericke-University Magdeburg
[email protected]
Abstract This paper develops methods for reducing the complexity and, thereby, improving the linguistic interpretability of Takagi-Sugeno fuzzy systems that are learned online in a data-driven, incremental way. In order to ensure the transparency of the evolving fuzzy system at any time, complexity reduction must be performed in an online mode as well. Our methods are evaluated on high-dimensional data coming from an industrial measuring process. Keywords: Incremental learning, evolving fuzzy systems, complexity reduction.
1
Introduction
Takagi-Sugeno (TS) fuzzy systems [10] play an important role for system modelling and identification, as they combine potentially high approximation accuracy [11] with linguistic interpretability. Recently, [5] has pointed out the importance of the problem to identify TS fuzzy systems in a data-driven, online manner. Among other things, this problem requires incremental learning methods that update a model whenever new observations have been made, referring to the current model and the new data but not to old observations. In contrast to batch learning algorithms, which train a model (from scratch) using all of the data observed so far (e.g. [1]), incremental methods guarantee fast training of a model and thus qualify for online applications. When learning fuzzy models in a data-driven way, the focus is usually on high approximation ac28
Erich Peter Klement Johannes Kepler University Linz
[email protected]
curacy. Unfortunately, accurate models are often complex at the same time, hence aspects of transparency and readability necessarily suffer [5]. This motivates the consideration of methods for reducing the complexity and, thereby, improving the interpretability of fuzzy models. In this paper, corresponding methods will be developed for evolving TS fuzzy systems. Even though we shall focus on FLEXFIS [6, 5], a specific variant for incremental learning of TS models, our methods are of a more general nature. In principle, it should be possible to use them, perhaps in a slightly modified way, for other approaches as well. FLEXFIS implements the Fuzzy Basis Function Network architecture, which is a specific case of a TS fuzzy system with multi-dimensional input ~x = (x1 , ..., xp ) and a single output variable y. Thus, the latter is given by fˆ(~x) = yˆ =
C X
li (~x) · Ψi (~x)
(1)
i=1
where C is the number of rules, and the basis functions Ψi are Gaussian kernels: (xj −cij )2 1 Pp exp − 2 j=1 σ2 ij (2) Ψi (~x) = PC P (xj −ckj )2 p 1 exp − k=1 j=1 2 σ2 kj
Moreover, the consequent functions are defined as li (~x) = wi0 + wi1 x1 + wi2 x2 + ... + wip xp .
(3)
Note that a multi-dimensional Gaussian kernel represents the premise part of a fuzzy rule, the antecedents (one-dimensional Gaussian fuzzy sets)
EUSFLAT - LFA 2005
of which are combined by means of the product t-norm. Learning a fuzzy model as defined above means fitting all of its parameters to the data: the number of rules (C), the centers (cij ) and widths (σij ) of the multivariate Gaussian kernels, and the parameters appearing in the rule consequents as output weights (wi0 , wi1 , ..., wip ). As we mentioned above, these parameter estimation problems have to be solved in an incremental manner.
2
Basics of FLEXFIS
FLEXFIS consists of two main components, one for estimating the parameters that specify the antecedent parts of the rules in model (1) and one for updating the parameters in the rule consequents. These two problems are handled separately, because they are of a quite different nature: The first one is an inherently nonlinear problem, while the second one can be approached by linear regression techniques. To solve the first problem, FLEXFIS exploits vector quantization in an incremental mode, combined with the idea of ART-networks: Whenever a new observation (data point) ~x has arrived, its distance to all existing cluster centers is compared to a so-called vigilance parameter ρ. The observation initializes a new cluster if all distances are larger than ρ. Otherwise, it is assigned to the nearest cluster center, ~cwin , which is then updated as follows: (new)
~cwin
(old)
(old)
= ~cwin + η (~x − ~cwin ),
(4)
with η being the learning rate; the latter decreases with the number kwin of data points lying nearest to cwin (this number is simply updated through counting). Moreover, the widths σwin,j are adapted by exploiting a recursive variance formula: 2 2 ki σwin,j = (ki − 1)σwin,j + ki (∆cwin,j )2 +
(5)
2
+ (cwin,j − xkj ) , where ∆cwin,j is the distance between the jth en(new) tries of the vectors ~cwin and ~cwin . Each new cluster gives rise to a new rule, i.e., there is a one-to-one correspondence between
clusters and rules: The fuzzy sets that appear in the antecedent part of a rule are obtained by projecting a cluster to the various dimensions of the input space, and these fuzzy sets are connected with the product t-norm, see (2). A new cluster, consisting of a single point ~x, is initialized with this point as its center; moreover, the width in the jth dimension is set to · (uj − lj ), where is a small constant and [lj , uj ] the range of jth input variable. Updating the parameters in the rule consequents is accomplished by means of a recursive weighted least squares (RWLS) estimation. In our approach, the parameters are estimated for each rule individually, which turned out to have several advantages [5]. This is why the weighted version of the well-known RLS estimation [4] has to be used: The influence of a data point on the parameter estimation is proportional to its membership in the corresponding cluster. Finally, in order to improve the adaptation to non-steady functional relationships that change over time, a “forgetting factor” λ is used that limits the influence of old observations. This leads to the following update scheme for linear consequent parameters: w ~ˆi (k + 1) = w ~ˆi (k) + γ(k) (y(k + 1)− (6) T ˆ ~r (k + 1) w ~ i (k)) γ(k) =
λ Ψi (~ x(k+1))
Pi (k) ~r(k + 1) (7) + ~rT (k + 1) Pi (k) ~r(k + 1)
1 (8) λ with Pi (k) = (Ri (k)T Qi (k)Ri (k))−1 the weighted inverse Hesse matrix and ~r(k + 1) = [1 x1 (k + 1) x2 (k + 1) . . . xp (k + 1)]T the regressor values of the (k + 1)th data point (which are identical for all C rules). Pi (k + 1) = (I − γ(k) ~rT (k + 1)) Pi (k)
Note that an incremental learning of rule consequents, using RWLS, actually assumes stable rule premises. However, since the premise parts are adapted as well, the RWLS estimates need to be corrected. Even though this can principally be done by incorporating corresponding correction terms, it is not possible in online training since these terms cannot be derived incrementally. Still, it can be shown that when omitting the correction terms (setting them to 0), our algorithm 29
EUSFLAT - LFA 2005
still converges for a typical parameter adjustment, especially with respect to η in (4), and delivers a suboptimal solution in the least squares sense [5, 6].
3
Improving Interpretability
The main objective of FLEXFIS as outlined above is to learn highly accurate models. The aspect of interpretability, on the other hand, has been neglected so far. Consequently, there is a high danger to obtain models that are hardly more understandable than “black box” models such as, e.g., neural networks. The strategies and algorithms proposed in this section are meant to overcome this problem, at least to some extent. 3.1
Interpretability of Rule Consequents
To guarantee the interpretability of an individual rule consequent, the corresponding linear function should first of all provide a good local approximation to the data in the input/output space. Both, formal and empirical investigations have shown that this requirement is often violated when using a global approach to parameter estimation. Here, ‘global’ means that the parameters of all C rule consequents (1) are adapted simultaneously to the data. This way, an optimal approximation quality can be achieved. However, due to the induced interaction between rules, a single rule consequent – considered in isolation – might no longer provide a good (local) approximation to the data. In the local estimation approach as outlined in the previous section, the parameters are estimated separately for each rule consequent. On the one hand, the overall approximation quality, as induced by the complete rule base, might deteriorate. On the other hand, the local approach guarantees that individual rule consequents fit the data rather nicely. See [5] for a more detailed comparison between global and local estimation. The aforementioned effects are illustrated in Fig. 1. As can be seen, the linear rule consequents (shown as line segments) ‘break out’ in the global approach, some of them lying far away from the data and, hence, from the graph of the original function. In contrast, the local approach (with 30
Figure 1: A sinusoidal relationship approximated with the local (left) and the global approach (right).
initial values of all linear parameters set to 0 and the corresponding inverse Hesse matrix to αI for large enough α) yields rule consequents in the immediate vicinity of the graph. 3.2
Interpretability of Fuzzy Sets
In this section, an attempt is made to ensure a set of reasonable properties for the membership functions of the fuzzy sets appearing in the rule premises. In particular, these properties should guarantee linguistically interpretable partitions of the input variables. In this regard, we refer to four semantic properties found to be crucial in [7]: a moderate number of membership functions, distinguishability, normality and unimodality, and coverage. Normality and unimodality as well as coverage of the input space are obviously fulfilled when choosing Gaussian fuzzy sets. The number of fuzzy sets, however, usually depends on the degree of nonlinearity of the underlying functional relationship. In FLEXFIS, this number can be controlled, at least to some extent, by the vigilance parameter ρ (see Section 2): The larger ρ, the less clusters are created; less clusters will in turn produce less rules and, hence, less fuzzy sets in each dimension. The main problem of FLEXFIS is distinguishability. In fact, by projecting high-dimensional clusters to the one-dimensional axes of the input space, strongly overlapping fuzzy sets might be produced. For the two-dimensional case, this problem is illustrated in the left part of Fig. 2. A straightforward idea to avoid the above problem
EUSFLAT - LFA 2005
Figure 2: Clusters causing two strongly overlapping sets (left) and sets with close modal values (right)
Figure 3: Approximation and fuzzy sets obtained by merging two sets (left) and three sets which are close to each other.
is to merge very similar fuzzy sets. Putting this idea into practice of course presupposes a suitable similarity measure. A standard measure in this regard is the so-called Jaccard index, which defines the similarity between two fuzzy sets A and B as follows: R (µA ∩ µB )(x) dx S(A, B) = R , (9) (µA ∪ µB )(x) dx
which is the membership degree of the inflection points µ±σ of a Gaussian kernel with parameters µ and σ.
where the intersection in the nominator is given by the pointwise minimum of the membership functions, and the union in the denominator by the pointwise maximum. (In practice, the integrals are of course approximated numerically.) On the basis of this measure, an algorithm for merging fuzzy sets has been proposed in [9]. Fortunately, this algorithm is directly applicable within our incremental framework, as it does not need any information about previous data points. In our approach, two Gaussian fuzzy sets are merged into a new Gaussian kernel with the following parameters: µnew = (max(U ) + min(U ))/2,
(10)
σnew = (max(U ) − min(U ))/2,
(11)
where U = {µA ± σA , µB ± σB }. The idea underlying this definition is to reduce the approximate merging of two Gaussian kernels to the exact merging of two of their α-cuts, for a specific value of α. Here, we choose α = exp(−1/2) ≈ 0.6,
Next, we consider the situation where modal values of adjacent fuzzy sets are close to each other. Since VQ-ART generates axis-parallel clusters, this situation can occur in the vicinity of very steep parts of the function to be approximated: The only chance to approximate such steep regions is to cover it by a relatively large number of such clusters (see Fig. 2). From Fig. 2 it becomes obvious that the narrow fuzzy sets on the right can be merged, as each of them triggers a rule consequent with almost the same slope. The FuZion algorithm is a routine that merges consecutive triangular membership functions whenever their modal values are close enough [2]. In FLEXFIS, we used this algorithm is a modified way: Firstly, two fuzzy sets A and B, both referring to the jth input variable, are merged only if the (partial) slopes of the consequents of rules in which these fuzzy sets occur are approximately equal, i.e., if wij ≈ wkj whenever A resp. B occur in the premise of the ith resp. kth rule. Thus, a merging is prevented in the case of a highly fluctuating functional relationship. Secondly, we merge Gaussian instead of triangular fuzzy sets, which can be done by extending (10–11) in a canonical way.1 1
In the current implementation the width of the new
31
EUSFLAT - LFA 2005
Table 1: Comparison between FLEXFIS with and without interpretability improvements Method FLEXFIS FLEXFIS*
Figure 4: building.
A two-layer-architecture for model
Quality Test 0.856 0.853
3.3
Rule Base Reduction
One possibility to reduce the number of rules is to delete rules that become redundant in the course of the iterative merging process of the fuzzy sets as outlined above. A well-known method for accomplishing this task is the fuzzy system simplification (FSS) algorithm [8]. Yet, this algorithm suffers from the drawback that, when removing redundant rules, it needs all of the training data for re-estimating the rule consequents of the replacing rule. This algorithm is hence not applicable in an online mode. For this reason, we developed a variant that builds models according to the two-layer-architecture shown in Fig. 4. The first layer updates the original model, i.e., the model that has been trained using the techniques from Section 2. This model is maximally accurate, which might be of critical importance for applications in fields like fault detection, prediction, or controlling. The second layer improves the interpretability of the updated fuzzy set is still defined in a slightly different way as σnew = max(|cnew − ca |, |cnew − cb |) + max(σa , σb ), where ca is the leftmost center, cb the rightmost, and cnew the new one. 32
Av. No. of Rules 7.18 7.07
model whenever insight into the system behavior becomes a major issue. To this end, FSS is employed in combination with a merging strategy which can be used in an online mode: w(new)j =
When applying this version of the FuZion algorithm to the fuzzy partition as shown in Fig. 2 (right), one obtains the fuzzy sets (and the corresponding approximation) in Fig. 3. In this example, the approximation accuracy hardly suffered (compare solid lines vs. dotted lines), while the number of fuzzy sets could be reduced from 7 to 6 resp. 5 for different threshold values.
Av. No. of Sets 7.18 4.39
w1j k1 + ... + wqj kq k1 + ... + kq
(12)
for j = 0, ..., p, where wij is the parameter in the consequent of the ith redundant rule, pertaining to the jth input variable, and ki is the number of data points belonging to the corresponding cluster. Thus, the parameters of the new rule consequent are defined as a weighted average of the consequent parameters of the redundant rules, with the weights representing the relevancy of the rules. This merging strategy can obviously be applied in online mode as it does not require any training data. Note that it would be possible to submit the improved model to the incremental learning process for new incoming data points (see the feed-back path in Fig. 4 represented by the dotted line). However, this would entail a worse approximation quality, as the inverse Hesse matrix for the newly obtained rule, which is required for an accurate update of rule consequent parameters, has to be re-initialized by αI (merging the Hesse matrices of the redundant rules is extremely difficult) [5].
4
Empirical Evaluation
In order to examine the effectiveness of the improvements suggested in the previous section, we have compared the original version of FLEXFIS with the extended variant, FLEXFIS*. In the experiments we used data coming from a diesel engine. This data has been recorded at an engine test bench. The original data set, which was split into a training set of 1810 and a test set
EUSFLAT - LFA 2005
of 136 samples, contained 80 measurement channels. However, 18 channels were neglected due to missing data. For each of the remaining 62 channels, a fuzzy model was trained, using that channel as the output variable and a subset of maximally 5 of the other channels as input variables. In each case, the subset was determined by means of a feature selection technique [3]. The accuracy of FLEXFIS and FLEXFIS* was measured in terms of the average of the r-squaredadjusted values obtained for the 62 fuzzy models. Likewise, complexity was measured in terms of the average number of fuzzy sets and rules. From Table 1 it becomes obvious that the fuzzy set and rule merging strategies as presented in the previous section lead to significant improvements regarding model complexity. In particular, the average number of fuzzy sets per input dimension could be decreased from about seven to four. At the same time, the approximation accuracy could be maintained at a high level. The complexity reduction was even stronger for the housing data from the UCI repository2 . By approximating the output variable in this data set with the five most relevant inputs, the average number of fuzzy sets was reduced from 13 to 5.8. The model quality, measured in terms of mean squared error between predicted and measured outputs, decreased by only 0.87%. In this connection, let us note that for these data sets, FLEXFIS is also competitive to conventional batch learning methods (see [5]).
5
Conclusion
This paper has presented methods which aim at reducing the complexity of data-driven, incremental Takagi-Sugeno fuzzy models. In particular, we have addressed the problems to obtain interpretable rule consequents, to guarantee linguistically interpretable fuzzy partitions in each input dimension, and to reduce the number of rules while maintaining a reasonably high approximation accuracy. Experiments with highdimensional, real-world data sets have shown these methods to be effective in practice. 2
References [1] R. Babuska. Fuzzy Modeling for Control. Kluwer Academic Publishers, Boston, 1998. [2] J. Espinosa and J. Vandewalle. Constructing fuzzy models with linguistic intergrity from numerical data - afreli algorithm. IEEE Transactions on Fuzzy Systems, 8:591–600, 2000. [3] W. Groißb¨ock, E. Lughofer, and E.P. Klement. A comparison of variable selection methods with the main focus on orthogonalization. In Proceedings of SMPS Conference 2004, Oviedo, Spain, 2004. [4] L. Ljung. System Identification: Theory for the User. Prentice Hall PTR, Prentic Hall Inc., Upper Saddle River, New Jersey 07458, 1999. [5] E. Lughofer. Data-Driven Incremental Learning of Takagi-Sugeno Fuzzy Models. PhD thesis, Department of Knowledge-Based Mathematical Systems, Johannes Kepler University Linz, February 2005. [6] E. Lughofer and E.P. Klement. FLEXFIS: A variant for incremental learning of Takagi-Sugeno fuzzy systems. In to appear in Proceedings of FUZZ-IEEE 2005, Reno, Nevada, U.S.A., 2005. [7] J. Valente De Oliveira. Semantic constraints for membership function optimization. IEEE Transactions on Systems, Man and Cybernetics - part A: Systems and Humans, 29(1):128–138, 1999. [8] M. Setnes. Simplification and reduction of fuzzy rules. In J. Casillas, O. Cord´on, F. Herrera, and L. Magdalena, editors, Interpretability Issues in Fuzzy Modeling, volume 128 of Studies in Fuzziness and Soft Computing, pages 278–302. Springer, Berlin, 2003. [9] M. Setnes, R. Babuska, U. Kaymak, and H.R.v.N. Lemke. Similarity measures in fuzzy rule base simplification. IEEE Trans. SMC-B, 28:376–386, 1998. [10] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man and Cybernetics, 15(1):116–132, 1985. [11] L.X. Wang. Fuzzy systems are universal approximators. In Proceedings of the IEEE International Conference on Fuzzy Systems, pages 1163–1169, 1992.
http://www.ics.uci.edu/ mlearn/MLRepository.html 33