Data-Driven Initialization and Structure Learning in Fuzzy Neural Networks. M. Setnes" A. Koene .... generated upon the
Data-Driven Initialization and Structure Learning in Fuzzy Neural Networks M. Setnes" A. Koene R. BabuSka P.M. Bruijn Control Laboratory, Delft University of Technology, P.O. Box 503 1,2600 GA Delft, The Netherlands tel:+31 15 278 3371, fax:+31 15 278 6679, email:
[email protected]
Abstract Initialization and structure leaming in fuzzy neural networks for data-driven rule-based modeling are discussed. Gradient-based optimization is used to fit the model to data and a number of techniques are developed to enhance transparency of the generated rule base: data-driven initialization, similarity analysis f o r redundancy reduction, and evaluation of the rules contributions. The initialization uses jexible hyper-boxes to avoid redundunt and irrelevant coverage of the input space. Similarity analysis detects redundant terms while the contribution evaluation detects irrelevant rules. Both are applied during network training for early pruning of redundant or irrelevant terms and rules, excluding them from further parameter leaming (training). All steps of the modeling method are presented, and the method is illustrated on an example from the literature.
1 Introduction Much previous work in the field of data-driven machine learning has concentrated on quantitative approximation, while paying little attention to the qualitative properties of the resulting rules. This has resulted in models that are able to reproduce input-output characteristics, but provide little insight into the actual working of themodeled process [ 1,2]. The resulting rule base is often not much different from a black-box neural network model, and the rules are unsuitable for other purposes like, e.g., expert systems. Recently there have been attempts to remedy this situation by reducing the number of generated rules, and maintaining a transparent rule structure. Examples of such approaches are adaptive spline modeling (ASMOD) [3] and fuzzy neural networks (FNN) with structure learning [4]. In ASMOD, reduction of the number of rules is achieved by generating global rules rather than local ones. In the FNN studied in [4], model transparency is sought through redundancy pruning. In [4], similarity analysis is applied to the learnt rules, to detect redundancies. We propose using a similar idea during the learning process. By moving the redundancy detection *The work IS panly supported by the Research Council of Norway.
0-7803-4863-X/98 $10.0001998 IEEE
1147
to an earlier stage in the modeling process, less effort is spent on training redundant rules. If redundancy in terms of similar membership functions (compatible terms) is detected, these are replaced by a single membership function (common, generalized term). This approach has proven useful in complexity reduction of fuzzy rule bases in general [5]. Further, we introduce evaluation of the rules' cumulative relative contribution to detect rules that deal with situations that does not occur in the modeled process. The proposed initialization of of the FNN is a data-driven method that utilizes flexible hyper-boxes to reduce redundancy and irrelevant coverage in the initial partitioning of the input space. The initialization can include expert knowledge in the form of partially or fully known qualitative and or quantitative rules, and a priori information about bounds of (parts of) the input space. Figure l a shows an overview of the process. The modeling method described in this paper has been developed for off-line data-driven modeling of MISO systems. Extension to MIMO systems can be achieved in general by several MISO models in parallel, while on-line application requires some structural changes (see Fig. la). The next section presents the used FNN structure while Section 3 investigates the various stages in the model identification process. In Section 4, the modeling method is applied to a simple example known from the literature. Section 5 concludes the paper and gives some remarks on further studies.
2 FNN network structure The generated fuzzy rules are of the following structure:
Ri
:
IF 2 1 is THEN ci,
and
. . . 2 d is A!d and . . .Z D is AYD
where Ri is the ith rule, x d E x d is the input in dimension d, and ci is the (singleton) consequent value of the ith rule. The antecedents are fuzzy sets defined on the domain of the input Z d and they are indexed j d = 1 , 2 , . . . , J d , where J d is the number of antecedents fuzzy sets defined on Xd. Note that the same antecedent function Ajd can be used in several rules.
Rule Concequents and Aggregation
Network I I
I I I
the AND connective in order to get differentiable rules as required for back-propagation parameter learning. Other differentiable operators such as softmin can also be used [7].
2.3 Layer 3, output level In this layer, the system output is determined, by taking the weighed average of the rule consequents:
-
I I I I I 1
(3)
.
I
U/ Input X, (a) Modeling steps
V
where N is the total number of rules. This weighted, normalized sum makes the contribution of each consequent proportional to the relative degree of fulfillment of the rule. Here we only consider singleton consequents due to their ease both in implementation and interpretation. However, more powerful consequent functions can be used aswell to achieve smoother function approximation capabilities with fewer rules [S, 91.
Input X,
(b) Example of FNN structure
Figure 1. a) Modeling steps, dashed line indicates on-line use. b) FNN with four rules.
Only AND connectives are used. Though OR connectives can reduce the number of rules, the resulting increase in complexity complicates the learning and makes the rules less transparent. The equivalent network structure is shown in Fig. Ib for a two input case where J1 = 3 and Jz = 2.
2.1 Layer 1, antecedent fUnCtiOnS At this level, the membership values Ajd(zd)are determined. The used antecedent membership functions are similar to those proposed in [6]:
3 Model i d e ~ t ~ c a ~ i o n As shown in Fig. la, model identification is achieved in three stages: data-driven initialization, parameter learning
and structure learning. The last two stages are repeated until the error criterion ( E )has been minimized, or the number of iterations exceeds a given limit. While parameter learning deals with numerical accuracy of the model, structure learning removes redundancies and irrelevantrules to ensure model transparency.
3.1 Data-driveninitialization(DDI) where mj, is the mean, sjd is the steepness of the flanks and wjd is the width of the membership function. This membership function definition is a generalization of a Gaussian function that enables the approximation of membership functions whose cores have a cardinality higher than one. This is required during structure learning when similar membership functions are combined in a “Lukasiewicz-OR‘ like manner.
DDI initiates the three sections of the network sequentially: 1) antecedent, 2) rule layer connectivity,and 3) consequent values. By initializing the rule connectivity only after full initialization of the antecedent layer, we ensure that the sequence of the antecedent function growth has no influence on the rule layer connectivity. A drawback, however, is that this requires the presentation of the training data to be repeated for each of the three stages of the initialization.
3.1.1 Antecedent initialization 2.2 Layer 2, rule antecedent connectivity Here the degree of fulfillment ( p i ) of a rule is determined by AND conjunction of the propositions in the rule’s ante-
d=l
where Ad is the fuzzy propositionconceming input variable Jd d in the zth rule. The product operator is used to model
1148
Utilizing flexible hyper-boxes similar to those proposed in [lo] for input space partitioning avoids generating antecedent functions in regions never visited by the input data. Similar to the training approach in [l], the initialization takes place sequentially as the training samples are presented. However, instead of placing the means of the antecedent functions at the location of the first N samples (N is the number of rules) like in [I], new antecedent functions are generated upon the presentation of a new data sample if this
sample is not a sufficient member of any previously created function. This avoids creating many similar functions, and complete coverage of the input space by the first N training samples is not required. The DDI algorithm for the antecedent functions is summarized below. The determination of the width of the antecedent functions requires an estimate of the number of membership functions that are needed per input dimension. A priori information concerning (parts of) the partitioning of the input space can be included in the initialization.
Algorithm 3.1 Input: Estimated number of antecedentfunctions Md, d = 1 , 2 , . . .,D, training data x d ( k ) ,k = 1 , 2 , . . . , I - and, optionally, a priori defined antecedentfunctions. Output: Mean m j d ,steepness sjd and width wjd of antecedentfunctions Ajd. Stepl: R e p e a t f o r d = 1,2, ..., D IF no antecedentfunction for dimension d THEN create Afd with mld = x l d , sld = s, and w l d = W d . where In S = (4)
(w)
1110.5 maxk ( z d ( k ) ) - mink ( x d ( k ) ) wd = . (5) 2(Md - 1 ) q m Step 2: Repeat f o r k = 1 , 2 , . . .,K I F m a x j ( A j d ( z d ( k ) ) )5 THEN increment Jd, set jd = Jd and create a new antecedent function Ajd with mjd = x d ( k ) , sjd = s and W j d = Wd.
Recall that Jd is the total number of antecedent functions defined for input dimension d. The columns of R correspond to input dimensions, the rows represent the rules, and corresponds to the number jd of the antecedent function Aid defined for input d in rule i. If rid = 0, input d is not used in the premise of rule i. For example, the connectivity matrix
corresponds to the premise “IF 21 is Ai, and 22 is A:,” for rule 1, and “IF 2 1 is Ail ” for rule 2. The DDI algorithm for the rule layer is as follows:
Algorithm 3.2
IF available, use supplied connectivity matrix ELSE start with empty matrix, R = 8. Repeat f o r k = 1 , 2 , . . .,I< Repeat f o r d = 1 , 2 , . . ., D Step 1: Find the antecedentfunctions Ajd that3re:
Step 2: Augment R with all possible AND connections not already present using the antecedentsfound in step 1.
3.1.3 Consequent initialization Unless a priori values are known, the consequent values of the rules are initializedas a weighted average of the desired outputs.
Algorithm 3.3 Equations (4)and (5) determine s and W d such that the membership value at the cross-over point, is 0.5 and the membership value at the center of a neighboring function is A. Typically X is given some default value that approximates zero, e.g. X = 0.001.
Step 1: Assign available a priori values to corresponding consequents. Step 2: Repeat for i = 1 , 2 , . . .,N IF consequent ci not already initialized THEN
3.1.2 Rule layer initialization In the rule layer only those antecedent nodes are connected that are simultaneously activated as a result of an input sample. This differs from full connectivity which is commonly used with, e.g., fixed-grid partitioning of the input space. Though a fully connected rule layer ensures that each possible combination of input data can be processed, the number of generated rules grows exponentially with the number of input dimensions D. Irrelevant rules may be defined for input combinations that do not occur in the system. If information concerning the rule layer connectivity is available, this can be presented in the form of a connectivity matrix, R = [ f i d ] ,r i d € {O,1,
where Pi ( k ) is the fulJillment of the premise of rule i for input sample k, and y( k ) is the corresponding output.
3.2 Parameter learning As in the original back-propagation learning by Rumelhardt and McCIelland [ 1 11, parameter learning minimizes the mean square error ( E )between the model output (y) and the actual output (y): (7)
..., J d } . 1149
Parameter learning adjusts the parameters of both the antecedent membership functions and the consequent values applying the back-propagation algorithm and the general Widrow-Hoff learning rule:
where w is one of the adjustableparameters in a node and 1 is the iteration number. Learning takes place iteratively, with repetitive presentation of the training data set, until the error E has been minimized or 1 reaches a predefined maximum. For the derivation of the back-propagation algorithm for FNN, we refer to, e.g., [ l , 121, or [13], in which the order of the parameter learning is also discussed. In this paper all parameters are learnt simultaneously.
3.3 Structure learning Structure learning aims at ensuring the transparency of the learned model by eliminating redundancies in the rule and term set and removing irrelevant rules. As illustrated in Fig. la, structure learning is an iterative process that is applied after each parameter learning step. As discussed in [5], similar fuzzy sets represent redundancy in the form of compatible concepts, and their existence hampers the transparency of the model. To help this, the similarities between the antecedent fuzzy sets Ajd are evaluated as the model evolves, and “sufficiently” similar antecedent terms are merged. The consequents are evaluated in a similar way, and compatible consequents are also merged. The fuzzy antecedent sets can also be similar to a universal set X d , adding no information to the model. Such sets are removed from the premise of the rules, and their corresponding entries rid in the connectivity matrix R are replaced by 0. These operations reduce the term set used by the model. Combination of rules follows when the premise of two or more rules get equal, reducing the rule set. Irrelevant rules are detected by evaluation of the cumulative relative contribution (CRC). These are rules whose contribution to the model output never has any mentionable effect. Either they deal with situations that never occur, or their relative degree of firing is always low.
compatibility between fuzzy sets have been proposed [ 141. Following [5],we apply the fuzzy Jaccard similarity index:
where the min and max operators model the intersection and the union, respectively, and I . I denotes the cardinality of a fuzzy set. The similarity measure takes on values S E [0, 11, where 1 reflects equality and 0 refers to nonoverlapping fuzzy sets. If S ( A j , > T , merging of the two membership functions Aj” and A;, takes place. The threshold T E ( 0 , l ) represent the users willingness to trade accuracy against transparency. In the case of merging, it is important that the coverage of the input space is preserved for the following training iterations, hence the following merging of two similar fuzzy sets A and B is proposed: mnew
+
~ A W A ~ B W B WA
1
-t W B
(10)
Figure 2 illustrates the merging of membership functions.
3.3.1 Similarity-drivenreduction
As in [5], compatible fuzzy sets are merged in an iterative manner. In each antecedent dimensionthe pair of antecedent terms having the highest similarity above a given threshold T are merged. A new antecedent membership function is created, and the model is updated by substituting this new fuzzy set for the ones merged. Antecedent terms represented by fuzzy sets similar to the universal set, i.e. A j d ( z d ) M 1 Vzd E X d , are removed. Different methods for assessing
1150
io
Figure 2. Merging of two fuzzy sets. The merging of antecedent functions can lead to the merging of rules when two or more rules get equal premise parts. This requires the consequent values to be merged. Since the total model output is a weighted average of the outputs of the various rules, consequents are simply merged by averaging,
Note that this merging of consequents does not result in zero change in the output. As shown in [5], an optimal merging of
rules can be obtained if weighted rules are used. However, in a FNN training scheme, the small errors introduced by (1 3) are easily eliminated in the next training iterations. In order to reduce the set of different consequent values, compatible consequents are also merged. If the used consequents are fuzzy sets, (9) is applicable to the consequents as well. The similarity of the singleton consequents used in this paper, is defined as the complement of the normalized distance between two consequents:
where jj and y are the upper- and lower bounds of the output domain Y ,&d S ( c i , G I ) E [0,1]. If the degree of similarity between two consequents is greater than the threshold T, they can be merged as described in (13) regardless of the similarity of the premise of their respective rules.
3.3.2 Detection of irrelevant d e s Irrelevantrules can be detected by evaluating their cumulative relative contribution (CRC) to the model. The CRC of a rule is measured by taking the sum of the relative degree of firing of a rule for each data sample k in the training data set:
Figure 3. Rosenbrock Valley. iteration only (one epoch) using a training data set of 300 samples equally distributed over the surface (note that the aim of this example is to illustrate the method and not to perfectly learn the mapping of the ‘RosenbrockValley’, this will require more training iterations). Structure learning was applied four times during the epoch and no a priori information was used. The required estimate of the number of antecedent functionswas in each case set at seven membership functions per dimension, and the used thresholds were T = 0.6 for the similarity analysis, and I = 1 for the CRC evaluation.
where yi is the CRC of the ith rule. If the CRC of a rule is lower than a given threshold, ~ y < i I,then this rule is considered irrelevant for the model output and can be deleted. When a rule Rj is removed based on the CRC, the corresponding ith row of the connection matrix R is removed. Like the merging threshold T , the CRC threshold I reflects the users willingness to trade accuracy for less complexity. Higher values of I remove more rules. It is advised, but not necessary, that I 5 1to ensure that no rules that account for special cases (exceptions) are removed.
1
0.8 L
%06
3e 0.4. w
0.2
Dim2
Y=
0.8
8 0.6
0.6
0.4
0.4
0.2
0.2
.. c
Consider a two-dimensional surface which is the normalization of the well-known ‘Rosenbrock Valley’, shown in Fig. 3, given by the equation - (1
400
- .1)2
!I
4.5
.
0 Dim I
0.5
1
0.4
0.6
0.8
1
0.8
m
- 2:y
!l
Dim*
I
4 Example
lOO(22
’
-0.5
0
Dim2
0.5
1
‘0
0.2
consequence
Figure 4. Without DDI and structure learning.
-4 f
(16)
In order to illustrate the influence of the discussed techniques, a comparison is made with a model trained with the same type of rules, but without the application of DDI and structure learning techniques (full connectivity). The results in Fig. 4 and Fig. 5 show both models after one learning
The model trained without DDI and structure learning (Fig. 4)was initialized with seven equally distributed fuzzy sets in both antecedent dimensions, and the consequent initialization in Algorithm 3.3. Studying the model obtained with DDI and structure learning (Fig. 5 ) reveals that the qualitative descriptionof the resulting surface is comparable, but the number of antecedent terms and rules has been reduced
1151
On-line capabilities can be added by introducing an outer feedback loop (as shown in Fig. la) using the model at time t as a priori information for a new model at time t 1.
+
1 05
References
0
[l] L. X. Wang, Adaptive Fuzzy Systems and Control: Design
1
and Stability Analysis, Prentice Hall, Englewood Cliffs, 1994. 1
1
0.5
08
806
06
-
2
0.4
04
03
0.2
PI
-0.5
0
0.5
Dim 2
1
‘0
[2] J. H. Nie and T. H. Lee, “Rule-based modeling: Fast construction and optimal manipulation,” IEEE Transactionson Systems, Man and Cybernetics - Part A: Systems and Hu-s, vol. 26, pp. 728-738,1996. 0.2
0.6
0.4
[3] T. Kavli, “ASMOD-an algorithm for adaptive spline modeling of observation data,” International Journal of Control, vol. 58, pp. 947-967,1993.
0.8
consequence
[4] C . T. Chao, Y. J. Chen, and T. T. Teng, “Simplification of fuzzy-neural systems using similarity analysis:’ IEEE Transactions on Systems, Man and Cybernetics - Part B: cybernetics, vol. 26, pp. 344-354,1996.
Figure 5. With DDI and structure leaming /
substantially in the latter case. Using structure learning the number of generated rules is reduced to 15, using a total of 9 antecedent terms and 11 consequent terms, as compared to the 49 rules, using a total of 14 antecedent terms and 49 consequent terms, in the model with full connectivity.
[SI M. Semes, R. BabuSka, U. KamA, and H. R. van
Nauta Lemke, ‘‘Similaritymeasures in fuzzy rule base simplification,” IEEE Transactionson Systems, Man and Cybernetics - Part B: Cybernetics,vol. 28, no. 3,1998. [6] Y. Lin and G. A. Cunningham IU, “A new approach to fuzzy-neuralsystemmodeling,” IEEE Transactionson Fuuy Systems, vol. 3, pp. 196197,1995. 171 H. R.Berenji and P. Khedkar, “Learningand tuning fuzzy logic controllers through reinforcements,” IEEE Transactions on Neural Networks, vol. 3, pp. 724740,1992. [8] M. Setnes, R. BabuSka, and H. B. Verbmggen, “Rule-based modeling: Precision and transparency,” IEEE Transactions on Systems,Man and Cybernetics- Part C:Applications and Reviews,vol. 28, no. 1, 1998.
5 Conclusions and We have proposed a combination of data-driven initialization, flexible hyper-boxes, and structure learning, by applying similarity analysis and cumulative relative contribution evaluation, for creating more transparent neuro-fuzzy models from measurement data. DDI ensures that no rules are created in regions not visited by the training data, while similarity analysis detects redundant terms and rules during parameter training, which are subsequently merged. In addition, CRC removes rules that become more or less irrelevant as parameter training proceeds. The application of the structure learning during the network training leads to rule base reduction early in the training process, and less effort is spent on training of redundant terms and rules. The results so far have shown that a great reduction of the rule and term set is possible with these methods, resulting in a more transparent rule-based model. This paper presented a snap-shot of on-going research concerning training methods for fuzzy r!ite-based models. Further research concentrate on three iss ::
[9] T. Takagi and M. Sugeno, “Fuzzy identification of systems and its applicationsto modelling and control,” IEEE Transactions on Systems, Man, and Cybernetics,vol. 15, pp. 116132,1985. [lo] G. Carpenter, S. Grossberg, and D. Rosen, “Fuzzy ART: Fast stable learning and categorizing of analog pattems by an adaptive resonance system,” Neural Networks,vol. 4, pp. 759-771,1991.
1 . Extending to Takagi-Sugeno functional rule consequent. 2. Eliminating the requirement of an estimate oE the number of antecedent functions in the initialization.
3. Investigating the possibilities of on-line application.
1152
[ll] D. E. Rumelhart and J. L. McClelland, Eds., Parallel Distributed Processing, MIT Press, Cambridge,MA, 1986.
[12] C. T. Lin, Neueal Fuzzy Control Systems with Structure and Parameter Learning, World Scientific, Singapore, 1994. 1131 T. Furuhashi T. Hasegawa, S. Horikawa and Y. Uchikawa, “On design of adaptive fuzzy controller using fuzzy neural networks and adescriptionof its dynamicalbehaviour,”Fuzzy Sets and Systems, vol. 71, pp. 5-23,1995. [I41 V. Cross, An Analysis of Fuzzy Set Aggregators and Compatibility Measures, Ph.D. thesis, Wright State University, Ohio, 1993.