TOLERATING MISSING VALUES IN A FUZZY ENVIRONMENT eMail ...

50 downloads 49099 Views 158KB Size Report
completion of input vectors. If the set of fuzzy rules is used to describe the behaviour of a system, then it is often of interest to nd good values for all or part of the ...
in Proceedings of the 7th IFSA World Congress, Prag, 1, pp. 359{362, 1997. (invited paper)

TOLERATING MISSING VALUES IN A FUZZY ENVIRONMENT Michael R. Berthold, Klaus{Peter Huber Institute for Computer Design and Fault Tolerance (Prof. D. Schmid) University of Karlsruhe, Am Zirkel 2, 76128 Karlsruhe, Germany phone: +49{721{608 4219, fax: +49{721{370 455 eMail: [Michael.Berthold,KlausPeter.Huber]@Informatik.Uni-Karlsruhe.DE

Abstract. In this paper a technique is proposed to tolerate missing values based on a system of fuzzy rules for

classi cation. The presented method is mathematically solid but nevertheless easy and ecient to implement. Three possible applications of this methodology are outlined: the classi cation of patterns with an incomplete feature vector, the completion of the input vector when a certain class is desired, and the training or automatic construction of a fuzzy rule set based on incomplete training data. In contrast to a static replacement of the missing values, here the evolving model is used to predict the most possible value for the missing attribute. Benchmark datasets are used to demonstrate the capability of the presented approach in a fuzzy learning environment. Keywords: Fuzzy Rules, Classi cation, Training, Missing Values.

1 Introduction Pattern classi cation applications often involve incomplete information. This maybe due to unrecorded information or occlusions of features, for example in vision applications. Dealing with incomplete information then means that a few of the input features are unknown. In this context three di erent scenarios can be of interest: classi cation with missing values, i.e. the model is used to predict the most likely class of the incomplete input vector. completion of input vectors. If the set of fuzzy rules is used to describe the behaviour of a system, then it is often of interest to nd good values for all or part of the input features when a certain outcome is desired. This is similar to nding an inverse function of the model. training with incomplete data. The last issue concerns training of a set of fuzzy rules when some or all of the training data is incomplete. To use as much of the available information as possible during training it is desirable to make use of incomplete training vectors as well. In this paper we concentrate on classi ers that are based on fuzzy rules, in contrast to the work presented in [1] and [4]. This simpli es the resulting math and allows to estimate the best guess for the missing values eciently. Few restrictions on the type of fuzzy system are assumed, i.e. rules have to be normalized and standard min/max{inference is required. For most practical applications these restrictions are of no relevance. This paper is organized as follows. In the next section the used terminology will be established. 





Thereafter a way to classify patterns with incomplete features will be presented. Then this method will be extended to nd the most likely value for these features. After that it is described how automatic learning of fuzzy rules can be adapted to tolerate missing values in the training data followed by a section describing experimental results.

2 Fuzzy Rule Based Classi ers A fuzzy rule based classi er (see [5]) consists of a set of rules Rik for each class k with 1 k c, 1 i mk , where c denotes the number of classes and mk indicates the number of rules of class k. If the feature space is of dimensionality n, each feature vector ~x consists of n attributes: ~x = (x1 ; ; xn ), xi ( denotes the domain of the attributes, usually the real numbers). A rule's activity ki (~x) indicates the degree of membership of the pattern ~x to the corresponding rule Rik , the degree of membership for the entire class is computed using:  k k (~x) = 1max i (~x) (1) im 









2 R

R

k

Using the normal notation of fuzzy rules, each rule can be decomposed into n individual, one{dimensional membership functions:  k ki (~x) = j=1min  ( x ) (2) j i;j ;:::;n where ki;j indicates the projection of ki onto the j {th attribute. In addition, we assume normalized, one{dimensional membership functions: xj : 0 ki;j (xj ) 1 and (3) xj : ki;j (xj ) = 1 8

2 R

9

2 R





3 Classi cation with Missing Values

eq:5

For the case of missing input values, i.e. attributes where no information about their true value is available, the feature vector can be written as: ~x = (~v ; ?), where ~v = (v1 ; :::; vnv ) represents the part of the vector with the known values and ? replaces ~u = (u1 ; :::; unu ), the vector of unknown attributes (nv + nu = n). Using equation 2 the activation of one rule can be obtained through:

ki (~v ; ?)

=



 min k (v ) argmax 1imk 1j nv i;j j



(7)

In the case of several maxima one of them can be chosen arbitrarily. The restriction ~v on the best rule leads to a not necessarily normalized fuzzy rule in nu with a subspace Amax (~u ~v ) of maximal degree of membership under the condition ~v: j

R

Amax(~uj~v) = (8)  k nu : k (~v ; ~u)  k (~v ; w~ ) ~ u j 8w ~ 2 R imax imax = supn i (~v ; ~u) (4) ~u2R u   Using equation 5 this can be simpli ed to: eq:2 k (vj ); sup fk (ul )g = 1min  i;j i;l j nv ul 2R  1lnu Amax (~uj~v ) = ~u j kimax (~v ; ~u)  kimax (~v ; ?) (9)

Since we are dealing with normalized fuzzy rules (equation 3), ki;l will have a maximum value of 1 and ki (~v; ?) is the activation of the rule using the somewhere on , and this equation can be simpli- missing values (see equation 5). Amax(~u ~v ) describes a hyper{rectangle in nu {dimensional space ed to: that is equal to the {cut of min1lnu kimax (ul )  k k  (v ); 1 i (~v; ?) = 1min using = kimax (~v; ?) and therefore easy to obtain. j nv i;j j  k Now the \best" vector (~v; ~umax) can be com ( v ) (5) puted, = 1min j i;j using a typical defuzzi cation technique, for j nv example the Center of Gravity: independent of ~u, i.e. it is sucient to compute the R activation of each rule in the projection of n on ~u kimax (~v; ~u) d~u nv . This means that only the activation for each ~ u 2Amax (10) ~umax = R k (~v ; ~u) d~u one{dimensional membership function of the known i max attributes has to be computed. ~u2Amax The results obtained here are similar to the results obtained by [1] for Gaussian Basis Functions Figure 1 illustrates this procedure using a two dibut are more ecient to compute. In the case mensional example with one known and one unof fuzzy rules the only assumption made is the known attribute. For the activation function a fact that the rule's activity is composed of one{ trapezoidal membership function is used, featuring dimensional normalized activation functions. a core{ ( = 1) and a support{region (0 <  < 1), for more details see [2]. j

R

g

f

R



R

4 Input Vector Completion

If the input{output pattern (~x; k) (input ~x, class k) has missing values the same notation as above can be used: ~x = (~v ; ~u). Usually nding the best guess for ~u is computationally expensive since the maximum of the model{function has to be found in u . In the case of fuzzy rules it becomes easier, the maximum of k (~v ; ~u) in respect to ~u has to be found: R



v

v support core

a

h



 k  i (~v; ~u) supn k (~v ; ~u) eq=:1 supn 1max i  m k u u ~u2R  ~u2R   k sup  (~v; ~u) = 1max imk ~u2Rnu i  k eq:5 = 1max  ( ~ v ; ?) (6) i imk

h(?,a) 1

v

u h u

A

1

max

h(?,a) u

u max

Figure 1: The proposed way to nd a best guess for a missing value. The attribute v has the value a, u This results in the rule with the highest degree of is unknown. Using the described method (?; a) can membership being selected: be obtained resulting in an area of maximum activ k ity Amax . The \best guess" for u is found using the imax = argmax  (~v; ?) Center of Gravity method resulting in umax. 1imk i

5 Training with Missing Values Training with missing values is usually being done by substituting the mean or another xed value for each missing attribute. In [4] it was demonstrated how this can lead to severe problems during training. More sophisticated methodologies can be used, but for practical applications an ecient implementation is also required, especially in the case of large datasets. The approach presented here uses the rule model that evolves during training to predict the most possible value for each of the missing attributes. As was shown above this is computationally uncritical, only the rule with the highest activation has to be found and from it's core{region the value of ~umax has to be extracted. A training vector with missing values is thus handeled in two steps, rst the best guess ~umax for ~u is determined, based on the known values for ~v. Then normal training is executed using the best guess for ~u together with the known values of the input ~v resulting in a complete ~x = (~v ; ~umax). In practice there are two possible ways to implement this type of training, depending on the underlying rule learning algorithm. If the algorithm operates iteratively and the evolving rule set is meaningful during intermediate presentation of training patterns, the current rule set can be used to replace the missing values. If the algorithm on the other hand is batch oriented, i.e. the rule set is only meaningful after training is nished, the rule set of the previous epoch is used to replace the missing values. In the following the later approach has been used. At the beginning all missing values were replaced with the mean of the corresponding attribute (which is the \best guess" when no rule set is present). Then, after training using this dataset was nished, the constructed rule set was used to replace the missing values in the training data. With this new dataset the next training was performed, resulting in a new rule set.

easy to use and trains fast. In addition its termination can be proven. Two datasets from the StatLog{archive [3] have been used. This archive evolved during a european ESPRIT-project and contains a number of real world datasets together with results of over 20 di erent classi ers. Both used datasets are based on real data and contain a large number of examples. The SatImage dataset was chosen because of its complexity (high number of attributes, high remaining generalization error), the Shuttle dataset because of its size (about 60; 000 examples). Since the used training algorithm does not require a validation dataset, training could be performed on the entire training set and the testing data was used to estimate the generalization capability of the constructed fuzzy rule set. Di erent levels of distortion were used to erase randomly chosen values, thus simulating di erent amounts of missing values. In addition to the presented fuzzy based replacement of missing values, a static substitution using mean and zero was also used.

6.1 Results on the SatImage Benchmark The SatImage dataset contains 4; 435 training cases divided into 6 classes in a 36{dimensional feature space. Testing is done on 2; 000 patterns. This dataset was used rst to demonstrate how using the evolving ruleset to replace the missing values leads to a new rule set with better generalization. Table 1 shows the obtained results. Obviously for this dataset the replacement with zero is harmful, the replacement with the mean produces much better results. This illustrates clearly how the static substitution using a constant value can be harmful. The proposed method, however, nds the \best guess" dynamically, using the evolving rule set. This way to compute individual estimates for the missing values leads to rule sets with better generalization. Additionally the size of the constructed rule{set grows slower with an increase in distortion. After a certain level of distortion (> 5% missing 6 Experimental Results values) the size of the rule set even decreases, due to the increasing amount of freedom to choose a To demonstrate the usefulness of the proposed me- replacement for the missing values. thod an existing algorithm to construct a set of fuzzy rules from training data was adapted using 6.2 Results on the Shuttle Benchmark the described technique to tolerate missing values during training. The underlying training algorithm In addition the Shuttle data was used, because of (see [2] for a detailed description) uses a set of train- its size: 43; 314 training cases, 3 classes), 9 diing data points (~x; k) (where ~x is an n{dimensional mensions, and 14; 442 patterns to test generalizainput vector and k indicates a class 1  k  c) tion. Table 2 shows results for the Shuttle dataset, to iteratively construct a set of locally independent where the di erence in required number of rules fuzzy rules. These rules only depend on relevant at- between the di erent methodologies is even more tributes, which is extremely bene cial when using drastic. At a level of 60% missing values, only a high{dimensional feature space. The algorithm is 6 rules are required, while both other approaches ) Three

other classes from the original dataset were ignored because their occurrence was below 1%.

percentage of missing values 0% 1% 5% 10% 20% 40% 60%

proposed method error #rules 12.96% 506 12.41% 526 14.11% 532 13.76% 501 15.31% 400 16.11% 303 16.96% 165

replace w. mean error #rules 12.96% 506 12.86% 529 14.51% 645 15.86% 745 15.41% 843 16.66% 950 18.01% 1121

replace w. zero error #rules 12.96% 506 13.86% 645 17.26% 966 27.51% 1285 43.12% 1444 51.33% 1353 41.98% 1314

Table 1: Results on the SatImage dataset with di erent levels of missing values. percentage of missing values 0% 10% 20% 30% 40% 60%

proposed method error #rules 0.007% 14 0% 28 0.007% 31 0% 25 0.007% 23 0% 6

replace w. mean error #rules 0.007% 14 0% 183 0% 449 0.035% 934 0.035% 1616 0.160% 3481

replace w. zero error #rules 0.007% 14 0% 347 0% 971 0% 1730 0.160% 2704

Table 2: Results on the Shuttle dataset with di erent levels of missing values. need several thousand rules. With this dataset it is References even more appearant how the rule{based replacement of missing values reduces the number of re- 1. Subutai Ahmad and Volker Tresp. Some soluquired rules without loosing performance. In contions to the missing feature problem in vision. In trast, compared to the other methods the generalStephen J. Hanson, Jack D. Cowan, and C. Lee ization performance is even marginally better. Giles, editors, Advances in Neural Information Processing Systems, 5, pages 393{400, California, 1993. Morgan Kaufmann. 7 Discussion A methodology was proposed to deal with missing 2. Klaus-Peter Huber and Michael R. Berthold. inputs for classi cation, completion of inputs, and Building precise classi ers with automatic rule training using sets of fuzzy rules under the usual asextraction. In IEEE International Conference sumptions i.e. max/min{inference and normalized on Neural Networks, 3, pages 1263{1268, 1995. rules. Fuzzy arithmetic was used to derive a computationally ecient way to compute the output of 3. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, a fuzzy rule set when parts of the input vector are editors. Machine Learning, Neural and Statistiunknown. In addition the input vector can be comcal Classi cation. Ellis Horwood Limited, 1994. pleted, computing the so{called \best guess". During incremental training, the evolving set of fuzzy 4. Volker Tresp, Subutai Ahmad, and Ralph Nerules can also be used to compute the completed uneier. Training neural networks with de cient input vector and then use a conventional training data. In Jack D. Cowan, Gerald Tesauro, and algorithm. Experimental results show that the preJoshua Alspector, editors, Advances in Neural sented methodology outperforms standard static reInformation Processing Systems, 6, pages 128{ placement techniques. 135, California, 1994. Morgan Kaufmann.

Acknowledgments

5. Lot A. Zadeh. Soft computing and fuzzy logic. IEEE Software, pages 48{56, November 1994. Thanks go to Prof. D. Schmid for his support and the opportunity to work on this project.