Missing Values in a Backpropogation Neural Net

1 downloads 0 Views 33KB Size Report
Several subnets were trained, each with one of the ... missing inputs the subnets were trained to classify the ... However in practice, it was found useful to perform.
Missing Values in a Backpropogation Neural Net Peter Vamplew and Anthony Adams Department of Computer Science, University of Tasmania Abstract An empirical study of methods of handling missing values in a backpropagation neural network is presented. Neural networks can be applied to many real world systems to perform classification, pattern recognition or prediction on the basis of input data. However, many such applications cannot guarantee that the data provided to the network will be complete. The backpropagation network does not lend itself easily to dealing with missing values due to its distributed nature of operation and the soft thresholds used in its nodes. Two common methods of handling this situation are tested and four new methods are proposed and compared.

1 Introduction Neural networks can be applied to many real world systems to perform classification, pattern recognition or prediction on the basis of input data. However many such applications can not guarantee that the data provided to the network will be complete. For example errors with measuring equipment may mean that for some cases one or more of the required values will not be available. In a real time control system this problem must be able to be handled while the measuring equipment is being repaired. The purpose of this study is to implement and compare the performance of a number of different methods of adapting the most commonly used neural network architecture (the back-propogation network) to perform in the absence of complete data. For the purposes of this investigation it has been assumed that a complete set of training data is available, and so we are concerned only with the problem of missing values in the data on which the net is required to operate after training has been finalised.

2 The test data set

The test data set used for this study involved the classification of weed seeds into one of ten types, based on seven measurements of the dimensions of the seeds. The basic network used had a 7:N:10 topology, where each of the output nodes corresponded to a particular seed type and the most strongly activated output indicated the type which the network classified a seed to be. Values of N (the number of nodes in the hidden layer) from 6 to 12 were used. Similar results were found with all the nets over this range and the results below are averages using these values. The original seed data consisted of measurements of 398 different seeds, giving 40 examples of each seed type, apart from two types which had only 39 examples. A test set consisting of ten examples of each seed type was extracted from these data, and the remaining 298 examples were used as the training set. The seven inputs were measurements of area, perimeter etc and were all scaled to range from –1 to 1.

3 Methods There appear to be three general approaches to the problem of missing values. (For example, see [1], [2]) • Value substitution: Substitute another value in place of the missing value. Either a value which is felt to be innocuous and unlikely to affect the overall result or an estimate of its most likely value. • Network reduction: Attempt to perform the calculations without making use of the variable whose value is not known. • Multiple outputs: Perform the calculation but allow multiple outcomes by taking into account the effects which different values of the unknown variable could have. The first of these methods can easily be applied to a neural network as the substitution can be done externally before the values are presented to the net, and so the net itself performs normally. The only task is to determine the most effective value to use as the replacement.

Different attempts were made at implementing the second method, namely trying to factor out the effect of the missing value and performing the calculation using only the values that are known. The last method and to some extent the second method are difficult to implement using a backpropagation net because of the distributed nature of the net and the soft thresholds used by the backpropagation algorithm. Every input affects every node in the network, rather than there being particular sections of the network dealing with the missing value, which could thus be ignored or treated separately. This contrasts with the situation in a rule-based system where it may be possible to produce an output whilst avoiding the rules which require the value which is missing. Alternatively an expert system may choose to follow every path leading from rules dealing with the missing value, and produce several outputs each annotated with the value of that variable which would produce that result (and possibly a measure of how likely such a value is based on the examples which the system has already seen).

4 Substitution of another value for the missing value This is the the simplest solution to the problem as it does not involve making any alterations to the network itself. A chosen value is substituted for the missing value. Three possibilities for selecting the value to be substituted were investigated during this study. S1 The ’zero’ solution A common and simple method is to replace the missing value by zero. It will cancel out the numerical effect of the node but as in this case it is also the midpoint of the range of possible values for the scaled input data it is hopefully a relatively innocuous value which will have little impact on the output of the network. Clearly for some data sets this will be untrue as values around the midpoint may in fact be of significance. However the simplicity of this solution made it worthy of examination. S2 The ’average’ solution Substitution of the statistically most likely value. The average value over the whole training set for that particular input was used as the replacement value. If the distributions of values for the different classifications in that input are very different then this solution would not be expected to be much better than the previous one. S3 The ’best estimate’ solution A more sophisticated method attempted to make use of the other known input values to provide a more accurate estimate of the value for the unknown input. In our case, with seven inputs, a

net with six inputs was trained on the full training set to produce an estimate of the seventh. Seven such nets were trained so that a missing value in any input could be estimated. The full classification net was then used with the six known values and the seventh estimated value. This method will not work if the inputs are uncorrelated but in most real world situations the inputs are correlated, ie seed perimeter and seed area.

5 Performing the classification without using the missing input Three different approaches to implementing this concept were attempted and tested. Two of these used a network with a reduced number of inputs whilst the third used flags to indicate whether the input was known or not. R1 Trained reduced net This first method was similar to S3. Several subnets were trained, each with one of the original inputs missing. However rather than being trained to produce an estimate for the value of the missing inputs the subnets were trained to classify the inputs directly. In other words the subnets were performing the classification purely on the basis of the input values that were known. R2 Derived reduced net The second method was perhaps the most interesting approach from a theoretical point of view. The aim was to train the net on complete training data then, when a missing value was encountered, to redistribute the weights leading from the missing input node by rescaling the other weights. Thus it was hoped to cancel out the effects of removing this input whilst maintaining the proportional relationships between the remaining weights. Alternatively, this could be implemented by temporarily altering the weight associated with the threshold. When applying the network to an input pattern with missing values, the threshold used was adjusted to maintain the same ratio between the threshold and the absolute sum of the weights connected to known inputs as between the original threshold and the absolute sum of all the weights. The training set for this method was complete, that is, without missing values. The aim was to use the basic network normally when all inputs were known, only modifying the operation of the net when unknown values were encountered. This method was the only one examined which attempted to modify the internal structure and operation of the network after it had been trained. F1 A flagged network The final approach was to associate each input value with a pair of input nodes, rather than just a single node as in the conventional network. A flag node was used to indicate whether the value was known or not, with –1 signalling a

known value and 1 indicating a missing value. In the case of a known value the second node had this value input to it as usual. In the case of a missing value if the net was correctly interpreting the flag node then the actual input given to this node should not matter. However in practice, it was found useful to perform substitution on these value nodes, using one of the simpler substitution methods described above rather than just using a random value. This method involved training on data with missing values. For each exemplar in the training set an input was chosen randomly and the two nodes corresponding to it were set accordingly.

6 Results and analysis For the complete data set (no missing values) a backpropagation network was able to correctly classify 57-62% (average value was 60.5%) of the test set correctly. This is, therefore, an upper limit on the number of correct classifications which could be expected once missing values were introduced to the test set. Single missing values The results obtained for each method on tests cases with a single missing value are given in Table 1. The values shown are averages over several test runs and net configurations, commencing with different initial weights and with training exemplars presented in different random order. Method

Full Data 60.5

S1

S2

S3

34.5 43.6 57.3 Average % correct R1 R2 F1 Method 55.8 33.8 52.9 Average % correct Table 1 Average percentage of correct classifications. Full data and one missing value data. As can be seen the most effective methodologies were subnet estimation of missing values (S3), direct subnet classification of inputs (R1) and the flagged network (F1). All of these methods correctly classified around 53–57% of the test cases with one missing value, which in the best case (S3) is only 5% lower than the success rate achieved by the basic network with no missing values. One interesting point about the flagged network was that the value substituted in place of missing values did have an effect on performance (with better results when the average was used rather than a random substitution) even though such values were indicated by their associated flag node. It would appear that the net is

failing to learn fully the significance of the indicator nodes. Possibly better results (or faster training times) could be obtained by commencing with a simplified training set which emphasises the nature of the indicator nodes before progressing to the full training set (eg for the first few iterations use only the average of all the examples of that type). The zero or mid-range substitution method, (S1), did not perform as well as the more sophisticated methods. However this may be offset by the ease of implementation and low implementation costs of this method. For the weed seed test data, although zero is the mid-range value, using the average as a substitute gave better results. This is due to the fact that many of the input values were skewed towards negative values so that the average was well below zero. For data with a more even distribution the difference in performance between these two substitution values is unlikely to be as significant, although it might be expected that the average would still generally be the better choice. The worst results were obtained by the derived reduced net, (R2), which would seem to indicate that the methods used to attempt to factor out the effects of the inputs which were missing failed to accomplish this task. However this method still has theoretical interest and is worthy of further study, as it holds the promise of producing results similar to those obtained by the direct subnet classification without the need for training several nets and with the ability to extend to cases of multiple missing values. An attempt was made to improve this method by training a basic net with all data, then examining the changes in weights when we remove one input node and retrain the net, with the hope that more effective means of redistributing the weights to counter the effects of the missing value could be found. However whilst there was some evidence of a pattern in the relationship between the initial and retrained weights we were unable to isolate this with sufficient clarity to apply it to the derived reduced net. Two missing values Methods S1, S2 and F1 were tested for performance with two missing values. Expanding the direct subnet classification technique (R1) to cover the situation of multiple missing values would mean creating a separate net for every possible combination of missing and known values and was not thought worthwhile and for similar reasons direct expansion of the best estimate method (S3) was rejected. However by combining the subnet (S3) and substitution (S2) methods it is possible to produce results using only the initial subnets. For each missing value the appropriate subnet is used to produce an estimate of that value, with any further missing inputs to the subnet replaced by the average. The estimates produced can either be used directly as inputs to the main

network or passed back as inputs to the subnet to refine the estimates. This ’bootstrapping’ process can be continued iteratively until the values stabilise. This process was tested with only a single iteration of this bootstrapping (S2/S3a) and with ten iterations (S2/S3b). The threshold modification approach (R2) performed so poorly in the presence of a single missing value it was deemed unnecessary to test it under the more difficult conditions of multiple missing values. The substitution methods obviously require no adaptation to the multiple missing value case, with every missing value being substituted as before. The flagged node net was tested using its original training data (F1a) and using training data with two missing values for each exemplar (F1b) Method S1 S2 S2/S3a Average % 27.3 43.6 49.4 correct Method S2/S3b F1a F1b Average % 48.6 39.0 44.5 correct Table 2 Average percentage of correct classifications. Two missing value data. As expected, results were generally lower than those obtained with only single missing values. The zero substitution method (S1) yielded the poorest performance, whilst average substitution (S2) actually had the same success rate as with single missing values. The performance of the flagged network was affected by the training data presented to it. The network trained on data with two missing values (F1b) performed better than the net trained with single missing values (F1a). It would, therefore, appear that to give the best performance in the case of an unspecified number of missing values the flagged network should be trained on exemplars with a varying number of missing inputs. The best results were obtained by the combination of average substitution and subnet estimation of missing values (S2/S3), which correctly identified almost 50% of the test cases. On average the additional iterations of the bootstrapping process (S2/S3b) had little effect on the success rate. However this method performed particularly well for some combinations of inputs (around 60% correct) and very poorly for others (less than 20% correct). It would appear that the extra iterations are actually harmful in the cases where the missing inputs are poorly correlated to the remaining inputs, but in other circumstances they are beneficial.

Acknowledgement We are grateful to our colleague, Mr P Collier, for the use of the seed data.

References [1]

[2]

Nie NH, Hull CH, Jenkins JG, Steinbrenner K and Bent DH (1970), SPSS: Statistical Package for the Social Sciences, McGraw-Hill Book Company, New York, 1970. Quinlan JR (1987), Decision Trees as Probalistic Classifiers, Technical Report 87.6, New South Wales Institute of Technology, 1987.

Suggest Documents