Adaptive Preprocessing for On-Line Learning with Adaptive Resonance Theory (ART) Networks Harald Ruda & Magnús Snorrason Charles River Analytics 55 Wheeler St., Cambridge, MA 02138
[email protected],
[email protected]
Abstract—Neural networks based on Adaptive Resonance Theory (ART) are capable of on-line learning. However, a limiting factor in on-line processing has been the need to preprocess input patterns so that features fall in the range [0.0, 1.0] 1, typically done with scale-factors that depend on the input range of each feature. This paper demonstrates a method by which the scaling of features becomes adaptive, eliminating the need to batch-process patterns before presenting them to the ART network. The resulting network implementation for on-line learning does not call for any knowledge of the feature signals, ranges or otherwise. A variety of implications of this scheme are analyzed.
INTRODUCTION A classifier is capable of on-line learning if it can learn to classify patterns as they are presented without storing the patterns for reference. It has been suggested that ART networks are capable of such on-line learning [1,2]. Even though this has not been proven theoretically, one finds that learning of the whole set is usually accomplished in only one cycle (especially with the learning rate parameter beta = 1.0). Therefore it is practical to use ART networks for on-line learning tasks. Traditionally, classification algorithms have been restricted to batch mode learning. In other words, the whole training data set must be stored and available for use by the algorithm. In such situations the scaling of input features is trivial: the training set can be searched for the extreme values of each feature and those values (possibly with added safety margins) can be used as estimates of the true ranges of each feature. 1 This is strictly true for Fuzzy-ART and Fuzzy-ARTMAP, but ART2 and ART2-A does not technically require preprocessing of features other than ensuring that all features are non-negative. In practice however, scaling of features is often necessary.
1
There are applications where batch operation is not desirable. For example, if there are real-time constraints acting on the system, one does not want to retrain the classifier from scratch each time a new pattern class is identified. Rather, one wants an on-line classifier that can add knowledge about the new pattern to its previous knowledge. As another example, the classifier may be required to train on such a large number of patterns that presenting the patterns more than once is too time-consuming or otherwise resource intensive. These are some situations in which on-line learning is required. Unless the true range for each feature is already known, these situations present a real problem for feature preprocessing. It is no longer possible to a priori determine the extreme values of features in the training set because the whole training set is never available at once. This is the problem that our adaptive preprocessing solves by linking the scaling of features with the scaling of weights. Adaptive preprocessing is also very convenient, even in batchprocessing situations, because it eliminates the need for analysis of input variable ranges. ALGORITHM There are two key parts to the algorithm needed for implementing adaptive preprocessing. •
•
The observed range for each feature must be tracked. The maximum ( fmax i ) and minimum ( fmin i) values of each feature i have to be stored and updated every time they change. p When input pattern p contains a value for feature i ( fi ) which falls outside of the previously observed range, the range must be updated and the weights relating each cluster ( wji , ∀j) must be adjusted such that previously seen patterns would be coded the same way if they were presented again.
The adjustment is performed according to the equations below, which apply specifically to Fuzzy-ARTMAP [3] and Fuzzy-ART [4]. First the new range has to be determined according to:
fmin' i = min( fmin i, fi p) and fmax' i = max( fmax i, fi p)
(1)
where (') indicates the updated value, such as in fmin i. If this new range is different from the old, then the cluster weights are scaled:
w' ji = w ji
fmin i – fmin' i fmax i – fmin i + fmax' i – fmin' i fmax' i – fmin' i
and the complement weights: 2
(2)
w' ji = w ji
fmax' i – fmax i fmax i – fmin i + fmax' i – fmin' i fmax' i – fmin' i
(3)
where for each feature i, the weights to every cluster j are adjusted. Whether the range changed or not in (1), each feature in every pattern must be normalized by mapping the feature range to [0.0, 1.0]:
fnormalizedi p =
fi p – fmin' i fmax' i – fmin' i
(4)
Equations (2 – 4) provide a linear mapping of feature and weight values, but the general approach will also work with nonlinear mappings. As shown in Figure 1, clusters that were not selected by the current input pattern merely shrink through (2) and (3) as the observed range is expanded. Patterns which previously selected those clusters will still do so, because the expanded range also shrinks features by (4). Cluster growth occurs in the normal Fuzzy-ART manner of weight update in response to training patterns. If a given input feature value is outside the previously observed range, the cluster weight representation for that feature will expand if it is selected by the current input pattern. In summary, for the selected cluster, the weight representation can shrink for some features and simultaneously expand for others.
1
f (B)
Weight space
f ' (B')
f f'
f (A) f '(A')
B' 0
A= A' fmin
B
Feature space
fmax
fm ' ax
Figure 1. A graphical representation of one feature (f) and the mapping of values from the original feature space to the [0.0, 1.0] weight space. At a given point in the training cycle, values of f from fmin to fmax have been observed and two clusters have been learned, A and B. Later in training, f'max is observed, which is larger than fmax. According to (2), the weight values representing A and B change. f'(A') shrinks because the mapping function now has a shallower slope. This is true for any cluster, other than the one which contains f'max, i.e. f'(B').
3
RESULTS Figures 2 and 3 show results of tests to verify that adaptive preprocessing does not interfere with the proper functioning of the network. We used benchmark machine learning databases from the University of California, Irvine repository at ftp://ftp.ics.uci.edu/pub/machine-learning-databases. The first test is a two-class problem based on categorical data. The Agaricus-Lepiota mushroom database contains over 8000 patterns of features, observed from 23 species of mushrooms that belong to the genera Agaricus and Lepiota. A pattern consists of 22 features (such as “stalk-shape” or “odor”), each with 2 - 12 possible categorical values (such as “tapered” or “pungent”). The classification task is to distinguish between edible or poisonous mushrooms, which often look very similar. The second test is a difficult six-class problem based on quantitative data. The glass database contains 214 patterns of 9 different measurements of chemical properties of glass. The classification task is to separate the data into 6 classes based on these 9-dimensional patterns.
Adaptive
On-line
Full
Limited
100
Percent Corrrect
90 80 70 60 50
10
40 80 Number of Training Patterns
320
Figure 2. A test to demonstrate that adaptive preprocessing does not cause degradation of classification performance. Fuzzy ARTMAP was used to classify the AgaricusLepiota mushroom database. Four processing conditions are shown: (Adaptive) uses the adaptive preprocessing. (On-line) uses the adaptive preprocessing with the on-line constraint of allowing only one cycle of learning for each set of training patterns. (Full) uses batch-scaling of features based on fmin i and fmax i over the whole data set. (Limited) uses batch-scaling of features based on fmin i and fmax i over the training set only. The mean and standard deviation of each bar were estimated from 50 different random selections of training sets. “Percent Correct” is the average of the evaluation set, which contains 8124 patterns minus the number of training patterns.
4
Notice the minimal effect of on-line learning on the results in Figure 2, i.e. presenting each pattern only once. The penalty for on-line learning using adaptive preprocessing is never more than 3.5% and the best average performance (for 320 training patterns), it is almost identical.
Adaptive
On-line
Full
Limited
70 Percent Corrrect
60 50 40 30 20 10 0
10
20 40 Number of Training Patterns
80
Figure 3. Same information and methods as in Figure 2, but for the glass database. The evaluation set contains 214 patterns minus the number of training patterns. 80
7
20 ▼
75 ▼
◆
15
▼
◆
5
10 ▲
4
70 ▲ 5
1.54 ▼
6
◆
3
▲
2
2
7
0.5 ▼ 0.4
▼
0 5
▼ 1.53
▼
▼
◆
1 65
0.6
0▲ 3
4
◆
1.51 ▲
◆ ▲
1.52
0.3 0.2 0.1
◆ ▲ 6
◆ ▲ 8
1.5 1
◆
0▲ 9
Figure 4. The ranges and means of each feature from the glass database. Feature 1 represents the refractive index of sample; the other features represent different concentrations of oxides.
It is evident from the lower overall scores in Figure 3 that discriminating between the 6 glass categories is much harder than classifying the mushrooms. Still, adaptive preprocessing works just as well as fixed preprocessing based on the training part of the set, and consistently better than fixed preprocessing based on the whole set. Figure 4 shows that the 5
natural range of the features is large for some features (such as 2, 5, and 7) and very small for others (1 in particular), and the distribution within that range is very skewed for some (such as features 6, 8, and 9). TYPES OF LEARNING IN FUZZY-ARTMAP When adding the adaptive preprocessing to Fuzzy-ARTMAP, the result is a network with four distinct types of learning which can be individually controlled (in high to low level order): •
•
•
•
Supervised learning, or learning in the map-field. This type of learning determines the connections between clusters and output classes. It is performed in training mode, i.e. when the current input pattern has an associated output class. Several clusters can map to the same output class, implementing decision surfaces of arbitrary complexity. Creation of new cluster nodes. When no previously established clusters meet the “vigilance” criterion new clusters are formed. This can be disallowed, forcing the choice of a previously created cluster. Conversely, cluster creation can be allowed during testing, although the mapping to an output class has to be deferred until relevant training data is presented. Modification of cluster weights. This is the type of learning FuzzyARTMAP has in common with most other neural networks. Learning of this type is usually allowed when in training mode. In certain situations, the network can also be allowed to update weights in testing mode. Allow updating of input feature ranges and the associated adjustment of cluster weights. Ordinarily, this type of learning applies both in training and testing modes (but see section below on dealing with noisy data).
The choice of learning types allowed during training and testing modes is application dependent and the number of options makes Fuzzy-ARTMAP with adaptive preprocessing applicable in a variety of situations. ADAPTIVE PREPROCESSING Nonlinear Mappings Mapping the range of the input feature to the usable [0, 1] range in order to maximize the effectiveness of the network also includes the possibility of using a nonlinear mapping rather than linear transformations. Indeed, any monotonic (and therefore one-to-one) function can be used for this purpose. Some of the functions which might be used instead of lines are sigmoids (logistic or tanh), logarithmic, exponential, and power functions. At the moment the choice of scaling function has to be determined manually, with some knowledge of the features used. A different function may be used for 6
each feature. An intriguing (but computationally expensive) option is to track the mean and variance of each input feature instead of the min and max. Then a sigmoid is the natural choice for a mapping function. This would seem an ideal mapping for features that are normally distributed. Applicable Network Types This adaptive preprocessing scheme is designed for analog ART networks that use complement coding, but it will work for any analog ART network, such as ART2, ART2-A, and Fuzzy-ART. The procedure should work equally well for any network where the weights can be considered templates of input features. Another consideration is whether a situation calls for feature normalization, which is often used in practice even if not theoretically required for the given paradigm. A mean-and-variance tracking version of the procedure could be applied to multi-layer perceptrons. The input features would be normalized to zero mean and unit variance. Changes in variance of a given feature affect the importance of weights propagating from that feature. These weights can be adjusted according to the reciprocal of the variance change. Changes in the mean have higher order effects on the required weight adjustments. These effects could possibly be ignored, or accounted for with bias adjustments. A deeper issue is the amount of training required for multilayer perceptrons. As mentioned in the introduction, on-line learning is a major impetus for the described procedure. It would seem that multilayer perceptrons are particularly ill suited for on-line learning, due to their extensive training requirements. Until on-line learning abilities are developed for multilayer perceptrons, we do not see a need to adapt our procedure to such networks. Noise Tolerance The adaptive preprocessing method introduced here can be sensitive to outliers, just as traditional fixed scaling methods. One solution is to detect and discard extreme outliers that would otherwise adversely compress the useful range in weight space. If outliers cannot be detected, there are still several methods available to avoid problems. The simplest possibility is to predetermine absolute bounds for each feature and clip any features that fall outside these bounds. Another possibility is to keep a running mean and variance, as discussed above, and to clip any values that fall outside a predetermined number of standard deviations. 7
A more elegant method is the use of a compressive non-linearity, such as a sigmoid function. In that case, no feature value is large enough to adversely compress the useful range. It is also possible to use a learning rate parameter for the feature and weight adjustments, such that the range will not immediately expand to encompass the noisy data. Several such data points would be necessary to establish a new bound. In a real application, the simultaneous use of some of these methods would be prudent. Note that by (1), the observed range for each feature can only grow. A continuous baseline shift in a given feature will cause the range to expand continuously, mapping the useful range of that feature into an ever decreasing portion of the [0.0, 1.0] range. This can eventually lead to resolution problems, but with the floating precision representation used in most computers, an extensive baseline shift is required. This method has the distinct advantage of breaking down gracefully as the resolution decreases, unlike methods based on clipping. CONCLUSION The use of adaptive preprocessing is necessary for true on-line learning with Fuzzy-ARTMAP and it can greatly simplify the network’s use in a variety of other applications. The procedure is a computationally simple addition that does not sacrifice performance and yet increases system robustness tremendously. Additionally, if classification rules are to be extracted from the network, then the cluster weights can be given direct meaning in terms of non-normalized feature values. Finally, there is potential for adapting the method to other on-line learning paradigms. REFERENCES [1] Carpenter, G. & Grossberg, S. (1987a). A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine. Computer Vision, Graphics, and Image Processing, 37, 54-115. [2] Carpenter, G. & Grossberg, S. (1987b). ART2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns. Applied Optics, 26, 4919-4930. [3] Carpenter, G.,Grossberg, S.,Markuzon, N.,Reynolds, J.H., & Rosen, D.B. (1992). Fuzzy ARTMAP: A Neural Network Architecture for Incremental Supervised Learning of Analog Multidimensional Maps. IEEE Transactions on Neural Networks, 3(5), 698-713. [4] Carpenter, G.,Grossberg, S., & Rosen, D.B. (1991). Fuzzy ART: Fast Stable Learning and Categorization of Analog Patterns by an Adaptive Resonance System. Neural Networks, 4, 759-771. 8