An Answer in Case of Neural Classifiers. Claudio De Stefano, Carlo Sansone, and Mario Vento. AbstractâIn this paper a method defining a reject option appli-.
84
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 30, NO. 1, FEBRUARY 2000
To Reject or Not to Reject: That is the Question— An Answer in Case of Neural Classifiers Claudio De Stefano, Carlo Sansone, and Mario Vento
Abstract—In this paper a method defining a reject option applicable to a given 0-reject classifier is proposed. The reject option is based on an estimate of the classification reliability, measured by a reliability evaluator . Trivially, once a reject threshold has been fixed, a sample is rejected if the corresponding value of is below . Obviously, as represents the least tolerable classification reliability level, when its value varies the reject option becomes more or less severe. In order to adapt the behavior of the reject option to the requirements of the considered application domain, a function characterizing the reject option’s adequacy to the domain has been introduced. It is shown that can be expressed as a function of and, consequently, the optimal value for is defined as the one which maximizes the function . The method for determining the optimal threshold value is independent of the specific 0-reject classifier, while the definition of the reliability evaluators is related to the classifier’s architecture. General criteria for defining appropriate reliability evaluators within a classification paradigm are illustrated in the paper and are based on the localization, in the feature space, of the samples that could be classified with a low reliability. The definition of the reliability evaluators for three popular architectures of neural networks, back-propagation, learning vector quantization and probabilistic neural network, is presented. Finally, the method has been tested with reference to a complex classification problem with data generated according to a distribution of distributions model.
9
9
Index Terms—Classificaton reliability, neural networks, reject option.
I. INTRODUCTION
I
N RECENT YEARS, neural networks have been widely employed in many classification problems [1]–[5]. To this end, different network paradigms have been taken into consideration and much attention has been devoted to the criteria for suitably training [6]–[14] and sizing the network [15]–[19]. However, in real classification systems, other problems often influence the achievement of satisfactory results: for instance, different classes of objects may contain some identical descriptions (overlapping regions) or the objects to be recognized may even be significantly different from those used to train the network. Therefore, classification could be unreliable and the risk of an error being committed becomes high. In these cases, it is particularly useful to make an estimate of the classification reliability in order to evaluate the convenience of rejecting the input sample rather than risking a wrong classification. Manuscript received October 6, 1995; revised March 26, 1999. C. De Stefano is with the Facoltá di Ingegneria, Universita’ del Sannio, I-82100 Benevento, Italy. C. Sansone and M. Vento are with the Dipartimento di Informatica e Sistemistica, Universita’ di Napoli “Federico II,” I-80125 Naples, Italy. Publisher Item Identifier S 1094-6977(00)02041-1.
The introduction of a reject option therefore aims to reject the highest possible percentage of samples which would otherwise be misclassified (i.e., misclassified without a reject option). However, it is worth noting that, even when used adequately, this criterion introduces a side effect whereby some samples that otherwise would have been correctly classified are rejected. It is, therefore, necessary to assess the utility of introducing a rejection criterion, i.e., to establish to what extent a reduction in the classification rate is acceptable with respect to the advantages deriving from a reduction in the number of the misclassified samples. Moreover, this utility cannot be expressed in absolute terms but depends on the specific needs of the application domain. In some contexts, it is desirable to reduce the error rate as much as possible, even to the detriment of a significant reduction of the classification rate; while in other kinds of applications there may be a dual objective. The first type of scenario could be one in which the correction of a misclassified sample has a high cost, for instance an error in an automatic postal delivery system due to an incorrect interpretation of the zipcode. Vice versa, in other applications, it may be desirable to carry out the classification regardless, even at the risk of a high error rate. This is the case, for instance, of a character classifier used in applications in which a text must in any case be extensively edited by hand afterwards. The problem of defining a reject option has been tackled only occasionally in literature. In particular, in [20] the problem of achieving an optimal tradeoff between error and reject rates has been addressed from a probabilistic viewpoint; in [21] a rejection criterion applicable to the nearest neighbor classifier is defined, while in [22] an experimental study is presented to evaluate the rejection capability of feed-forward neural networks in presence of unreliably classified samples. In all the above cases, however, the suitability of making a rejection has not been evaluated on the basis of the requirements of the specific application domain. In this paper, a method that defines a reject option adaptive to the application domain is proposed. The requirements of the domain are characterized by assigning a cost coefficient to the misclassified, rejected and correctly classified samples. These costs represent the penalties attributed to each rejected and misclassified sample, and the gain associated to each correctly classified sample. The reject option can be applied to any 0-reject classifier and is based on the use of a reliability evaluator which is able to estimate the generic sample’s classification reliability. Trivially, once a reject threshold has been fixed, a sample is rejected if its value is lower than .
1094–6977/00$10.00 © 2000 IEEE
DE STEFANO et al.: TO REJECT OR NOT TO REJECT: THAT IS THE QUESTION
In order to adapt the behavior of the reject option to the requirements of the considered domain, a function (effectiveness function) characterizing the adequacy of the reject option to the domain has been introduced. As shown in Section III, is expressed in terms of the previously defined cost coefficients and is defined so as to be independent of the absolute performance of the 0-reject classifier. It is shown that can be expressed as a function of and, consequently, the optimal value for is assumed as the one which maximizes . The present paper is a generalization of [23], which also defines a method for determining the best value of the reject threshold for a given application domain. In [23] the reject option is implemented by means of two classification rules applicable to any neural classifier. The results obtained are evaluated and reported with reference to multilayer perceptron (MLP) neural classifiers in the context of an OCR application of isolated omni-writer handwritten characters. The classification rules, which may also be applied to other neural network paradigms, turned out to be particularly suitable for the MLP networks. In the present paper, the method for determining the optimal threshold is generalized and rendered independent of the architecture of the considered classifier, thus making it applicable to any kind of classifier. This goal has been achieved by generalizing the two classification rules of [23] and defining reject criteria operating only on the basis of the reliability evaluator , which is related to the classifier’s architecture. Furthermore, general criteria for defining appropriate reliability evaluators within a given classification paradigm are illustrated here. They are based on the localization, in the feature space, of the samples that could be classified with a low reliability. The definition of the reliability evaluators for three popular architectures of neural networks is illustrated. The paper is organized as follows: in Section II the rationale of the method is illustrated, and the algorithm for determining the optimal value of the reject threshold is described. In Section III the situations that typically yield unreliable classifications are characterized and the corresponding evaluators defined with reference to three neural network paradigms. In particular, we considered the multilayer perceptron trained with the backpropagation algorithm, the learning vector quantization, and the probabilistic neural network. In Section IV the experimental results relative to a complex classification problem of data generated on the basis of the distribution of the distributions model [24], are presented. II. THE PROPOSED METHOD As anticipated in the Introduction, the reject option is based on the evaluation of the reliability of each classification by is defined on the basis means of a reliability evaluator . of the classifier and assumes values of the output vector ]; for a given sample, if the value of is in the range [ greater than a threshold , the classification will be accepted, otherwise the sample will be rejected (cf. Fig. 1). The introduction of a reject option produces two opposite effects: on the one hand, all the misclassified samples having a value less than are rejected, and this effect is undoubtedly de-
85
Fig. 1. The architecture of the proposed method: the reject option operates on the basis of the reliability evaluator Y . The optimal value of the reject threshold is established through a training phase, as described in the paper.
9
sirable since the cost of a misclassification is always higher than the cost of a reject. On the other hand, the correctly classified samples with values of less than are also rejected, and this is an undesirable side effect since it contributes to the reduction of the classification rate. In order to evaluate the true utility of the reject option, we, therefore, need to introduce a suitable function (effectiveness function) evaluating the relative weights of these two effects in the considered application domain. The effectiveness of the reject option can be measured by attributing a cost to a single misclassification ( ) and rejection ( ), and a gain to each correctly classified sample ( ). These costs, which are in general functions of the classification, error, and reject rates, are used to weight the number of the misclassified, rejected and correctly classified samples respectively. In , and this way, given a generic set of samples and using to denote the percentage of correct classifications, misclasfunction can be sifications and rejections respectively, the generically expressed in the form (1a) For to actually measure the effectiveness of the reject opand a tion, it must be an increasing function with respect to and , that is, decreasing function with respect to and
(2a)
A further condition is added to point out that a misclassification is expected to have a more negative affect on than a reject (2b) In principle no further hypotheses on the form of are necessary. It could be worthwhile rendering independent of the absolute performance of the 0-reject classifier so as to measure the real increase in performance whenever the reject option is adopted. To this end, respectively denoting the recognition and and , we have the error rate at 0-reject with (1b) For most applications the costs introduced above may be considered constant and the function linear. In fact, the costs are generally attributed by considering the burden of locating and correcting the error for the misclassified samples, and just the correction for the rejected samples. Consequently, it is reason-
86
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 30, NO. 1, FEBRUARY 2000
In the same way, the percentages of the correctly classified and misclassified samples, with the reject option, are given by
and (5b) Consequently, substituting (4) and (5) into (3), it follows that
(9)
(9)
Fig. 2. Two qualitative distributions D and D . The percentages of correctly classified and misclassified samples which are rejected by the introduction of the threshold (referred to as R and R , respectively) are shown in gray; R R represents the percentage of samples which have been correctly classified (misclassified) after the introduction of the reject option.
( )
able to assume that such a burden is independent of the relative number of correctly classified, misclassified or rejected samples , and remain constant) and that is propor(i.e., that tional to the overall gain of the correctly classified samples, just as it is to the cost of all rejected and misclassified samples (i.e., is linear). Under these hypotheses, and with (2a) holding, the function can be written in the form (3) . Note that (3) satisfies (2b) as In the following, the results of the method are reported in the case where assumes the form (3). It must be explicitly noted that the method remains valid regardless of the form of , and paper [23] shows how the discussion can be extended to the case , and of a generic function. It can be noted that since depend on the value of the reject threshold , (3) is also a function of . To highlight the dependence of on , let us consider the occurrence densities of correctly classified and misclassified samples as a function of the value of . Let us call , and respectively. For the sake of convethem has been defined so nience, the density curve ] provides the that its integral extending to the interval [ percentage of correct classifications (misclassifications) with a value of ranging from – . From the definitions of and it follows that
and
(4)
of the rejected From these distributions, the percentage samples can be easily calculated: in fact, considering the percentage of samples, correctly classified and misclassified at and 0-reject, with a value of less than (referred to as , respectively), we have (see Fig. 2)
(5a)
(6a) hence:
(6b) From (6b) it is evident that, since the two integral functions , an increase have a monotonic increasing trend and as in implies that the first term contributes to an increase in , while the second contributes to a decrease. The introduction of the function allows us to determine the optimal value of the reject threshold as the value for which (7) It is worth noting that, once the cost coefficients have been fixed, the maximum value assumed by obviously depends on and . Moreover, these have a form which the distributions depends on the adopted reliability evaluator. In this regard, a reis concentrated for liability evaluator is better the more for values close to one. The values of close to zero and ideal case is the one which allows all the misclassified samples at 0-reject to be turned into rejected ones, without lowering the classification rate. This situation occurs when the adopted reliability evaluator gives rise to distributions such as those presented in Fig. 3. In this case, it is possible to find two values for , deand , such that noted as
and with The ideal value of , indicated with , is therefore obtained in the range [ ]. In this by choosing a threshold value case, (6b) becomes (6c)
DE STEFANO et al.: TO REJECT OR NOT TO REJECT: THAT IS THE QUESTION
87
Fig. 3. An example of distributions in which the ideal P value can be obtained by rejecting all the misclassified samples without lowering the classification rate.
As this condition cannot generally be achieved in real cases, it is necessary to establish general criteria to find reliability evaluators that, for the considered network architecture, allow us to get near to the ideal situation. Once a reliability evaluator has been chosen and the distriand have been obtained, the training of the reject butions option consists of the determination of the optimal value of the reject threshold . For this purpose, let us assume we have a classifier operating at 0-reject, and a set of samples, which may even coincide with the one used for training the neural network. Under the hypothesis that the set is sufficiently representative of the target domain, the optimal reject threshold value is obtained by finding the value of that satisfies the (7) on set . For this purpose, let us calculate the derivative of expression (6b) with respect to and make it equal to zero. In this way, it turns out that: on
(8)
. Hereafter will be where denoted as the normalized cost. In order to evaluate the solutions of this equation it can be and are not considered that, as the functions available in their analytical form, they should be experimentally determined in tabular form. The process for determining is performed on the set , once the cost coefficients for the given application domain have been fixed, and is described by the following training algorithm, denoted with TA. 1) The set is submitted to the 0-reject classifier and then of misclassified samples and the split into the subsets of correctly classified samples. subset 2) For each sample of the set , the value of the evaluator is computed. The set of values of obtained for allow . In us to construct the occurrence density function can the same way, by using the set the function be determined. 3) The values of satisfying (8) can be determined with a numerical algorithm. 4) The value of that corresponds to the absolute maximum of , i.e., , is selected from the values computed in the previous step. It may well be that several values will satisfy (8), because the density curves do not necessarily have a monotonic trend.
Fig. 4. The dashed lines identify two regions of the feature space containing samples belonging to class I and class J: the symbols “x” and “o” in (a) represent samples that are significantly different from the majority of those present in the TS; in (b) they are samples lying in an overlapping region.
From (8) it can be deduced that the optimal threshold value depends on the normalized cost . In particular, it can be verified that all the triples of cost coefficients ( ), obtained as varies, provide the same value for , and hence the same solution for (8). Finally, it can be noted that is generally much lower than the other cost coefficients; in , and consethis case the normalized cost quently the optimal reject threshold value depends only on the . ratio III. RELIABILITY EVALUATORS The low reliability of a classification can generally be attributed to two fundamental situations: a) the considered sample is significantly different from those present in the training set (hereafter indicated by TS) [see Fig. 4(a)], i.e., it is located in a region of the feature space far from those occupied by the samples of the TS; and b) the considered sample is located in an overlapping region, i.e., in a region of the feature space in which samples of the TS belonging to two or more classes are present [see Fig. 4(b)]. Even though it is theoretically possible to introduce a single reliability evaluator that is able to identify the two above situations simultaneously, the proposed method is based on the separation of the two problems—i.e., the introduction of two reliaand that identify samples falling within bility evaluators situations a) and b), respectively. This proposal is based on the consideration that the samples that fall into situation a) might not fall into situation b) and vice versa. Under this hypothesis, the reject option operates in accordance with the scheme preor is lower sented in Fig. 5, i.e., rejecting a sample if and , respectively. than the optimal thresholds Therefore, in the reject option training phase, the procedure for calculating the optimal threshold value is repeated twice. and have In particular, once the distributions evaluated, the albeen computed and the optimal threshold gorithm TA is reapplied substituting steps 1) and 2) with steps 1b) and 2b) respectively:
88
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 30, NO. 1, FEBRUARY 2000
Fig. 5. The reject option works on the basis of the two evaluators value lower than .
9
and
9 ; the rejected samples are the ones whose 9
1b) The set is submitted to the 0-reject classifier which and . Then, on the basis splits it into two subsets of the previously determined value , three subsets are made of those samples of with values obtained: greater than containing the samples of of with values of greater than ; and containing the and with values of less than . samples of is cal2b) The value assumed by the reliability evaluator culated for each sample of the set . The set of all values obtained on allows us to numerically construct of . With a similar procedure on , the function . Having obtained these distributions, the we get can be calculated with steps 3) and second threshold 4) of the algorithm TA. In the following, the reliability evaluators will be defined for three neural classifiers: the multilayer perceptron trained with the back-propagation algorithm (BP) [25], the learning vector quantization (LVQ) [26], and the probabilistic neural network (PNN) [27]. It is worth noting that these paradigms differ from one another in the meaning of their output vector and represent a wide variety of classifiers [3]. For the BP network used as a classifier, the activation function ]. Thus, for each input is typically a sigmoid in the range [ sample , only one neuron (i.e., the one associated to the class of ) would assume a value of one while all the others would have a value of zero (ideal outputvector).However,the real output vector is typically different from the ideal one. The most used classification rule (winner-takes-all [3]), attributes the sample to the class associated with the output neuron with the highest value (the ). value of the winner neuron is hereinafter indicated by As it is well known, the BP network uses the value of the connection weights to define the hyperplanes that separate the regions associated to each class (decision regions) [3]. During the training phase, the network dynamically modifies the decision regions in such a way as to provide an output vector for each sample as close as possible to the corresponding ideal one. Therefore, the winner-takes-all rule attributes the input sample to the class associated to the decision region in which it lies. Consequently, samples which are significantly different from the ones belonging to the TS (just as they are different from the samples which contributed to determining the hyperplanes separating the decision regions) may lie outside every decision region built during the training phase; in this case all the output neurons can assume values near to zero. A simple classification problem in which this situation occurs is shown in Fig. 6 (see the sample in A).
value is lower than , or whose
9
It can be simply proved that an effective definition of the reliability evaluator is as follows: In this way the nearer the value of is to zero, the more unreliable the classification is considered to be. With reference to situation b), it should be considered that the samples which lie in overlapping regions typically generate output vectors with two or more neurons having similar values (cf. Fig. 6, samples in B and C). This is mainly due to the fact that these samples fall in the neighborhood of one or more hyperplanes. Therefore, the higher the classification reliability, the greater the difference between and of (second winner neuron). An the value of adequate definition of the reliability evaluator is therefore as follows:
It can be simply noted that the reliability evaluators and assume values in the range [ ], as is to be expected. For an LVQ net the output vector is composed of the values of the distances between each Kohonen neuron (prototype) and the input sample. The net assigns the sample to the class according to the following rule: class where is the value of the neuron at the minimum distance from . The learning algorithms (e.g. LVQ1 [26]) modify the starting prototypes according to the TS in such a way as to partition the feature space like a Voronoi tessellation. Thus the final prototypes defined by the net will be the centroids of the regions into which the feature space is partitioned. Obviously, samples significantly different from those present in the TS, not having contributed to the prototype adjustment process, have a distance from the winner neuron greater then that relative to the samples of the TS. Therefore the reliability can be defined as evaluator if otherwise is the highest value of in the TS. The second where as, for condition is introduced to avoid negative values of some samples of the data set, it may turn out that thus making . Note that samples very far from the winner . neuron, and hence possibly unreliable, have low values of
DE STEFANO et al.: TO REJECT OR NOT TO REJECT: THAT IS THE QUESTION
89
method [28]. Then the PNN net assigns the input samples to the class associated to the output neuron with the highest value class where is the a priori probability of the th class and is the loss associated to a wrong decision for the th class. As the value of depends on the whole TS, samples different from those in the TS obviously have a low probability of belonging to any can be defined as class; therefore the reliability evaluator if otherwise is the highest value of in the samples bewhere longing to TS. Whereas, samples lying in an overlapping region exhibit a probability of belonging to a class similar to two or more classes. can be defined as Thus, the reliability evaluator
where is the value of the second winner neuron. According to this definition the greater the probability that the input sample belongs to the winner class rather than to the second winner class, the more reliable the classification is ranges considered to be. Finally note that, by definition, from zero to one. IV. EXPERIMENTAL RESULTS Fig. 6. (a) BP net trained to solve a simple three-class classification problem is reported. (b) Lines locating the decision regions are indicated. On the basis of the weight values highlighted in (a), it is possible to verify that when there are samples in the shadowed region all the output neuron values are near to zero (e.g. for samples lying in A (0; 1) the output vector is: O = 0:1; O = 0:0; O = 0:1). Moreover, in the presence of samples lying near to the lines separating two decision regions, the output vector has two components with similar values (e.g., for a sample in B (0:9; 1:1) the output vector is: O = (1; 1) the output 0:02; O = 0:60; O = 0:41; while for a sample in C vector is: O = 0:01; O = 0:5; O = 0:5).
0
On the other hand, samples belonging to an overlapping region have a comparable distance from at least two prototypes. must be Consequently in this case the reliability evaluator and , where is the disa function of both tance between the input sample and the second winner neuron
On the basis of this definition, assumes values ranging , the from zero to one. Moreover, the higher the values of further the second winner is from the input samples than the winner. In a PNN net, for each neuron the output vector assumes a value proportional to the probability that the input sample belongs to the class associated to the th neuron. The distances between the input sample and all the samples belonging to TS are computed and, on the basis of these values, the probability that belongs to each class is evaluated. These probability density functions are generally computed using the Parzen
The method has been experimented on a complex classification problem obtained by considering artificial data generated by a statistical process already used in [24]. This is based on the distribution of distributions model, i.e., on a model in which the data is generated using a set of distribution functions of different types (Gaussian, uniform, etc.) with different statistical moment values, so as to model the complexity of a large number of real applications. In particular, we have considered two types of distribution functions: a) A Gaussian distribution to model those samples which have been adequately processed by the description algorithms, thus forming in the feature space clusters whose centroids can be effectively assumed as prototypes. Furthermore, as in real applications each class can be adequately represented only with more than one single prototype, the considered model introduces a set of clusters for each class. The statistical parameters characterizing each cluster (the mean and the variance of the Gaussian distribution) have been randomly generated using other distribution functions. The random generation of the mean representing the centroid of the relative cluster simulates the independence of the prototypes and causes each cluster to be randomly located in the feature space. Moreover the random generation of the variance effectively models the variability introduced by the description algorithms, generating clusters that are more or less spread throughout the feature space. b) A uniform distribution to model the samples which have not been adequately described (noisy samples) and are
90
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 30, NO. 1, FEBRUARY 2000
TABLE I THE STATISTICAL PARAMETERS OF THE CLUSTERS MAKING UP THE TRAINING SET TS
consequently spread over the feature space. To simulate this situation other samples have been randomly generated for each class according to a uniform distribution. The latter has been extended to a large region of the feature space, so as to overlap all the considered clusters. A bidimensional classification problem with three classes, each made of three clusters, has been considered. This choice cannot be considered restrictive and helps to interpret the work of the reliability evaluator on a Cartesian plane. The a priori clusters has been generated probability of each of the and according to a Gaussian distribution with mean , while the mean and the variance of variance the clusters have been evaluated according to a uniform distribution in the range [−1, 1], and to a Gaussian distribution with mean 0.1 and variance 0.03, respectively. The noise has been generated by a bidimensional uniform distribution in the range ([−1.5, −1.5], [1.5, 1.5]). The TS is made up of 2475 samples, including about 10% of noise. In Table I the statistical parameters of the nine considered clusters are shown. The data set DS has been generated independently of the TS and is composed of 2700 samples. The clusters of the DS have the same centroids as the TS (otherwise they would represent a different classification problem) but the variance is 50% greater and the noise is about 20%. Fig. 7 shows the obtained DS characterized by a widespread noise component and large overlapping regions. To highlight the complexity of the considered classification problem, Table II shows the classification results obtained on the same data by a Bayesian classifier and by the three considered neural classifiers operating at 0-reject. Even if this paper does not aim to compare the absolute performance of the three neural classifiers, it can be noted that their performance is similar to that of the Bayesian classifier, which represents an upper bound [4]. It is worth noting that in these conditions, as the reject option is applied to well trained classifiers, it is operating in unfavorable conditions. Table III reports the reject thresholds obtained for values of ranging from 0.17–3.75. These values of have been obtained and choosing the coefficients and reby fixing and . This spectively within the sets choice seems adequate to include a number of possible real situations and allows us to verify the behavior of the reject option for a is near to zero set of situations ranging from the case in which assumes values near to those of ), to the case in which (i.e., assumes relatively high values (i.e., is much smaller than
Fig. 7. The data set DS; boxed areas indicate regions of the feature space with significant overlaps among different classes.
TABLE II RECOGNITION RATE
ON TS AND DS OBTAINED BY A BAYESIAN CLASSIFIER AND BY THE THREE NEURAL CLASSIFIERS CONSIDERED OPERATING AT 0-REJECT
For the PNN net, the recognition rate on TS is not significant, as the TS is used as reference set.
). It can be noted that the value of increases as increases; in fact, when is much smaller than the maximum of is obtained by rejecting a larger number of samples and increasing . Similar considerations hold for . , the Fig. 8 shows, for the three classifiers and for samples misclassified at 0-reject and the ones misclassified and rejected when the reject option is adopted. This figure points out the effectiveness of the considered evaluators; the evaluator allows us to identify most of the samples located in areas of the feature space far from any centroid, while the evaluator identifies a large number of samples located in overlapping also regions. Note that for the BP network the evaluator allows us to identify some of the samples located in overlapping regions of the feature space. This situation can occur for samples located in the neighborhood of the hyperplanes separating two or more different classes and having at least two output neurons with comparable values but both lower than the threshold . The overall effectiveness of the method is shown in Figs. 9–11 which respectively indicate the relative variations of the classification rate, the error rate and the reject rate for different values . of
DE STEFANO et al.: TO REJECT OR NOT TO REJECT: THAT IS THE QUESTION
VALUES OF THE REJECT THRESHOLDS
These values of C
91
TABLE III
AND
OBTAINED FOR VALUES OF C RANGING FROM 0.17 TO 3.75
have been obtained by fixing C equal to one and choosing C and C within the sets {3, 4, 5} and {6, 9, 12, 15, 18}, respectively.
As expected, for high values of the percentage of rejected samples increases and consequently the percentage of misclassification and correct classification decreases; in this situation it is always more convenient to reject an unreliably classified sample rather than run the risk of misclassifying it. In particular, Figs. 9 and 10 point out that, for the three considered networks, the reduction of the misclassification rate is on average about 30%, while the reduction of the recognition rate is only 7%. Furthermore, Fig. 11 clearly shows that the percentage of rejected samples for the LVQ net is, in most cases, greater than that obtained by the other neural classifiers, while PNN nets are generally characterized by lower reject rates. , conFig. 12 shows the variations of as a function of firming that the method performs well for all the considered networks and for any cost coefficient value. Note that the best result , with a has been obtained by the PNN classifier for . This figure value of equal to about 45% of the ideal value also indicates that the adequacy of the reject option increases as increases and that significant increases in can already be . The latter condition is practically veriobtained for fied in most of the real applications. decreases [i.e., Finally, it should be considered that when ) decreases], the convenience of introthe difference ( ducing the reject option generally becomes negligible. As can be seen in (6b), the percentage of samples transformed from misclassified into rejected contributes to the increase in propor); consequently for values of tionally to the difference ( similar to the increase in can be comparable to the decrease in caused by the fall in the classification rate. In these conditions the applicability of a reject option becomes more critical; moreover even if the utility of this option is verified on the TS, the convenience may well disappear on a DS that is not adequately represented by the TS. In the experimental context under consideration, as the DS is significantly more complex than the TS, the reject option is definitely unprofitable for values . of
V. CONCLUSION In this paper, a method for defining an optimal reject option tailored to a given application domain has been proposed. This option, which can be effectively applied to any 0-reject classifier, is based on the definition of a function which allows us to estimate the classification reliability of the generic input sample. The results of the method have been tested with three neural network paradigms and revealed to be very interesting. The reject option has allowed us to transform about 30% of the 0-reject misclassified samples into rejected ones with a small reduction in the classification rate (about 7%). Moreover, the reliability evaluators of the three considered neural classifiers were able to effectively localize the majority of the unreliably classified samples in the feature space. These results have been obtained using values of the cost coefficients within ranges that are large enough to meet the requirements of most real classification problems. Investigations geared toward further improving the performance of the method may involve the introduction of a family of reliability evaluators each able to detect, in the feature space, situations corresponding to highly risky classifications. However, it is important to verify that the efforts made in defining a new reliability evaluator effectively contribute to improve the performance of the reject option. Preliminary tests demonstrated that evaluators able to detect specific situations (different from the general ones considered in the paper), involve too few samples. Consequently in order to make the training set of the reject option representative, very large sets should be considered and the computational cost of obtaining all the distributions significantly increases. instead of separate The use of joint distributions and , could also slightly improve the perforones mance of our method. In this case, however, there is the problem of choosing the best way for combining the distributions
92
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 30, NO. 1, FEBRUARY 2000
Fig. 8. Samples misclassified at 0-reject by the three neural networks and samples misclassified and rejected after the introduction of the reject option. Note that different symbols represent samples rejected by different evaluators (“ ” denotes samples rejected by and “o” denotes samples rejected by ).
0
9
Fig. 9. Percentage variation of the classification rate with respect to the 0-reject case as a function of C .
9
DE STEFANO et al.: TO REJECT OR NOT TO REJECT: THAT IS THE QUESTION
Fig. 10.
Percentage variation of the error rate with respect to the 0-reject case as a function of C .
Fig. 11.
Reject Rates as a function of C .
Fig. 12.
Values of the P=P
ratio for the three neural classifiers considered as a function of C .
93
94
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 30, NO. 1, FEBRUARY 2000
and . As the optimal choice can generally depend on the data, it is necessary to define some criteria for choosing, given a specific application domain, the kind of combination that must be performed. This will be matter of future investigation. REFERENCES [1] J. Y. Han, M. R. Sayeh, and J. Zhang, “Convergence and limit points of neural network and its application to pattern recognition,” IEEE Trans. Syst., Man, Cybern., vol. 19, pp. 1217–1222, Sept./Oct. 1989. [2] B. Hussari and M. R. Kabuka, “A novel feature recognition neural network and its application to character recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 16, pp. 98–106, Jan. 1994. [3] R. P. Lippmann, “Pattern classification using neural networks,” IEEE Commun. Mag., pp. 47–64, Nov. 1989. [4] I. Sethi and A. K. Jain, Eds., Neural Networks and Statistical Pattern Recognition. Amsterdam, The Netherlands: North Holland, 1991. [5] D. F. Specht, “Probabilistic neural networks and the polynomial adaline as complementary techniques for classification,” IEEE Trans. Neural Networks, vol. 1, pp. 111–121, Mar. 1990. [6] S. Becker and Y. Le Cun, “Improving the convergence of back-propagation learning with second order methods,” in Proc. 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowsky, Eds. San Mateo, CA, 1989, pp. 29–37. [7] U. Bhattacharya and S. K. Parui, “On the rate of convergence of perceptron learning,” Pattern Recognit. Lett., vol. 16, no. 5, pp. 491–498, May 1995. [8] R. P. Brent, “Fast training algorithms for multilayer neural nets,” IEEE Trans. Neural Networks, vol. 2, pp. 346–354, May 1991. [9] C.-C. Chiang and H.-C. Fu, “Using multithreshold quadratic sigmoidal neurons to improve classification capability of multilayer perceptrons,” IEEE Trans. Neural Networks, vol. 5, pp. 516–519, May 1994. [10] S. Ergezinger and E. Thomsen, “An accelerated learning algorithm for multilayer perceptrons: Optimization layer by layer,” IEEE Trans. Neural Networks, vol. 6, pp. 31–42, Jan. 1995. [11] S. E. Fahlman, “Faster-learning variations on back-propagation: An empirical study,” in Proc. 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowsky, Eds. San Mateo, CA, 1989, pp. 38–51. [12] J. S. N. Jean and J. Wang, “Weight smoothing to improve network generalization,” IEEE Trans. Neural Networks, vol. 5, pp. 752–763, Sept. 1994. [13] A. G. Parlos, B. Fernandez, A. F. Atiya, J. Muthusami, and W. K. Tsai, “An accelerated learning algorithm for multilayer perceptron networks,” IEEE Trans. Neural Networks, vol. 5, pp. 493–497, May 1994. [14] X.-H. Yu, G.-A. Chen, and S.-X. Cheng, “Dynamic learning rate optimization of the back-propagation algorithm,” IEEE Trans. Neural Networks, vol. 6, pp. 669–677, May 1995. [15] M. A. Abou-Nasr and M. A. Sid-Ahmed, “Fast learning and efficient memory utilization with a prototype based neural classifier,” Pattern Recognit., vol. 28, no. 4, pp. 581–593, Apr. 1995. [16] E. B. Baum and D. Haussler, “What size net gives valid generalization?,” Neural Computation, pp. 81–90, Jan. 1989. [17] Y. Hirose, K. Yamashita, and Y. Hijiya, “Back-propagation algorithm which varies the number of hidden units,” Neural Networks, vol. 4, pp. 61–66, 1991. [18] D. C. Psichogios and L. H. Ungar, “SVD-NET: An algorithm that automatically selects network structure,” IEEE Trans. Neural Networks, vol. 6, no. 1, pp. 513–516, Jan. 1995. [19] R. N. Sharpe, M. Chow, S. Briggs, and L. Windingland, “A methodology using fuzzy logic to optimize feedforward artificial neural network configurations,” IEEE Trans. Syst., Man, Cybern., vol. 24, pp. 760–768, May 1994. [20] C. K. Chow, “On optimum recognition error and reject tradeoff,” IEEE Trans. Inform. Theory, vol. 16, Jan. 1970. [21] M. E. Hellman, “The nearest neighbor classification rule with a reject option,” IEEE Trans. Syst., Sci. Cybern., vol. 6, pp. 179–185, July 1970. [22] G. C. Vasconcelos, M. C. Fairhust, and D. L. Bisset, “Investigating feedforward neural networks with respect to the rejection of spurious patterns,” Pattern Recognit. Lett., vol. 16, no. 2, pp. 207–212, Feb. 1995. [23] L. P. Cordella, C. De Stefano, F. Tortorella, and M. Vento, “A method for improving classification reliability of multi-layer perceptrons,” IEEE Trans. Neural Networks, vol. 6, pp. 1140–1147, Sept. 1995. [24] J. de Villiers and E. Barnard, “Back-propagation neural networks with one and two hidden layers,” IEEE Trans. Neural Networks, vol. 4, pp. 136–141, Jan. 1993.
[25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 9, pp. 533–536, 1986. [26] T. Kohonen, “The self-organizing map,” Proc. IEEE, vol. 78, pp. 1464–1480, Sept. 1990. [27] D. F. Specht, “Probabilistic neural networks,” Neural Networks, vol. 3, no. 1, pp. 109–118, Jan. 1990. [28] E. Parzen, “On estimation of a probability density function and mode,” Ann. Math. Stat., vol. 36, pp. 1065–1076, 1962.
Claudio De Stefano was born in Naples, Italy, on October 4, 1961. He received the Laurea degree with honors in electronic engineering in 1990 and the Ph.D. degree in electronic and computer engineering in 1994, both from the University of Naples “Federico II,” Italy. From 1994 to 1996, he was an Assistant Professor of computer science at the Department of Computer Science and Systems of the University of Naples “Federico II.” In 1996, he joined the Faculty of Computer Engineering in Benevento, University of Sannio, where he is currently an Assistant Professor of computer architectures. He has been active in the fields of pattern recognition, image analysis, machine learning, and parallel computing. His current research interests include on-line and off-line handwriting recognition, cursive script segmentation, neural networks, and evolutionary learning systems. Dr. De Stefano is a Member of the International Association for Pattern Recognition (IAPR).
Carlo Sansone was born in Naples, Italy, in 1969. He received the Laurea degree (cum laude) in electronic engineering in 1993 and the Ph.D. degree in electronic and computer engineering in 1997, both from the University of Naples “Federico II.” Since 1997, he has been an Assistant Professor of computer science, database systems, and neural programming at the University of Naples “Federico II” and, more recently, of computer science at the University of Cassino. His research interests are in the fields of neural networks theory and classification methodologies, exploiting applications in different areas of pattern recognition as optical character recognition, document processing, and signature verification. Dr. Sansone is a Member of the International Association for Pattern Recognition (IAPR).
Mario Vento was born in Italy in 1960. In 1984, he received the Laurea degree (cum laude) in electronic engineering, and in 1988, the Ph.D. degree in electronic and computer engineering, both from the University of Naples “Federico II,” Italy. Since 1989, he has been an Assistant Professor at the “Dipartimento di Informatica e Sistemistica” in the Faculty of Engineering of the University of Naples, where he is currently an Associate Professor of computer science and artificial intelligence. His interests involve basic research in the areas of artificial intelligence, image analysis, pattern recognition, machine learning, and parallel computing in artificial vision. He is especially dedicated to classification techniques, either statistical, syntactic, and structural, giving contributions to neural network theory, statistical learning, exact and inexact graph matching, multiexpert classification, and learning methodologies for structural descriptions. He participated in several projects in the areas of handwritten character recognition, document processing, car plate recognition, signature verification, raster to vector conversion of technical drawings, and automatic interpretation of biomedical images. He has authored over 70 research papers in international journals and conference proceedings. Dr. Vento is a Member of the International Association for Pattern Recognition (IAPR), and of IAPR Technical Committee on “Graph Based Representations” (TC15).