BAYES-NEAREST: a new Hybrid classifier Combining Bayesian ...

10 downloads 844 Views 155KB Size Report
ite classifiers to real world problems by intelligently combining known learning algorithms. .... 6 n. Fig. 2. An example of a Bayesian Network structure. The Class node is associated with the class ... points. We shall call ..... IOS press. Sierra, B.
BAYES-NEAREST: a new Hybrid classifier Combining Bayesian Network and Distance Based algorithms Elena Lazkano and Basilio Sierra Department of Computer Science and Artificial Intelligence University of the Basque Country P.O. Box 649 E-20080 Donostia-San Sebasti´ an Basque Country, Spain e-mail: {ccpsiarb, ccplaore}@si.ehu.es http://www.sc.ehu.es/ccwrobot

Abstract. This paper presents a new hybrid classifier that combines the probability based Bayesian Network paradigm with the Nearest Neighbor distance based algorithm. The Bayesian Network structure is obtained from the data by using the K2 structural learning algorithm. The Nearest Neighbor algorithm is used in combination with the Bayesian Network in the deduction phase. For those data bases in which some variables are continuous valued, automatic discretizations of the data are performed. We show the performance of the new proposed approach compared with the Bayesian Network paradigm and with the well known Naive Bayes classifier in some standard databases; the results obtained by the new algorithm are better or equals according to the Wilcoxon statistical test.

BAYES-NEAREST: a new Hybrid classifier Combining Bayesian Network and Distance Based algorithms No Author Given No Institute Given

Abstract. This paper presents a new hybrid classifier that combines the probability based Bayesian Network paradigm with the Nearest Neighbor distance based algorithm. The Bayesian Network structure is obtained from the data by using the K2 structural learning algorithm. The Nearest Neighbor algorithm is used in combination with the Bayesian Network in the deduction phase. For those data bases in which some variables are continuous valued, automatic discretizations of the data are performed. We show the performance of the new proposed approach compared with the Bayesian Network paradigm and with the well known Naive Bayes classifier in some standard databases; the results obtained by the new algorithm are better or equals according to the Wilcoxon statistical test.

1

Introduction

During the last years an extensive development has been produced in hybrid classifiers, that is, in classifiers that are implemented taking the ideas of various standard classifiers. This research has the potential to apply accurate composite classifiers to real world problems by intelligently combining known learning algorithms. Classifier combination falls within the supervised learning paradigm in the Machine Learning area (Mitchell, 1997). This task orientation assumes that we have been given a set of training examples, which are customarily represented by feature vectors. Each training example is labelled with a class target, which is a member of a finite, and usually small set of class labels. The goal of supervised learning is to predict the class labels of examples that have not been seen. Combining the predictions of a set of component classifiers has shown to yield accuracy higher than the most accurate component on a long variety of supervised classification problems (Xu et al., 1992; Ho and Srihati, 1994; Lu, 1996; Dietterich, 1997; Sierra et al., 2001b). In this paper, we present a new hybrid classifier based on two families of well known classification methods; the first one is a probabilistic classifier used to obtain the first model, a Bayesian network (Pearl, 1988; Cowell et al., 1999), and the second one is a distance based classifier (Dasarathy, 1991) which is combined with the Bayesian network in the classification process. We show the

results obtained of the new approach and compare it with two probabilistic classifiers: Naive Bayes and Bayesian Networks. The rest of the paper is organized as follows. Section 2 to 5 review the different concepts needed to understand the new hybrid classifier. Section 6 introduces the new proposed approach and the results obtained are presented in section 7. Last section is dedicated to give conclusions and to point out the future work.

2

Bayesian Networks

Bayesian networks (BN) are probabilistic graphical models represented by directed acyclic graphs in which nodes are variables and arcs show the (in)dependencies among the variables (Castillo et al., 1997; Jensen, 1996). There are different ways of establishing the Bayesian network structure (Heckerman et al., 1995). It can be the human expert who designs the network taking advantage of his/her knowledge about the relations among the variables. It is also possible to learn the structure by means of an automatic learning algorithm. A combination of both systems is a third alternative, mixing the expert knowledge and the learning mechanism. Within the supervised classification area, learning is performed using a training datafile but there is always a special variable, namely the class, i.e. the one we want to deduce(Sierra et al., 2000). Some structural learning approaches take into account the existence of that special variable (Friedman et al., 1997; Sierra and Larra˜ naga, 1998), but most of them – and it is the case of the K2 algorithm explained bellow – do consider all the variables in the same manner and make use of an evaluation metric to measure the appropriateness of a net given the data. Hence, a structural learning method needs two components: the learning algorithm and the evaluation measure (score+search). The algorithm used in the experimentation here described is the K2 algorithm (Cooper and Herskovits, 1992). This algorithm assumes an order has been established for the variables so that the search space is reduced. The fact that X1 , X2 , · · · , Xn is an ordering of the variables implies that only the predecessors of Xk in the list can be its parent nodes in the learned network. The algorithm also assumes that all the networks are equally probable, but because it is a greedy algorithm it can not ensure that the net resulting from the learning process is the most probable one given the data. Figure 1 shows the pseudo-code of the algorithm. The original algorithm used the K2 Bayesian metric to evaluate the net while it is being constructed:  ! qi ri n Y Y Y (r − 1)! i P (D|S) = log10  Nijk !  (1) (N + r − 1)! ij i i=1 j=1 k=1

where:

– P (D|S): is a measure of the goodness of the S Bayesian net defined over the D dataset.

Input: Ordering, Database + metric. Output: net structure for i = 1 to n do πi = ∅ p old = g(i, πi ) ok to proceed = true while ok to proceed and |πi | < max parents S do z ∈ pred(i) where z = argmax g(i, π {k}) k i S p new = g(i, πi {k}) if p new > p old then p new p old = S πi = πi z else ok to proceed = false end if end while end for Fig. 1. Pseudo-code of the K2 structural learning algorithm

n: is the number of variables ri : represents the number of values or states that the i-th variable can take. qi : is the set of all possible configurations for the parents of variable i Nijk : the frequency with whom variable i takes the value k while its parent configuration is j qi X Nijk . – Nik =

– – – –

j=1

– N : is the number of entries in the database

In addition to this metric, we have tried one more measure in combination with the algorithm, the well known entropy metric which measures the disorder of the given data: ! qi ri n X X X Nij Nijk + 1 Nijk + 1 − (2) ln P (D|S) = N Nij + ri Nij + ri i=1 j=1 k=1

2.1

Propagating the evidence

Evidence propagation or probabilistic inference consists of, given an instantiation of some of the variables, obtaining the a posteriori probability of one ore more of the non-instantiated variables (Henrion, 1988). It is known that this computation is a NP-hard problem, even for the case of a unique variable. There are different alternatives to perform the propagation methods. Exact methods calculate the exact a posteriori probabilities of the variables and the

resulting error is only due to the limitations of the computer where the calculation is performed. The computational cost can be reduced looking over the independence of the nodes in the net. Approximated propagation methods are based on simulation techniques that obtain approximated values for the probabilities needed. (Pearl, 1987) proposes a stochastic simulation method known as the Markov Sampling Method. It first assigns the evidence to the evidential nodes and then simulates the values for the rest of the nodes in the net. Initially, a realization is performed, i.e. values are assigned to the nodes based in their probabilities, and afterwards the nonevidential variables are simulated using an arbitrary order. Then, a value is generated for the selected variable using its conditional probability function and the simulated values. In the case all the Markov Blanquet1 of the variable of interest is instantiated, and therefore in many of the supervised classification tasks where there is no missing values, there is no need of the simulation process to obtain the values of the non-evidential variables and thereby, P (Y = yi |X1 = x1 , ..., Xn = xn ) can be calculated using only the probability tables of the parents and children of the node, i.e. using the parameters saved in the model specification. The method becomes for that particular case an exact propagation method. Figure 2 shows an example of a BN structure specifically obtained for a classification task, in which the special variable (Cl) appears in the center of the structure.

X2

X3

X4

Xn

Cl

X1 X5

X6

Fig. 2. An example of a Bayesian Network structure. The Class node is associated with the class variable

1

The Markov Blanquet of a node is the set of nodes formed by its parents, its children and the parents of those children

3

Naive Bayes classifier

Theoretically, Bayes’ rule minimizes error by selecting the class yj with the largest posterior probability for a given example X of the form X =< X1 , X2 , ..., Xn >, as indicated below (equation 3): P (Y = yj |X) =

P (Y = yj )P (X|Y = yj ) P (X)

(3)

Since X is a composition of n discrete values, one can expand this expression to equation 4: P (Y = yj )P (X1 = x1 , ..., Xn = xn |Y = yj ) P (X1 = x1 , ..., Xn = xn ) (4) where P (X1 = x1 , ..., Xn = xn |Y = yj ) is the conditional probability of the instance X given the class; yj . P (Y = yj ) is the a priori probability that one will observe class yj ; and P (X) is the prior probability of observing the instance X. All these parameters are estimated from the training set. However, a direct application of these rules is difficult due to the lack of sufficient data in the training set to reliably obtain all the conditional probabilities needed by the model. One simple form of the previous diagnose model has been studied that assumes independence of the observations of feature variables X1 , X2 , ..., Xn given the class variable Y , which allows us to use the next equality (equation 5): P (Y = yj |X1 = x1 , ..., Xn = xn ) =

P (X1 = x1 , ..., Xn = xn |Y = yj ) =

n Y

P (Xi = xi |Y = yj )

(5)

i=1

where P (Xi = xi |Y = yj ) is the probability of an instance of class yj having the observed attribute value xi . In the core of this paradigm there is an assumption of independence between the occurrence of features values, that is not true in many tasks; however, it is empirically demonstrated that this paradigm gives good results in medical tasks. This model is equivalent to the Bayesian network shown in figure 3. Note that the structure is fixed, and the variable to be predicted is parent of all predictor variables. In our experiments, we use this Naive Bayes (NB) classifier (Kohavi, 1996) in order to compare the results obtained by using the BN approach and the new proposed method Bayes-Nearest.

4

The k-NN Classification Method

The operation process of the Nearest Neighbor algorithm is as follows: A new pair (X, Y ) is given, where only the measurement X is observable, and it is desired to estimate Y by utilizing the information contained in the set of correctly classified points. We shall call

Cl

X1

X2

X

n

Fig. 3. Naive Bayes structure

Xp′ ∈ X1 , X2 , ..., Xn the nearest neighbor of X if min d(Xi , X) = d(Xp′ , X)

i = 1, 2, ..., n

The NN classification decision method gives to X the category yp′ of its nearest neighbor Xp′ . In case of tie for the nearest neighbor, the decision rule has to be modified in order to break it. A mistake is made if yp′ 6= y. An immediate extension to this decision rule is the so called k-NN approach (Dasarathy, 1991; Aha et al., 1991; Sierra and Lazkano, 2002), which assigns to the candidate X the class which is most frequently represented in the k nearest neighbors to X.

5

Discretization of continuous attributes

Many classification algorithms require the classifying variables to be discretized, either because they need nominal values, or in order to reduce the computational cost and to accelerate the induction process. Discretization of the variables does not necessarily affect negatively the learning process, even more, the performance of a classifier can be improved with the discretization. (Dougherty et al., 1995) makes a comparison of several discretization methods over 16 databases and two classifiers, Quinlan’s C4.5 (Quinlan, 1993) and the Naive Bayes classifier. The results show that for most of the cases the discretization process improves the performance of the classifiers. Discretization methods can be classified according to three criteria: – Global/Local: local discretization produces partitions of the data that are applied to local regions of the space of instances. On the other side, global discretization produces a net in the n-dimensional space of instances, in which each variable is partitioned in regions independently of the rest of the attributes.

– Supervised/Non-supervised: the “class” variable is used while discretizing the data in a supervised manner. This does not happen in the non-supervised discretization. – Static/Dynamic: static discretization performs in each step one discretization for each variable. The number of regions k may be an input to the process, or it can also be determined independently for each variable. Dynamic discretization searches for the possible values of k simultaneously for all the variables, capturing the interdependencies during the discretization method. Below some discretization methods found in the literature are presented. The easiest way to make a discretization is to fix a number of intervals k, and to divide the range of values of a variable in k intervals of the same length. This method is known as the equal width interval binning. The equal frequency intervals method consists of dividing the set of values on k intervals or regions, each interval containing m k adjacent instances (m being the total number of instances). Those methods do not consider the class when determining the intervals. Thereby, they are non-supervised methods and the classification information contained within the original database can be lost if the discretization combines cases tightly associated to different classes. Entropy-based discretization methods have proved to be very effective for different problems. This methods pretend to divide the set of data in k subsets of minimal entropy. Given a dataset and a variable to discretize, the entries are ordered according to the values of the variable and the value that minimizes the entropy in the generated subsets is selected as a breakpoint. This criterion is applied recursively in each of the new subsets (Catlett, 1991; M. and Irani, 1993).

6

Proposed approach: Bayes-Nearest

Classifier combination can fuse together different information sources to utilize their complementary information. The sources can be multi-modal, such as speech and vision, but can also be transformations (Kohavi, 1996) or partitions (Cowell et al., 1999; Murphy and Aha, 1994; Pearl, 1987) of the same signal. In each case, combination can produce appreciable gains, even when individual classifiers exhibit widely varying accuracies. Underlying the hybrid approach is the concept of reductionism, where complex problems are solved through stepwise decomposition (Sierra et al., 1999). Intelligent hybrid systems involve specific (hierarchical) levels of knowledge defined in terms of concept granularity and corresponding interfaces. Specifically, the hierarchy would include connectionist and symbolic levels, with each level possibly consisting of an ensemble architecture by itself, and with proper interfaces between levels. As one moves upward in the hierarchical structure, we witness a corresponding degree of data compression allowing more powerful (’reasoning’) methods to be employed on reduced amounts of data.

In this section a new classification algorithm that combines the Bayesian Network paradigm with the Nearest Neighbor algorithm is presented. The Bayesian Network structure is obtained from the database in an automatic form by using the K2 structural learning algorithm. In order to better evaluate the classification power of the Bayesian Network approach, we use two different net evaluation metrics in the learning process (the so called K2 metric and the entropy metric), and compare the results obtained by using each of them.

Input: samples file, containing n cases (Xi , Yi ), i = 1, ..., n and a new case (X, Y ) to be classified. Output: Class for the new case Select the nearest neighbour to x from the sample file cases Propagate this nearest case using the previously learned BN Output the class Ci which a posteriori probability Pi is maximal among all the classes Fig. 4. The pseudo-code of the Bayes-Nearest Algorithm.

The classification process carried out by our new algorithm is showed in algorithmic form in Figure 4. As explained before, the classification process is done by looking for the nearest case to the new case to be classified in the Training Database, and by propagating the evidence of this nearest case in the previously learned BN. Figure 5 shows a schemata of this new method.

X3 X2 Discret DB

New Case

1−NN Algorithm

Cl X1

Nearest Case

X4 Xn X5

Evidence Propagation Algorithm

New Case Classification

Fig. 5. A general schemata of the new cases classification method

7

Experimental Results

Six databases are used to test our hypothesis. All of them are obtained from the UCI Machine Learning Repository (Murphy and Aha, 1994). These domains are public at the Statlog project WEB page (Michie et al., 1995). The characteristics of the databases are given in table 1. As it can be seen, we have choosen different types of databases, selecting some of them with a large number of predictor variables, or with a large number of cases and some multi-class problems. In order to give a real perspective of applied methods, we use 10-Fold Crossvalidation (Stone, 1974) in all experiments. All databases have been randomly separated into ten sets of training data and its corresponding test data. Obviously all the validation files used have been always the same for the five algorithms shown. It agrees to indicate, on the other hand, that the used databases have been discretized previously when the values of the predicting variables required it. The discretization has been made using an equal frequency interval method. But in order to give a more sound comparison among the different classifiers, the discretization has been the same for all the databases. This fact allows us to outline the effect of the classifying method on the obtained accuracy, as we have not tried to look for a specific discretization method for each database, which would benefit the presented new approach.

Domain Training cases Test cases Num. of classes Num. of attributes Breast 614 69 10 11 Letters 15,000 5,000 26 16 Pendigit 7,494 3,493 10 16 satimage 4,435 2,000 7 36 shuttle 43,500 14,500 7 9 Vote 300 135 2 16 Table 1. Details of experimental domains (ten same size but different train and test file for each domain)

The class distribution of the training databases can be seen in figure 6. Different distributions appear, some of them are uniformly distributed while in other cases the distribution is bowed. Table 2 shows the results obtained for the different learning mechanisms and the different databases. This table shows that the new algorithm presents better classification percentages for five of of the six standard databases that have been utilized. For the “vote” database the results are poorer. We have used the Wilcoxon Rank-Sum (Wilcoxon, 1945) test in order to compare the results obtained by each of the probabilistic classifiers in the databases, and we have obtained as result that the new proposed approach outperforms the others for the “Breast” and “Pendigits” databases, while it is not possible to reject the

Letters class distribution

Pendigits class distribution

5

12

10

4

8 3 6 2

Porcentaje

Porcentaje

4 1

0 1

3

5

7

9

11

13

15

17

19

21

23

2

0

25

,00

V17

Satimage class distribution

Shuttle class distribution

30

1,0

2,0

3,0

4,0

5,0

6,0

7,0

8,0

9,0

V17

Vote class distribution

100

70 60

80 50 20 60

40 30

40 10

0 1

V37

2

3

4

5

Porcentaje

Porcentaje

Porcentaje

20 20

0

7

1

2

3

4

5

6

10 0

7

V10

1

2

V17

Fig. 6. Class distributions of the training databases

K2 Datafile Naive Bayes K2-BN K2-BN-nearest Breast 61.79 ± 13.97 65.20 ± 18.54 57.33 ± 17.81 Letters 72.88 ± 1.69 71.94 ± 3.50 74.50 ± 3.50 Pendigits 84.37 ± 1.74 89.36 ± 1.23 90.67 ± 1.06 Satimage 72.31 ± 9.57 78.13 ± 6.85 78.07 ± 6.68 Shuttle 78.60 ± 0.55 98.17 ± 1.57 98.17 ± 1.57 Vote 90.11 ± 3.26 94.26 ± 4.63 91.74 ± 4.62

Entropy Entr-BN Entr-BN-nearest 59.08 ± 19.14 66.95 ± 19.78 69.97 ± 4.67 72.59 ± 4.21 89.29 ± 0.96 90.53 ± 1.13 78.06 ± 6.62 78.44 ± 6.36 98.46 ± 0.98 98.49 ± 0.98 93.79 ± 4.88 91.73 ± 5.01

Table 2. Well classified validated results obtained for each datafile by each of the tested paradigms

null hypotesis of equal accuracy between the Bayesian Network and the BayesNearest methods in the rest four databases used.

8

Conclusions and Further Work

In this paper a new hybrid classifier that combines Bayesian Networks with distance-based algorithms is presented. The main idea is to perform the evidence propagation in the Bayesian Network not for the case that is being classified at each moment, but for its nearest case in the training database. The results obtained show that the new hybrid algorithm performs better for most of the used databases. But the distance metric used in the nearest

neighbour algorithm does not give good results when the number of values of the predictor variables take is small. It seems that the more possible combinations of values exist for the predictor variables and minor is the proportion reflected in the training database for each of the classes, better behaves the new algorithm with respect to the Bayesian Networks. Although this fact must be further analysed, the experimental done shows that the new approach can outperform the other probabilistic classifiers in the so called sparse databases – databases with a large number of variables or a high number of classes and few number of cases. Looking to the computational load, this classification technique allows us to pre-propagate all the cases in the database and thereby, to classify a new case, it is only needed to look for its nearest case and select the class with the maximum a posteriori probability previously obtained. This allows to reduce the time needed to predict a new case and thereby makes the new method suitable for real-time applications. An extension of the presented approach is to select among the feature subset that better performance presents by the classification point of view. A Feature Subset Selection (Inza et al., 2000, 2001; Sierra et al., 2001a) technique can be applied in order to select which of the predictor variables should be used. This could take advantage in the hybrid classifier construction, as well as in the accuracy.

Bibliography

Aha, D., Kibler, D., and Albert, M. (1991). Instance-based learning algorithms. Machine Learning, 6:37–66. Castillo, E., Guti´errez, J., and Hadi, A. (1997). Expert Systems and Probabilistic Network Models. Springer-Verlag. Catlett, J. (1991). On chanching continuous attributes into ordered discrete attributes. In Verlag, S., editor, Proceedings of the European Working Session on Learning, pages 164–178. Cooper, G. F. and Herskovits, E. A. (1992). A bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347. Cowell, R. G., Dawid, A. P., Lauritzen, S., and Spiegelharter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer. Dasarathy, B. V. (1991). Nearest neighbor (nn) norms: Nn pattern recognition classification techniques. IEEE Computer Society Press. Dietterich, T. G. (1997). Machine learning research: four current directions. AI Magazine, 18(4):97–136. Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. International Conference on Machine Learning, pages 194–202. Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian network classifiers. Machine Learning, 19(4):131–163. Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197–243. Henrion, M. (1988). Propagating uncertainty in bayesian networks by probabilistic logic sampling. In Proceedings of the Fourth Conference on Uncertainty in Artificial Intelligence, pages 149–163. Ho, T. K. and Srihati, S. N. (1994). Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:66–75. Inza, I., Larra˜ naga, P., Etxeberria, R., and Sierra, B. (2000). Feature subset selection by bayesian networks based optimization. Artificial Intelligence, 123(1-2):157–184. Inza, I., Larra˜ naga, P., and Sierra, B. (2001). Feature subset selection by bayesian networks: a comparison with genetic and sequential algorithms. International Journal of Approximate Reasoning, 27(2):143–164. Jensen, F. V. (1996). Introduction to Bayesian networks. University College of London. Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: a decisiontree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Lu, Y. (1996). Knowledge integration in a multiple classifier system. Applied Intelligence, 6:75–86.

M., F. U. and Irani, K. B. (1993). Multi-interval discretization of continuousvalued attributes for classification learning. In Kauffman, M., editor, Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029. Michie, D., Spiegelhalter, D., and Taylor, C. e. (1995). Machine learning, neural and statistical classification. Mitchell, T. (1997). Machine Learning. McGraw-Hill. Murphy, P. M. and Aha, D. W. (1994). Uci repository of machine learning databases. Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 32(2):245–257. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Los Altos, California. Sierra, B., Inza, I., and Larra˜ naga, P. (2000). Medical bayesian networks. Lecture Notes in Computer Science, 1933:4–14. Sierra, B. and Larra˜ naga, P. (1998). Predicting survival in malignant skin melanoma using bayesian networks automatically induced by genetic algorithms. an empirical comparision between different approaches. Artificial Intelligence in Medicine, 14:215–230. Sierra, B. and Lazkano, E. (2002). Probabilistic-weighted k nearest neighbor algorithm: a new approach for gene expression based classification. In KES02 proceedings, pages 932–939. IOS press. Sierra, B., Lazkano, E., Inza, I., Merino, M., Larra˜ naga, P., and Quiroga, J. (2001a). Prototype selection and feature subset selection by estimation of distribution algorithms. a case study in the survival of cirrhotic patients treated with tips. Artificial Intelligence in Medicine, pages 20–29. Sierra, B., Serrano, N., Larra˜ naga, P., Plasencia, E. J., Inza, I., Jim´enez, J. J., Revuelta, P., and Mora, M. L. (2001b). Using bayesian networks in the construction of a bi-level multi-classifier. Artificial Intelligence in Medicine, 22:233– 248. Sierra, B., Serrano, N., Larra˜ naga, P., Plasencia, E. J., Inza, I., Jim´enez, J. J., Revuelta, P., and Mora, M. L. (1999). Machine learning inspired approaches to combine standard medical measures at an intensive care unit. Lecture Notes in Artificial Intelligence, 1620:366–371. Stone, M. (1974). Cross-validation choice and assessment of statistical procedures. Journal Royal of Statistical Society, 36:111–147. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1:80–83. Xu, L., Kryzak, A., and Suen, C. Y. (1992). Methods for combining multiple classifiers and their applications to handwriting recognition. IEEE Transactions on SMC, 22:418–435.

Suggest Documents