Partition based pattern synthesis technique with

0 downloads 0 Views 252KB Size Report
Nearest neighbor (NN) classifier is a popular non-parametric classifier. It is conceptually a simple classifier and shows good perfor- mance. Due to the curse of ...
Pattern Recognition Letters 27 (2006) 1714–1724 www.elsevier.com/locate/patrec

Partition based pattern synthesis technique with efficient algorithms for nearest neighbor classification P. Viswanath a, M. Narasimha Murty b, Shalabh Bhatnagar b

b,*

a Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati 781 039, India Department of Computer Science and Automation, Indian Institute of Science, Bangalore, Karnataka 560 012, India

Received 10 March 2005; received in revised form 28 March 2006 Available online 21 June 2006 Communicated by B. Kamgar-Parsi

Abstract Nearest neighbor (NN) classifier is a popular non-parametric classifier. It is conceptually a simple classifier and shows good performance. Due to the curse of dimensionality effect, the size of training set needed by it to achieve a given classification accuracy becomes prohibitively large when the dimensionality of the data is high. Generating artificial patterns can reduce this effect. In this paper, we propose a novel pattern synthesis method called partition based pattern synthesis which can generate an artificial training set of exponential order when compared with that of the given original training set. We also propose suitable faster NN based methods to work with the synthetic training patterns. Theoretically, the relationship between our methods and conventional NN methods is established. The computational requirements of our methods are also theoretically established. Experimental results show that NN based classifiers with synthetic training set can outperform conventional NN classifiers and some other related classifiers.  2006 Elsevier B.V. All rights reserved. Keywords: Nearest neighbor classifier; Pattern synthesis; Artificial patterns; Curse of dimensionality

1. Introduction Nearest neighbor (NN) classifier is a very popular nonparametric classifier used in many fields since its conception in early fifties (Fix and Hodges, 1951, 1952; Duda et al., 2000). Its wide applicability is due to its simplicity and good performance. It has no design phase and simply stores the training set. A test pattern is classified to the class of its nearest neighbor in the training set. A simple extension of this is to take k nearest neighbors and assign a label to the test pattern based on the majority voting among them. This method is known as k nearest neighbor (kNN) classifier.

*

Corresponding author. Tel.: +91 80 22932987; fax: +91 80 23602911. E-mail addresses: [email protected] (P. Viswanath), mnm@ csa.iisc.ernet.in (M.N. Murty), [email protected] (S. Bhatnagar). 0167-8655/$ - see front matter  2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.04.015

One of the severe drawbacks of NN classifiers is its performance (classification accuracy) that depends on the training set size. Cover and Hart (1967) show that the error for NN classifier is bounded by twice the Bayes error when the available sample size is infinity. However, in practice, one can never have an infinite number of training samples. With a fixed number of training samples, the classification error for NN classifier tends to increase as the dimensionality of the data gets large. This is called the peaking phenomenon (Fukunaga and Hummels, 1987a; Hughes, 1968). Jain and Chandrasekharan (1982) point out that the number of training samples per class should be at least 5–10 times the dimensionality of the data. The peaking phenomenon with NN classifier is known to be more severe than other parametric classifiers such as Fisher’s linear and quadratic classifiers (Fukunaga and Hummels, 1987b; Fukunaga, 1990). Duda et al. (2000) mention that for non-parametric techniques, the demand for a large number of samples

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

grows exponentially with the dimensionality of the feature space. This limitation is called the curse of dimensionality. Thus, it is widely believed that the size of training sample set needed to achieve a given classification accuracy would be prohibitively large when the dimensionality of data is high. Two possible remedies are: (i) to reduce the dimensionality by employing certain feature selection or extraction methods, or (ii) to increase the training set size by adding some artificially generated training patterns to the training set. Feature selection (Kudo and Sklansky, 2000) and extraction (Fodor, 2002) methods are well studied areas having a variety of techniques. But artificially increasing the training set size is a less addressed area which is being dealt in the paper. Generating artificial training set from the given training set is often called as bootstrapping and the artificial samples are called the bootstrap samples (Efron, 1979). The bootstrap method has been successfully applied to error estimation of the NN classifier (Jain et al., 1987; Chernick et al., 1985; Weiss, 1991; Hand, 1986). Hamamoto et al. (1997) proposed a bootstrap technique for the NN classifier design and experimentally showed it to outperform conventional NN classifiers. In this paper we propose a novel bootstrap technique for NN classifier design, that we call partition based pattern synthesis. The artificial training set generated by this method can be exponentially larger than the original set. So, conventional k-NN classification methods are not suitable as the space and time requirements are very high. We propose two methods to find k-NNs which have smaller space and classification time requirement, (i) a divideand-conquer method (DC-k-NNC) which is equivalent to conventional k-NN classifier with synthetic patterns (kNNC(SP)), and (ii) a greedy method (GDC-k-NNC) which is even better than DC-k-NNC in terms of classification time requirement and classification accuracy. Relationship among the various NN methods and upper bounds for their space and classification time requirements are theoretically established. Experimental results show that our synthetic pattern based k-NN classifiers outperform both conventional k-NN classifier with original training set and k-NN classifier based on Hamamoto’s bootstrapped training set. Other advantages of our methods are that there is no separate bootstrap step required to find the artificial training set, and there is no need for any extra space to store the synthetic training patterns. The paper is organized as follows. Section 2 describes partition based pattern synthesis with examples. Section 3 explains various NN classification methods using synthetic patterns. Experimental studies are given in Section 4 and conclusions in Section 5. Appendix gives relationship among the NN methods and some other proofs. 2. Partition based pattern synthesis To describe the concept of partition based pattern synthesis we use the following notation and definitions.

1715

2.1. Notation and definitions Pattern: A pattern (data instance) is described as an assignment of values X = (x1, . . . , xd) to a set of features F = (f1, . . . , fd). X[fi] is the value of the feature fi in the instance X. That is, if X = (x1, . . . , xd) then X[fi] = xi. Sub-pattern: Let A be some subset of F and X be a feature vector. Then, we use XA to denote the projection of X onto the features in A. XA is called a sub-pattern of X. Set of sub-patterns: Let Z be a set of patterns. Then ZB ¼ fZ B jZ 2 Zg is called the set of sub-patterns of Z with respect to the subset of features B. Set of classes: X = {x1,x2, . . . , xc} is the set of classes. Set of training patterns: X is the set of all training patterns. Xl is the set of training patterns for class xl. X ¼ X1 [ X2 [    [ Xc . Partition: p = {B1, B2, . . . , Bp} is a partition of F, i.e., Bi  F, "i, ¨iBi = F, and Bi \ Bj = ;, if i 5 j, "i, "j. 2.2. Formal description We assume that all of the data instances are drawn from some probability distribution over the space of feature vectors. Let X1, . . . , Xp be p patterns (need not be distinct) in a class and p = {B1, . . . , Bp} be a partition of F. Then X ¼ ðX 1B1 ; X 2B2 ; . . . ; X pBp Þ is a synthetic pattern belonging to the same class such that, X[fi] = Xj[fi] if fi 2 Bj, for 1 6 i 6 d and 1 6 j 6 p. It is worth noting that, a pattern Y can be equivalently represented as ðY B1 ; Y B2 ;    ; Y Bp Þ. That is, all projections of Y can be combined to get back that pattern. For a given class of patterns Xl and for a given partition pl = {Bl1, . . . , Blp}, SS pl ðXl Þ ¼ fðX 1Bl1 ; X 2Bl2 ; . . . ; X pBlp ÞjX 1 ; . . . ; X p 2 Xl g is the synthetic set of patterns for the class. One can also see SS pl ðXl Þ as the Cartesian product, XlBl1  XlBl2  . . .  XlBlp . That is, Cartesian product of the sets of sub-patterns of Xl with respect to different blocks of the partition. The given training set (which we call the original training set) is a subset of the derived synthetic set. (i) Example: This illustrates the concept of partition based pattern synthesis. Let F = (f1, f2, f3, f4), Xl ¼ fða; b; c; dÞ; ðp; q; r; sÞ; ðw; x; y; zÞg be the given training set for the class xl, and pl = {Bl1, Bl2} such that Bl1 = (f1, f2) and Bl2 = (f3, f4). Then, the synthetic set for the class is SS pl ðXl Þ ¼ fðX aBl1 ; X bBl2 Þj X a ; X b 2 Xi g ¼ XlBl1  XlBl2 ¼ fða; b; c; dÞ; ða; b; r; sÞ; ða; b; y; zÞ; ðp; q; c; dÞ; ðp; q; r; sÞ; ðp; q; y; zÞ; ðw; x; c; dÞ; ðw; x; r; sÞ; ðw; x; y; zÞg. (ii) Example: We present one more simple example with handwritten digits which has geometric appeal. Fig. 1 illustrates two original digits (for digit ‘3’) and four synthetic digits derived from the original digits. In this example the digits are drawn on a two-dimensional grid of size 4 · 5. A cell with ‘1’ indicates presence of ink. Empty cells are assumed to

1716

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

1

1

1

1

1 1

1

1 1

1 1 1

1

1

1

1

1 1

1

1

Original Patterns for OCR digit ’3’ Blank cells are assumed to have 0’s 1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1 1

1 1

1

1

1

1 1

^pðT ; xl Þ ¼

kln V

and thus a reasonable estimate for the posteriori probability P(xljT) is 1

1

1

1

features in Bi for the given class xl. Let qðX jxl Þ ¼ q1 ðX B1 jxl Þ      qp ðX Bp jxl Þ. Then the synthetic set for the class xl is generated according to the density q(Xjxl), for 1 6 l 6 c. Let us call q(Xjxl) as the synthetic class conditional density for class xl and p(Xjxl) as the original class conditional density for xl. Now, consider the k-nearest neighbor (k-NN) classifier which classifies the given test pattern (T) according to the majority vote among the k nearest neighbors of T in the training set. Let n be the total number of training patterns (in the original training set). Let V be the volume of the smallest hyper-sphere at T which can encompass all k nearest neighbors of T. Let, out of the k nearest neighbors, kl are from the class xl, for 1 6 l 6 c. Then, an estimation of p(T, xl) is

1

Synthetic Patterns Fig. 1. Illustration of synthetic patterns with OCR data.

have ‘0’. Patterns are represented row-wise. So the original digits are (1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1) and (0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0). Let the partition be p = {B1, B2} such that B1 consists of first eight features and B2 the remaining 12 features. Four synthetic patterns (out of which two are nothing but original patterns) as shown can be derived. 2.3. Quality of synthetic patterns For a given class xl and for a given partition pl = {Bl1, . . . , Blp}, it can be seen that the original set of patterns Xl and the synthetic set generated i.e., SS p ðXl Þ are from the same probability distribution if the blocks of F viz., Bl1, . . . , Blp are statistically independent for the given class (a formal proof for this is given in Appendix A). We call such a partition as the ideal partition. But unfortunately, such a partitioning of F may not exist at all! Nevertheless, a partition which is close to the ideal partition may be useful in many cases. An argument in favor of this is given below. Let p(Xjxl) be the class conditional probability density at X in the feature space characterized by the set of features F for the given class xl. Let qi ðX Bi jxl Þ be the (marginalized) probability density at X Bi in the subspace spanned by the

^pðT ; xl Þ kl ¼ : P^ ðxl jT Þ ¼ Xc ^pðT ; xi Þ k i¼1 So, k-NN classifier can be seen as an approximation of Bayes classifier. Theoretically it is shown (Duda et al., 2000) that the necessary and sufficient conditions for ^pðT ; xl Þ to converge to p(T, xl) are limn!1kl = 1 and limn !1kl/n = 0. So, asymptotically k-NN classifier is equivalent to Bayes classifier. Also, it is shown (Cover and Hart, 1967) that the error rate of 1-NN classifier (i.e., when k = 1) is upperbounded by twice the Bayes error when the available sample size is infinity. By this it is clear that as the value of n increases (accordingly with this the value of k also should increase) the performance of k-NN classifier is better. Let the bias of the estimate ^pðT ; xl Þ be b ¼ Eð^pðT ; xl Þ  pðT ; xl ÞÞ. As the training set size increases, on average, b decreases. Let the estimation of q(T, xl) using the synthetic set be ^qðT ; xl Þ and let its bias be bs ¼ Eð^qðT ; xl Þ  qðT ; xl ÞÞ. Since the synthetic set size is larger, on average, bs < b. If p(T, xl) = q(T, xl) (this is the case when the partition pl used for doing the synthesis is an ideal one) then ^qðT ; xl Þ is definitely a better estimate of p(T, xl). Now, even when the pl is not an ideal one, but for a carefully chosen pl, it is possible that q(T, xl) is close to p(T, xl) (may be based on some distance function like Kullback– Leibler distance), and since the synthetic set is a larger one, it is reasonable to assume that Eð^qðT ; xl Þ pðT ; xl ÞÞ < Eð^pðT ; xl Þ  pðT ; xl ÞÞ. That is, the estimation using the large synthetic set could be a closer one to the actual value (unknown) than that using the original set. So, in those domains where a good partition for doing the pattern synthesis exists (i.e., where the synthetic distribution is closer to the original distribution), k-NN classifier based on synthetic patterns can give better performance than that using only the original patterns.

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

Among various possible partitions, it is important to choose a better one based on a certain criteria. One such criterion is Kullback–Leibler distance. For the class xl, choose a partition pl such that, Z pðX jxl Þ dX DKL ðpl Þ ¼ pðX jxl Þ ln qðX jxl Þ is minimum. Obviously, if the blocks in pl are statistically independent then DKL(pl) has the least value i.e., zero. In absence of such a partition one can use a partition which has the least possible distance. Unfortunately, the above distance function is useful only when the probability structure of the problem is known which is in general not given. But what is given is only a set of training examples. Domain knowledge, if available, can be used in choosing a good partition. Also one can consider the following facts in deriving an approximate partitioning method. Let rij be the correlation coefficient between a pair of features fi and fj computed from Covarianceðfi ;fj Þ where Stddev(fi) is the standard deviation Stddevðfi Þ Stddevðfj Þ of feature fi. These features are said to be uncorrelated, if rij = 0. It is a known fact that, if fi and fj are statistically independent then they are uncorrelated. Converse of this, in general, is not true. However, it is a fact that uncorrelated variables are statistically independent if they have a multivariate normal distribution. In practice, often uncorrelated features are treated as if they are statistically independent. Further, it is a common practice to consider variables to be uncorrelated for practical purposes if the magnitude of their correlation coefficient is below some threshold (like 0.05). By considering these points, an approximate partitioning method is to cluster the set of features, so that features in a cluster are better correlated with each other than features in different clusters. The absolute value of the correlation coefficient i.e., jrijj can be used as a similarity measurement between the features fi and fj while deriving the clustering. We used a well known clustering method called average-linkage (Duda et al., 2000) method. 3. Nearest neighbor (NN) classification methods This section describes three NN-based classification methods using the synthetic patterns along with their space and time requirements. First one is a simple and straight method which explicitly generates the synthetic set and applies k-NN classifier (k-NNC) using the synthetic set. This method is called k-NNC(SP). Since the synthetic set size is O(np) (n is the size of the original training set and p is the number of blocks of a partition that is used for doing the synthesis), space and time requirements of kNNC(SP) is O(np). Second method called divide and conquer k-NNC, in short DC-k-NNC, applies divide and conquer strategies in finding k-NNs of the test pattern in the synthetic set. The synthetic set is not generated explicitly, but the method does an implicit generation. DC-k-NNC has both space and time requirements equal to O(n + kp).

1717

DC-k-NNC is shown to be equivalent to k-NNC(SP) as for as the classification decisions are concerned. Finally, the third method is called greedy DC-k-NNC, in short GDC-k-NNC, which is a simplification over DC-k-NNC and has both space and time requirements of O(n + k). But GDC-k-NNC is not decision equivalent to either DC-k-NNC or to k-NNC(SP). Nevertheless, a relation between GDC-k-NNC and DC-k-NNC is established and experimentally it is shown that GDC-k-NNC has on par performance (classification accuracy) as that of DC-kNNC. 3.1. Divide-and-conquer k-NN classifier with synthetic patterns (DC-k-NNC) This method is decision equivalent to k-NNC(SP). That is for given a test pattern T, both k-NNC(SP) and DC-k-NNC will classify it to the same class. This equivalence is established in Appendix B. DC-k-NNC is better than k-NNC(SP) in terms of both space and time requirements. DC-k-NNC basically searches each set of sub-patterns belonging to a block (of the partition) and to a class separately. The results are combined at a later stage to get the actual k nearest neighbors. 3.1.1. DC-k-NNC method (1) Find k nearest neighbors of the test pattern T in each class of synthetic patterns separately. That is, local k nearest neighbors in each class for the given test pattern are found. Let Zl denote this local k nearest neighbors in class xl, for 1 6 l 6 c. (2) Find k global nearest neighbors from the union of all local neighbors, that is, from Z1 [    [ Zc . (3) Classify T to the class based on majority vote among the k global nearest neighbors. First step in DC-k-NNC is elaborated in Algorithm 1. The projected set of original training patterns of a class (let us say for the class xl) onto features in a block of the partition pl (let the block be Bj) is searched to find k nearest neighbors of the projection of T onto Bj. Let this be Vj . Then, the local k nearest neighbors of T in the synthetic set for the class, i.e., in SS pl ðXl Þ has to be a subset of ptp 2t2 1 V ¼ V1      Vp ¼ fðX 1t B1 ; X B2 ; . . . ; X Bp Þj1 6 t l 6 k; 1 6 l 6 pg where p = jplj. A formal proof of this can be found in Appendix B. But note that jVj ¼ k p only, whereas jSS pl ðXl Þj ¼ Oðnp Þ. Local k-NNs for the class xl is found by searching V. This is done for all classes. The distance function used is squared Euclidean distance. Note that, k-NNs found using either Euclidean distance or squared Euclidean distance are same (except for tie breaks when there are many patterns that are at the same distance from the test pattern). This simplifies some of the proofs given in Appendix B.

1718

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

Algorithm 1 Step 1 of DC-k-NNC( ) {Let T be the given test pattern, Xl be the set of given original training patterns which belong to the class xl, for 1 6 l 6 c.} for each class xl, for l = 1 to c do for each subset (i.e., block) of features Bj 2 pl do Find k nearest neighbors of T Bj in XlBj ¼ fX Bj jX 2 Xl g. Let these k nearest j2 jk neighbors be Vj ¼ fX j1 Bj ; X Bj ; . . . ; X Bj g. end for Find the Cartesian product, V ¼ V1      Vp ¼ ptp 2t2 1 fðX 1t B1 ; X B2 ; . . . ; X Bp Þj1 6 t l 6 k; 1 6 l 6 pg where p = jplj. {Note, jVj ¼ k p }. Find Zl ¼ k nearest neighbors from V. fZl gives the k local nearest neighbors in the class xl} end for DC-k-NNC does an implicit pattern synthesis and avoids explicit generation of the synthetic patterns. 3.1.2. Space and time requirements of DC-k-NNC Space requirement of DC-k-NNC is O(n + kp), since n is the original training set size and V (see Algorithm 1) for each class is of size O(kp), where, p, the number of blocks is assumed to be constant for each class’s partition. Time requirement of the method is also of O(n + kp). 3.2. Greedy DC-k-NNC (GDC-k-NNC) This method is similar to DC-k-NNC except for step 1. This method does not find k nearest neighbors, instead it derives k representatives called greedy-neighbors from the synthetic set of each class. The principal difference with DC-k-NNC is that, this method avoids finding Cartesian product of the sets of k nearest sub-patterns (i.e., projected patterns onto a block) found for each block as done in DCk-NNC. But, instead this finds k representatives, such that, the ith representative is found by combining the ith nearest sub-patterns belonging to each block. This method is not strictly decision equivalent to either k-NNC(SP) or DC-k-NNC. The relation between DC-kNNC and GDC-k-NNC with respect to the classification decisions they make is given in Appendix B. Under certain assumptions, DC-k-NNC and GDC-k-NNC are decision equivalent. Experimental studies shows that GDC-kNNC has on par (some times even better) performance as that of DC-k-NNC. But, GDC-k-NNC is computationally better (has lesser computational requirements) than DC-kNNC. 3.2.1. GDC-k-NNC method (1) Find k greedy-neighbors of the test pattern T in each class of synthetic patterns. That is, local k greedy-

neighbors in each class for the given test pattern are found. Let Gl denote the local k nearest neighbors in class xl, for 1 6 l 6 c. (2) Find k global greedy-neighbors from union of all local greedy-neighbors, that is from G1 [    [ Gc . (3) Classify T to the class based on majority vote among the k global greedy-neighbors. First step in GDC-k-NNC is elaborated in Algorithm 2. Algorithm 2 Step 1 of GDC-k-NNC( ) {Let T be the given test pattern, Xl be the set of given original training patterns belonging to the class xl, for 1 6 l 6 c.} for each class xl, for l = 1 to c do for each subset (i.e., block) of features Bj 2 pl do Find k nearest neighbors of T Bj in XlBj ¼ fX Bj jX 2 Xl g. Let these k nearest neighbors j2 jk ji be Vj ¼ fX j1 Bj ; X Bj ; . . . ; X Bj g, and let X Bj be the ith nearest neighbor. end for p1 p2 21 12 22 Gl ¼ fðX 11 B1 ; X B2 ; . . . ; X Bp Þ; ðX B1 ; X B2 ; . . . ; X Bp Þ; . . . ; pk 1k 2k ðX B1 ; X B2 ; . . . ; X Bp Þg is taken to be the set of k greedy-neighbors in the class xl. end for

3.2.2. Space and time requirements of GDC-k-NNC Space and time requirements of the method are both O(n + k), since Gi in the Algorithm 2 takes only O(k) space and time. 4. Experiments This section performed.

describes

the

experimental

studies

4.1. Datasets We performed experiments with five different datasets, viz., OCR, WINE, VOWEL, THYROID, GLASS and PENDIGITS, respectively. Except the OCR dataset, all others are from the UCI Repository (Murphy, 1994). OCR dataset is also used in (Babu and Murty, 2001; Ananthanarayana et al., 2001). The properties of the datasets are given in Table 1. For OCR, VOWEL, THYROID and PENDIGITS datasets, the training and test sets are separately available. For the remaining datasets 100 patterns are chosen randomly as the training patterns and the remaining as the test patterns. All the datasets have only numeric valued features. The OCR dataset has binary discrete features, while the others have continuous valued features. Except OCR dataset all other datasets are normalized to have zero mean and unit variance for each feature.

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

Dataset

OCR WINE VOWEL THYROID GLASS PENDIGITS

Table 2 A comparison between the classifiers (showing CA (%))

Number of features

Number of classes

Number of training examples

Number of test examples

192 13 10 21 9 16

10 3 11 3 7 10

6670 100 528 3772 100 7494

3333 78 462 3428 114 3498

4.2. Classifiers for comparison The classifiers chosen for comparison purposes are as follows: NNC: This is the conventional NN classifier with original training patterns. Distance measure is Euclidean distance. Space and time requirements are O(n) where n is the number of original training patterns. k-NNC: This is conventional k-NN classifier with original training patterns. The distance measure is Euclidean distance. Threefold cross validation is done to choose the k value. Space and classification time requirements of the method are both O(n) when k is assumed to be a constant when compared with n. NNC with bootstrapped training set(NNC(BS)): We used the bootstrap method given in (Hamamoto et al., 1997) to generate an artificial training set. The bootstrapping method is as follows. Let X be a training pattern and let X1, . .P . , Xr be its r nearest neighbors in its r class. Then X 0 ¼ ð i¼1 X i Þ=r is the artificial pattern generated for X. In this manner, for each training pattern an artificial pattern is generated. NNC is done with this new bootstrapped training set. The value of r is chosen according to a threefold cross validation. Bootstrapping step requires O(n2) time, whereas space and classification time requirements are both O(n). DC-k-NNC: This method is given in Section 3.1. Partitioning method used is average-linkage clustering (Duda et al., 2000) method. It is found to be better (as for as classification accuracy is concerned) than other similar methods like single-linkage and complete-linkage clustering methods. The number of blocks (i.e., the parameter p) used is chosen by using a threefold cross validation. Similarly k value is chosen from a threefold cross validation. GDC-k-NNC: This method is given in Section 3.2. The parameters are chosen as done for DC-k-NNC. 4.3. Experimental results

Dataset

NNC k-NNC NNC(BS) DC-k-NNC GDC-k-NNC

OCR WINE VOWEL THYROID GLASS PENDIGITS

91.12 94.87 56.28 93.14 71.93 96.08

92.68 96.15 60.17 94.40 71.93 97.54

92.88 97.44 57.36 94.57 71.93 97.54

94.09 98.16 60.17 97.23 72.06 97.54

94.89 98.72 60.17 97.23 72.06 97.54

Table 3 A comparison between the classifiers showing percentage accuracies and computational requirements for various training set sizes for OCR data set Training set size

NNC

k-NNC

NNC(BS)

GDC-k-NNC

2000 4000 6670

87.67 90.16 91.11

87.79 90.22 92.68

88.86 90.84 92.88

92.65 94.03 94.89

k-NNC shows similar or better performance than DC-kNNC. Table 3 compares classifiers with varying training set size (by taking a random sample) for OCR dataset. Training set sizes used are 2000, 4000 and 6670. It can be seen that GDC-k-NNC with 2000 training patterns shows almost similar accuracy with other classifiers with 6670 training patterns. With equal number of training patterns, clearly GDC-k-NNC outperforms the other classifiers. Fig. 2 gives the comparison of classification accuracy with respect to the number of blocks (p) used in the synthesis for the OCR dataset. Increasing p increases the number of synthetic patterns. But this does not mean that larger p gives better performance. Since, the synthetic distribution for larger values of p can deviate very much from that of the original distribution, the performance can actually decrease. For OCR dataset, it can be seen that p = 3 gives better performance. Fig. 3 shows the comparison between classification accuracy and the k value for the classifier GDC-k-NNC for the

95

k=1 k = 15

90

Classification Accuracy (%)

Table 1 Properties of the datasets used

1719

85 80 75 70 65 60

Table 2 shows the comparison between various classifiers on various datasets. It can be seen that in most cases, the NN classifiers using synthetic patterns (i.e., DC-kNNC and GDC-k-NNC) outperforms the others. GDC-

55

0

2

4

6

8

p - Value

Fig. 2. CA for various values of p.

10

12

1720

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724 96

Classification Accuracy (%)

94 92 90 88 86 p= p= p= p=

84 82 0

5

10

15

20

2 3 4 6

25

30

K - Value

Fig. 3. CA vs. k for various p values.

OCR dataset, for various values of p. From this also it can be seen that p value greater than 3 reduces the performance. Feature selection is often used to reduce the curse of dimensionality effect. Feature selection along with prototype selection with k-NN classifier is studied for OCR dataset in (Babu and Murty, 2005) where the reported best CA is 93.58%. Whereas GDC-k-NNC gives 94.89% which uses the synthetic training set. 5. Conclusions The paper presents a novel pattern synthesis technique called partition based pattern synthesis to create artificial (synthetic) training patterns so that the number of training patterns can be increased. This can reduce the curse of dimensionality effect and hence performance of a classifier (using a larger synthetic set) can increase. The quality of the derived synthetic training set depends on the partitions used for the synthesis. For carefully chosen partitions, nearest neighbor (NN) methods using the larger synthetic set can give better performance than that using only the given training set (original training set). If the partition is a good one then the synthetic distribution (from which the synthetic set can be seen to be derived) can be a closer one to the original distribution, and since the synthetic set is a larger one, the bias in the probability density estimation (using non-parametric methods) using the synthetic patterns can be smaller than the bias using only the original training patterns. Increasing number of blocks in the partitions can increase the synthetic set size, but this does not mean that the classification accuracy (CA) also increases. Because the synthetic distribution can deviate from that of the original distribution considerably, and hence the classifier using the synthetic set can show poor performance. This is a trade-off. We require large synthetic set and also we require that the synthetic distribution to be closer to the original distribution. Cross validation is applied to choose the number of blocks in the partitions. Since, in general, the probability structure of the problem is unknown, and what is given is only a training set,

we used approximate partitioning methods in finding a partition of the set of features which can be used for the synthesis. The set of features is clustered using average-linkage clustering method where the absolute value of correlation coefficient between a pair of features is taken to be a similarity measure between them. Since the synthetic training set size can be larger than the original training set size, applying k-NN classifiers using the synthetic set could be a computationally costly affair. To overcome this problem, divide-and-conquer strategies are applied in deriving new faster k-NN methods to work with the synthetic training set. These are called DC-k-NNC and GDC-k-NNC. DC-k-NNC is decision equivalent to k-NN classifier using the synthetic set. It has space and time requirements of O(n + kp) where n is the original training set size and p is the number of blocks in a partition that is used for doing the synthesis. GDC-kNNC is a greedy version of DC-k-NNC which has space and time requirements of O(n + k). Experimental studies are done with various standard datasets. The results are compared with conventional NN methods. In many cases, NN methods using the synthetic patterns are found to outperform the conventional methods. Appendix A. Relationship between original and synthetic patterns Parzen-window based density estimation method is used to prove that both original and synthetic patterns for a class are from the same distribution, if the blocks in the partition are statistically independent for the given class. According to the Parzen-windows method, an estimation for the probability density at a point X, where a set of patterns drawn from the same distribution is given in the form of a training set, is obtained from ^pðX Þ ¼

k=n ; V

where V is the volume of a (small) region in the feature space encompassing X, n is the total number of patterns in the training set and k is the number of training patterns that are present in the region. As V ! 0, k ! 1 and n ! 1, ^pðX Þ converges to the actual density p(X). V is assumed to be a hyper-cube whose edge is of length h, and hence V = hd, where d is the dimensionality of the feature space. Theorem A.1. For a given class xl and for a given partition p = {B1, . . . , Bp}, the original training set Xl and the synthetic set SS p ðXl Þ are from the same probability distribution, if B1, . . . , Bp are statistically independent for the class. Proof. Let p(Xjxl) be the density at X for the given class xl. Let qi ðX Bi jxl Þ be the (marginalized) probability density at X Bi in the subspace spanned by the features in Bi, for the given class xl, for 1 6 i 6 p. Then, since B1, . . . , Bp are statistically independent, pðX jxl Þ ¼ q1 ðX B1 jxl Þ      qp ðX Bp jxl Þ.

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

Let V be the volume of the hyper-cube around the pattern X in the feature space spanned by F, and V = hd where h is the length of each edge of the hyper-cube. Let Vi be the projection (having edges corresponding to the dimensions in Bi) of V onto the subspace spanned by the features in Bi, for 1 6 i 6 p. So, V i ¼ hjBi j . Let ^ qi ðX Bi jxl Þ be the Parzen-window based density estimate of qi ðX Bi jxl Þ. Let ki be the number of patterns present in Vi out of the n patterns in XlBi (the set of projected training patterns of class xl onto features in Bi). Then, for 1 6 i 6 p, ^ qi ðX Bi jxl Þ ¼ k i =nV i . So,

pids imposes a total ordering on the training set and according to this ordering unique k-NNs are found. For this purpose, we first define pattern identifier (pid) for a training pattern, then we impose a total order over the training set with respect to the given test pattern. Similarly, a total order over a set of sub-patterns is imposed. Pattern identifier (pid): pid uniquely identifies a pattern in a training set. pid for original training patterns: Each original training pattern X i 2 X is assigned with a unique non-negative integer called pid(Xi).2 pid for projected training patterns (sub-patterns): pid(XB) = pid(X) where XB is the projection of X onto B, B  F. We assume that two patterns X B1 and Y B2 can be compared (in order to get an order between them) only if B1 = B2. pid for synthetic training patterns: Let Q be a synthetic training pattern, and Q ¼ ðX 1B1 ; X 2B2 ; . . . ; X pBp Þ then pid(Q) = hpid(X1), pid(X2), . . . , pid(Xp)i, an ordered ptuple. Relation between pids: Let X,Y be two training patterns with pid(X) = ha1, a2, . . . , api and pid(Y) = hb1, b2, . . . , bpi. Then we say

^ pðX jxl Þ ¼ ðk 1 k 2    k p Þ=ðnp V 1 V 2    V p Þ ¼ ðk 1 k 2    k p Þ=ðnp hjB1 j hjB2 j    hjBp j Þ ¼ ðk 1 k 2    k p Þ=ðnp hd Þ ¼ ðk 1 k 2    k p Þ=ðnp V Þ: We show that ^ pðX jxl Þ obtained from the synthetic set is also (k1k2    kp)/(npV). Since the parameters (like V) used are arbitrary, both the estimates will converge to the actual density asymptotically. According to the pattern synthesis, if X 1B1 2 XlB1 ; 2 X B2 2 XlB2 ; . . . ; X pBp 2 XlBp , then ðX 1B1 ; X 2B2 ; . . . ; X pBp Þ is a synthetic pattern in SS p ðXl Þ. Further, if X iBi is present in Vi, for 1 6 i 6 p, then ðX 1B1 ; X 2B2 ; . . . ; X pBp Þ is present in V. Since ki patterns in XlBi are present in Vi, for 1 6 i 6 p, there are (k1k2    kp) synthetic patterns present in V. Since jXl j is assumed to be n, total number of synthetic patterns for class xl, i.e., jSS p ðXl Þj=np. So, the Parzen-window estimate of the density at X based on the set of synthetic patterns is (k1k2 . . . kp)/(npV). h Appendix B. Relationship between the NN methods This section establishes relationships between (i) kNNC(SP) and DC-k-NNC, and (ii) DC-k-NNC and GDC-k-NNC.

1721

(i) pid(X) = pid(Y), if ai = bi for 1 6 i 6 p. (ii) pid(X) < pid(Y), if $m(0 6 m < p) such that ai = bi for 1 6 i 6 m and am+1 < bm+1. (iii) pid(X) > pid(Y), if pid(X) 5 pid(Y) and pid(X) 6< < pid(Y). Thus, relation between pids is according to their lexicographic ordering. B.1.1. Total order on the training set Let T be the test pattern. For any two patterns Wi, Wj in the training set ðWÞ, we say that the relation Wi M Wj is true, if

B.1. Relationship between k-NNC(SP) and DC-k-NNC k-NNC(SP) is decision equivalent to DC-k-NNC.1 This means, for a given test pattern both k-NNC(SP) and DC-kNNC give the same classification decision. This is established by showing that the k nearest neighbors found by both the methods are the same. k nearest neighbors (NNs) of a test pattern T need not be unique, since there could be many patterns at the same distance from T. To simplify some of the following proofs, we assume that there is a way of getting unique k-NNs from the given training set for a given test pattern T. One way of doing this is to define a unique identifier (pid) for each pattern, and whenever two or more patterns are at the same distance from T, then they are ordered according to their pids. So, distance function along with 1

This assumes, the two methods use the same strategy to break ties.

(i) i = j, or (ii) M(Wi, T) < M(Wj, T), or (iii) i 5 j, M(Wi, T) = M(Wj, T), and pid(Wi) < pid(Wj), where M(Wi, T) is the distance between Wi and T based on the distance function M. So, hW; M i is a linearly (totally) ordered set and members of W can be enumerated (without repetitions) as hW t1 ; W t2 ; . . . ; W tk ; . . . ; W tn i, where W ti M W tiþ1 for 1 6 i < n, and jWj ¼ n. Now, the ith nearest neighbor of T is W ti and is unique. Then, k nearest neighbors of T in W, based on metric M is the unique set, fW t1 ; W t2 ; . . . ; W tk g. Further, it is assumed that squared Euclidean distance is used as the distance function. Note that, k-NNs found 2

Row number of the pattern in the database can serve this purpose.

1722

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

using either Euclidean or squared Euclidean distance are the same. This assumption also is made to simplify some of the following proofs. Lemma B.1. Let Zl ¼ fX t1 ; X t2 ; . . . ; X tk g be the set of k nearest neighbors of the test pattern T in the synthetic set for the class xl, i.e., in SS pl ðXl Þ where pl = {B1, B2, . . . , Bp}.Then, Zl  V1  V2      Vp where Vi is the k nearest sub-patterns of T Bi in XlBi for 1 6 i 6 p. Proof. Let X o 2 Zl , and X o 62 V1  V2      Vp . We show this leads to a contradiction, and thus prove the lemma. X o 62 V1  V2      Vp ) X oBi 62 Vi for at least one i in 1 6 i 6 p. Suppose without loss of generality that X oB1 62 V1 . The other cases can be proved along similar lines. Let B2 [ B3 [    [ Bp = B 0 . Let V1 ¼ fX tB11 ; X tB21 ; . . . ; X tBk1 g be the set of k nearest sub-patterns of T B1 in XlB1 , and let it satisfy X tB11  X tB21      X tBk1 . Now, X o 2 Zl ) X o 2 SS pl ðXl Þ ) X oB1 2 XlB1 and o X B0 2 XlB2  XlB3      XlBp . Since X oB1 62 V1 ) X tBk1  X oB1 ) ðX tB11 ; X oB0 Þ  ðX tB21 ; X oB0 Þ      ðX tBk1 ; X oB0 Þ  ðX oB1 ; X oB0 Þ. Since ðX oB1 ; X oB0 Þ ¼ X o ) fðX tB11 ; X oB0 Þ; ðX tB21 ; X oB0 Þ; . . . ; tk o ðX B1 ; X B0 Þg is the set of k patterns nearer to T in SS pl ðXl Þ than X o ) X o 62 Zl . So the contradiction. h Lemma B.2. Let Z ¼ fX t1 ; X t2 ; . . . ; X tk g be k nearest neighbors of a test pattern T in SS p1 ðX1 Þ [ SS p2 ðX2 Þ [    [ SS pc ðXc Þ, let Zl ¼ X m1 ; X m2 ; . . . ; X mk be the k nearest neighbors of T in SS pl ðXl Þ and xl 2 X, then Z  Z1 [ Z2 [    [ Zc . Proof. Self explanatory.

h

Theorem B.1. For a given test pattern T, original training sample X, the k nearest neighbors found by k-NNC(SP) and DC-k-NNC are same. Proof. This is proved by showing that the k-NNs found by both of the methods are same. Lemma B.1 establishes the correctness of Step 1 of DC-k-NNC which is given in Algorithm 1. Lemma B.2 shows that Step 2 of DC-k-NNC, indeed finds k nearest neighbors from the entire synthetic set of patterns. On the other hand, k-NNC(SP) finds k nearest neighbors from the generated synthetic set of patterns. Because of the total ordering imposed on the synthetic set, the set of k nearest neighbors is unique, hence k nearest neighbors found by k-NNC(SP) and DC-k-NNC are same. h

In Appendix B.1, it is said that a total ordering over the training patterns is established based on distances (with the test pattern) and pattern identifiers (pid). But, when total ordering over sub-patterns (corresponding to each block of the partition) alone is considered, then there is a partial ordering among the synthetic patterns. This partial ordering at a later stage can be resolved to yield a total ordering. For simplicity, consider the case with p = 2, i.e., number of blocks in the partitions is 2. This can be generalized to arbitrary p. Let T be the given test pattern. For a class xl, let p = {B1, B2} be the partition, and let X1, X2, . . . , Xk, . . . , Xn be the totally ordered sub-patterns in the set of sub-patterns XlB1 ¼ fX B1 jX 2 Xl g w.r.t T B1 , such that Xi  Xi+1. Similarly let Y1, Y2, . . . , Yk, . . . , Yn be the total ordering of the sub-patterns in XlB2 w.r.t T B2 . Here we assumed jXl j ¼ n. Then, the synthetic set for the class is SS p ðXl Þ ¼ XlB1  XlB2 ¼ fðX 1 ; Y 1 Þ; ðX 1 ; Y 2 Þ; ðX 2 ; Y 1 Þ; ðX 2 ; Y 2 Þ; . . . ; ðX n ; Y n Þg which can be arranged in the form of a partial ordering as shown in Fig. 4. The first nearest neighbor in the synthetic set is G1 = (X1, Y1), and the second nearest neighbor can be either A1 = (X1, Y2) or A2 = (X2, Y1) (this can be resolved based on their distances with T and their pids, but for time being let us assume that it is not resolved). Similarly all other neighbors can be listed. The set of all greedy-neighbors (as found by GDC-k-NNC) in the synthetic set is {G1 = (X1, Y1), G2 = (X2, Y2), . . . , Gn = (Xn, Yn)}. For example, if k = 4, DC-k-NNC searches in the set {G1, . . . , G4, A1, . . . , A12} which is of size 16 to find 4 nearest neighbors. Whereas GDC-k-NNC without any search gives {G1, G2, G3, G4}. The 4 nearest neighbors found by DC-k-NNC can be any of the following: (a) {G1, A1, A3, A5}, (b) {G1, A1, A2, A3}, (c) {G1, A1, A2, G2}, (d) {G1, A1, A2, A4}, (e) {G1, A2, A4, A8}, respectively. Fig. 5 shows the cases (a)–(c). Let Si be a set of p nearest neighbors in the synthetic set of patterns belonging to class xi. Similarly, let Sj be a set of q nearest neighbors in the synthetic set of patterns belonging to class xj. Let G(S) be the set of greedy-patterns in S. Then, we assume that, if jSij P jSjj then jG(Si)j P jG(Sj)j. Let us call this assumption monotonicity assumption. With this assumption the relationship between DC-k-NNC and GDC-k-NNC can be easily established. G5

G4 A11 A9 A6

A5

B.2. Relationship between DC-k-NNC and GDC-k-NNC Under certain conditions, DC-k-NNC and GDC-kNNC are decision equivalent. This is established both theoretically and experimentally.

A12 A10

G3 A7 G2

A3 A1

A8 A4

A2 G1

Fig. 4. Partial ordering of the synthetic patterns.

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

G5

G5

G5

G4

G4

G4

A11 A9

A12

A7 G2

A3

A11 A10

G3 A6

A5

1723

A1

A8 A4

A2

A9

A12

A6

A5

A7 G2

A3 A1

G1

(a)

A11 A10

G3

A9 A6

A5

A8 A4

A12

A7 G2

A3

A2

A10

G3

A1

G1

A8 A4

A2

G1

(b)

(c)

Fig. 5. Different ways of getting 4 nearest neighbors as found by DC-k-NNC.

Theorem B.2. For a given test pattern T, DC-k-NNC with k = m and GDC-k-NNC with k = g(m; T) are decision equivalent (except for tie breaks), provided monotonicity assumption holds. Proof. Let Z be the set of m nearest neighbors found by DC-k-NNC, that is from the entire set of synthetic patterns. Let Zi be the subset of Z which contains all patterns of Z which belong to the class xi, for 1 6 i 6 c. Suppose, Z has maximum number of patterns from a class xl. That is, jZl j P jZi j for 1 6 i 6 c. Let Zg be the set of g(m; T) greedy-patterns found by GDC-k-NNC. Since the set of g(m; T) greedy-neighbors is unique (because of the total ordering imposed on the set of synthetic patterns), we have, Zg ¼ GðZÞ. So, Zg contains exactly GðZi Þ greedy-neighbors from the class xi, for 1 6 i 6 c. From the monotonicity assumption, jZl j P jZi j ) jGðZl Þj P jGðZi Þj for any l and i. So, jGðZl Þj P jGðZi Þj for 1 6 i 6 c. So, Zg has maximum number of patterns from the class xl. So, the classification decisions made by DC-k-NNC and GDC-kNNC are same except for tie breaks. Tie break is needed when there is more than one class which has maximum number of patterns. h B.2.1. Approximation to g(m; T) Since g(m; T) depends on the training patterns and the test pattern T, it is difficult to find it exactly. We give a heuristic based approximation to it which is independent of training patterns and T. The total number of synthetic patterns is O(np) where n is the number of original training patterns and p is the number of blocks used in partitions. However the total number of greedy-patterns is O(n) only. So, g(m; T) is approximated to m1/p. Since the number of nearest neighbors can not be a fraction, dm1/p e is used instead of m1/p.

95

Classification Accuracy (%)

Let us denote the number of greedy-neighbors in a set of m nearest neighbors of the test pattern T (from the entire synthetic set) by g(m; T). That is, if Z is the set of m nearest neighbors of T, then gðm; T Þ ¼ jGðZÞj.

94.5

94

93.5

93 DC-k-NNC with p = 3 GDC-k-NNC with p = 3

92.5

0

5

10

15

20

25

30

35

K-Value for GDC-k-NNC and Square-root(K) for DC-k-NNC

Fig. 6. Comparison between DC-k-NNC and GDC-k-NNC for OCR data.

B.2.2. Experimental evidence Because of the monotonicity assumption and approximation to g(m; T), DC-k-NNC with k = m is not strictly equivalent to GDC-k-NNC with k = dm1/pe. Nevertheless, experimental results show that they are very similar as far as classification decisions are concerned. Fig. 6 shows comparison between the methods (both with p = 3) for OCR data (Ananthanarayana et al., 2001) (see Section 4 for details). The horizontal axis inpthe ffiffiffi figure corresponds to k value of GDC-k-NNC and k value for DC-k-NNC. An observation is that, at the peak of GDC-k-NNC, difference in classification accuracy (CA) of GDC-k-NNC with DC-k-NNC is about 0.62% and also this is the maximum difference between them. With respect to the maximum CA, GDC-k-NNC is better than DC-k-NNC. In general, this is true for other datasets too. References Ananthanarayana, V., Murty, M., Subramanian, D., 2001. An incremental data mining algorithm for compact realization of prototypes. Pattern Recognition 34, 2249–2251. Babu, T.R., Murty, M.N., 2001. Comparison of genetic algorithms based prototype selection schemes. Pattern Recognition 34, 523–525.

1724

P. Viswanath et al. / Pattern Recognition Letters 27 (2006) 1714–1724

Babu, T.R.M., Murty, N., 2005. On simultaneous selection of prototypes and features in large data. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (Eds.), Pattern Recognition and Machine Intelligence. First Internat. Conf., PReMI 2005, Proc., Lecture Notes in Computer Science, vol. LNCS-3776. Springer-Verlag, Berlin, Heidelberg, pp. 595–600. Chernick, M., Murthy, V., Nealy, C., 1985. Application of bootstrap and other resampling techniques: Evaluation of classifier performance. Pattern Recognition Lett. 3, 167–178. Cover, T., Hart, P., 1967. Nearest neighbor pattern classification. IEEE Trans. Information Theory 13 (1), 21–27. Duda, R.O., Hart, P.E., Stork, D.G., 2000. Pattern Classification, 2nd ed. A Wiley-Interscience Publication, John Wiley & Sons. Efron, B., 1979. Bootstrap methods: Another look at the jackknife. Ann. Statist. 7, 1–26. Fix, E., Hodges Jr., J., 1951. Discriminatory analysis: non-parametric discrimination: Consistency properties. Report No. 4, USAF School of Aviation Medicine, Randolph Field, TX. Fix, E., Hodges Jr., J., 1952. Discriminatory analysis: non-parametric discrimination: Small sample performance. Report No. 11, USAF School of Aviation Medicine, Randolph Field, TX. Fodor, I.K., 2002. A survey of dimension reduction techniques. Tech. Rep. UCRL-ID-148494. Center for Applied Scientific Computing, Larence Livermore National Laboratory, P.O. Box 808, L-560, Livermore, CA 94551. Fukunaga, K., 1990. Introduction to Statistical Pattern Recognition, 2nd ed. Academic Press.

Fukunaga, K., Hummels, D., 1987a. Bias of nearest neighbor error estimates. IEEE Trans. Pattern Anal. Machine Intell. 9, 103–112. Fukunaga, K., Hummels, D., 1987b. Bayes error estimation using Parzen and k-NN procedures. IEEE Trans. Pattern Anal. Machine Intell. 9, 634–643. Hamamoto, Y., Uchimura, S., Tomita, S., 1997. A bootstrap technique for nearest neighbor classifier design. IEEE Trans. Pattern Anal. Machine Intell. 19 (1), 73–79. Hand, D., 1986. Recent advances in error rate estimation. Pattern Recognition Lett. 4, 335–346. Hughes, G., 1968. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Information Theory 14 (1), 55–63. Jain, A., Chandrasekharan, B., 1982. Dimensionality and sample size considerations in pattern recognition practice. In: Krishnaiah, P., Kanal, L. (Eds.), Handbook of Statistics, vol. 2. North Holland, pp. 835–855. Jain, A., Dubes, R., Chen, C., 1987. Bootstrap technique for error estimation. IEEE Trans. Pattern Anal. Machine Intell. 9, 628–633. Kudo, M., Sklansky, J., 2000. Comparison of algorithms that select features for pattern classifiers. Pattern Recognition 33 (1), 25–41. Murphy, P.M., 1994. UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine, CA. Available from: Weiss, S., 1991. Small sample error rate estimation for k-NN classifiers. IEEE Trans. Pattern Anal. Machine Intell. 13, 285–289.

Suggest Documents