USE BP-NETWORK TO CONSTRUCT COMPOSITE ATTRIBUTE

Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007

SAMPLING BASED ON MINIMAL CONSISTENT SUBSET FOR HYPER SURFACE CLASSIFICATION QING HE 1, XIU-RONG ZHAO1, 2, ZHONG-ZHI SHI 1 1

The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100080, China 2 Graduate University of Chinese Academy of Sciences, Beijing 100039, China E-MAIL: {heq, zhaoxr, shizz}@ics.ict.ac.cn

though the population includes all houses. When using this method, the researcher must be confident that these chosen samples are truly representative of the entire population. In this paper, a judgmental sampling method is proposed to obtain the Minimal Consistent Subset for a disjoint cover set based on Hyper Surface Classification. Nearest Neighbor (NN) classification is one of the important non-parametric classification methods and has been studied at length [1]. To tackle the problem of high computational demands of NN, the approach advocated over the years has been the selection of a representative subset of the original training data, or generating a new prototype reference set from available samples. The very early study of this kind was probably the “condensed nearest neighbor rule” (CNN) presented by Hart in 1968 [2]. A consistent subset of a sample set is a subset which, when used as a stored reference set for the NN rule, correctly classifies all of the remaining points in the sample set. And the Minimal Consistent Subset (MCS) is defined as consistent subset with a minimum number of elements. He also pointed out that every set has a consistent subset, since every set is trivially a consistent subset of itself. Also, every finite set has a minimal consistent subset, although the minimum size is not, in general, achieved uniquely. Hart’s method indeed ensures consistency, but the condensed subset is not minimal, and is sensitive to the randomly picked initial selection and to the order of consideration of the input samples. There are “reduced nearest neighbor rule” of Gates [3], and “iterative condensation algorithm” of Swonger [4]. All of these algorithms do not realize the minimalist of the reference subset. The method proposed by Chang created a reference set by generating new representative prototypes in 1974 [5]. These prototypes are generated by merging two nearest neighbors of the same class at each step as long as such merging does not increase the error rate. The editing algorithm MULTIEDIT, developed by Devijver and Kittler in 1980 [6], aims at

Abstract: For Hyper Surface Classification (HSC), based on the concept of Minimal Consistent Subset for a disjoint Cover set (MCSC), a judgmental sampling method is proposed to select a representative subset from the original sample set in this paper. The sampling method depends on sample distribution. HSC can directly solve the nonlinear multi-class classification problems and observe the sample distribution. The sample distribution is obtained by adaptively dividing the sample space, and the classification model of hyper surface is directly used to classify large database based on Jordan Curve Theorem in Topology while sampling for MCSC. The number of MCSC is calculated. MCSC has the same classification model with the entire sample set and can totally reflect its classification ability. For any subset of the sample set that contains MCSC, the classification ability remains the same. Moreover, a formula is put forward that can predict the testing accuracy exactly when some samples are deleted from MCSC. So MCSC is the best way of sampling from the original sample set for Hyper Surface Classification method.

Keywords: Minimal Consistent Subset; Disjoint Cover Set; Hyper Surface Classification; Sampling

1.

Introduction

Sampling methods are classified as either probability or non-probability. In probability sampling, each member of the population has a known non-zero probability of being selected. The advantage of probability sampling lies in that sampling error can be calculated. Sampling error is the degree to which a sample might differ from the population. In non-probability sampling, the degree to which the sample differs from the population remains unknown. Judgmental sampling is a common kind of non-probability method. The researcher selects the sample based on judgment. For example, a researcher may decide to draw the entire sample from some "representative" house, even

1-4244-0973-X/07/$25.00 ©2007 IEEE 12

Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007 editing the training samples to make the resulting classification more reliable, especially the ones located near the boundaries between classes. MULTIEDIT has been proven to be asymptotically Bayes-optimal for infinite samples. In 1994, Dasarathy presented a condensing algorithm for selecting an optimal consistent subset based on his concept of the nearest unlike neighbor subset (NUNS) [7]. The algorithm is the best-known algorithm in terms of consistent subset size and the selected samples’ representative nature. However, his conjecture of the minimalism of obtained MCS (also cited in Kuncheva, 1997) [8] later is proven not to be true by Kuncheva and Bezdek (1998) [9] and Cerveron and Fuertes (1998) [10] for the popular IRIS data set. In Zhang & Sun (2002) [11], they minimize the number of the reference samples while constrained to some error rate of classification. Henry Brighton and Chris Mellish introduced an algorithm that rivals the most successful existing algorithm [12]. When they evaluated on 30 different problems, neither algorithm consistently outperforms the other: consistency is very hard. As Henry Brighton and Chris Mellish pointed that the structure of the classes formed by the instances can be very different from problem to problem, this results in inconsistency when they apply one instance selection scheme over many problems. We needs to gain an insight into the structure of the classes within the sample space to effectively deploy a sample selection scheme. To solve the same problem for Hyper Surface Classification, a judgmental sampling method is proposed to obtain the Minimal Consistent Subset for a disjoint cover set based on Hyper Surface Classification in this paper. Hyper Surface Classification (HSC) is a novel classification method based on hyper surface is put forward by He & Shi & Ren (2002) [13]. In this method, a model of hyper surface is obtained by adaptively dividing the sample space in the training process and then the hyper surface is directly used to classify large database according to whether the wind number is odd or even based on Jordan Curve Theorem in Topology. HSC is a kind of covering learning algorithm. Experiments show that HSC can efficiently and accurately classify large-size data in two-dimensional space and three-dimensional space [14]. Though HSC can classify higher dimensional data according to Jordan Curve Theorem on the theoretical plane, it is not as easy to realize HSC in higher dimensional space as in three-dimensional space. However, what we really need is an algorithm that can deal with data not only of massive size but also of high dimensionality. Thus in He, Zhao & Shi (2006) [15] a simple and effective kind of dimension reduction method without losing any essential information is proposed. Formally, the method is a dimension reduction method but

13

in nature the method is a dimension transposition method. And in paper [16], based on the idea of ensemble, another solution to the problem of HSC on high dimensional data sets is proposed. The rest of this paper is organized as follows: In section 2, we give an outline of Hyper Surface Classification (HSC). Then in Section 3, the sampling method of Minimal Consistent Subset for a disjoint cover set based on HSC is described. In Section 4, experimental results are presented, followed by our conclusions in Section 5. 2.

Overview of Hyper Surface Classification Method

HSC is a universal classification method based on Jordan Curve Theorem in topology. Jordan Curve Theorem. Let X be a closed set in n -dimensional space R n . If X is homeomorphic to a sphere in n − 1 dimensional space, then its complement R n \ X has two connected components, one called inside, the other called outside. Classification Theorem. For any given point x ∈ R n \ X , x is in the inside of X ⇔ the wind number i.e. intersecting number between any radial from x and X is odd. And x is in the outside of X ⇔ the intersecting number between any radial from x and X is even. From the two theorems above, we conclude that X can be regarded as the classifier, which divides the space into two parts. And the classification process is very easy just by counting the intersecting number between a radical from the sample point and the classifier X . How to construct the separating hyper surface for any data set, whose distribution is unknown, is an important problem. Based on Jordan Curve Theorem, we put forward the following construction method for separating hyper surface in [13]. Step1. Input the training samples, containing k categories and d -dimensions. Let the training samples be distributed within a rectangle region. d

Step2. Divide the region into 10 × 10 × "10 small regions, called units. Step3. If there’re some units containing samples from two or more different categories, then divide them into smaller units repeatedly until each unit covers at most samples from the same category. Step4. Label each unit with 1, 2," , k , according to the category of the samples inside, and combine the adjacent units with the same labels into a bigger unit.

Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007 Step5. For each unit, save its contour as a link, and this represents a piece of hyper surface. All these pieces of hyper surface make the final separating hyper surface. Step6. Input a new sample and calculate the wind number of the sample about separating hyper surface. Drawing a radial from the sample can do this. Then the class of the sample is decided according to whether the intersecting number between the radial and the separating hyper surface is even or odd. HSC tries to solve nonlinear multi-classification problems in the original space without having to map into higher dimensional spaces, using multiple pieces of hyper surface. It is polynomial in time complexity if samples with the same categories distribute in finite connected components. Experiments show that HSC can efficiently and accurately classify large datasets in two-dimensional and three-dimensional space for multi-classification. For large three-dimensional data up to 107 , the speed of HSC is still very fast [14]. 3.

smaller units repeatedly until each unit covers at most samples from the same category. Step4. Label each unit with 1, 2," , k , according to the category of the samples inside, and unite the adjacent units with the same labels into a bigger unit. Step5. For each sample in the set, locate its position in the model, which means to figure out which unit it locates in. Step6. Combine samples that locate in the same unit into one equivalent class, then we get a number of equivalent classes in different layers. Step7. Pick up one and only one sample from each equivalent class to form the MCSC of HSC. By the algorithm above, we justify Hart’s statement that every set has a consistent subset, since every set is trivially a consistent subset of itself, and every finite set has a minimal consistent subset, although the minimum size is not, in general, achieved uniquely in [2]. For our method, the number of samples in each MCSC equals to the number of equivalent classes. And the number of MCSC equals to the size of Cartesian product of these equivalent classes. The method indeed ensures consistency and minimal for given cover set. Moreover, it is not sensitive to the randomly picked initial selection and to the order of consideration of the input samples. We point out that some samples in the MCSC are replaceable, while others are not. As we can see from the process of dividing large regions into small units in the algorithm, some close samples with the same category may fall into the same unit. In that case, these samples are equivalent to each other in the building of the classifier, and we can randomly pick up one of them into the MCSC. However, sometimes there can be only one sample in a unit, and this sample plays a unique role in the forming of the hyper surface. So it is irreplaceable in the MCSC.

Minimal Consistent Subset for Disjoint Cover Set

Suppose C is the collection of all subsets for a finite sample set S . And C ' is a disjoint cover set for S , i.e., a subset C ' ⊆ C such that every element in S belongs to one and only one member of C ' . Minimal Consistent Subset for a disjoint Cover set (MCSC) C ' is a sample subset combined by choosing one and only one sample from each element in the disjoint cover set C ' . For HSC method, we call sample a and b are equivalent if they are with the same category and fall into the same unit. And the points falling into the same unit make an equivalent class. The cover set C ' is the union set of all equivalent classes in the hyper surface H . More specifically, let H be the interior of H and u is a unit in H . Minimal Consistent Subset of HSC denoted by S min | H is a sample subset combined by selecting one and only one representative sample from each unit included in the hyper surface, i.e. S

4.

Experiments and Characteristics of MCSC

To make the concept of MCSC based on HSC more clear and vivid, the following two figures are listed.

| = ∪ {choosing one and only one s ∈ u} min H u ⊆ H

For a given sample set, we propose the following computation methods for its MCSC. Step1. Input the samples, containing k categories and d -dimensions. Let the samples be distributed within a rectangle region. d

Step2. Divide the region into 10 × 10 × "10 small regions, called units. Step3. If there’re some units containing samples from two or more different categories, then divide them into

14


We use the dataset of Breast-Cancer-Wisconsin from UCI repository, which contains 699 samples from 2 different categories. The dataset is firstly transformed into 3 dimensions by using the method in [15], and then trained by HSC. The trained model of hyper surface, composing of units in two layers, is shown in Figure 1. Each unit may contain multiple samples from the same category. Then we adopt the MCSC computation method mentioned in Section 3 to obtain the MCSC of this data set. The MCSC is also used for training, whose hyper surface structure is shown in Figure 2. The two figures are totally the same except different number of samples contained in some units. For a specific sample set, MCSC totally reflects its classification ability. Any addition into the MCSC will not improve the classification ability, either. This can be seen in Table 1, where we give the classification ability of MCSC on the data set of Breast-cancer-Wisconsin, Wine, Iris, Sonar and Wdbc. In Test I, MCSC is used for training and the other for testing. In Test II, ten samples are deleted from the testing set and added to the training set. We can see that after training, MCSC has the same hyper surface with the original data set, but contains much fewer samples. Also, Table 1 lists the numbers of MCSC, i.e. the size of Cartesian product of all equivalent classes

Figure 1. The Hyper Surface Structure of BreastCancer-Wisconsin

Figure 2. The Hyper Surface Structure of MCSC for Breast-Cancer-Wisconsin . Table 1. The classification ability of Minimal Consistent Set No. Of Samples in MCSC

Data Set

No. Of Samples

Breast-cancer-Wisconsin

699

229

Wine Iris Sonar

178 150 208

129 81 186

Wdbc

569

268

No. Of

MCSC

286237933452892089509 19781601425489920000 2087354105856 369853055631360 663552 39081636303994100913998 37598126964736000000000

From the viewpoint of the computational demands, it is interesting to determine how much loss in the consistency property results from an incomplete set. Figure 3 shows the variation in the extent of consistency achieved by the selected subset as a function of the size of the selected subset for the data set of Breast-cancer-Wisconsin. During

Test I

Test II

100%

100%

100% 100% 100%

100% 100% 100%

100%

100%

this experiment, each time we choose the sample with most representative ability into the training set, i.e. the sample whose unit contains the most samples. As we can see, the consistency increases monotonically as the subset expands in the way of sampling. Also, the consistency offered by the subset increases faster initially but slows down gradually

15

Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007 until hundred percent consistency is achieved, which is offered by MCSC. However, when we delete some samples from MCSC, the testing accuracy tends to fall down. Moreover, we can predict the testing accuracy exactly. Suppose there’re N samples in a data set, and its MCSC contains n samples. If the MCSC is used for training and the other samples for testing, the accuracy will be 100%. When one sample is deleted from the training set and added to the testing set, the accuracy will drop to 1 − m /( N − n + 1) (1) where m represents the number of samples that fall into the same unit with the one deleted. Take the data set of Breast-cancer-Wisconsin for example, there’re 229 samples out of 699 in the MCSC. Table 2 lists all the conditions in which one sample is deleted from MCSC and added to the testing set, and shows how the accuracy will be affected. The trend can be seen clearly in Figure 4.

100 Consistency(Accuracy)

90 80 70 60 50 40 30 20 10 0 0

20 40 60 80 100 120 140 160 180 200 220 240 No.of samples selected

Figure 3. Consistency vs. No. of samples selected

Table 2. Single deletion from MCSC of Breast-cancer-Wisconsin ID of deleted sample from MCSC

Samples in the same unit with the deleted

Accuracy by experiment

Accuracy by formula 1 − m /(699 − 229 + 1)

The number of the same cases

4

1

99.79%

99.79%

155

26

2

99.58%

99.58%

39

10

3

99.36%

99.36%

11

27

4

99.15%

99.15%

6

35

5

98.94%

98.94%

3

43

6

98.73%

98.73%

4

20

7

98.51%

98.51%

1

30

8

98.30%

98.30%

2

6

10

97.88%

97.88%

1

178

11

97.66%

97.66%

1

37

17

96.39%

96.39%

1

17

34

92.78%

92.78%

1

9

39

91.72%

91.72%

1

1

48

89.81%

89.81%

1

3

71

84.93%

84.93%

1

7

117

75.16%

75.16%

1

16


100.00% Accuracy

90.00% 80.00% 70.00% 60.00% 50.00% 0

10

20

30

40

50

60

70

80

90

100

110

120

No.of samples in the same unit with the one deleted from the MCSC

Figure 4. Changing trend in accuracy with different single deletions Table 3. Multiple deletions from t MCSC of breast-cancer-Wisconsin

1

ID of deleted samples 4,26

2

4,26,10,27,76

3

4,26,10,27,76, 43,20,30,6,178

Description K = 2 , m = {1, 2} K =5 m = {1, 2,3, 4,5} K = 10 m = {1, 2,3, 4,5, 6, 7,8,10,11}

Accuracy by experiment 99.36%

96.84%

96.84%

88.13%

88.13%

samples in MCSC are replaceable, while others are not. For any given training set, MCSC totally reflects its classification ability. For any subset of the training set that contains MCSC, the classification ability remains the same. However, when we delete some samples from MCSC, the testing accuracy tends to fall down, which can be predicted exactly by our formula. The method indeed ensures consistency and minimal. Moreover, it is not sensitive to the randomly picked initial selection and to the order of consideration of the input samples.

In general, if k (1 ≤ k ≤ n ) samples are deleted from the MCSC, the accuracy will fall down to 1 − ( m1 + m2 + " + mk ) /( N − n + k ) (2) where mi represents the number of samples that fall into the same unit with the i -th deleted sample. Experiments of multiple deletions are shown in Table 3, where predicted results are consistent with experiments. 5.

Accuracy by formula 99.36%

Conclusions

Acknowledgements

We argued that a sample critical to the classification process depends on sample distribution. We need to gain an insight into the structure of the classes within the sample space to effectively dividing a sample selection scheme. The Minimal Consistent Subset for a disjoint cover set (MCSC) provided by our HSC algorithm is a first step in achieving more perspicuous insights into problem structure and sample distribution. Division is the issue if we wish to ensure successful sample selection, and the key to deployment is the insights we have into how classes are constructed within the sample space. The number of samples in each MCSC for HSC equals to the number of equivalent classes. And the number of MCSC equals to the size of Cartesian product of all equivalent classes. Some

This work is supported by the National Science Foundation of China (No. 60435010, 90604017, 60675010), 863 National High-Tech Program (No.2006AA01Z128), National Basic Research Priorities Programme (No. 2003CB317004) and the Nature Science Foundation of Beijing (No. 4052025). References [1] B. V. Dasarathy, “Nearest neighbor (NN) norms: NN pattern classification techniques”, os Alamitos, CA: IEEE Computer Society Press, 1991.

17

Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007 [2] P. E. Hart, “The condensed nearest neighbor rule”, IEEE Trans. In formation Theory, IT214 (3), pp. 515-516, May 1968. [3] G.W. Gates, "The reduced nearest neighbor rule”, IEEE Trans. In formation Theory, IT218 (3), pp. 431433, 1972. [4] C.W. Swonger, "Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition”, in: S. Watanade (Ed.), Frontiers of Pattern Recognition, Academic Press, New York, pp. 511-519, 1972. [5] C. L. Chang, “Finding prototypes for nearest neighbor classifiers”, IEEE Trans. Computers, c223 (11), pp. 1179-1184, Nov 1974. [6] P. A. Deliver and J. Kittler, “On the edited nearest neighbor rule”, Proc. 5th ICPR, Miami, Florida, pp. 72-80, 1980. [7] B. V. Dasarathy, “Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design”, IEEE Trans Syst .man Cybern., 24 (3), pp. 511-517, March 1994. [8] L. I. Kuncheva, “Fitness functions in editing kNN reference set by genetic algorithms”, Pattern Recognition, 30 (6), pp.1041-1049, 1997. [9] L.I. Kuncheva, J.C. Bezdek, “Nearest prototype classification: clustering, genetic algorithms, or random search”, IEEE Trans. Syst. Man and Cybern., 28 (1), pp. 160-164, 1998.

[10] V. Cerveron, A. Fuertes, “Parallel random search and Tabu search for the minimal consistent subset selection problem”, Lecture Notes in Computer Science, Vol.1518, pp. 248-259, Springer, Berlin, 1998. [11] H. B. Zhang, G. Y Sun, “Optimal reference subset selection for nearest neighbor classification by tabu search”, Pattern Recognition, 35, pp. 1481-1490, 2002. [12] H. Brighton, C. Mellish, “Advances in Instance Selection for Instance-Based Learning Algorithms”, Data Mining and Knowledge Discovery, 6, pp. 153-172, 2002. [13] Q. He, Z. Z. Shi, L. A. Ren, “The classification method based on hyper surface”, Proceedings of the 2002 International Joint Conference on Neural Networks, pp.1499-1503, 2002. [14] Q. He, Z. Z. Shi, L. A. Ren, E. S. Lee, “A Novel Classification Method Based on Hyper Surface”, International Journal of Mathematical and Computer Modeling, pp. 395-407, 2003. [15] Q. He, X. R. Zhao, Z. Z. Shi, “Classification based on dimension transposition for high dimension data”, SOFT COMPUTING, 11(4), pp. 329-334, 2007. [16] X. R. Zhao, Q. He, Z. Z. Shi, “HyperSurface Classifiers Ensemble for High Dimensional Data Sets”, In Wang et al. (Eds.): 3rd International Symposium on Neural Networks (ISNN 2006), LNCS 3971, pp. 1299-1304, 2006.

18

USE BP-NETWORK TO CONSTRUCT COMPOSITE ATTRIBUTE

USE BP-NETWORK TO CONSTRUCT COMPOSITE ATTRIBUTE

Suggest Documents