A novel data selection method based on shadowed sets - Science Direct

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Procedia Engineering

Procedia Engineering 00 (2011) 000–000 Procedia Engineering 15 (2011) 1410 – 1415 www.elsevier.com/locate/procedia

Advanced in Control Engineering and Information Science

A novel data selection method based on shadowed sets Yu Zhou a , Haibin Sua, Hongtao Zhanga a

North China Institute of Water Conservancy and Hydroelectric Power; Zhengzhou 450011; China

Abstract One of the main factors that affect the supervised learning performance of neural network (NN) is training data set. However, for real world problems, it is not always easy to obtain high quality training data sets. In this paper, a novel training data selection method based on shadowed sets is proposed that can select an informative and representative subset of training data to ameliorate the supervised learning performance of NN. The main goal of this work is to improve the generalization ability and diminish misclassification errors of classifier of NN. This paper firstly introduces central idea of shadowed sets. Then the algorithm of proposed data selection method is described in detail. Finally, taking LVQ model as an example, some experiments are made to test validity of this method. The experiment results indicate that using selected sample data to train NN can lessen computation consumption, save training time, guarantee generalization ability, which verify the effectiveness and applicability of proposed data selection method.

© 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and/or peer-review under responsibility of [CEIS 2011] Keywords: Data selection; neural network (NN); shadowed sets; supervised learning

1. Introduction Supervised learning is one of the most important learning paradigms in neural network (NN). It has been well know that supervised learning in training NN is highly training data dependent. The level of generalization, i.e., the ability to correctly respond to novel inputs, achievable using a fixed number of training data is heavily dependent up the training data. Usually supervised learning works well only when we have high quality training data. However, for real world problems and practical applications of engineering, it is not always easy to create perfect training data. This is because training data are often difficult, expensive, or time consuming to obtain as they require the efforts of experienced human annotators [1]. So if we have no effective strategy to select the efficient data to achieve good performance of NN, we must use the original training data which data quality is poor. In this case, the training becomes

Corresponding author. Tel.:+8615037128252 E-mail address: [email protected].

1877-7058 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. doi:10.1016/j.proeng.2011.08.261

Yu Zhou et al. Procedia Engineering 15 00 (2011) 1410 – 1415 Yu Zhou ,et/al/ Procedia Engineering (2011) 000–000

2

difficult and inevitably leads to drop in NN performance. Training data selection is one method which addresses these problems by selecting the most representative and informative sample data for supervised training. Therefore, data selection to achieve a better performance is a very important task. In this paper, we proposed a novel data selection method based on shadowed sets theory [2-3] that can improve quality of training data for some kinds of NN and improve their supervised learning ability. The objectives of this paper are: (1) by proposed data selection method, we can select an informative and representative subset of training data; (2) we can improve supervised learning performance (generalization ability, classification accuracy, and learning time) of NN by using proposed selected training data. 2. Outline of shadowed sets The theory of shadowed sets is developed by Pedrycz [2,3] in order to overcome the problem of excessive precision in describing imprecise information. Shadowed sets establish some sound compromise between the qualitative Boolean (two-valued) description of data and quantitative membership grades. Shadowed sets can be induced from fuzzy set. Consider a fuzzy set as depicted in Fig.1. Attempt to modulate the membership values (MVs) on the lines of three-valued logic by elevating and reducing some MVs and balancing the uncertainty thus introduced. Try to disambiguate the concept represented by the original fuzzy set by promoting a few of the MVs to 1 and reducing a few other MVs to 0. In order to maintain the overall level of vagueness, some other region is defined as the zone of uncertainty. Provision is made so that this particular area of the universe of discourse has intermediate MVs on a unit interval between [0, 1], but left undefined. Rather than a single value, the entire unit interval can be marked as a non-numeric model of membership grade. This mapping is called a shadowed set, and is defined as: A : X → {0,[0,1],1} . Where X is a given universe of discourse. The co-domain of A consists of three components that is 0, 1, and the unit interval [0, 1]. All elements for which A( x) assume 1(denoted by core(A) ) are called a core of the shadowed set — they embrace all elements that are fully compatible with the concept conveyed by A . The elements of X for which A( x) attains 0 (denoted by exclusion(A) ) are excluded from A . The elements of X for which we have assigned the unit interval (denoted by shadow(A) ) are completely uncertain — we are not at position to allocate any numeric membership grade. An illustration of a shadowed set is included in Fig. 1. 1 1-α α

1 1-α

core shadow

α

exclusion

Fig.1 The fuzzy set inducing a shadowed set via a threshold

One of the most appealing questions concerns the computation of the threshold α . An optimization based on balance of vagueness was proposed in [2, 3]. Refer to Fig.2, reduction of some MVs to 0 and elevation to 1 should be compensated by marked indecision in the other zones, or increased uncertainty in MVs in the form of a unit interval [0, 1] over particular ranges of A . A particular threshold α is selected for the quantification process and is expressed in terms of the relationship

1411

1412

Yu Zhou et /al. / Procedia Engineering 15 (2011) 1410 – 1415 Author name Procedia Engineering 00 (2011) 000–000

V (α ) = |

a1

+∞

-∞

a2

∫ A(x)dx +

∫

a2

(1 − A(x))dx − ∫ A(x)dx |

3

(1)

a1

i.e., Ω1 + Ω 2 = Ω3 .Where A is the fuzzy set. In other words, the threshold α located in [0,1/ 2) , if selected properly, should zero the Eqn. (1), V (α ) = 0 . The optimization task comes in the form

α opt = arg Minα V (α )

(2)

The three terms on the right hand side of Eqn. (1) correspond to regions Ω1 , Ω 2 and Ω3 in Fig. 2. The parameters a1 and a2 denote the boundaries in the integral, delineating the regions in the figure where the membership values are below the threshold α and above the threshold 1 − α Ω3

Ω2

Ω1

Fig.2 Computing the threshold α via an optimization

In the discrete domain, Eqn. (1) gets transformed to V (α i ) =

∑

X k |uik 1−α i

(ui max −α i ) − card {X k | α i < uik < ui max − α i }

α i = α opt = arg Minα V (α ) Where

uik

,

ui min

and

ui max

(3)

(4)

denote respectively the discrete, the lowest and the highest MVs to the i-th class.

3. Proposed training data selection mechanisms

Many researchers proposed the data selection methods from different points. Selecting training data at random from the sample space is the most popular method followed. Despite its simplicity, this method does not ensure nor guarantee that the training would be optimal. Thus, there exist some improved methods. In [4, 5] the authors choose the valuable training samples which are closest to the classification boundary. Some other existing works [6, 7] put more emphasis on the representative samples which are close-to centers. In [8] the authors propose boundary data selection adding randomly selected data, which can accelerate converges of NN. In [9] the authors present three methods to select data point, that are center-based selection (CS), border-based selection (BS) and hybrid selection (HS), and the authors use empirical study to proof that the HS method outperforms other selection method. From past studies, we can get clear picture about two kinds of data point are very important as training data in the field of classification problems: the data points close-to class centers and the data points closeto borders between classes. Hence, in order to improve classification performance of NN, we select data points that close to class centers and boundaries as selected training data to train NN. In the framework of shadowed sets, we establish two important data types – core data and boundary data. Definition 1: Core data: Assume that the number of classes is equal to c . Those are the data points that belong to a core of at least one or more shadowed sets. core data = {x | ∃i x ∈ core(A i )}, i =1,2,L ,c

(5)

1413


4

Core data are formed by data points with membership degree equal to one in each class. Core data are located in the central part of each class. The data points in core data are usually representative and valuable for supervised learning of NN. Definition 2: Boundary data: Assume that the number of classes is equal to c . Those are the data points that belong to the shadow of at least two shadowed sets. Boundary data = {x | (∃i , j x ∈ shadow(Ai ) ∩ x ∈ shadow(A j )) ∪ .... ∪ (∃i , j , k x ∈ shadow(A i ) ∩ x ∈ shadow(A j ) ∩ x ∈ shadow(A k )) ∪ ...

(6)

.... ∪ (∃i , j , k，L x ∈ shadow(A i ) ∩ x ∈ shadow(A j ) ∩ x ∈ shadow(A k ) ∩ ...）∪ ...} This type of data is formed by data points that do not belong to a core of any of the shadowed sets but fall within the shadow of at least two shadowed sets. Boundary data consist of data points around the borders between two or more classes. Boundary data are quite likely to lie near the decision boundaries of classes; they can be regarded as “confusing samples”, which require more attention in classification problems. It is worth noting that the capture of core data and the boundary data is fully automated by using optimal threshold of shadowed sets. This is a distinguished difference with other data-selection methods. We can select training data only from core data and boundary data batch by batch after the establishment of core data and boundary data, which is also different with the traditional data selection method – one by one. So, our proposed method can increase the efficiency of data selection . The detailed algorithm of proposed data selection method is described below. Step 1: Run FCM algorithm and obtain optimal fuzzy partition matrix U = [uik ] of available data set, i = 1, 2,L , c, k = 1, 2,L , N . Here c stands for the number of the classes while N denotes the number of the data points. Step 2: From fuzzy partition matrix U = [uik ] we can get c fuzzy sets. Step 3: Compute optimal threshold α i for the i-th class, in terms of Eqns. (3) and (4). Step 4: Obtain c shadowed sets induced by c fuzzy sets and c thresholds. Step 5: Convert the partition matrix U = [uik ] into corresponding shadowed partition matrix U ' = [uik' ], uik' → {0, 1, [0, 1]} . Step 6: According to U ' = [uik' ] , Definition 1 and Definition 2, establish the core data and boundary data. Step7: Select data. The basic principle of selection is choice closer center and boundary data. For this purpose, two control parameters β1 (between 0 and 1) and β 2 (between 1 and 2) are introduced. These two parameters can control the number of selection data. Using control parameters, shadowed partition matrix can be updated. It can be expressed in the following formula. According to the Definition 1 and Definition 2, core data and boundary data can be restructured. ⎧if uik > 1 − β1 ∗ α i，t hen uik' = 1 ⎪ ' (7) ⎨i f uik < β 2 ∗ α i，t hen uik = 0 ⎪i f β ∗ α < u < 1 − β * α ，t hen u ' = [0, 1] 2 1 i ik i ik ⎩ Step 8: Using selected data to train NN.

4. Experiment study

Some experiments of classification problem of LVQ neural network are illustrated to show the application of the proposed data selection method. Some data sets, including synthetic data set and benchmark data set are used to test the effectiveness and feasibility of the approach. In order to guarantee the objectivity, each experiment did 10 times at the same configuration of LVQ.

1414

Yu Zhou et /al. / Procedia Engineering 15 (2011) 1410 – 1415 Author name Procedia Engineering 00 (2011) 000–000

4.1. Synthetic data set The synthetic data set is two-dimensional dada set which includes 200 data points. All data points have no label and we use FCM to label these data points. Here we do 2 experiments. In experiment 1, we use the proposed data selection method to select training data to train LVQ and use the full synthetic data set to measure the performance. In experiment 2, the same number of training data is selected at random. Compute optimal thresholds, α1 = 0.2639 , α 2 = 0.3214 , α 3 = 0.3418 and α 4 = 0.2620 ,respectively. Based on Definition 1 and Definition 2 ( β1 =0.5, β 2 = 1 ), we can get core data which include 54 data points and boundary data which include 39 data points. Then we use selected data points to train LVQ. Table 1 exhibits the comparison of the experimental results. Compare with traditional random selection method, our proposed method not only can keep the classification accuracy and short learning time, but also can guarantee generalization ability of LVQ, which can prove the proposed data select method is effective and feasible, and proposed method can select the representative and informative samples as training data set. Table 1. Experimental results using synthetic data set Experimental method

Experiment 1：

Experiment 2：

data selection based on shadowed sets

random selection method.

Number of training data

93

93

Learning speed (epochs)

5.9

12.3

Training error

0.0295

0.0269

Accuracy

97.17%

93.85%

4.2. Iris data set In this section, we will be concerned with the Iris data set. The system randomly chooses 75 instances from the Iris data set as the training data set, and let the rest of the instances of the Iris data set as the testing data set to measure the generalization ability of the trained network. Two experiments are made. In experiment 1, the training data set (75 instances) is directly used to train LVQ. In experiment 2, data selection is conducted from 75 training data using our proposed method, and then use selected data to train LVQ and use 75 testing data to test the performance of LVQ. According to the Eqns. (3) and (4) we can compute optimal thresholds α1 = 0.2003 , α 2 = 0.2006 and α 3 = 0.0419 . Table 2 exhibits the experimental results. Table 2. Experimental results using Iris data set Experimental method

Experiment 1：

Experiment 2：

without using data selection

using data selection based on shadowed sets

Number of training data

75

40

Learning speed (epochs)

>200

10

Training error

0.0533

0.0333

Accuracy

90.67%

94.67%

It can be seem form Table 2 that the experiment 2 has a shorter learning time. On the other hand, the proposed method can keep the generalization ability of network using less training data. So we can find

5

6


that the performance of LVQ can be significantly improved with the selected training data produced by the proposed data selection method 5. Conclusions

It is know that one of the important factors that determine the fidelity of a neural network is the quality of training data. To achieve the best possible supervised learning performance NN, in this paper, a novel training data selection method based on shadowed sets theory is proposed to refine training data. The main purpose of our research work is to improve the generalization ability and classification accuracy of NN by using our proposed data selection method. Compared with existing work, our methods do not require much computational effort. We use shadowed set theory to implement our data selection mechanisms, which make the formation of core data and boundary data automatic. Experimental results on synthetic data set and benchmark data set indicate that our posed selection method can keep typical sample data and reduce the number of training data. Using selected data to train NN can effectively improve supervised learning performance of NN. Acknowledgements

The authors gratefully acknowledge the Project supported by the Research Foundation from Ministry of Education of China (Grant No. 107021) and High-level talents scientific research project of North China Institute of Water Conservancy and Hydroelectric Power (Grant No. 201117). References [1] X. J. Zhu, Semi-supervised learning literature survey, Technical report 1530, Computer Science, University of Wisconsin Madiso, 2008 [2] Witold Pedrycz. Interpretation of cluster in the framework of shadowed sets. Pattern Recognition Letters. 2005; 26: 24392449 [3] Witold Pedrycz. From fuzzy sets to shadowed sets: interpretation and computing. International journal of intelligence system. 2009; 24: 48-61 [4] K. Hara, K. Nakayama, Data selection method for generalization of multilayer neural network, IEICE Trans., Fundamentals.. 1998; E81-A: 371-381 [5] D.D. Lewis, W.A. Gale, A sequential algorithm for training text classifiers, in: Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval, 1994, 3–12. [6] G. Schohn and D. Cohn, Less is more: active learning with support vector machines, in: Proceedings of 17th International Conference on Machine Learning, 2000, 839–846. [7] Z. Xu, K. Yu, V. Tresp and J. Wang, Representative sampling for text classification using support vector machines, in: Proceedings of the 25th European Conference on Information Retrieval Research, 2003, 393–407. [8] K. Hara, K. Nakayama, A training method with small computation for classification, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, (IJCNN'00), pp3543- 3547 [9] Donghai Guan, Weiwei Yuan, Young-Koo Lee, Andrey Gavrilov, Sungyoung Lee, Improving supervised learning performance by using fuzzy clustering method to select training data, Journal of Intelligent & Fuzzy Systems . 2008; 19: 321–334

1415

A novel data selection method based on shadowed sets - Science Direct

A novel data selection method based on shadowed sets - Science Direct

Suggest Documents

A Novel Traffic Stream Detection Method Based on ... - Science Direct

Shadowed Sets-Based Linguistic Term Modeling and

A Novel Fractal Coding Method Based on MJ Sets

A Novel DWT Based Image Securing Method Using ... - Science Direct

A Novel Automatic Image Annotation Method Based ... - Science Direct

A NOVEL VARIABLE SELECTION METHOD BASED ON ... - arXiv

A Novel Fault Line Selection Method Based on Improved Oscillator ...

Fuzzy Sets - Science Direct

A Domain Knowledge Based Method on Active and ... - Science Direct

Organic field effect transistor based on a novel ... - Science Direct

A Novel Mobile Payment Scheme based on Secure ... - Science Direct

A Novel Mobile Payment Scheme based on Secure ... - Science Direct

A Text Feature Selection Method Based on the Small ... - Science Direct

A New Method of Feature Selection for Flow ... - Science Direct

A Rank-based Convex Hull method for Dense Data Sets

a novel band selection method for hyperspectral data analysis

Optimal Feature Selection based on Image Pre ... - Science Direct

Gabor Feature Selection Based on Information Gain - Science Direct

A Novel Noise Reduction Method Based on

A Novel Antibody Humanization Method Based on

A Novel Selective Encryption Method for Securing ... - Science Direct

A Novel Current Disturbance Estimation Method for ... - Science Direct

A Novel Chemical Method for Determining Ester ... - Science Direct

A Novel Chemical Method for Determining Ester ... - Science Direct