An Empirical Study on Dimensionality Reduction and ... - IEEE Xplore

2 downloads 2646 Views 369KB Size Report
K.L.N College Of Information. Technology, Madurai-15,India [email protected] .... information gain based feature ranking for dimensionality reduction.The rest ...
2012 - International Conference on Emerging Trends in Science, Engineering and Technology 102

An Empirical Study on Dimensionality Reduction and Improvement of Classification Accuracy Using Feature Subset Selection and Ranking D. Asir Antony Gnana Singh

S. Appavu Alias Balamurugan

E.Jebamalar Leavline

Department of CSE M.I.E.T Engineering College, Tiruchirappalli -7,India [email protected]

Research Coordinator K.L.N College Of Information Technology, Madurai-15,India [email protected]

Department of ECE Anna University Chennai, BIT Campus, Tiruchirappalli - 24, India [email protected]

Abstract - Data mining is a part in the process of Knowledge discovery from data (KDD). The performance of data mining algorithms mainly depends on the effectiveness of preprocessing algorithms. Dimensionality reduction plays an important role in preprocessing. By research, many methods have been proposed for dimensionality reduction, beside the feature subset selection and feature-ranking methods show significant achievement in dimensionality reduction by removing irrelevant and redundant features in highdimensional data. This improves the prediction accuracy of the classifier, reduces the false prediction ratio and reduces the time and space complexity for building the prediction model. This paper presents an empirical study analysis on feature subset evaluators Cfs, Consistency and Filtered, Feature Rankers Chi-squared and Information-gain. The performance of these methods is analyzed with the focus on dimensionality reduction and improvement of classification accuracy using wide range of test datasets and classification algorithms namely probability-based Naive Bayes, tree-based C4.5(J48) and instance-based IB1. Keywords - Data prepocessing; Dimensionality Reduction; Feature subset selection; Classification Accuracy; Data mining; Machine learning;Classifier.

I. INTRODUCTION Dimensionality reduction for classification process has involved significant attention in both pattern recognition and machine learning. High dimensional data space may increase the computational cost and reduce the prediction accuracy of classifiers [1]-[3]. The classification process is known as supervised learning that builds the classifier by learning from a training data sets[6] and it is observed that, when the features of training data sets exceed a particular range of a sample space, the accuracy of the classifier will decrease [1][2]. There are two ways to be followed to achieve the dimensionality reduction: feature extraction and feature selection [5], [3]. In feature extraction problems, [12], [13], [26] the original features in the measurement space are initially

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

transformed into a new dimension-reduced space via some specified transformation. Significant features are then determined in the new space. Although the significant variables determined in the new space are related to the original variables, the physical interpretation in terms of the original variables may be lost. In addition, though the dimensionality may be greatly reduced using some feature extraction methods, such as principal component analysis (PCA) [14], the transformed variables usually involves all the original variables. Often, the original variables may be redundant when forming the transformed variables. In many cases, it is desirable to reduce not only the dimensionality in the transformed space, but also the number of variables that need to be considered or measured [15],[16],[26]. Unlike feature extraction, feature selection aims to seek optimal or suboptimal subsets of the original features [16]-[19], [31], [32] by preserving the significant information carried by the collected complete data, to facilitate feature analysis for high dimensional problems [22]-[24], [26]. This study mainly focuses on analyzing the performance of these Cfs, Consistency and Filtered attribute subset evaluators in view of dimensionality reduction with the wide range of test datasets and the learning algorithms namely probability-based learner Naive Bayes, tree-based learner C4.5(J48) and lazy instance-based learner IB1. A. Feature subset Selection Feature subset selection is a process that removes irrelevant and redundant features form the dataset to improve the prediction accuracy of the learning algorithm. The irrelevant features reduce the predictive accuracy and redundant feature deteriorate the performance of the learners and requires high computation time and other resources for training and testing the data [9] [8]. The Feature subset selection can be classified into three methods namely wrapper, filter and hybrid [10], [11]. In wrapper approach, a predetermined learning model is assumed, wherein features

2012 - International Conference on Emerging Trends in Science, Engineering and Technology 103 are selected that justify the learning performance of the particular learning model [27], [11] whereas in the filter approach, statistical analysis of the feature set is required, without utilizing any learning model [28]. The Hybrid approach is the combination of filter and wrapper methods. 1) Subset Generation and Searching methods: A Search strategy and criterion functions are needed for subset selection. The search algorithm generates and compares possible feature-selection solutions by calculating their criterion function values as a measure of the effectiveness of each considered feature subset. The feature subset with the best criterion function value is given as the output of the feature-selection algorithm [12], [5], [11] [4]. In general, many searching strategies are followed to generate the feature subset [30]-[33] as follows. The first method is sequential forward search, starts the searching process with empty set and adds the features effectively. The second method, sequential backward search starts with the full set of attributes at each step; it removes the worst attribute remaining in the set. The third method bidirectional selection starts on both ends to add and remove features concurrent. In the fourth method, the searching process is done on the randomly selected subset using a sequential or bidirectional strategy. The fifth method is complete search, thoroughly search the subsets so that it gives best solution, but is not possible when the number of features are large.[11] [30]-[33]. B. Feature ranking Feature ranking makes use of a scoring function S(i) computed from the values xk,i and yk (k = 1, . . . , m examples and i = 1, . . . , n features). By convention, It is assumed that a high score is indicative of high relevance and that features are sorted in decreasing order of S(i). We consider ranking criteria defined for individual features, independently of the context of others. Features are ranked according to some evaluation measure such as chi-squared, Information gain with the list of attributes and ranks from the beginning to the last ranked features [25]. The threshold value decides the selected features from the ranked features. C. Classification algorithms The classification algorithms can be employed for evaluating the performance of feature subset evaluators. Several algorithms are available, each has pitfalls and strengths. Research says that, there is no possibility to work best on all supervised learning problems by a single learning algorithm. The most familiar algorithms namely Naive Bayes, C4.5 (J48) and IB1 have been chosen to build an experimental setup to carry out this study. This setup is strengthened with the most highlighted characteristic of Naive Bayes classifier that, it estimates the parameters

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

(mean and variance of the variables) necessary for classification by minimal amount of training data; advantages of C4.5 (J48) decision tree which is simple to understand and interpret, requires little data preparation, robust and performs well with large data in a short time; benefit of IB1 that it is able to learn quickly from a very small dataset. 1) Naive Bayes: In this classifier, the classification is achieved by the principle of basic Bayes’ theorem. It gives relatively better performance on classification tasks [47]. In general, Naive Bayes (NB) learns with the assumption that, features are independent to the given class variable. More formally, this classifier is defined by discriminant function: N (1) P ( x j|ci ) P ( ci ) f i( X ) j 1 where X = (x1, x2, ..., xN) denotes a feature vector and cj, j = 1, 2, ..., N, denotes possible class labels. The training phase for learning a classifier consists of estimating conditional probabilities P(xj|ci) and prior probabilities P(ci). Here, P(ci) are estimated by counting the training dataset that fall into class ci and then dividing the resulting count by the size of the training set. Similarly, conditional probabilities are estimated by simply observing the frequency distribution of feature xj within the training subset that is labeled as class ci. To classify a class-unknown test vector, the posterior probability of each class is calculated, given the feature values present in the test vector; and the test vector is assigned to the class that is of the highest probability[49]. 2) C4.5 (J48): Many techniques are followed to build a decision tree. In all these techniques the given data are formed into a tree structure, the branches represent the association between feature values and class label. The C4.5 (J48) is familiar and superior among these techniques [48]. It partitions the training datasets in recursive fashion, derived from examining the potential on feature values in separating the classes. The decision tree learns from a set of training dataset through an iterative process of choosing a feature and splitting the given dataset based on the values of that feature. The entropy or information gain is used to select the most representative features for classification. The selected features have the lowest entropy and highest information gain. This learning algorithm works with the following steps. First, computing the entropy measure for each feature, secondly partitioning the set of dataset according to the possible values of the feature that has the lowest entropy, and thirdly estimating probabilities, in a way exactly the same as the Naive Bayes approach [49]. Although feature sets are chosen one at a time in a greedy manner, they are dependent on results of previous tests.

2012 - International Conference on Emerging Trends in Science, Engineering and Technology 104 3) IB1: This classifier works on nearest neighbour classification principle. In this approach, the distance between the training instance and the given test instance are calculated by the Euclidean distance measure. If more than one instance has the smallest distance to the test instance, the first found instance is used. Nearest neighbour is one of the most significant learning algorithms; it can be adapted to solving wider problems [46]. To classify an unclassified vector X, this algorithm ranks the neighbours of X amongst a given set of N data (Xi, ci), i = 1, 2, ..., N, and uses the class labels cj (j = 1, 2,..., K) of the K most similar neighbours to predict the class of the new vector X. In particular, the classes of these neighbours are weighted using the similarity between X and each of its neighbours, where similarity is measured by the Euclidean distance metric. Then, X is assigned as the class label with the greatest number of votes among the K nearest class labels. This learner works according to the persecution that the classification of an instance is likely to be most similar to the classification of other instances that are nearby within the vector space. It does not depend on prior probability compared to the other learning algorithm such as NB. This is computationally cost effective when the size of the dataset is less and the calculation of the distance is quite expensive when the dataset is large. Conversely, the computational cost is reduced by adopting the PCA and information gain based feature ranking for dimensionality reduction.The rest of this article is organized as follows: In section II, the related works are reviewed. Section III, elucidates the proposed work. Section IV, represents the experimental result with discussion. In section V, conclusion is drawn with future directions. II. RELATED WORK Many feature subset evaluator and ranker algorithms have been proposed for choosing the most relevant feature subsets form the datasets. In this section, feature subset selection algorithms namely Cfs, Consistency and Filtered Subset Evaluator and Features rankers namely Chi-squared and Information gain are discussed. a.

CFS In this method CFS (Correlation-based Feature Selection), the subsets of features are evaluated rather than individual features [38, 39, 43]. The kernel of this heuristic principle that evaluates the effectiveness of individual features based of the degree of inter-correlation among the features to predict the class. The goodness of the feature subset is determined by heuristic equation (1) on basis of the features presented in subset have high correlation with the class and low inter-correlation with each other.

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

k r cf

Merits

k (k

k

(2) 1)r ff

where S is the subsets containing k features, rcf the average feature-class correlation, and rff the average feature-feature inter-correlation. The numerator can be thought of as giving an indication of how predictive a group of features are; the denominator of how much redundancy is there among them. This heuristic method handles irrelevant features, as they will be poor predictors of the class. Redundant features are discriminated against as they will be highly correlated with one or more of the other features. Since features are treated independently, CFS cannot identify strongly interacting features such as in a parity problem. However, it has been shown that it can identify useful features under moderate levels of interaction [38] [43]. In order to apply equation (2) it is necessary to compute the correlation (dependence) between features. Cfs first discretizes numeric features using the technique discussed in [42] [43] and then uses symmetrical uncertainty to estimate the degree of association between discrete features (X and Y):

SU

2.0

H ( X ) H (Y ) H ( X , Y ) H ( X ) H (Y )

(3)

After computing a correlation matrix, Cfs applies a heuristic search strategy to find a good subset of features according to equation (3). As mentioned at the beginning of this section, we use the modified forward selection search, which produces a list of features ranked according to their contribution to the goodness of the set. b.

Consistency In consistency-based subset elevation, many approaches use class consistency as an evaluation metric in order to select the feature subset [40], [41]. These methods look for combinations of features whose values divide the data into subsets containing a strong single class majority. several approaches to feature subset selection use class consistency as an evaluation metric [40], [41]. These methods look for combinations of features whose values divide the data into subsets containing a strong single class majority. Usually the search is biased in favour of small feature subsets with high-class consistency. Our consistency-based subset evaluator uses the work of [41] consistency metric. J

Consistency S

1

i 0

D

i

M

i

(4)

N

where s is an feature subset, J is the number of distinct combinations of feature values for s, Di is the number of occurrences of the ith feature value combination, Mi is the cardinality of the majority class for the ith feature value combination and N is the total number of instances in the

2012 - International Conference on Emerging Trends in Science, Engineering and Technology 105 data set. Data sets with numeric features are first discretized using the methods found in [42, 43]. The modified forward selection search described earlier in this section is used to produce a list of features, ranked according to their overall contribution to the consistency of the feature set. c.

Filter This approach can be most suitable for reducing only the data in dimensionality, rather than for training a classifier. The goodness of the subset selection takes the advantages of the Cfs subset evaluator and Greedy Stepwise searching algorithm. It reduces the dimensionality as maximum as possible in high dimensional data [44].

d.

Chi-Squared This Feature ranker uses the chi-square (χ2) test [50]. It estimates the worth fullness of a feature by computing the value of the chi-squared statistic with respect to the class. The initial hypothesis H0 is the assumption that the two features are unrelated, and it is tested by chi-squared formula as in equation-(5) 2

c

Oij E ij

2

(5)

E ij

i 1j 1

where Oij is the observed frequency and Eij is the expected (theoretical) frequency, asserted by the null hypothesis. The greater the value of χ2, the greater the evidence against the hypothesis H0 is. e.

Information Gain This is a ranker based feature selection measure using information theory. Given the entropy is a criterion of impurity in a training set S, a measure reflecting additional information about Y is defined, provided by X that represents the amount by which the entropy of Y decreases [51]. The Information Gain(Info-Gain) measure is formulated in equation (6). Info-Gain = H(Y) − H(Y |X ) = H(X ) − H(X|Y) (6) The information gained about Y after observing X is equal to the information gained about X after observing Y. A weakness of the IG criterion is that it is biased in favor of features with more values even when they are not more informative. III.

PROPOSED WORK

For this proposed work, the experimental setup was constructed with three feature subset selection techniques , two Feature Ranking method and ten standard machinelearning data sets from the Weka data set collection and UCI Machine Learning Repository[44][52] to carry out the experiment. These data sets range in size from a minimum of 5 to 36 features in maximum with instances range in size form a minimum of 14 to maximum of 1728. The

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

experiment is carried out by the Weka data mining tool [44]. These three feature subset evaluators and two feature rankers were evaluated with the help of the datasets by the well known classifier algorithms namely probability-based Naive Bayes(NB), tree-based C4.5 (J48) and instance-based IB1. For balancing the improvement in dimensionality reduction and classification accuracy, the threshold value Tv is formulated and fixed by trial and error basis as shown in equation-(7). It selects features who’s rank value is greater than the threshold value from the raked feature set by Feature rankers Chi-squared and Information Gain. M Min M Max Tv Min M (7) 2 Tv- Threshold value Min – Minimum rank value Max –Maximum rank value IV.

RESULTS AND DISCUSSION

The Summary of datasets used in the experiments and the experimental results derived from the analyses of dimensionality reduction and improvement of the accuracy of the classifiers are shown in Table-I, Table -II and TableIII, respectively. TABLE I. S.No.

Dataset

Instances

Features

Classes

1 2 3 4 5 6 7 8 9 10

Contact Lenses Diabetes Glass Ionosphere Iris Labor Soybean Vote Weather Car

24 768 214 351 150 57 683 435 14 1728

5 9 10 35 5 17 36 17 5 7

3 2 7 2 3 2 19 2 2 4

TABLE II.

S. No 1 2 3 4 5 6 7 8 9 10

SUMMARY OF DATA SETS

COMPARISON OF REDADUCED FEATURES SUBSETS BY FEATURE SUBSET EVALUATORS AND RANKERS

Feature Subset Evaluators

Dataset

Contact Lenses Diabetes Glass Ionosphere Iris Labor Soybean Vote Weather Car

Feature Rankers

Cfs

Consistency

Filtered

Chisquared Ranker

Information Gain Ranker

1

4

1

2

2

4 8 14 2 7 22 4 2 1

4 7 7 2 4 13 10 2 6

3 5 14 2 4 7 1 1 1

3 7 25 2 3 10 4 1 2

3 8 25 2 3 10 3 1 1

2012 - International Conference on Emerging Trends in Science, Engineering and Technology 106 other methods in terms of dimensionality reduction regardless of the Classifier. The Chi-squared ranker achieves better performance rather than Information gain, consistency and CFS.

In dimensionality reduction analysis, it is observed that the Feature subset evaluators namely Filter, Consistency and Cfs, Feature Rankers namely Chi-squared and Information Gain have reduced much dimensionality as shown in Figure1. The performance of the Filter is superior than the all TABLE III.

S.No.

SUMMARY OF CLASSIFIERS ACCURACY WITH RESPECT TO THE FEATURE SUBSET EVALUATORS AND RANKERS Accuracy of NB of Reduced Feature Subsets

Dataset

Accuracy of C4.5(J48) of Reduced Feature Subsets

Accuracy of IB1 of Reduced Feature Subsets

I

II

III

IV

V

I

II

III

IV

V

I

II

III

IV

V

70.83

70.83

70.83

87.50

87.50

70.83

83.33

70.83

87.50

87.50

66.66

62.50

66.66

75.00

75.00

77.47 47.66

77.47 44.39

76.43 44.39

76.43 49.06

76.43 47.66

74.86 68.69

74.86 70.09

74.60 65.88

74.60 69.62

73.04 68.69

68.35 71.02

68.35 70.09

70.18 71.02

70.18 77.57

68.48 71.02

2 3

Contact Lenses Diabetes Glass

4

Ionosphere

92.02

87.17

92.02

83.47

83.19

90.59

87.46

90.59

91.45

91.16

88.88

87.74

88.88

88.03

88.03

5 6 7 8 9 10

Iris Labor Soybean Vote Weather Car Average

96.00 91.22 87.11 96.09 57.14 70.02 78.55

96.00 87.71 81.69 92.41 57.14 85.53 78.03

96.00 87.71 83.30 95.63 50.00 70.02 76.63

96.00 84.21 81.25 92.87 50.00 76.85 77.76

96.00 84.21 82.72 94.71 50.00 76.85 77.92

96.00 77.19 85.65 96.09 42.85 70.02 77.27

96.00 82.45 83.74 96.32 42.85 92.36 80.94

96.00 80.70 82.86 95.63 50.00 70.02 77.71

96.00 80.70 80.81 95.63 50.00 76.56 80.28

96.00 80.70 82.86 95.63 50.00 76.56 80.21

96.66 84.21 83.89 94.02 78.57 66.84 79.91

96.66 87.71 76.57 93.33 78.57 77.25 79.87

96.66 87.71 79.94 91.72 64.28 66.84 78.38

96.66 80.70 79.20 93.79 64.28 73.49 79.89

96.66 80.70 78.91 93.10 64.28 73.49 78.96

1

I- Cfs, II-Consistency ,III- Filter ,IV- Chi-square Ranker, V- Information Gain Ranker

Fig 1:Performance comparison in dimensionality reduction of Feature subset Evaluators and Rankers

In classification accuracy analysis, it is observed that, the performance of Cfs is superior for instance-based IB1 and probability-based Naive Bayes (NB) and inferior for the tree-basedC4.5 (J48). The Consistency performs well with C4.5(J48) compared to other methods. The performance of Chi-squared is considerably good in NB and C4.5(J48) rather than Information Gain and Filtered methods as shown in Figure 2.

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

Fig 2:Comparison of accuracy of classifiers with Feature Subset Evaluators and Rankers

V.

CONCLUSION

This paper presented an experimental analysis on dimensionality reduction and improvement of classifier accuracy by the Feature Subset Evaluator namely Cfs, Consistency and Filter, Feature Ranker namely Chi-squared and Information Gain. From this experimental study, it is observed that In dimensionality reduction analysis, The

2012 - International Conference on Emerging Trends in Science, Engineering and Technology 107 performance of the Filter is superior than the all other methods in terms of dimensionality reduction regardless of the Classifier. The Chi-squared ranker achieves better performance rather than Information gain, consistency and Cfs. In classification accuracy analysis, it is observed that, the performance of Cfs is superior for instance-based IB1 and probability-based Naive Bayes (NB) and inferior for the tree-basedC4.5 (J48). The Consistency performs well with C4.5(J48) compared to other methods. The performance of Chi-squared is considerably good in NB and C4.5(J48) rather than Information Gain and Filtered methods. REFERENCES [1] C.-I Chang and S. Wang, “Constrained band selection for hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 44, no. 6, pp. 1575–1585, Jun. 2006. [2] A. Plaza, P. Martinez, J. Plaza, and R. Perez, “Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 466–479, Mar. 2005. [3] Liangpei Zhang, Yanfei Zhong, Bo Huang, Jianya Gong, and Pingxiang Li,“Dimensionality Reduction Based on Clonal Selection for Hyperspectral Imagery” IEEE Transactions On Geoscience And Remote Sensing, Vol. 45, No. 12, December 2007 [4] J. Wang and C.-I Chang, “Independent component analysisbased dimensionalityreduction with applications in hyperspectral image analysis,”IEEE Trans. Geosci. Remote Sens., vol. 44, no. 6, pp. 1586–1600,Jun. 2006. [5] A. Jain and D. Zongker, “Feature selection: Evaluation, application, andsmall sample performance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 2, pp. 153–158, Feb. 1997. [6] Jiawei Han and Micheline Kamber “Data Mining Concepts and Techniques” Second Edition Elsevier 2006 [7] Xuechuan Wang Kuldip K. Paliwal “Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition” Pattern Recognition 36 (2003) 2429 – 2439 [8] Qinbao Song, Jingjie Ni and Guangtao Wang “A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data” IEEE Transactions On Knowledge And Data Engineering, Vol. X, No. X, 2011 [9] John G.H., Kohavi R. and Pfleger K.,”Irrelevant Features and the Subset Selection Problem, In the Proceedings of the Eleventh International Conference on Machine Learning” pp 121-129, 1994. [10] Oh,l, Lee,.J & Moon B (2004)” Hybrid genetic algorithms for feature selection” IEE transaction on Pattern Analysis and Machine Intelligence 26(11). 1424-1437 [11] Md. Monrul Kabir,Md. Shahjahan,Kazuyuki Murase “A new hybrid ant colony optimization algorithm for feature selection” Expert systems with application 39 3747-3763 (2012) [12] A.K. Jain, R.P.W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000. [13] A.R. Webb, “Statistical Pattern Recognition, seconded.” Wiley, 2002.

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

[14] I.T. Jolliffe,”Principal Component Analysis,” second ed. Springer, 2002. [15] G.P. McCabe, “Principal Variables,” Technometrics, vol. 26, pp. 137-144, May 1984. [16] W.J. Krzanowski, “Selection of Variables to Preserve Multivariate Data Structure Using Principal Components,” Applied Statististics, vol. 36, no. 1, pp. 22-33, 1987. [17] P. Mitra, C.A. Murthy, and S.K. Pal, “Unsupervised Feature Selection Using Feature Similarity,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301312, Mar. 2002. [18] B. Krishnapuram, A.J. Hartemink, L. Carin, and M.A.T. Figueiredo, “A Bayesian Approach to Joint Feature Selection and Classifier Design,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 11051111, Sept. 2004. [19] M.H.C. Law, M.A.T. Figueiredo, and A.K. Jain, “Simultaneous Feature Selection and Clustering Using Mixture Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1154-1166, Sept. 2004. [20] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1-2, pp. 273324, Dec. 1997. [21] A.J. Miller,”Subset Selection in Regression” Chapman and Hall, 1990. [22] P. Pudil, J. Novovicova, and J. Kittler, “Floating Search Methods in Feature Selection,” Pattern Recognition Letters, vol. 15, no. 11, pp. 1119-1125, Nov. 1994. [23] S.K. Pal, R.K. De, and J. Basak, “Unsupervised Feature Evaluation: A Neuro-Fuzzy Approach,” IEEE Trans. Neural Networks, vol. 11, no. 2, pp. 366-376, Mar. 2000. [24] K.Z. Mao, “Identifying Critical Variables of Principal Components for Unsupervised Feature Selection,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 35, pp. 339-344, 2005. [25] I.T. Jolliffe, “Discarding Variables in a Principal Component Analysis-I: Artificial Data,” Applied Statistics, vol. 21, no. 2, pp. 160-173, 1972. [26] Hua-Liang Wei and Stephen A. Billings “Feature Subset Selection and Ranking for Data Dimensionality Reduction” IEEE Transactions On Pattern Analysis And Machine Intelligence, Vol. 29, No. 1, January 2007 [27] Guyon, I & Elisseeff A “an introduction to variable and feature selection”, Journal of Machine Learning Research 3, 1157-1182 (2003) [28] Dash, M., & Liu, H. “Feature selection for classification.” Intelligent Data Analysis (1) 131-156 (1997) [29] Huang, J., Cai,Y., &Su,X. A hybrid genetic algorithm for feature selction wrapper based on mutual information. Pattern Recognition letters, 28, 1825-1844 (2007) [30] Guan,l S., Liu, J., & Qi, Y . “An incremental approach to contribution-based feature selection” Journal of Intelligence System 13(1). (2004) [31] Peng, H., Long, F., & ding. C. “Overfiting in making comparisions between variable selection methods.” Journal of Machine Learning Research, 3, 137-1382.(2003) [32] Gasca, E., Sanchez, J.S., & Alonso R. “Elimination redundancy and irrelelevance using a new MLP-based feature selection method.” Pattern Recognition, 39, 313-315 (2006) [33] Hsu, C., Huang. H., & Schuschel, D. “The ANNIGMAwrapper approach to fast feature seletion for neural nets”.

2012 - International Conference on Emerging Trends in Science, Engineering and Technology 108

[34] [35] [36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

IEEE Transactions on system, Man, and Cybernetics- Part B: Cybernetic, 32(2) 207-212(2002) Caruana, R., & Freitage, D “Greedy attribute selection”. In proceedings of 11th international conference of machine learing USA: Morgan Kaufman (1994) Lai, C., Reinders, M.J.T., & Wessels, “L Random subspace method for multivariate feature selection”. Pattern Recognition Letters , 27, 1067-1076 (2006) Straceezzi, D. J., & Utgoff, P. E. “Randomixed variable elimination.” Journal of Machine Learning Research 5, 1331-1335(2004) Liu, H., & Tu, L. Toward integrating feature selection algorithems for classification and clustering . IEEE Transactions on Knowledge and Datat Engineering 17(4), 491-502.(2004) M. A. Hall, “Correlation-based feature selection for machine learning,” Ph.D. thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1998. Mark Hall, “Correlation-based feature selection for discrete and numeric class machine learning," in Proc. of the 17th International Conference on Machine Learning (ICML2000, 2000. H. Almuallim and T. G. Dietterich, “Learning with many irrelevant features," in Proceedings of the Ninth National Conference on Arti_cial Intelligence. 1991, pp. 547{552, AAAI Press. H. Liu and R. Setiono, “A probabilistic approach to feature selection: A _lter solution," in Proceedings of the 13th International Conference on Machine Learning. 1996, pp. 319{327, Morgan Kaufmann. U. M. Fayyad and K. B. Irani, “Multi-interval discretisation of continuous-valued attributes," in Proceedings of the Thirteenth International Joint Conference on Arti_cial Intelligence. 1993, pp. 1022{1027, Morgan Kaufmann. Mark A. Hall, Geo_rey Holmes “Benchmarking Attribute Selection Techniques for Discrete Class Data Mining” IEEE

ISBN : 978-1-4673-5144-7/12/$31.00 © 2012 IEEE

[44]

[45]

[46] [47] [48]

[49]

[50] [51]

[52]

Transactions On Knowledge And Data Engineering, Vol. 15, No. 3, May/June 2003 Remco R. Bouckaert,Eibe Frank,Mark Hall,Richard Kirkby,,Peter Reutemann,Alex Seewald,David Scuse “WEKA Manual for Version 3-6-6” University of Waikato, Hamilton, New Zealand October 28, 2011- P 220-220 P. Langley, W. Iba, and K. Thompson, “An analysis of Bayesian classi_ers," in Proc. of the Tenth Na- tional Conference on Arti_cial Intelligence, San Jose, CA, 1992, pp. 223{228, AAAI Press, [Langley92.ps.gz,from,http://www.isle.org/_langley/papers/b ayes.aaai92.ps]. Kuramochi, M., and Karypis, G., “Gene classification using expression profiles: a feasibility study”, International Journal on Artificial Intelligence Tools, 14 (4) (2005) 641-660. Domingos, P., and Pazzani, M., “Feature selection and transduction for prediction of molecular bioactivity for drug design”, Machine Learning, 29 (1997) 103-130. Xing, E. P., Jordan, M. L., and Karp, R. M., “Feature selection for high-dimensional genomic microarray data”, Proceedings of the 18th International Conference on Machine Learning, 2001, 601-608. J. Novakovic, 120 P. Toward “optimal feature selection using ranking methods and classification algorithms Strbac”, D. Bulatovic Yugoslav Journal of Operations Research, Number 1, 119-135 -21 (2011) Hall, M.A., and Smith, L.A., “Practical feature subset selection for machine learning”,Proceedings of the 21st Australian Computer Science Conference, 1998, 181–191. Liu, H., and Setiono, R., “Chi2: Feature selection and discretization of numeric attributes”,Proc. IEEE 7th International Conference on Tools with Artificial Intelligence, 1995, 338-391. D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/~mlearn/ MLRepository.html, 2006.