2009 IEEE International Advance Computing Conference (IACC 2009) Patiala, India, 6-7 March 2009
NANO: A New Supervised Algorithm for Feature Selection with Discretization J.Senthilkumarl, D.Manjula2, R.Krishnamoorthy3 '2Department of Computer science and Engineering, Anna University, Chennai, India 3Department ofInformation Technology, Anna University, Tiruchirapalli, India 1senthil@cs. annauniv. edu, 2manju@annauniv. edu,
[email protected] Abstract-Discretization turns numeric attributes into discrete ones. Feature selection eliminates some irrelevant and/or redundant attributes. Data discretization and feature selection are two important tasks that performed prior to the learning phase of data mining algorithms and significantly reduces the processing effort of the learning algorithm. In this paper, we present a new algorithm, called Nano, that can perform simultaneously data discretization and feature selection. In feature selection process irrelevant and redundant attributes as a measure of inconsistence are eliminated to determine the final number of intervals and to select features. The proposed Nano algorithm aims at keeping the minimal number of intervals with minimal inconsistency and establishes a tradeoff between these measures. The empirical results demonstrate that the proposed Nano algorithm is effective in feature selection and discretization of numeric and ordinal attributes. Keywords- Discretization, feature selection, pattern classification I. INTRODUCTION Feature selection has been an active research area in pattern recognition, statistics, and data mining communities. The main idea of feature selection is to choose a subset of input variables by eliminating features with little or no predictive information. Feature selection can significantly improve the comprehensibility of the resulting classifier models and often build a model that generalizes better to unseen points. Further, it is often the case that finding the correct subset of predictive features is an important problem in its own right. For example, physician may make a decision based on the selected features whether a dangerous surgery is necessary for treatment or not. Feature selection in supervised learning has been feature well studied, where the main goal is tofnfin subset that produces higher classification accuracy. The most common types of features used in data mining are nominal, continuous and discrete values. Nominal features often assume a limited number of values that do not have any relationship of order among them. 978-1-4244-2928-8/09/$25 .00 © 2009 IEEE
Continuous features can assume an infinity number of ordinal values. For many data mining algorithms, it is important that features have both the characteristics: a limited number of values and a relation of order among their values so as to improve the speed and quality of the learning process. These two issues are present in discrete features, whose values preserve the relation of order among them; also the number of values is often limited. Hence an important step of the data preprocessing for a learning algorithm is the discretization of continuous features.
Many feature selection algorithms reported in the literature [3], [5], [10] have shown to work effectively on discrete data or even more strictly, on binary data. In order to deal with numeric attributes, a common practice for those algorithms is to discretize the data before conducting feature selection. This paper provides a way to select features directly from numeric attributes while discretizing them. Numeric data are very common in real world problems. However, many classification algorithms require that the training data contain only discrete attributes, and some would work better on discretized or binarized data [2], [4]. If those numeric data can be automatically transformed into discrete ones, these classification algorithms would be readily at our disposal. Nano is our effort towards this goal: discretize the numeric attributes as well as select features among them. The problem can be stated as: Given data sets with numeric attributes (some of which are irrelevant and/or redundant and the range of each numeric attribute could be very wide), find an algorithm that can automatically discretize the numeric attributes as well as remove irrelevant/redundant ones. In general, feature selection eliminates some irrelevant and/or redundant attributes. With feature selection, relevant features are extracted and hence classification algorithms improve their predictive accuracy, shorten the learning period, and form simpler concepts. In fact, the usage of a large number of features remains a critical problem. As the number of features grows, the significance of each feature decreases and the time spent to analyse the data increases, leading to the effects of the "curse of the
1515
dimensionality" to occur. In the literature, there are abundant feature selection algorithms. Some use methods like principle component to compose a smaller number of new features [17], [18], [19]; some select a subset of the original attributes [5]. In [4], Kerber introduced a scheme called ChiMerge that discretizes numeric attributes based on the f statistic. ChiMerge consists of an initialization step and a bottom-up merging process, where intervals are continuously merged until a termination condition, which is determined by a significance threshold value a (set manually). It is an improvement from the most obvious simple methods such as equal-width-intervals, which divides the number line between the minimum and maximum values into N intervals of equal size; or equal-frequency-intervals, in which the interval boundaries are chosen so that each interval contains approximately the same number of training examples. Instead of defining a width or frequency threshold (which is not easy until scrutinizing each attribute and knowing what it is), ChiMerge requires a to be specified (ideally one a for each attribute). Nevertheless, too big or too small an a will over- or under-discretize an attribute. An extreme example of under-discretization is the continuous attribute itself. Over-discretization introduces many inconsistencies nonexistent before and, thus, change the characteristics In short, it is not easy to find a proper a for of the data. ChiMerge. It is thereby ideal to let the data determine what value a should take. To overcome this problem, in this paper a new algorithm called Nano is presented. Naturally, if we let the discretization continue as long as no more inconsistencies generated than in the the i ta, each originalndat attributes may isy bediscretized nto maximum, and some attributes discretized into one interval. Hence, these attributes can be removed the discriminating pwer wout affecting the power of the without original data.
originasno moreaincnsstecibte isgeratedth 1raff Hetie, discriminatircng rofed a
In this paper we present a new algorithm that performs supervised data discretization of continuous features and feature selection. Nano uses the measure of inconsistence to determine the final number of intervals and to select features. As the number of intervals gets smaller, the number of inconsistency increases. Some features may get maximum number of inevl an miiu numbrof inossec.Nn minim alnumberofinco the wit
aimtorfosinconsis the mincy, minimalto
terval
lnumberofintervalswith
The remainder of the paper is structured as follows, Section II summarizes related works. Section III presents the design and implementation of the proposed algorithm Nano, which is used for discritization and 15116
feature selection. Section IV discusses the experiments and results achieved. Finally, Section V summarizes the conclusions and the future directions II. RELATED WORK Feature selection is an important technique in the preprocessing of data and it has been an active area in Statistical Pattern Recognition, Machine Learning, and Data Mining since 1970 [1], [5], [7], [8], [9]. Feature selection has also been applied in the fields, like Medical Image Diagnosis [11], [12], [13], Image Restoration, and Gene Expression Analysis. Since selection of an optimal subset is always difficult, many topics related to feature selection have difficulties to select a best subset. One of the most well known feature selection algorithm is Relief [6], which measures the quality of features according to their values that distinguish instances of different classes. However, Relief can only be applied in the preprocessing of classifier to the case where the results are two distinguished yes or no situations. Also Relief has difficulty of input sample size which users set randomly before processing of data. In [14], a branch and bound algorithm was proposed f f feature selection that requires monotone on m However, most evaluation methods can not guarantee monotone. Approximately monotone introduced in [15] the ~~~~~~~makes applies the desio uptrethe insufficiency. intofthetninDTMsTs[16]andapplies ecision TreeC.5the s fetatr subset of setion.H ever, TM as t h time Inbs17], shein. pos a modeling aproc for rpsdamdln aprac o feature selection that uses Gaussian distribution data. In [18], Mucciardi et.al presents seven techniques for choosing subsets of properties and compares their performance on a nine-class vector cardiogram classification problem. The algorithm could list all features, but could not decide the number of effective features. htnee It requires ob users to provide the number of faue eetd
'etea
Shivl.ta
.
In [5], an algorithm Chi2 was proposed, which is an ' . . t
su s taf impredi verion of ChiM feaue dt.iscrtizato selction.eCh2 fuses the stastcs to merge consecutive intervals, fusing the
consecutive intervals that lead to the smallest f? value at each step. The feature selection process is done by
removing from the set of features the ones that generate only one interval, which are the features that tend to be class-independent. In this paper we propose a new feature selection algorithm that keeps minimal number of intervals with minimal inconsistency.
2009 IEEE Internaztionazl Advance Computing Conference (IACC 2009)
III. THE PROPOSED NANO ALGORITHM Nano is a novel supervised algorithm that performs discretization of the continuous values of the features. The Nano algorithm performs simultaneously data discretization and feature selection. A measure of inconsistence is employed to determine the final number of intervals and to select features. As the number of intervals gets smaller, the number of inconsistence increases. Nano aims at keeping the minimal number of intervals with minimal inconsistency, establishing a tradeoff between these measures. The following descriptions are necessary before detailing the Nano algorithm.
The Class is the most important keyword of a diagnosis given by a specialist. Cut Points are the limits of an interval of real values. An interval is also called bin and the most frequent class in a bin is called majority class of the bin. Nano processes each feature separately. Let D be the set of training input transactions. Letfbe a feature of the input feature vector F. Let f be the value of the feature f in a transaction i. Nano uses a data structure that links f to the class Ci, for all ie D, where Ci is the class of the transaction i. Each line in the data structure is called an instance. An instance Ii belongs to an interval Tr if its valuef is between two consecutive cut points h p and h p+ and, That is, f E T =[hp,hp+,]. The Nano algorithm performs the following steps. In the step 1, Nano first sorts the continuous values. In the step2, Nano defines the initial cut points. The class label of the current instance Ii and i should be greater than or equal to one. The current instance Ii is
different from the class label of the previous instance, i.e. Ci . C _1_ The initial cut points perform by the majority class Mr of interval Tr andl Mr is the number of occurrences of M, in the interval T,r This condition may generate too many cut points, especially when working with noisy data. The larger the number of cut points the larger is the number of intervals. Each interval represents an item in the process of mining association rules. The use of many items potentially generates a huge number of irrelevant rules, with low confidence. Hence, it is important to keep the number of cut points small and, consequently, generates a small number of items. In this step, Nano produces pure bins, which are the bins of lowest possible entropy (zero). This step produces intervals that minimize the inconsistencies present with the discretization process.
In the third step, Nano restricts the minimum frequency that a bin must present, so as to avoid a huge number of cut points. Nano algorithm initiates input thresholds, Minrange. Minrange restricts the minimal number of occurrences of the majority class allowed in an interval. The number of occurrences of the majority class in an interval Tr must be greater than or equal the Minrange threshold, i.e. M > Minrange threshold. If the Minrange threshold condition is not satisfied by the intervalT = [h thehrightcutpoint the r r p p ] interval Tr is removed. Nano produces the fewer bins, for higher values of Minrange threshold. However, some caution should be taken before adjusting the Minrange value, as higher the Minrange, the higher is the inconsistencies generated by the discretization r
hp+of
process.
In the fourth step, Nano fuses consecutive intervals, using the measure of inconsistence rate to determine
which intervals should be merged. Let Mr be the majority class of an intervalT . Nano fuses consecutive
intervalS Tr
andTr+i
that is those have the same majority class (M r M r,1 ) and also have rates ) below or equal to an inconsistence (;Tr Tr+1 input thresholdMax = (o < max < 1). The inconsistence mal - maxrate of an interval Tr is given by the relation. ,
r
T
r
_IT rI
Tn
IM
IT r
(1)
In the fifth step, Nano algorithm calculates global inconsistence value; GInc Let T be the set of intervals in which a feature is descretized. For each feature, Nano computes the global inconsistence value . The global inconsistence value is given by
GInc the relation. 5GInc
zTrE (jTr IM T r) T
Tr
ITrI
(2)
r
At the last step, Nano algorithm calculates global cut point value ;GCut For each feature, Nano computes the
;GCut
is obtained with the relation. Gt
_E
H) (83) fT,Pl()
2009 IEEE Inxternational Advanxce Computing Conference (IACC 2009)
1517
Where Hp is the total number of cut points in the feature f of an interval Tr . The feature selection criterion employed by Nano removes from the set of attributes every attribute with global inconsistence value and global cut point value is greater than an input . Since the number threshold ;Glncmax .GInc and G
of inconsistencies of the features is the factor that contributes more to disturb the learning algorithm, discarding the most inconsistent attributes can contribute to improve accuracy and speed up the learning algorithm. The Algorithm 1 summarizes the proposed NANO algorithm. As we show in the section of experiments, NANO is well-suited to feature selection and discretization of numeric and ordinal attributes. Algorithm-1 NANO Algorithm Input: Image Feature Vectors F , Image Classes C, Minrange, 5max' ;GImax' and 5GImin thresholds. Output: Processed Feature Vector V. 2. 3.
Sort values available in f For each transaction i , Create an instance
of the form
ci, fi, where ci E C
4. To create vector H of cut points h~, if
the class label of the current instance I i . 1 is different from the class label of the previous instance, i.e., ci . Ci-1 5. end for 6. foreach hp E H do 7. Remove h according to the number of p occurrences of the majority class in an interval Tr must be equal or greater than the Minrange threshold, i.e., 2 Minrange.
IMr
8. Remove hp according to Nano fuses consecutive intervals Tr andTr+i7 that have the same majority Class (Mr = Mr+1) and also have inconsistence rates (; T, T,+1 below or equal an input threshold = (°