Feature Weighting for Nearest Neighbor by Estimation of Bayesian Networks Algorithms I. Inza, P. Larra~naga, B. Sierra Dept. of Computer Science and Arti cial Intelligence. University of the Basque Country. P.O. Box 649. E-20080 San Sebastian. Basque Country. Spain. Telf: (+34) 943015106
Fax: (+34) 943219306
e-mail:
[email protected]
Abstract| The accuracy of the Nearest-neighbor classi er heavily depends on the weight of each feature on the distance metric. In this paper, a new method, in order to learn accurate feature weights called FW-EBNA (Feature Weighting by Estimation of Bayesian Network Algorithm) and inspired in the EDA (Estimation of Distribution Algorithm) optimization paradigm, is presented. FW-EBNA is based on an evolutionary, population-based and randomized search algorithm. The evolution of solutions is guaranteed by the factorization, using a Bayesian network, of the probability distribution of best individuals found in each generation of the search: this learned Bayesian network is sampled to generate the next generation of solutions. The search, guided by a wrapper strategy for the evaluation of each possible solution, is performed in a restricted space of three discrete weights. Keywords-
Feature Weighting, Nearest Neighbor, wrapper, Estimation of Dis-
tribution Algorithm, Estimation of Bayesian Network Algorithm, Bayesian Network.
1
I. Introduction
The k-nearest neighbor (k-NN) classi er has long been used by Pattern Recognition and Machine Learning communities (Dasarathy [1]) in supervised classi cation tasks. The basic approach implies storing all training instances and then, when a test instance is presented, retrieving the training instances nearest (least distant) to this test instance and using them to predict its class. Distance is classically de ned as follows: (
distance X; Y
where
X
is a training instance,
Y
vuuXn )=t i=1
w
i dif f erence(xi ; yi )2
a training instance and i is a weight value assigned w
to feature . To compute the dierence between two values, the overlap metric is used i
for symbolic features and the absolute dierence (after being normalized) for numeric ones. Dissimilarities among values of the same feature are compute and added, obtaining a representative value of the dissimilarity (distance) between the compared instances. In basic nearest neighbor approach, dissimilirities in each dimension are added in a `naive' manner, weighing dissimilarities in each dimension equally (for all features: i = w
1). This approach is unrealistic, allowing irrelevant features to in uence the distance computation and enhancing equally features with dierent degrees of relevance. With this handicapped approach, each time an unimportant feature is added to the feature set and a weight, similar to an important feature assigned, it exponentially increases the quantity of training instances needed to maintain the predictive accuracy (Lowe [2]): this phenomenom is known as the `curse of dimensionality'. In order to nd a `realistic' weight for each feature of the problem, several approaches have been proposed by Pattern Recognition and Machine Learning communities, approaches which are 2
included under the `Feature Weighting for Nearest Neighbor' term. In this paper, we present a novel approach to search for a set of discrete feature weights for nearest neighbor classi cation. We use EBNA (Estimation of Bayesian Network Algorithm), a search engine based on the EDA (Estimation of Distribution Algorithm) paradigm. The EDA approach guarantees the evolution of the population of solutions, factorizing by a Bayesian network the probability distribution of best individuals in each generation of the search, and generating a new generation of solutions from the learned Bayesian network. In this way, as Genetic Algorithms, an evolutionary, population-based search is performed, avoiding the dicult tuning of well known crossover and mutation genetic operators. Our algorithm, called FW-EBNA (Feature Weighting by Estimation of Bayesian Network Algorithm), learns a set of attribute weights based on an attribute's demonstrated classi cation importance, incorporating a wrapper method to calculate the goodness of each proposed set of weights. Section 2 introduces EDA approach, Bayesian networks and EBNA search engine, nally mixing them to solve the Feature Weighting problem in the resulting new FWEBNA algorithm. Experimental design and a discussion on the results appear in the next two sections. In order to properly locate the new algorithm, section 5 relates various previous approaches to the Feature Weighting problem. The last section summarizes the contribution of the work and foresees lines for future research. II. Learning weights by Bayesian networks based optimization
In this section, EDA paradigm and Bayesian networks will be explained. Bearing in mind these two concepts, EBNA, the search engine used in our Feature Weighting algorithm will be presented. Once EBNA is explained, the new algorithm for Feature 3
Weighting in nearest neighbor, FW-EBNA, can be presented. A. EDA paradigm
Genetic Algorithms (GAs, Holland [3]) are one of the best known techniques for solving optimization problems. The GA is a randomized population based search method. First, a set of individuals (or candidate solutions to our optimization problem) is generated (a population), then promising individuals are selected, and nally new individuals which will form the new population are generated using crossover and mutation operators. An interesting adaptation of this is the Estimation of Distribution Algorithm (EDA, Muhlenbein and Paa[4]). In EDA (see Figure 1), there are no crossover nor mutation operators: the new population is sampled from a probability distribution which is estimated from the selected individuals.
EDA Generate
D0
individuals (the initial population) randomly.
N
Repeat for = 1 2 l
s l?1
D
;
;:::
Select S
l (x) = p(xjDls?1 )
p
until a stop criterion is met. N
individuals from l?1 according to a selection method. D
Estimate the joint probability distribution of an individual
being among the selected inviduals. l
D
Sample
N
individuals (the new population) from l(x). p
Fig. 1. Main scheme of the EDA approach.
In this way a randomized, evolutionary, population-based search can be performed, 4
using probabilistic information to guide the search. It is shown that although the EDA approach processes solutions in a dierent way to GAs, it has been empirically proven that the results of both approaches can be very similar (Pelikan et al. [5]). In this way, both approaches do the same except that EDA replaces genetic crossover and mutation operators by the following two steps: 1. a probabilistic model of selected promising solutions is induced, 2. new solutions are generated according to the induced model. In some tasks (known as `deceptive' problems), GAs fail to identify the dierent building blocks (partial solutions) of the problem. When this occurs, the probabilistic modeling of the population proposed by the EDA approach has usually demonstrated ecacy in a proper building block identi cation (Harik [6]). The main problem of EDA resides on how the probability distribution of selected individuals, l (x), is estimated. Obviously, the computation of 2n probabilities (for a p
domain with binary variables) is impractical. This has led to several approximations n
where the probability distribution is assumed to factorize according to a probability model (see Larra~naga et al. [7] for a review). Several approaches have been proposed assuming that the variables of the problem are independent or assuming some kind of pairwise dependencies between them. Based on the EBNA work of Etxeberria and Larra~naga [8], we propose the use of Bayesian networks, a general framework able to cover multivariate interactions between variables as the models for representing the probability distribution of a set of candidate solutions in our Feature Weighting problem, using the application of automatic learning methods to induce the right distribution model in each generation in an ecient way. 5
B. Bayesian networks
Bayesian networks (BNs) (Pearl [9]) constitute a probabilistic framework for reasoning under uncertainty. From an informal perspective, BNs are directed acyclic graphs (DAGs) where the nodes are random variables and the arcs specify the conditional independence assumptions (Dawid [10]) that must be held between the random variables. BNs are based upon the concept of conditional independence among variables. This concept makes a factorization of the probability distribution of the -dimensional n
random variable (
Z1 ; ::::; Z
n) in the following way:
(
P z1 ; ::::; z
n) =
Yn
i=1
( ij ( i))
P z
pa z
where i represents the value of the random variable i, and ( i ) represents the value z
Z
pa z
of the random variables parents of i. Z
Thus, in order to specify the probability distribution of a BN, one must give prior probabilities for all root nodes (nodes with no predecessors) and conditional probabilities for all other nodes, given all possible combinations of their direct predecessors. These numbers, in conjunction with the DAG, specify the BN completely. Once the network is constructed, it constitutes an ecient device to perform probabilistic inference. Nevertheless, the problem of building such a network remains. The structure and conditional probabilities necessary for characterizing the network can either be provided externally by experts or obtained, as in this paper, from an algorithm which automatically induces them. We will perform an automatic `score + search' procedure: carrying out a search in the space of possible Bayesian network structures and measuring the goodness of each proposed structure. Therefore the `score' and `search' procedures must be reasonable 6
in computational cost: we need to nd the Bayesian network structure as fast as possible, so a simple algorithm which returns a good structure is preferred, even if it is not optimal with respect to the used score. In our work, Algorithm B (Buntine [11]) is used for learning Bayesian networks from data. Algorithm B is a greedy search heuristic. The algorithm starts with an arc-less structure and at each step, it adds the arc with the maximum increase in the de ned score: in our case the BIC metric (Schwarz [12]). The algorithm stops when adding an arc does not increase the utilized measure. C. Estimation of Bayesian Network Algorithm: EBNA
The general procedure of EBNA appears in Figure 2. The initial Bayesian network, BN0
, assigns the same probability to all possible individuals.
N
is the number of indi-
viduals in the population. is the number of individuals selected from the population. S
Although can be any value, we take into consideration the suggestion that appears in S
Etxeberria and Larra~naga [8], being = N2 . If is close to S
S
N
then the populations will
not evolve very much from generation to generation. On the other hand, a low value S
will lead to low diversity resulting in early convergence (Etxeberria and Larra~naga [8]). For individual selection range based selection is proposed, i.e., selecting the best individuals from the
N
2
N=
individuals of the population. However, any selection method
could be used. Probabilistic Logic Sampling algorithm (PLS, Henrion [13]) is used to sample new individuals from the Bayesian network. Finally, the way in which the new population is created must be pointed out. In the given procedure all individuals from the previous population are discarded and the new population is composed of all the newly created individuals. This has the problem of 7
EBNA Initialize
BN0
Sample
D0
For = 1 2 l
;
s l?1
D
BN
l
D
l
. N
;:::
individuals from
BN0
.
until a stop criterion is met
Select individuals from l?1 . S
D
Learn a Bayesian network from l?1 using B search algorithm and BIC metric. D
Sample
N
individuals from
BN
l
by means of PLS.
Fig. 2. EBNA basic scheme.
losing the best individuals that have been previously generated, therefore, the following minor change has been made: instead of discarding all the individuals, we maintain the best individual of the previous generation and create
N
? 1 new individuals.
An elitist approach has been used to form iterative populations. Instead of directly discarding the ?1 individuals from the previous generation replacing them with ?1 N
N
newly generated ones, the 2 ? 2 individuals are put together and taken the best ? 1 N
N
among them. These best ? 1 individuals will form the new population together with N
the best individual of the previous generation. In this way, the populations converge faster to the better individuals found; however, this also implies a risk of losing diversity within the population. D. FW-EBNA: implementation details
Once the Feature Weighting problem for nearest neighbor classi er and EBNA algorithm are presented, we will use the search engine provided by EBNA to solve the 8
Feature Weighting problem, resulting in the new FW-EBNA algorithm. In order to specify the nature of our search space, rather than considering a continuous weight space, we restrict the space of possible weights to a set of discrete weights. In this way, we follow the ndings of Kohavi et al. [14] to determine the number of possible discrete weights. Applying a wrapper approach (John et al. [15]), they found that considering only a small set of weights gave better results than using a larger set: increasing the number of possible weights the variance largely increased in their Feature Weighting algorithm, which induced a deterioration in the overall performance. However, when the wrapper approach was applied, they concluded that the overall utility of increasing the number of possible above two or three was negative. As we will also perform the search using a wrapper evaluation schema in a discrete space of weights, we will restrict our search space to three possible weights for each feature, i.e.
f0 0 5 1g. Thus, being the number of features in a domain, the cardinality of the ;
: ;
d
search space will be 3d. To select the number of neighbors (k) of the k-NN classi er, we x the number of neighbors to one, since our objective was to research in the Feature Weighting rather than in the number of neighbors. We use a wrapper schema to asses the evaluation function of each proposed solution, calculating the leave-one-out error (LOOCE)1 of the 1-NN classi er applied over the proposed set of weights. Being an individual2 in the search space a possible set of our restricted feature weights, a common notation will be used to represent each individual: for a full
d
feature problem, there are bits in each proposed solution, each bit indicating whether d
1 2
The leave-one-out error. is the most widely used evaluation function measure in Feature Weighting searches Within a search procedure we will indistinctly use `individual' or `solution' terms.
9
(3)
(1) X1
X2
X1 X2
............
Xd
(*)
1 0.5 0.5 ........ 0.5 looce1 2 0.5 0 ........ 1 looce2 3 1 0.5 ........ 0.5 looce3 ... ............................. .......... N 0 1 ........ 0 looceN
1 2 .. N/2
(2) selection of N/2 individuals
.............
Xd
0.5 0 ......... 1 0 0.5 ......... 0.5 ............................ 0.5 1 .......... 0
induction of the Bayesian network with ’d’ nodes
(4)
put together individuals from the previous generation (1) and newly generated ones (7), and take the best ’N-1’ of them to form the next population (1)
X5 Xd
X1 X3 X8
X1 X2
.............
Xd
1 0.5 1 ......... 1 2 0 0 ......... 0.5
...................
(*) sample ’N-1’ individuals from
looce1 looce2
(6) the Bayesian network and calculate their evaluation function values
..... ........................... ............ N-1 0 0.5 .......... 0.5 looceN-1
(7)
(5)
(*) leave-one-out-error of the proposed set of weights
Fig. 3. FW-EBNA method.
a feature has 0 0 weight, 0 5 weight or 1 0 weight. In each generation of the search, :
:
:
the induced Bayesian network will factorize the probability distribution of selected solutions. The Bayesian network will be formed by nodes, each one representing one d
feature of the domain. Each node has three possible values or states representing the mentioned weights. Bearing in mind the general EBNA procedure, Figure 3 summarizes the FW-EBNA method. Following the basic EDA scheme the initial population of weights is randomly created. To adopt a stopping criteria we will follow the ndings of Ng [16] and Kohavi and Sommer eld [17]. Ng [16], in a work on the `over tting' phenomenom, demonstrates that when cross-validation is used to select from a large pool of dierent classi cation models in a noisy task with too small training set, it may not be advisable to pick the 10
model with minimum cross-validation error, and a model with higher cross-validation error will have better generalization power over novel test instances. Kohavi and Sommer eld [17] display the eect of `over tting' in a Feature Subset Selection problem using a wrapper cross-validated approach when the number of instances is small. As we usually work with small (less than 1 000 training instances) and noisy training sets, ;
we decide to stop the search when in a sampled new generation no feature weight set appears with a LOOCE improving, at least with a p-value smaller than 0 1 (using a :
cross-validated paired t test [18]), the LOOCE of the best3 feature weight set of the previous generation. Thus, the best feature weight set of the previous generation is returned as FW-EBNA's solution. A detailed study of this stopping criteria within a wrapper strategy is carried out in Inza et al. [19]. Adopting this stopping criteria, our aim is to avoid the `over tting' risk of the wrapper process, only allowing the continuation of the search when a signi cant improvement in the accuracy estimation of best solutions of consecutive generations appears. We hypothesize that when this signi cant improvement appears, the `over tting' risk decays and there is a basis for further generalization accuracy improvement over unseen instances. When this improvement is absent, we hypothesize that FW-EBNA is getting stuck in an area of the search space without statistically signi cant better solutions (compared to already found ones); therefore, it is better to stop the search to avoid the risk of `over tting'. This stopping criteria takes into account two critical characteristics of our evaluation function:
the intrinsic uncertainty of the cross-validation estimate (Kohavi [20]), re ected 3
We name as `best' the set with the lowest LOOCE.
11
in the variance of the cross-validation estimate,
the high computational cost of the wrapper approach. In this way, we consider that the further computational cost of simulation and evaluation of a new generation of solutions is justi ed when a signi cant improvement in the evaluation function between the best solutions of adjacent generations appears. We must x two parameters in FW-EBNA: the population size ( ) and the selection N
set size ( ). As explained in the former section, S
S
=
2 is used and
N=
S
is xed to
1 000, which we consider it is an enough population size to estimate a reliable Bayesian ;
network. III. Experimental design
We have tested the power of FW-EBNA on four arti cial and four real domains. All datasets, except 3-Weights, can be achieved from UCI Repository (Murphy [21]). Table 1 summarizes the characteristics of these domains. LED24 is a well known arti cial dataset with 7 equally relevant and 17 irrelevant binary features. In Waveform-21 task all the features have dierent degrees of relevance (see Breiman et al. [22] for more details). Parity7+4 has 7 irrelevant features while the parity of the sum of other equally relevant 4 features determines the output. We have designed the 3-Weights domain, which includes 12 continuous features in the range [3 ? 6]; its target concept is to de ne whether the instance is nearer (using the Euclidean metric in all dimensions and summing them) to (0 0 ;
;:::;
0) or (9 9 ;
;:::;
9)
prototype; in the distance computation 1 0 weight is assigned to 4 features, 0 5 weight :
:
to other 4 and 0 0 to the 4 remaining ones. The 3-Weights dataset is inspired in the :
12
W
Table 1. Details of experimental domains. C = continuous. N = nominal.
Domain (1) LED24 (2) Waveform-21 (3) Parity7+4 (4) 3-Weights (5) Glass (6) Vote (7) CRX (8) Vehicle
Num. of instances 600 600 600 600 214 435 690 846
Num. of classes 10 3 2 2 7 2 2 4
Num. of features 24 (24-N) 21 (21-C) 11 (11-N) 12 (12-C) 9 (9-C) 16 (16-N) 15 (6-C, 9-N) 18 (18-C)
domain proposed by Kohavi et al. [14]. These domains are selected to evaluate FWEBNA's ability to tolerate dierent types of problematic features. In these datasets, which have irrelevant features and/or features with dierent degrees of relevance, we would expect FW-EBNA to outperform 1-NN with homogeneous weights, identifying the appropiate feature weights. Four proposed real datasets are well known in Machine Learning literature. In spite of not knowing the target concept and the true relevance of their features, it is basic to test FW-EBNA in real datasets to ensure its capability. In these datasets, we would expect FW-EBNA to behave, at least, as well as 1-NN with homogeneous weights. Due to the randomized nature of FW-EBNA (two executions could not give the same result), instead of a classic unique-execution training/testing holdout scheme, ve replications of two-fold cross-validation (5x2cv) are applied to estimate the predictive accuracy of FW-EBNA in each domain. Each time a 2cv is performed, FW-EBNA is run independently in each of the two folds and it has no access to the other fold: the accuracy of the weight set selected from each FW-EBNA run is measured in the no accessed fold. In this way, reported accuracies will be the mean of the ten accuracies 13
Table 2. A comparison of accuracy percentages of 1-NN with and without FW-EBNA. Reported accuracies are the mean of the ten accuracies which are included in the 5x2cv process. The standard deviation of this mean is also reported.
Domain (1) LED24 (2) Waveform-21 (3) Parity7+4 (4) 3-Weights (5) Glass (6) Vote (7) CRX (8) Vehicle Average-arti cial Average-real
no-FW 47:37 3:36 76:20 1:24 51:05 1:88 77:19 3:36 64:58 2:15 92:65 1:31 81:56 1:92 67:33 2:11 62:95 76:53
FW-EBNA 69:03 1:54 76:87 1:04 100:00 0:00 85:99 2:57 71:12 4:01 94:44 1:17 83:74 1:94 69:43 2:11 82:97 79:68
p-value 0:00 0:85 0:00 0:00 0:05 0:25 0:10 0:30
from the 5x2cv schema. IV. Experimental results
Table 2 displays the results in presented datasets. Experiments have been carried out in a SGI-Origin 200 computer. The notation `no-FW' indicates the execution of 1-NN algorithm with homogeneous weights for all features. Once 5 iterations of a 2-fold cross-validation (5x2cv) have been executed, a 5x2cv F (Alpaydin [23]) test was applied to determine whether accuracy dierences between FW-EBNA approach and no-FW are signi cant or not. The 5x2cv F test is a variation of the well known 5x2cv paired t test (Dietterich [18]). The p-value of the test is reported, which is the probability of observing a value of the test statistic that is at least as contradictory to the null hypothesis (FW-EBNA and no-FW have the same accuracy), and supportive of the alternative hypothesis (FW-EBNA outperforms no-FW) as the one computed from sample data (Mendenhall and Sincich [24]). In LED24 domain FW-EBNA has been able to detect the major part of irrelevant 14
features of the domain and the homogeneous relevance of relevant ones. In all the runs more than 14 of the 17 irrelevant features have been detected by FW-EBNA, assigning them a 0 0 weight. The homogeneous relevance of the rst 7 features is usually detected :
by FW-EBNA: in 8 runs 1 0 weight is learned for all relevant features and in 2 runs 0 5 :
:
weight is learned for just one feature (the remaining relevant features having a weight of 1 0). :
We think that the bias of FW-EBNA is not able to learn the special characteristics of Waveform-21 weights. In spite of the absence of irrelevant attributes, all features have dierent relevance degrees. Although the weights learned by FW-EBNA are `near' to the real weights, FW-EBNA is only able to `approximate' the variety of real weights to its 3 levels. In this way, the achieved improvement regarding the homogeneous weight setting is insigni cant. All FW-EBNA runs are able to learn the correct weights in Parity7+4: 7 irrelevant features with weight 0 0 and 4 interacting relevant ones with :
the same 1 0 weight. :
FW-EBNA usually learns the optimal weights of 3-Weights domain: taking into account the ten runs of the 5x2cv schema, FW-EBNA always assigns the target weight (1 0) to high-relevant features, in 75% of the occasions the target weight (0 5) to :
:
medium-relevant features and in 95% of the occasions the zero-weight (0 0) to irrelevant :
ones. FW-EBNA reports accuracy improvements over no-FW approach in all real domains. The average accuracy increases from 76 53% to 79 68%, which indicates a 13 42% :
:
:
relative error reduction. Compared to arti cial domains, the attainment of statistically signi cant dierences in real domains with few instances is a hard question, the p-values 15
of the presented 5x2cv F test can be used to judge the degree of the performance improvement. To understand these accuracy dierences between arti cial and real domains, Kohavi and John [25], in a Feature Subset Selection approach, argue that real datasets are already preprocessed to include only relevant features, while the arti cial ones include irrelevant features on the purpose. In FW-EBNA, the search algorithm that we propose is independent to the search evaluation function used. However, FW-EBNA's reliance on the wrapper approach (LOOCE measure to guide the search) for the evaluation function makes it slow: in the proposed 5x2cv approach, the mean CPU times for one LOOCE estimation have varied from 3 3 seconds for Glass domain to 60 4 for Vehicle domain. On the other hand, times :
:
to induce the Bayesian networks in each generation have been very competitive: on average, 2 2 seconds are needed in Glass domain (the domain with fewer features) and :
5 5 seconds in LED24 (the domain with more features). Appart from these two time :
indicators, FW-EBNA's overall time also depends on the generation that the search stops: on average, with the exposed stopping criteria, the stop generation varied from 3 5 generations in Vehicle domain to 6 5 generations in 3-Weights. Thus, before the :
:
execution of FW-EBNA with the present settings, we must take the amount of available resources we arrange into account. V. Related work
Since the Pattern Recognition and Machine Learning communities have frequently used the nearest neighbor classi er, they have also proposed many variants to address the Feature Weighting problem. A good review of these eorts was done by Wettschereck et al. [26], organizing Feature Weighting algorithms along ve dierent 16
dimensions. In this section, in order to properly locate the new FW-EBNA approach, assuming that its principal contribution is the search mechanism, we will organize the principal Feature Weighting algorithms with respect to the search strategy they use to optimize the set of weights, as well as pointing out the principal relationships between FW-EBNA and these known algorithms. Several well-known Feature Weighting methods learn the set of weights by means of a hill-climber, incremental and on-line strategy, making only one pass through the training data. Applying the nearest neighbor's distance function, they iteratively adjust the feature weights after one or several classi cations (on the training set) are made. Weight adjustment is computed in consideration whether the classi cation answer was correct or incorrect. Considering each training instance once, weight adjustment has the purpose of decreasing the distance similarity metric among instances of the same class and increasing the distance among instances of dierent classes. These types of algorithms can be seen in Salzberg [27] (in a Nearest Hyperrectangle framework), Aha [28] and Kira and Rendell [29]. Lowe [2] and Scherf and Brauer [30] have proposed another local search mechanism as gradient descent optimization to optimize a set of continuous weights. Lowe applies the gradient descent over the distance similarity metric to optimize feature weights so as to minimize the LOOCE on the training set. Near to the basic idea of FW-EBNA's stopping criteria, Lowe also tries to prevent large weight changes in their optimization process which are not statistically reliable on datasets with few examples. In the other work, Scherf and Brauer apply the gradient descent to optimize feature weights so as to minimize a function which reinforces the distance similarities between all training 17
instances of the same class while deteriorates the similarities between instances of dierent classes. Instead of incremental on-line optimizers, these two approaches, as FW-EBNA, repeteadly pass through the training set: each time a new weight set is found they accede to the training set to measure the value of the function to be optimized. We can qualify hill-climbing and gradient descent optimization as `local' search engines in the sense that they can not escape from local optimas. With the capability of escaping from local optimas, Kohavi et al. [14] propose a best- rst search. Using the wrapper approach, each time a weight set is found, the training set is acceded to estimate the accuracy of the proposed set by ten-fold cross-validation. Rather than considering a continuous weight space, they restrict (as FW-EBNA) the possible weights to a small, nite set. All the algorithms reviewed until now are deterministic in the sense that all the runs over the same data always give the same result. Up to date, mainly Genetic Algorithms (GAs) and also random sampling, two popular non-deterministic search strategies have been the only non-deterministic search engines applied in the Feature Weighting problem. With the non-deterministic approach, randomness is used to avoid getting stuck in local optimas: this implies that one should not expect the same weight set solution from dierent runs. The search strategy performed by FW-EBNA can be also located within this non-deterministic category. GAs were used by Kelly and Davis [31], Punch et al. [32] and Wilson and Martinez [33], performing the search in a space of continuous weights. These three works use the wrapper approach to guide the GA, with access to the training set each time a new weight set is found; however, slightly 18
dier in the way they apply it. Kelly and Davis measure the ve-fold cross-validation accuracy on training data; Punch et al., in a 5-NN approach, appart from the LOOCE on training data, also propose a mixed tness function which combines the LOOCE with the number of neighbors that were not used in the nal classi cation of each training instance; Wilson and Martinez just measure the LOOCE on training data. Skalak [34] uses the Monte Carlo sampling to simultaneously select features (only two discrete weights are used: 0 0 and 1 0) and prototypes for nearest neighbor. He also :
:
utilizes `Random mutation hill climbing' (Papadimitrious and Steiglitz [35]), a local search method that has a stochastic component: changing at random one position of the solution until an improvement is achieved, bounding a maximum number of iterations. The wrapper approach is used, performing a hold-out estimate to measure the feature and prototype set performance on training set. The feature selection algorithm for nearest neighbor proposed by Aha and Bankert [36], in a domain with 204 features, also has a random sampling component: they randomly sample a speci c part of the feature space for a xed number of iterations and then begin a beam search with the best (by the ten-fold cross-validation wrapper approach) feature subset found during those iterations. All the algorithms we have reviewed state the Feature Weighting task as a search problem. They are grouped under the wrapper term because they use feedback from the nearest neighbor classi er itself during training to learn weights: in order to measure the value of the wrapper function to be optimized, while on-line weighting algorithms accede only once to the training set, the rest of presented algorithms accede to the training set each time a new weight set is found. 19
Other well-known approximations do not state the Feature Weighting task as a search problem, learning feature weights from the intrinsic characteristics of the data. These approaches do not make use of the nearest neighbor classi er itself to learn the weights, and thus, they can be grouped under the lter term (John et al. [15]). To learn a set of weights, classic approximations make use of conditional probabilities (Crecy et al. [37]), class projections (Stan ll and Waltz [38]; Howe and Cardie [39]), mutualinformation (Wettschereck and Dietterich [40] in a nearest hyperrectangle approach) or information-gain (van den Bosch and Daelemans [41]; Cardie and Howe [42] rst build a decision tree to select features and then weight each feature according to its information-gain). VI. Summary and future work
Once the Feature Weighting problem for nearest neighbor classi er has been stated as a search problem, GAs, due to their attractive randomized and population-based nature, have long been applied to solve it. This paper presents FW-EBNA, a new Feature Weighting algorithm with a new search engine (called EBNA) which shares these interesting characteristics of GAs, but it also presents the interesting property of avoiding the use of crossover and mutation operators. In this way, the evolution of the population of weights is carried out by means of factorizing, by a Bayesian network, the probability distribution of the best solutions in each generation of the seach. The search, performed in a restricted space of 3 discrete possible weights for each feature, is directed by a wrapper approach which consideres feedback from the nearest neighbor classi er itself. FW-EBNA has demonstrated its capacity to identify the true degree of relevance of 20
dierent kinds of features in a variety of arti cial domains. At the same time, interesting accuracy improvements with respect to the homogeneous weighting approach are achieved in a set of real tasks where the true weight of features is not known. Continuing the work within the EDA approach for Feature Weighting problem, an interesting way to be explored is the extent of the EBNA algorithm to search in a space of continuous weights: we think that this approach could bring an improvement in domains as Waveform-21 where all the features have dierent degrees of relevance. In order to deal with domains where the dimensionality is very high (as in Aha and Bankert [36] [43], datasets with 204 and 98 features respectively), another way of researching is the use of simpler probability models (see Inza et al. [44]), models which assume fewer or no dependencies among the features of the problem. References
[1] B.V. Dasarathy, Nearest neighbor(NN) norms: NN pattern classi cation techniques, IEEE Computer Society Press, Los Alamitos, CA, 1991. [2] D. Lowe, Similarity metric learning for a variable-kernel classi er. Neural Computation 7 (1995) 72-85. [3] J.H. Holland, Adaptation in Natural and Arti cial Systems, University of Michigan Press, Ann Arbor, MI, 1975. [4] H. Muhlenbein, G. Paa, From recombination of genes to the estimation of distributions. Binary parameters, Lecture Notes in Computer Science 1411: Parallel Problem Solving from Nature { PPSN IV, pp. 178-187, Springer, Berlin, Germany, 1996. [5] M. Pelikan, D.E. Goldberg, F. Lobo, A Survey of Optimization by Building and 21
Using Probabilistic Models, IlliGAL Report no. 99018. Urbana: University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, 1999. [6] G. Harik, Linkage Learning via Probabilistic Modeling in the ECGA, IlliGAL Report no. 99010. Urbana: University of Illinois at Urbana-Champaign, Illinois Genetic Algorithms Laboratory, 1999. [7] P. Larra~naga, R. Etxeberria, J.A. Lozano, B. Sierra, I. Inza, J.M. Pe~na, A review of the cooperation between evolutionary computation and probabilistic graphical models, Proceedings of the II Symposium on Arti cial Intelligence, CIMAF99, La Habana, Cuba, 1999, pp. 314-324. [8] R. Etxeberria, P. Larra~naga, Global Optimization with Bayesian networks, Proceedings of the II Symposium on Arti cial Intelligence, CIMAF'99, La Habana, Cuba, 1999, pp. 332-339. [9] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, Palo Alto, CA, 1988. [10] A.P. Dawid, Conditional independence in statistical theory, Journal of the Royal Statistics Society, Series B 41 (1979) 1-31. [11] W. Buntine, Theory re nement in Bayesian networks, Proceedings of the Seventh Conference on Uncertainty in Arti cial Intelligence, Los Angeles, USA, 1991, pp. 52-60. [12] G. Schwarz, Estimating the dimension of a model, Annals of Statistics 7 (1978) 461-464. [13] M. Henrion, Propagating uncertainty in Bayesian networks by probabilistic logic sampling, Uncertainty in Arti cial Intelligence 2, pp. 149-163, Elsevier Science 22
Publishers B.V., Amsterdam, The Netherlands, 1988. [14] R. Kohavi, P. Langley, Y. Yun, The Utility of Feature Weighting in NearestNeighbor Algorithms, European Conference on Machine Learning, ECML'97, Prague, Czech Republic, 1997, poster. [15] G. John, R. Kohavi, K. P eger, Irrelevant features and the subset selection problem, Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, USA, 1994, pp. 121-129. [16] A.Y. Ng, Preventing `Over tting' of Cross-Validation Data, Proceedings of the Fourteenth Conference on Machine Learning, Nashville, USA, 1997, pp. 245-253. [17] R. Kohavi, D. Sommer eld, Feature Subset Selection using the wrapper model: over tting and dynamic search space topology, Proceedings of the First International Conference on Knowledge Discovery and Data Mining, KDD'95, Montreal, Canada, 1995, pp. 192-197. [18] T.G. Dietterich, Approximate Statistical Tests for Comparing Supervised Learning Algorithms, Neural Computation 10 (7) (1998) 1895-1924. [19] I. Inza, P. Larra~naga, R. Etxeberria, B. Sierra, Feature Subset Selection by Bayesian Networks based Optimization, Technical Report no. EHU-KZAA-IK2/99, University of the Basque Country, Spain, 1999. [20] R. Kohavi, Feature subset selection as search with probabilistic estimates, Proceedings of the AAAI Fall Symposium on Relevance, New Orleans, USA, 1994, pp. 122-126. [21] P. Murphy, UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer Science, 1995. 23
[22] L. Breiman, J.H. Friedmann, R.A. Olshen, C.J. Stone, Classi cation and Regresion Trees, Wadsworth, Belmont, CA, 1984. [23] E. Alpaydin, Combined 5x2cv F test for comparing supervised classi cation learning algorithms, Neural Computation, accepted for publication. [24] W. Mendenhall, T. Sincich, Statistics for Engineering and The Sciences, Prentice Hall International, Englewood Clis, NJ, 1995. [25] R. Kohavi, G. John, Wrappers for feature subset selection, Arti cial Intelligence 97 (1997) 273-324. [26] D. Wettschereck, D.W. Aha, T. Mohri, A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms, Arti cial Intelligence Review 11 (1997) 273-314. [27] S.L. Salzberg, A nearest hyperrectangle learning method, Machine Learning 6 (1991) 251-276. [28] D.W. Aha, Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms, International Journal of Man-Machine Studies 36 (1992) 267287. [29] K. Kira, L.A. Rendell, A practical approach to feature selection, Proceedings of the Ninth International Conference on Machine Learning, Aberdeen, Scotland, 1992, pp. 249-256. [30] M. Scherf, W. Brauer, Feature Selection by Means of a Feature Weighting Approach, Technical Report no. FKI-221-97, Forschungsberichte Kunstliche Intelligenz, Institut fur Informatik, Technische Universitat Munchen, Germany, 1997. [31] J.D. Kelly, L. Davis, A Hybrid Genetic Algorithm for Classi cation, Proceedings 24
of the Twelfth International Joint Conference on Arti cial Intelligence, Sidney, Australia, 1991, pp. 645-650. [32] W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland, R. Enbody, Further Research on Feature Selection and Classi cation Using Genetic Algorithms, Proceedings of the International Conference on Genetic Algorithms, ICGA'93, 1993, pp. 557-564. [33] R. Wilson, T.R. Martinez, Instance-Based Learning with Genetically Derived Attribute Weights, Proceedings of the International Conference on Arti cial Intelligence, Expert Systems and Neural Networks, AIE'96, 1996, pp. 11-14. [34] D. Skalak, Prototype and feature selection by sampling and random hill climbing algorithms, Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, USA, 1994, pp. 293-301. [35] C.H. Papadimitrious, K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, Englewood Clis, NJ, 1982. [36] D.W. Aha, R.L. Bankert, Feature selection for cased-based classi cation of cloudtypes: An empirical comparison, Working Notes of the AAAI-94 Workshop on Case-Based Reasoning, Seattle, USA, 1994, pp. 106-112. [37] R. H. Crecy, B.M. Masand, S.J. Smith, D.L. Waltz, Trading MIPS and memory for knowledge engineering, Communications of the ACM 35 (1992) 48-64. [38] C. Stan ll, D. Waltz, Toward memory-based reasoning, Communications of the ACM 29 (1986) 1213-1228. [39] N. Howe, C. Cardie, Examining Locally Varying Weights for Nearest Neighbor Algorithms, Lecture Notes in Arti cial Intelligence: Case-Based Reasoning Research 25
and Development: Second International Conference on Case-Based Reasoning, pp. 455-466, Springer, Berlin, Germany, 1997. [40] D. Wettschereck, T.G. Dietterich, An Experimental Comparison of the NearestNeighbor and Nearest-Hyperrectanlge Algorithms, Machine Learning 19 (1995) 1-25. [41] A. van den Bosch, W. Daelemans, Data-oriented methods for grapheme-tophoneme conversion, Technical Report no. 42, Tilburg University, Institute for Language Technology and Arti cial Intelligence, Tilburg, The Netherlands, 1993. [42] C. Cardie, N. Howe, Improving Minority Class Prediction Using Case-Speci c Feature Weights, Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, USA, 1997, pp. 57-65. [43] D.W. Aha, R.L. Bankert, A comparative evaluation of sequential feature selection algorithms, Proceedings of the Fifth International Workshop on Arti cial Intelligence and Statistics, Ft. Lauderdale, USA, 1995, pp. 1-7. [44] I. Inza, M. Merino, P. Larra~naga, J. Quiroga, B. Sierra, M. Girala, Feature Subset Selection by Population-Based Incremental Learning. A case study in the survival of cirrhotic patients treated with TIPS, Tehnical Report no. EHU-KZAA-IK-1/99, University of the Basque Country, Spain, 1999.
26