Comparison between a filter and a wrapper approach to variable

0 downloads 0 Views 225KB Size Report
Comparison between a filter and a wrapper approach to variable ... denoted by ℵ and the random vector to be predicted by Y. The elements (i.e. random ...
Comparison between a filter and a wrapper approach to variable subset selection in regression problems Ivan Kojadinovic*, Thomas Wottka** *I REMIA, Université de la Réunion 15, avenue René Cassin - BP 7151 97715 Saint-Denis messag cedex 9, France e-mail: [email protected] ** M IT - Management Intelligenter Technologien GmbH Promenade 9, 52076 Aachen, Germany e-mail: [email protected] ABSTRACT : Variable subset selection is an important issue when learning regression models from data. In this paper, a filter approach to subset variable selection based on the notion of mutual information is compared with a wrapper approach based on fast non-parametric regression algorithms. A detailed comparison of the two approaches is made on three nonlinear and noisy regression problems. KEYWORDS : variable subset selection, regression, mutual information, non-parametric regression algorithms

1

INTRODUCTION

Variable subset selection is an important issue when learning regression models from data. It is well-known that regression models involving too many input variables suffer from the so-called "curse of dimensionality". The aim of a variable subset selection procedure is therefore to determine the subset of input variables leading to the model that generalizes the best. The two main elements of a variable subset selection procedure are a measure of relevance of a subset of input variables, say ω, and an optimization algorithm for searching for the optimal or a near-optimal subset with respect to ω. From the relevance measure point of view, variable subset selection procedures can be classified into two groups [26] : filter procedures and wrapper procedures (see Figure 1). In case of filter procedures, the relevance measure is defined independently from the learning algorithm. The subset selection procedure in this case can be seen as a preprocessing step. In case of wrapper procedures, the relevance measure is directly defined from the learning algorithm, for example, in terms of the cost of the learning and the generalization ability of the model. Although filter approaches tend to be much faster, their major drawback is that an optimal subset of variables may not be independent of the representational biases of the algorithm that will be used during the learning phase [26]. In case of wrapper procedures, the learning algorithm should fulfill two main conditions : the number of possible parameters to be optimized should be as low as possible and the algorithm should be highly computationally efficient. Consequently, basing an input variable selection procedure on the most popular algorithms for nonlinear function approximation such as neural networks or fuzzy rule-based systems is not feasible. In this paper, we compare a filter approach, using the mutual information as relevance measure, with a wrapper approach based on fast non-parametric regression algorithms. After presenting both approaches from a theoretical perspective, we make a detailed comparison of their abilities for determining the best subset of candidate input variables in the case of three highly nonlinear and noisy regression problems. In the rest of this paper, we place ourselves in a probabilistic framework. The set of candidate input variables will be denoted by ℵ and the random vector to be predicted by Y. The elements (i.e. random variables) of ℵ will be denoted by upper cases e.g. X. The subsets of ℵ (which can be equivalently seen as random vectors) will be denoted by "double" upper cases e.g. X.

ESIT 2000, 14-15 September 2000, Aachen, Germany

311

Set of candidate input variables



Subset selection algorithm



Learning algorithm



Learning algorithm

(a) Filter approach

Subset selection algorithm Set of candidate input variables

✻ ❄ Subset evaluation



✻ ❄ Learning algorithm (b) Wrapper approach

Figure 1: Two approaches to variable subset selection based on the incorporation of the learning algorithm [26].

2

THE FILTER APPROACH

In this section, we briefly describe the filter approach for variable subset selection proposed in [21]. This approach uses the notion of mutual information to define the measure of relevance ω. In a probabilistic framework, the notion of mutual information [1, 5] arises as a natural measure of the general dependence between two random vectors. The estimation of the mutual information between two continuous random vectors from data is based on the estimation of probability densities and has been initially proposed by Fraser and Swinney [7] for the analysis of chaotic time series . In [22], Moon et al. used a more robust estimation method but without proposing an appropriate estimator for the mutual information. As we will see, the relevance measure ω defined using the mutual information turns out to be a fuzzy measure. In order to be able to estimate the relevance ω(X) of a subset X of candidate input variables from a sample, we propose an estimator of the mutual information between two continuous random vectors. The estimation of the probability densities is carried out using one of the two following non-parametric methods : the classical adaptive kernel density estimation [24, 14] or the so-called projection pursuit density estimation [9, 8, 13]. 2.1 ENTROPY AND MUTUAL INFORMATION The concepts of entropy and mutual information are at the root of information theory [1, 5]. In a probabilistic framework, they arise as natural tools for measuring the information content of a random vector or to quantify the information one random vector contains about another. The entropy of a random vector can be seen as a measure of the information (or the uncertainty) contained in its probability distribution. The less predictable a random vector is, the higher its entropy. Let X be a continuous random vector with probability density p(x). The entropy of X is defined by  (1) H(X) := − p(x) log p(x)dx S

where S is the support set of the random vector (i.e. the set where p(x) > 0). Note that for certain probability densities the above integral does not exist. Furthermore, one should keep in mind that not all continuous random vectors have a density function. However, in the rest of this paper, we will assume that we are dealing with random vectors that have a probability density function and whose entropy exist. Given two random vectors, the mutual information can be interpreted as a measure of the information one random vector contains about the other [5]. It can be equivalently seen as a measure of general dependence. The mutual information between two continuous random vectors X and Y with joint probability density p(x, y) and marginal probability densities p(x) and p(y) respectively is defined by

ESIT 2000, 14-15 September 2000, Aachen, Germany

312

 I(X; Y) :=

p(x, y) log

p(x, y) dxdy p(x)p(y)

(2)

It can also be defined using entropies [5, p 231], in which case we have I(X; Y) := H(X) + H(Y) − H(X, Y)

(3)

I(X; Y) = H(Y) − H(Y|X)

(4)

which can be rewritten as

where H(Y|X) = H(X, Y) − H(X) is the conditional entropy of Y given X. From (4), it appears that I(X; Y) can be interpreted as a measure of the reduction of the uncertainty of Y given the knowledge of X. The mutual information can equivalently be seen as a particular case of the Kullback-Leibler distance between two probability densities [5, p 232]. It follows that the mutual information is symmetric (i.e. I(X; Y) = I(Y; X)), always positive and that it can be interpreted as a measure of the "distance" between the joint probability density p(x, y) and the product of the marginal densities p(x)p(y). Consequently, if X and Y are statically independent (i.e. p(x, y) = p(x)p(y)), the mutual information is zero. On the contrary, the more dependent X and Y are, the higher their mutual information. If the logarithm used in (2) is in base 2, the mutual information is expressed in bits. If the natural Neperian logarithm is used, the results are in nats. 2.2 MUTUAL INFORMATION AS A MEASURE OF RELEVANCE OF A SUBSET OF VARIABLES As mentioned in the introduction, the first step when developing a filter approach to input variable selection consists in defining a set function ω : P(ℵ) → R which assigns its ability for predicting Y to each subset X of ℵ . In the previous subsection, we have seen that, in probabilistic framework, mutual information arises as a natural measure of general dependence between two random vectors. It seems therefore natural to define ω by :  0 if X = ∅ ω(X) := (5) I(X; Y) for all X ⊆ ℵ \ ∅ It can be easily shown [21] that the set function ω : P(ℵ) → R is a fuzzy measure on ℵ i.e. ω is positive and fulfills the following conditions : (i) ω(∅) = 0 (ii) X1 ⊆ X2 implies ω(X1 ) ≤ ω(X2 ) 2.3 ESTIMATION OF THE MUTUAL INFORMATION FROM A SAMPLE In order to assess the relevance ω(X) of a subset X of candidate input variables from data, one has to be able to estimate the mutual information from a sample. Let X and Y be two random vectors whose mutual information is to be estimated. Suppose that we have a random sample {(x1 , y1 ), . . . , (xN , yN )} of size N drawn according to the density p(x, y) of (X, Y). An unbiased estimator of the mutual information between X and Y (see [21]) is then given by N  p(xi , yi ) ˆ Y) = 1 log I(X; N i=1 p(xi )p(yi )

(6)

Furthermore, it has been shown in [21] that the above estimator converges towards I(X; Y) in probability. In order to be able to use the estimator defined in (6) to estimate the mutual information between X and Y, it appears necessary to know the probability densities at the (x i , yi ), xi and yi . In most real world problems, the probability densities of the observed variables are not explicitly known. It is therefore necessary to estimate the densities from the available data.

ESIT 2000, 14-15 September 2000, Aachen, Germany

313

2.4 MULTIVARIATE DENSITY ESTIMATION There exist two main approaches to multivariate density estimation : parametric methods and non-parametric methods. Although they are simpler, parametric methods (e.g. multivariate histograms) usually require a much larger sample size in order to estimate the density accurately. Moreover, parametric methods are very sensitive to the choice of the parameters (e.g. number and position of the bins). Since the accuracy of the mutual information estimate directly depends on the quality of the density estimates, we have opted for non-parametric methods which are more robust even though computationally more expensive. The first method we have implemented is the classical adaptive kernel density estimation (AKDE) [24, 14] which essentially consists in placing a kernel function at each observation and then in constructing the multivariate density from these kernels. The second method we have implemented is the so-called projection pursuit density estimation (PPDE) [8, 13, 14] which consists in searching for "interesting" projections of the multivariate data and in constructing the multivariate density as a product of the univariate densities estimated along the "interesting" projection directions. Although the notion of "interestingness" is difficult to define, Huber [13] argued that the projection directions having a Gaussian density are the least interesting. Following Huber, Friedman [9] proposed an algorithm for multivariate density estimation which consists in searching for the least Gaussian projection directions. Recently, more efficient algorithms for searching such projections have been proposed in the framework of independent component analysis [4, 16]. Consequently, we have used the fixed-point algorithm proposed by Hyvärinen [15] as a search procedure for "interesting" projection directions, while keeping Friedman’s original method for constructing the multivariate density estimate. Both density estimation methods are compared in [21]. PPDE seems more interesting when the dimension of the data is high since, as all projection pursuit methods, it suffers less then other methods from the "curse of dimensionality". In case of low dimensional data, AKDE can lead to a better estimation if the underlying density is highly structured. In terms of computational expenses, the two methods behave equally for low sample sizes (e.g. N < 800). However, the more the sample size grows, the more AKDE is penalized compared to PPDE since the multivariate density is estimated by placing a kernel function at each observation.

3

THE WRAPPER APPROACH

The expression wrapper approach covers the category of variable subset selection algorithms that apply a learning algorithm in order to conduct the search for the optimal or a near-optimal subset [20]. Practically, the proposed wrapper approach to variable subset selection differs from the filter approach proposed in the previous section in the way the measure of relevance ω is defined. While in the filter approach, ω is estimated as a measure of statistical dependence between subsets of random variables, in the wrapper approach, ω is defined as the accuracy obtained by nonlinear regression. Thus, the wrapper considers the relevance of the subset X of random variables with respect to the random vector Y to be predicted as the precision of an obtained nonlinear regression model f Y|X : yˆ = fY|X (x)

(7)

with regressor X and regressand Y. yˆ denotes an estimate of a realization y of the random vector Y and x denotes a realization of the random vector X. The two fundamental elements of the proposed approach are: a regression algorithm and an accuracy measure. The accuracy measure corresponds to the measure of relevance ω that conducts the whole variable subset selection in the wrapper approach. Suppose that we have two candidate subsets X  and X of ℵ. Subset X is assumed to be more relevant if it leads to a more accurate regression model than subset X  . The choice of the accuracy measure depends on the application. In this paper, we propose a nonlinear version of the coefficient of determination (COD). For a subset X ⊆ ℵ, the measure of relevance ω is therefore defined by : |Y|

1  CODYi |X ω(X) = |Y| i=1

(8)

where

CODYi |X :=

 2 (fYi |X (x) − fYi |X (x)) (yi − yi ) 2

(fYi |X (x)2 − fYi |X (x) ) (yi − yi 2 )

(9)

where the random vector Y is composed of |Y| random variables Y 1 , Y2 , ..., Y|Y| whose realizations are denoted  by y1 , y2 , ..., y|Y| respectively. The bars in (9) denote the mean value estimate over the sample of size N e.g. x := 1/N i xi .

ESIT 2000, 14-15 September 2000, Aachen, Germany

314

3.1 k NEAREST NEIGHBOR REGRESSION (kNN) Nearest neighbor search is a powerful method for a large number of nonlinear problems. It is used for non-parametric density estimation [12] and classification [17], analysis of nonlinear dynamics [3, 6], nonlinear noise reduction [19, 23], etc. In the branch of classification problems, it is often referred to as the k nearest neighbor method (kNN) [12] while in the context of time series analysis the technique appears as the method of false nearest neighbors [18, 3]. As for variable subset selection, kNN approaches are successfully applied in classification as well as in regression problems. Popular examples of regression problems comprise nonlinear time series prediction [6, 25]. In the domain of nonlinear dynamical systems, nearest neighbor regression is applied to time delay embedding [3] which represents a special case of subset selection. In a probabilistic framework, finding the regression function (7) is equivalent to estimating the conditional mean of the random vector Y given the random vector X. Given the probability density function p(x, y) of (X, Y), this conditional mean is given by :  dy y p(x, y) (10) yˆ =  dy p(x, y) Here x and y denote realizations of the random vectors X and Y. This statistically driven regression function has the property to possess the least mean square error. The density function p(x, y) can be approximated e.g. by an adaptive kernel density approach or a projection pursuit density estimation. In order to reduce the computational expenses as much as possible, we propose a nearest neighbor version of the kernel density estimate : p(x, y) =

1 k



κY (y − yi )κ(x − xi )

(11)

i∈kN N (x)

The summation in (11) is carried out over the k nearest neighbors {x i1 , xi2 , ...xik } for the given realization x of the random variable X. In the remainder, this x is referred to as the query point. In the above equation, kN N (x) denotes the nearest neighbor indices {i 1 , i2 , ..., ik }, yi denotes the regressand value assigned to x i , and κY and κ are kernel functions of the regressand Y and the regressor X respectively to be specified in the following. Inserting (11) into (10) and using dy y κY (y − yi ) = yi , we obtain :  yi κ(x − xi ) yˆ =

i∈kN N (x)



i∈kN N (x)

κ(x − xi )

=: fkN N (x)

(12)

This equation defines the k nearest neighbor regression function f kN N . A high performance nearest neighbor search is realized via Bentley’s k-d tree method [10, 2]. As choice of kernel function κ, we opted for the simple triangle function :  1 − h1 |x − xi | if |x − xi | < h (13) κ(x − xi ) = 0 if |x − xi | ≥ h where | ∗ | denotes the Euclidean distance. Because of the form of equation (12), it is not necessary for the kernel function  to satisfy the normalization condition i.e. dx κ(x) = 1. The parameter h defines the range where a neighbor x i of x contributes to the predicted value yˆ. The contribution of the neighbor x i decreases as the distance |x − xi | increases. For distances larger than h the contribution vanishes completely. In the special case where all nearest neighbors of a given query point x do not contribute, the estimate is given by the mean value f kN N (x) = N1 i yi . Other choices of kernels such as the uniform, Gaussian or Epanechnikov functions are possible. The advantage of the choice (13) is that nearest neighbors lying too far away from the query point x do not affect the estimate yˆ. For simplicity, the parameter h is considered to be fixed here, in contrast to AKDE where h is adapted locally. Nevertheless nearest neighbor regression based on (12) and (13) can easily be extended to locally varying h(x) as described in [12] for density estimation but at the expense of higher computational cost. In the analysis presented in this paper (see next section), we fix h to 1.0. Another branch of nearest neighbor methods uses local linear regression instead of kernels. This is computationally much more expensive as for every query point x, a linear regression calculation has to be carried out . 3.2 PROJECTION PURSUIT REGRESSION (PPR) The projection pursuit regression method (PPR) aims at finding projections α ∈ R |X| in the variable space of the candidate explanatory subsets X ⊆ ℵ. Once M projection vectors are found, the realizations of the random vector Y can be approximated using the following equation :

ESIT 2000, 14-15 September 2000, Aachen, Germany

315

yˆ =

M 

gj (αTj x) =: fP P R (x)

(14)

j=1

where αT corresponds to the transposed version of α. Equation (14) defines the projection pursuit regression function fP P R . The projection pursuit regression function is a linear combination of M functions g j each applied to a one dimensional variable that is obtained by projection. A projection vector α j transforms any random vector X to the one dimensional random variable α Tj X. The αj and gj are calculated during the projection pursuit algorithm described in the sequel. The PPR calculation is carried out in several steps as described in [11]. In the first step, α 1 is estimated iteratively by minimizing the mean square error I(α 1 ) = (y − g1 (αT1 x))2 using a gradient descent optimization scheme. Here g 1 denotes a non-parametric function that approximates Y and is found by a smoothing operation. The next step is based on the projected data {z 2 , g2 (αT2 x)} where z2 denotes the residuals of the previous step, i.e. z 2 := y − g1 (αT1 x). α2 is estimated to minimize the mean square error I(α 2 ) = (z2 − g2 (αT2 x))2 . In exactly the same way, the j-th step carries out the minimization of the mean square error : I(αj ) = (zj − gj (αTj x))2

(15)

based on the projected data {z j , gj (αTj x)} with the residuals of the previous step, i.e. zj = zj−1 − gj−1 (αTj−1 x) where j = 3, 4, ..., M

(16)

The objective function I(α) to be minimized in a projection pursuit step is also referred to as the projection pursuit index. For the numerical calculations presented in the remainder, we set M =3. The αTj X in approximation (14) can be interpreted as new relevant variables generated from the original variables X. In that sense, PPR is a variable generator. Each variable α Tj X contributes additively to the target Y via the non-parametric functions gj . 3.3 COMPARISON BETWEEN kNN AND PPR PPR is computationally more expensive in comparison with the kNN method. This is due to the additional optimization step that has to be carried out while searching for projection directions. The search for these projections may be an advantage if one aims at eliminating contributions from irrelevant variables in a given variable subset, especially in the case of sparse samples in a high dimensional variable space. That is the reason why PPR does not suffer that much from the "curse of dimensionality". Each estimated projection can be interpreted as a new variable. While irrelevant variables in the subset deteriorate the result of kNN regression, the PPR method tends to eliminate them via projection. Problems with PPR arise when there is no low dimensional structure in the data so that no appropriate projection directions can be found in the variable space. In those cases, PPR is clearly outperformed by kNN. In the scope of this paper, we restrict ourselves to the case of one dimensional regressands i.e. |Y| = 1.

4

COMPARISON BETWEEN THE DIFFERENT RELEVANCE MEASURES

In the previous sections, several relevance measures were defined. Let us denote by ω AKDE and ωP P DE the two measures proposed in the filter approach, and by ω kN N and ωP P R the two measures proposed in the wrapper approach. In this section, we compare these four relevance measures on three nonlinear and noisy variable subset selection problems. 4.1 DESCRIPTION OF THE ARTIFICIAL PROBLEMS In order to compare the four relevance measures mentioned above, we artificially generated three different variable subset selection problems. For each of them, the set of candidate input variables ℵ is composed of 8 random variables X1 , . . . , X8 . The variables X 1 , . . . , X7 are Gaussian with unit variance and zero mean. The variable X 8 is derived from X1 by X8 = sin(X1 ) + Z where Z is a white noise of zero mean and variance 0.1. The three variable subset selection problems are described in Table 1. For each problem, we list the relevant, the redundant and the irrelevant variables. All four relevance measures should be able to rank the subsets of ℵ according to their relevance for predicting Y . Since, for each problem, several variables play symmetric roles, we only calculate the relevances of the subsets listed in Table 2. Subset 9 has been added in order to see how the relevance measures deal with subsets containing redundant variables..

ESIT 2000, 14-15 September 2000, Aachen, Germany

316

Table 1: Description of the three variable subset selection problems. Problem number I II III

Equation 5 Y = sin i=1 tanh(Xi ) + Z

Y = 3i=1 Xi + Z

Y = 5i=1 Xi + Z

Relevant variables

Redundant variables

Irrelevant variables

X 1 , . . . , X5

X8

X6 , X7

X1 , X2 , X3 X 1 , . . . , X5

X8 X8

X 4 , . . . , X7 X6 , X7

Table 2: Index and cardinal of the subsets of ℵ whose relevance is calculated. Subset index 1 2 3 4 5 6 7 8 9

|X| 1 2 3 4 5 5 6 7 6

Subset X {X1 } {X1 , X2 } {X1 , X2 , X3 } {X1 , X2 , X3 , X4 } {X1 , X2 , X3 , X4 , X6 } {X1 , X2 , X3 , X4 , X5 } {X1 , X2 , X3 , X4 , X5 , X6 } {X1 , X2 , X3 , X4 , X5 , X6 , X7 } {X1 , X2 , X3 , X4 , X5 , X8 }

For problems I and III, the most relevant subset is {X 1 , X2 , X3 , X4 , X5 }. In case of problem II, the most relevant variable combination is {X 1 , X2 , X3 }. As for the variable subset selection task, problems II and III are much harder than problem I. This is due to the strong nonlinearity arising in II and III. As the variable Y is generated as the product of input variables, multidimensional correlation analysis fails completely to select the most relevant variables. For problem II and especially problem III, the distribution of the target variable Y reveals a spiky structure at 0 and enormous tails. 4.2 RESULTS AND COMPARISON For all three variable subset selection problems, the relevances of all the subsets listed in Table 2 are calculated using the four relevance measures ω AKDE , ωP P DE , ωkN N and ωP P R . In order to study the influence of the sample size N on the relevances of the subsets, the calculations are carried out for N = 100, 200, 400, 800, 1600, 3200, 6400. The results of the calculations for problem I are given in Figure 2. The bold curve corresponds to Subset 6, which is the most relevant subset for this problem. As one can see, for all four measures, and for almost all sample sizes, Subset 6 has maximum relevance. It is interesting to note that all four measures behave monotonically with respect to the sample size. Indeed, for almost all sample sizes and for all four relevance measures, all the subsets containing subset 6 have very similar relevance. This behavior was to be expected in the case of the measures ω AKDE and ωP P DE because of the properties of the mutual information. It can also be noticed that the measures that behave the "most" monotonically are based on projection pursuit algorithms. Finally, note that the value of the relevance of Subset 6 does not vary much for N ≥ 400 except in the case of measure ω kN N . The curves for problem II are given in Figure 3. All four measures identify Subset 3 as the most relevant for all subset sizes, except ωP P DE which requires as least 400 observations. Again, the two measures based on projection pursuit algorithms behave monotonically with respect to the subset size for N ≥ 400, which is not the case for ω AKDE and ωkN N . Note also that, for ω P P DE and ωP P R , a minimum sample size of 800 seems to be required in order to obtain a "stable" behavior. As said in the previous subsection, the third problem is the most difficult. This is confirmed by the graphical results given in Figure 4. As one can see, ω AKDE and ωkN N identify successfully the most relevant subset as soon as N ≥ 400. However, in case of ω kN N , the relevance values are very small. Moreover, the behavior of ω AKDE is very much dependent on the sample size. The high nonlinear structure of the data has an even worser effect on ω P P DE and ωP P R since these measures are not able to identify the most relevant subset. To conclude, it appears that, for all three artificially generated problems, ω AKDE and ωkN N are able to successfully identify the most relevant subset as soon as N ≥ 400. It is also interesting to note that both measures clearly discard subsets that contain irrelevant or redundant variables. The two measures based on projection pursuit algorithms perform well for problem I and II but have difficulties with problem III. Indeed, because of the highly nonlinear data structure, the projection pursuit algorithms have troubles finding "good" projection directions.

ESIT 2000, 14-15 September 2000, Aachen, Germany

317

1,8

2,5

1,6 2

1,4

relevance

relevance

1,2 1 0,8

1,5

1

0,6 0,4

0,5

0,2 0 100

1000

0 100

10000

1000

10000

sample size N

sample size N

(a)

(b) 1,2

1 0,9

1

0,8

0,8

0,6

relevance

relevance

0,7

0,5 0,4

0,6

0,3

0,4 0,2 0,1

0,2

0 100

1000

10000

0 100

sample size N

(c)

1000

10000

sample size N Subset 1

Subset 2

Subset 3

Subset 4

Subset 6

Subset 7

Subset 8

Subset 9

Subset 5

(d)

Figure 2: Relevances of the subsets listed in Table 2 for problem I plotted against the sample size. (a) ω AKDE (b) ωP P DE (c) ωkN N (d) ωP P R .

From a computational point of view, ω kN N is clearly the most efficient measure because it is based on Bentley’s k-d tree method to speed up the nearest neighbor search. The two measures based on projection pursuit algorithms come in second place. The worse measure in terms of computational efficiency is ω AKDE because of its adaptive algorithm which requires two passes through the observations. However, its cost could be drastically reduced by using the k-d tree method to access the data.

5

CONCLUSION

Variable subset selection is an important issue when learning regression models from data since the presence of irrelevant or redundant variables usually deteriorates the performance of the model. In this paper, we have compared a filter approach to variable subset selection based on the notion of mutual information with a wrapper approach based on fast non-parametric regression algorithms. A detailed comparison of the two approaches has been made on three nonlinear and noisy artificial regression problems. This comparison has higlighted the limitations of the statistical algorithms on which the two approaches are based. Reassuringly enough, the two approaches have led to very similar results, which is why other criteria such as computational efficiency when combined with a specific search algorithm should be taken into account before discriminating between the two approaches.

ESIT 2000, 14-15 September 2000, Aachen, Germany

318

2,5

1,2

1

2

relevance

relevance

0,8

0,6

1,5

1

0,4

0,5

0,2

0 100

1000

0 100

10000

1000

sample size N

(a)

(b)

1

0,8

0,9

0,7

0,8

0,6

0,7

0,5 relevance

relevance

10000

sample size N

0,6 0,5

0,4 0,3

0,4 0,2 0,3 0,1

0,2

0 100

0,1 0 100

1000 sample size N

1000 sample size N

10000

(c)

10000

(d) Subset1 Subset6

Subset2 Subset7

Subset3 Subset8

Subset4 Subset9

Subset5

Figure 3: Relevances of the subsets listed in Table 2 for problem II plotted against the sample size. (a) ω AKDE (b) ωP P DE (c) ωkN N (d) ωP P R .

REFERENCES [1] N. Abramson. Information Theory and Coding. McGraw Hill, New-York, 1963. [2] J.L. Bentley. K-d trees for semidynamic point sets. In 6th Ann. ACM Sympos. Comput. Geom., page 187, 1990. [3] L. Cao, A. Mees, K. Judd, and G.Froyland. Determining the embedding dimensions of input-output time series data. Bad reference, 1997. [4] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [5] T. Cover and J. Thomas. Elements of Information Theory. John Wiley and Sons, 1991. [6] J.D. Farmer and J.J. Sidorovich. Predicting chaos dynamics. In J.A.S.Kelso, A.J. Mandell, and M.F. Shlesinger, editors, Dynamic Patterns in Complex Systems, page 265. World Scientific, Singapore, 1988. [7] A. Fraser and H. Swinney. Independent coordinates for strange attractors from mutual information. Physical Review A, 33(2):1134–1140, 1986. [8] J. Friedman, W. Stuetzle, and A. Schroeder. Projection pursuit density estimation. Journal of the American Statistical Association, 79(387):599–608, 1984. [9] J.H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Association, 82(397):249–266, 1987. [10] J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 209, 1977.

ESIT 2000, 14-15 September 2000, Aachen, Germany

319

1

0,25

0,9 0,8

0,2

0,15

relevance

relevance

0,7

0,1

0,6 0,5 0,4 0,3 0,2

0,05

0,1 0 100

1000

0 100

10000

1000

(b)

0,6

0,12

0,5

0,1

0,4

0,08 relevance

relevance

(a)

0,3

0,06

0,2

0,04

0,1

0,02

0 100

10000

sample size N

sample size N

1000

10000

0 100

1000

10000

sample size N

sample size N

(c)

(d) Subset 1

Subset 2

Subset 3

Subset 4

Subset 6

Subset 7

Subset 8

Subset 9

Subset 5

Figure 4: Relevances of the subsets listed in Table 2 for problem III plotted against the sample size. (a) ω AKDE (b) ωP P DE (c) ωkN N (d) ωP P R .

[11] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76:817, 1981. [12] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, 1990. [13] P. Huber. Projection pursuit. The Annals of Statistics, 13(2):435–475, 1985. [14] J. Hwang, S. Lay, and A. Lippman. Nonparametric multivariate density estimation: A comparative study. IEEE Trans. on Signal Processing, 5(10):2795–2810, 1994. [15] A. Hyvärinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [16] A. Hyvärinen. Survey on independent component analysis. Neural Computing Surveys, 2:94–128, 1999. [17] J.Strackeljahn and D. Behr. A feature selection method for acoustic analysis problems. In Third European Congress on Intelligent Techniques and Soft Computing (EUFIT), page 20, Aachen, Germany, 1995. [18] M. Kennel, R. Brown, and H. Abarbanel. Determining minimum embedding dimension using a geometrical construction. Phys. Rev. A, 45:3403–4311, 1992. [19] M. Kennel and S. Isabelle. Method to distinguish possible chaos from colored noise and to determine embedding parameters. Phys. Rev. A, 46:3111, 1992.

ESIT 2000, 14-15 September 2000, Aachen, Germany

320

[20] R. Kohavi and G.H. John. The wrapper approach. In H. Liu and H. Motoda, editors, Feature extraction, construction and selection, page 33. Kluwer Academic Publisher, 1998. [21] I. Kojadinovic and H. Ralambondrainy. Input variable selection using information-theoretic functionals and a genetic algorithm. In COIL 2000: Symposium on computational intelligence and learning, Chios, Greece, 2000. [22] Y-L. Moon, B. Rajagopalan, and U. Lall. Estimation of the mutual information using kernel density estimators. Physical Review E, 52(3):2318–2321, 1995. [23] A. Pikovski. Discrete-time dynamic noise filtering. Sov. J. Commun. Technol. Electron., 31:81, 1986. [24] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, New-York, 1986. [25] G. Sugihara and R.May. Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature, 344:734, 1990. [26] J. Yang and V. Honavar. Feature subset selection using a genetic algorithm. In H. Liu and H. Motoda, editors, Feature extraction, construction and selection, pages 118–135. Kluwer Academic Publisher, 1998.

ESIT 2000, 14-15 September 2000, Aachen, Germany

321