Abstract A Genetic Algorithm for Best Subset Selection in ... - CiteSeerX

18 downloads 0 Views 24KB Size Report
using genetic algorithms. A genetic algorithm is described which uses a unique two criteria population management scheme. This method is explorative in.
A Genetic Algorithm for Best Subset Selection in Linear Regression Bradley C. Wallet1, David J. Marchette1, Jeffery L. Solka1, and Edward J. Wegman2 1Naval

Surface Warfare Center Systems Research and Technology Department Dahlgren, Virginia 22448 2George

Mason University Center for Computational Statistics Fairfax, Virginia 22030

Abstract Given a data set consisting of a large number of predictors plus a response, the problem addressed in this work is to select a minimal model which correctly predicts the response. Methods for achieving this subsetting of the predictors have been the topic of a considerable amount of study within the statistics community. Unfortunately, current methods often fail when the predictors are highly correlated. Furthermore, because of the exponential growth of the number of possible subsets as the number of candidate predictors increase, current methods have great difficulty handling high dimensional data sets. This paper details a method for variable selection using genetic algorithms. A genetic algorithm is described which uses a unique two criteria population management scheme. This method is explorative in nature, and allows for an approximation of the all possible subsets method over a set of interesting model sizes. Results of an application of this method to data are discussed.

1. INTRODUCTION The problem of dimensionality reduction in linear regression has been the topic of a considerable amount of study (Miller, 1990). The all subsets regression method of dimensionality reduction becomes intractable due to the exponential growth of the number of possible subsets. Similarly, stepwise methods fail when there is a significant amount of correlation between the predictors (Berk, 1978). Other techniques, such as ridge regression, do well under some circumstances, but generally break down when the response is strongly correlated with a linear combination of regressor but only weakly correlated with the individual regressors (Miller, 1990). Also, because the number of possible subsets grows expo-

nentially with the increasing of the number of candidate predictors, no current methods cope well with high dimensional data. It has been suggested that one of the areas in which genetic algorithms may have a significant impact upon modern statistics is in the area of subset selection (Chartterjee and Laudato, 1995). Since regression is one of the classical problems in statistics, it would seem obvious that this would be the first place to apply genetic algorithms in this manner. However, the application of genetic algorithms to variable selection in the regression model is not a straight forward process. Traditionally, genetic algorithms work to optimize some single fitness feature. In the case of linear regression and dimensionality reduction, it would not be sufficient to just optimize the fit of the model. It is the case that a model cannot be made worse by adding an additional term. Therefore, it would seem obvious that it is necessary to use some scoring system that penalizes for model size. The classical scoring system for comparing different models of different sizes is Mallows’ CP (Mallows, 1973). However, our experience has shown that using a canonical genetic algorithm to select a model based simply upon optimizing the CP will lead to less than optimal results. It is usually the case that when dealing with high dimensional data, the optimal CP is biased towards a model which is unnecessarily complex. This is not a surprising result since previous research has shown that when dealing with a large number of predictors, there are often unfortunate false correlations (Rencher, 1980). Indeed, Mallows recently wrote that simply minimizing the CP is not always a good thing (Mallows, 1995). In a recent discussion, De Jong stated that he felt that too much emphasis within the evolutionary computation community had been placed on hard optimization problems

To appear in the Proceedings of the 28th Symposium on the Interface, 1996.

while not enough focus had been placed upon their explorative nature (De Jong, 1995). Since many of the techniques for doing regression are interactive in nature, it seems that perhaps an explorative genetic algorithm might be more appropriate. As was previously stated, the fit of a model cannot be decreased by adding an additional term. At the very worst, a zero coefficient can be associated with the new term yielding a model equivalent to the previous one. In practice, by taking terms out of a model, one has a trade-off between model size and fit. Recognizing this, many traditional methods focus on looking at a curve of the trade-off and manually picking a point which looks desirable. With this in mind, we developed a genetic algorithm which attempts to estimate this all subsets regression method over a region of interesting model sizes. The most straightforward method of doing this would be to use a separate genetic algorithm to select optimal models at each interesting model size. However, the set of terms in a good model of size N is closely related to the set of terms in a good model of size N1. With this in mind, we developed a method to simultaneously evolve models of varying sizes and to use knowledge gained about good larger models to find good smaller models. We created a unique two stage population management scheme which simultaneously optimizes two parameters in an explorative manner.

b' = ( b 0, b 1, …, b ρ )

(3)

once b is known, it is then possible to predict future dependent variables based upon future observables using the linear equation yˆ = Xb

(4)

A complete discussion of the linear regression model can be found in Draper and Smith (1981). Now, consider the case where κ is large. In such a case, it is often desirable and necessary to select a subset of Κ = {1, 2,..., κ}. Let Π be any subset of Κ having |Π| = π members. Let XΠ be the submatrix of X containing only those columns whose indices are in Π. Using equation 2, it is then possible to estimate a new coefficient vector, bΠ, with the same goal of estimating dependent variables. The question then becomes how to select Π so that the resulting model is in some way good or desirable. One classical measure of the goodness of a model is the R2 of the model. The R2 is defined as SS regression R 2 = ---------------------------SS model

(5)

2. BACKGROUND where SS is the sum of the squares. 2.1 Linear regression and subsetting Consider the standard linear regression problem where the model is assumed to be of the form (1)

y = Xβ + e

where y is a η × 1 vector of observations, X is a η × (κ + 1) matrix of independent predictors {xj; j = 0,..., ρ}, β is a (κ + 1) × 1 vector of unknown coefficients, and e is an error term from an unspecified distribution with zero mean and a finite variance. Since β is unknown, we estimate it by the method of Least Squares (LS). The LS estimates of the regression coefficients, which will be denoted by b’s, is specified by b = ( X'X )

where

–1

X'Y

(2)

In essence, R2 can be thought of as the amount of variation in the y’s explained by the model. One of the earliest methods of variable selection was the allsubsets regression method. In the all subsets method, all possible regressions are performed, and the highest R2 at each model size is noted. This information is then plotted as a curve. It is then possible to understand the trade-offs involved and select a model which is most desirable with regards to both fit and size. 2.2 Genetic Algorithms A canonical genetic algorithm maintains a population P=(x1...xn) of individuals xi. Typically, the xi’s are candidate inputs to an objective function F(x) for which it is desired to find an optimal answer in some space Ω. The closer to optimal an individual is, the higher its fitness is said to be. In a genetic algorithm, candidate solutions are usually

To appear in the Proceedings of the 28th Symposium on the Interface, 1996.

encoded as binary bit strings though richer alphabets and richer data structures are sometimes used (Wallet, Marchette, and Solka, 1996). More formally, using an alphabet ℵ={0,1}, a point of consideration, x∈Ω, is represented as an l length sting c=(a1...al) where ai∈ℵ. A bit string, c∈C, is called a chromosome. The bit string representation is the genotypic representation, and the natural representation is called the phenotypic representation. Obviously, there is a function that maps Ω to C and vice versa though due to the discrete nature of C, one function may not be precisely the inverse of the other. As a genetic algorithm runs, it progresses through a series of populations where each population Pg builds upon the previous population, Pg-1 in order to evolve increasingly fit individuals, i.e. more optimal points in Ω. This is accomplished by selecting members from Pg-1 with replacement according to some stochastic scheme which takes into consideration the fitness of individuals. The members of Pg-1 chosen to compose Pg are randomly changed in the hope that some of them will be improved. This process is then repeated until some termination criteria is achieve such as a specific number of generations having passed or a individual with a fitness above a certain threshold being found. Pseudocode for a typical genetic algorithm is as follows: begin initialize g to 0 initialize members of P0 to random values evaluate fitness of members of P0 while (termination condition not reached) loop increment g select members of Pg-1 to compose Pg randomly change members of Pg evaluate fitness of members of Pg end loop end There are many different ways in which to randomly change individuals in order to search the space of interest. Most genetic algorithms use at the minimum a perturbation and a recombination operation. The most common perturbation operator is mutation. In the case of binary bit string representations, each element of the bit string is inverted with some small probability. Besides perturbation, genetic algorithms typically also employ a recombination operator as well (Figure 1). In fact, recombination is usually the primary method by which

genetic algorithms search the problem space with perturbation primarily being meant to increase diversity to prevent premature convergence to a suboptimal answer. Parent 1: Parent 2: Offspring:

00101|110 11000|000 00101|000

Figure 1: An example of the recombinant operator, crossover. A new individual was created by exchanging genetic information from either side of a randomly selected point in the parents. A through coverage of the field of genetic algorithms and other forms of evolutionary computation can be found in Michalewicz (Michalewicz, 1994). 2.2.1 Selection Schemes The selection from Pg-1 to form Pg is typically done via a stochastic method. The probability, pg-1(i), of an individual i in population Pg-1 being selected for creation of an individual in population Pg is in some way related to the fitness of individual i and the fitness of all the individuals in Pg-1. In such a way, more fit individuals are more likely to be selected than less fit individuals. The determination of these probabilities is typically done either proportionally to fitness or rank. Computing probabilities proportionate to fitness has some undesirable properties. Early in the search, when there is a great deal of variety in the population, fitness proportionate weighting tends to heavily favor the few best answers or “super individuals.” This leads to an incomplete search and can often result in convergence to a local maxima. Late in the search, when the goal is to get the last bit of fine tuning, fitness proportional probabilities lack the ability to distinguish between individuals who are very close in fitness. Computing probabilities according to rank corrects many of the problems associated with fitness proportional operabilities. In this method, individuals are ordered according to relative fitness, and an equation, typically linear, is used to determine probabilities. In this method, differences are deemphasized when there is a great deal of diversity in the fitnesses of the population so that less fit answers are given more of a chance to contribute what could potentially good components. Later, when attempting to fine tune among a number of answers with relatively close fitnesses, ranked

To appear in the Proceedings of the 28th Symposium on the Interface, 1996.

proportional weighting very strongly favors small improvements in fitness unlike fitness proportional weighting. 2.2.2 (µ,λ) Population Management In Germany during the 1960’s, an alternative method of evolutionary computation was developed independently of genetic algorithms. This method, evolutionary strategies, does not involve binary encoding, and is heavily focused upon hard numerical optimization problems One of the different things developed by the evolutionary strategies group was the idea of a population which expands and then contracts at each generation. Starting with a population of size µ, a larger population of size λ is created by selecting parents and applying perturbation and recombination operators. A new population of size µ is then selected from the larger population without any explorative operators being applied. Typically, the selection of the smaller population from the larger population is done with the consideration of fitness, but the selection of the larger population from the larger population is done completely at random. It has been demonstrated that considering fitness at both stages leads to an algorithm that is too exploitative at the expense of exploration and search of the space. 2.2.3 Two cycle population management It is possible to modify the (µ,λ) population management used by evolutionary strategies in order to optimize two parameters simultaneously. Consider the following changes. Starting from a λ population, choose a µ population according to one of the two fitnesses called the survival fitness. This selection is done without perturbation or recombination. In such a way, the most fit individuals can be thought of as having survived. Given the µ population, create the next λ population based upon the second fitness function called the reproductive fitness. In this stage, the typical perturbation and recombination operators are applied. In this way, answers can be evolved that are increasingly fit according to two criteria. Pseudocode for a typical dual criteria GA is as follows: begin initialize g to 0 initialize members of Pλ0 to random values evaluate fitness of members of Pλ0 according to the criteria one while (termination condition not reached) loop increment g

select member of Pλg-1 to compose Pµg based upon criteria one evaluate fitness of members of Pµg according to criteria two select members of Pµg to compose Pλg based upon criteria two randomly change members of Pg end loop end 3. VARIABLE ALGORTHMS

SELECTION

AND

GENETIC

Consider the application of the two criteria genetic algorithm as described above to variable selection in linear regression. By using R2 as the survival fitness and model size as the reproductive fitness, it is hoped that one would be able to evolve increasingly smaller models which do a good job predicting our response. There are, of course, a number of things which one needs to consider. Since R2 cannot be decreased by adding a term to the model, one notices that the idea of evolving increasingly smaller models with increasingly better fit is somewhat of a misnomer. However, it is possible to evolve increasingly smaller models which have the maximum possible R2 at that model size if the algorithm is structured correctly. Suppose the algorithm were to start with a larger (λ) population. Let the weights for determining the smaller (µ) population be determined fitness proportionately. As discussed previously, given a number of individuals with nearly the same fitness, this method does very little selection. This is important since a slightly smaller model is probably going to have at least a slightly lower R2. After the selection of the smaller population, one then selects a larger population based upon model size. Since the members of the µ population have been selected based upon R2, and there has been no perturbative or recombinant operators, the current individuals are all in some sense fit at modelling the response. At this point, it is necessary to be very selective in considering model size since the R2 selection is at least partially working against evolving smaller models. To this end, a rank based selection process would be most appropriate. This algorithm leads to a method which is adaptive in nature. Search effort is concentrated near the smallest models which

To appear in the Proceedings of the 28th Symposium on the Interface, 1996.

have yielded good R2. When the trade-off is small, the search races along finding smaller and smaller models. In areas where the trade-off is high, the algorithm slows down and concentrates of finding good smaller models. Also, the algorithm has an inherent parallelism. Since what is a good model of size n is closely related to what is a good model at size n-1, this algorithm works simultaneously across a number of model sizes in doing its processing. Along with the application of this algorithm, one can then keep track of the best model found at each particular model size. Graphing this information then gives an estimate of the all subsets regression curve over a region of moderate to small model sizes. Given this information, it is then possible to select a desired model size and model with an understanding on the trade-off being made between model size and R2.

Table 1: Florida Cloud Seed Data y

x1

x2

x3

x4

0.32

2.0

0.041

2.70

2.0

12.0

1.18

3.0

0.100

3.40

1.0

8.0

1.93

3.0

0.607

3.60

1.0

12.0

2.67

23.0

0.058

3.60

2.0

8.0

0.16

1.0

0.026

3.55

0.0

10.0

6.11

5.3

0.526

4.35

2.0

6.0

0.47

4.6

0.307

2.30

1.0

8.0

4.56

4.9

0.194

3.35

0.0

12.0

6.35

12.1

0.751

4.85

2.0

8.0

5.74

6.8

0.796

3.95

0.0

10.0

4.45

11.3

0.398

4.00

0.0

12.0

1.16

2.2

0.230

3.80

0.0

8.0

0.82

2.6

0.136

3.15

0.0

12.0

0.28

7.4

0.168

4.65

0.0

10.0

x5

4. APPLICATION TO DATA The dual criteria genetic algorithm was applied to variable selection on the Florida cloud seeding data first reported by Biondini, Simpson, and Woodley (1977) and discussed extensively in Miller (1990). The point of this regression is to predict the target rainfall for days on which clouds were seeded by building a model of rainfall based upon days on which clouds were not seeded. If rainfall on days which clouds were seeded would then exceed predicted values, it would be concluded that the seeding had an effect. Rainfall was measured for 14 days on which clouds were not seeded. Five predictor values were gathered for these days (Table 1). These represented x1 to x5. The predictors x6 to x10 consisted of x12 to x52. The predictors x11 to x20 consist of the cross products of the original five predictors taken two at a time. It is known that both forward selection and sequential replacement fail to find the best model for model sizes bigger than two. The dual criteria genetic algorithm was run on this data set for 30 generations using a λ population of size 50 and a µ population of size 20. The GA search matched the performance of these two methods for model sizes one and two while exceeding their performance on model sizes three to five. The SSresiduals for the best models at various model sizes found by the various methods is given in Table 2. A comparison of the various models found for a number of interesting model sizes by the different methods is located in Table 3. In all cases, the models found by the GA can be shown to be optimal via an exhaustive search (Miller 1990).

Table 2: Best SSresiduals found by model size for the various search algorithms Size

Forward Selection

Sequential Replacement

Genetic Algorithm

1

26.87

26.87

26.87

2

21.56

21.56

21.56

3

19.49

19.49

12.61

4

11.98

11.98

11.49

5

9.05

8.70

6.61

To appear in the Proceedings of the 28th Symposium on the Interface, 1996.

Table 3: Best models found for different model sizes by various search methods Size

Forward Selection

Sequential Replacement

Genetic Algorithm

R. Biondini, J. Simpson, and W. Woodley, “Empirical predictors for natural and seeded rainfalls in the Florida Area Cumulus Experiment (FACE) 1970-1975” Journal of Applied Meteorology, Vol 16, pp 585-594, 1977.

1

x15

x15

x15

S. Chatterjee and M Laudato, “Genetic algorithms and their statistical applications” presented at the 155th meeting of the American Statistical Association, 1995.

2

x14, x15

x14, x15

x14, x15

K.A. De Jong, personal conversation, 1995.

3

x14, x15, x17

x14, x15, x17

x9, x12, x20

4

x12, x14, x15, x17

x12, x14, x15, x17

x9, x10, x17, x20

N.R. Draper, H. Smith, Applied Regression Analysis. New York: Wiley, 1981.

x6, x12, x14, x15, x17

x1, x12, x15, x17, x19

x1, x2, x6, x12, x15

5

It should also be noted that the GA search found good models of model sizes larger than five. In this case, these models were not interesting, but in many cases they will be. Experience thus far has shown that the GA search does very well in searching through hundreds of variables.

C.L. Mallows, “Some comments on CP” Technometrics, Vol. 15, No. 4, pp 661-675, November 1973. C.L. Mallows, “More comments on CP” Technometrics, Vol. 37, No. 4, pp. 362-372, November 1995. Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, New York: Springer-Verlag, 1994. A.J. Miller, Subset Selection in Regression, New York: Champman and Hall, 1990.

5. CONCLUSIONS The use of genetic algorithms appears to have great promise in the variable selection for linear regression. The two criteria genetic algorithm which we developed allows not only for the location of desirable models, but it also allows for the understanding of the trade-off between model size and fit when the number of candidate predictors is large. The methods presented in this paper could be fine tuned and a better understanding of how well the process works needs to be developed. This will come through expanded application to other data sets. Better visualization tools would also allow for better understanding of how the algorithm could be improved.

A.C. Rencher and F.C. Pun, “Inflation of R2 in best subset regression” Technometrics, Vol. 22, No. 1, pp 49-53, February 1980. B. C. Wallet, D. J. Marchette, and J. L. Solka, “A matrix representation for genetic algorithms” Proceedings of the 10th Annual International AeroSense Symposium: Object Recognition IV, 1996. K. R. William, “Designed Experiments” Rubber Age, Vol. 100, pp. 65-71, 1968.

This algorithm should easily extend to a number of other closely related applications in statistics. This includes nonlinear regression, dimensionality reduction for discriminant analysis, semiparametric mixture models density estimation, and reduced kernel estimators. 6. REFERENCES K.N. Berk, “Comparing subset regression procedures” Technometrics, Vol 20, No. 1, pp 1-6, February 1978.

To appear in the Proceedings of the 28th Symposium on the Interface, 1996.

Suggest Documents