Automatic Feature Engineering for Regression ...

Automatic Feature Engineering for Regression Models with Machine Learning: an Evolutionary Computation and Statistics Hybrid Vinícius Veloso de Meloa , Wolfgang Banzhafb a Institute

of Science and Technology (ICT), Federal University of São Paulo (UNIFESP), São José dos Campos, SP, Brazil b Department of Computer Science and Engineering and BEACON Center for the Study of Evolution in Action, Michigan State University, East Lansing, MI, 48864, USA

Abstract Symbolic regression (SR) is a well-studied task in Evolutionary Computation (EC), where adequate free-form mathematical models must be automatically discovered from observed data. Statisticians, engineers, and general data scientists still prefer traditional regression methods over EC methods because of the solid mathematical foundations, the interpretability of the models, and the lack of randomness, even though such deterministic methods tend to provide lower quality prediction than stochastic EC methods. On the other hand, while EC solutions can be big and uninterpretable, they can be created with less bias, finding high-quality solutions that would be avoided by human researchers. Another interesting possibility is using EC methods to perform automatic feature engineering for a deterministic regression method instead of evolving a single model; this may lead to smaller solutions that can be easy to understand. In this contribution, we evaluate an approach called Kaizen Programming (KP) to develop a hybrid method employing EC and Statistics. While the EC method builds the features, the statistical method efficiently builds the models, which are also used to provide the importance of the features; thus, features are improved over the iterations resulting in better models. Here we examine a large set of benchmark SR problems known from the EC literature. Our experiments show that KP outperforms traditional Genetic Programming - a popular EC method for SR - and also shows improvements over other methods, including other hybrids and well-known statistical and Machine Learning (ML) ones. More in line with ML than EC approaches, KP is able to provide high-quality solutions while requiring only a small number of function evaluations. Keywords: Feature engineering, Machine learning, Symbolic regression, Kaizen Programming, Linear regression, Genetic programming, Hybrid. Email addresses: [email protected] (Vinícius Veloso de Melo), [email protected] (Wolfgang Banzhaf)

Preprint submitted to Elsevier

October 13, 2017

1. Introduction In a traditional regression task, one seeks to model the relationship between a dependent variable (the response) and one or more independent variables (also called explanatory variables). In a statistical regression approach, the practitioner employs a predetermined function f (or manually develops a variation) to combine the explanatory variables x in order to calculate output y: y = f (x, β) + ,

(1)

where β is a set of parameters (constants, one for each variable), and is a measure of error. An optimization method has to optimize β to minimize the "lack of fit". Symbolic regression (SR), on the other hand, is a non-linear regression analysis technique that generates mathematical expressions to fit a given dataset. Being an optimization algorithm, SR optimizes the mathematical expressions according to some criterion, such as goodness-of-fit and/or expression complexity. Another particular aspect of SR is that it assumes no a priori model; nevertheless, one may be provided. Therefore, an initial expression, or group of expressions, is randomly generated from the operand and operator sets provided by the user. Operands are the features of the dataset and other constants, such as π. Operators are the functions that generate data (random distributions, for instance) or functions to be applied to the operands (arithmetical, geometrical, etc.). As SR may start with random expressions, and usually there is no mechanism to avoid specific constructions, the algorithm is free to explore the search space of solutions. Thus, it may find high-quality models that would never be discovered by humans because the relationships among the variables could not make sense from a human perspective. Nonetheless, if necessary, domain knowledge and bias can be employed in grammar-based SR algorithms [34]. One may notice that SR is a more general, mixed-type problem, where not only the parameters β must be optimized but also an appropriate function f must be found. Therefore, SR must optimize both the model structure and its parameters, while traditional regression techniques optimize the parameters of a model supplied by the user. Clearly, SR solves a substantially more difficult problem and requires particular algorithms in order to work properly, and many researchers are still looking for good heuristics to improve the search. Over the last years, SR has been widely studied with Evolutionary Computation (EC) techniques able to produce computer code. As examples one may cite Genetic Programming (GP, [25, 5]), Multi Expression Programming (MEP, [31]), Gene Expression Programming (GEP, [15]), Grammatical Evolution (GE, [34]), Linear Genetic Programming (LGP, [8]), Cartesian Genetic Programming (CGP, [29]), Behavioral Programming (BP, [26]), and Stack-based Genetic Programming (Stack-based GP, [32, 39]). These methods evolve populations of individuals, each being a single model. Related non-EC techniques may also be found in the literature, for instance Fast Feature Extraction (FFX, [27]) 2

and Prioritized grammar enumeration (PGE, [42]); different from the EC methods, these two are very successful in performing feature engineering, which is usually defined as a process of creating relevant features from the original features in the data in order to increase predictive power of the learning algorithm. As previously stated, Evolutionary Computation is largely used for finding models composed of a single highly predictive feature. In EC methods, the differential survival of fitter solutions is one of the main ingredients. In most cases, a population of individuals is evolved to solve a particular task, where an individual represents a complete solution. Competition among the individuals is used to control evolution allowing the population to converge to the best individual. There is no guarantee of convergence to the optimum, of course, due to the stochasticity of the process, only of approximation. Thus, it is important to examine techniques that can provide better guidance in stochastic global optimization tasks such as SR. An interesting proposal for better guidance is the Cooperative co-evolution algorithm (CCeA, [33]). It was proposed as a Genetic Algorithm extension to provide an environment where subcomponents could "emerge" and collaboration could automatically appear. CCeA is an evolutionary approach, therefore one expects that emergent behavior will occur. Over the years, some issues of this approach have been improved, e.g. larger populations than traditional Evolutionary Algorithms or multiple populations (a population for each subcomponent); the credit assignment problem; random selection of subcomponents for combination or based on their individual fitnesses (good subcomponents do not always produce good solutions when put together); among others. CCeAs have been applied with success in solving several tasks including SR [33, 41, 1, 6]. De Melo [9] proposed Kaizen Programming (KP) to also search for collaboration among subcomponents. KP, different from CCeA, was proposed as an iterative approach focused on efficient problem-solving techniques that could come from Statistics, ML, Classical Artificial Intelligence (AI), Econometrics, or other related areas. For instance, KP has been used with Logistic Regression [12, 10], CART decision tree [13], and Random Forests [36]. Also, a greedy approach was developed for solving a control problem known as the virtual Lawn Mower [37]. These techniques used by KP can be seen as powerful local optimizers that may need good starting points. For providing such points, KP searches the solution space through random search, recombination, variation, and sampling, among other methods. KP then uses those starting points as subcomponents and the local optimization techniques try to find the best combination of the subcomponents to get the highest-quality solution. Later, one can identify what was combined and the importance given by local optimizers to each subcomponent. The application of KP to SR means that, instead of directly searching for a solution, KP will search for a set of features (mostly non-linear, but single features may be selected) for a known model; in this paper, a standard linear model optimized by Ordinary Least Squares. A relevant characteristic of such approach is the posterior use of other statistical tools for further feature and 3

model selection (AIC, BIC, among others), to calculate prediction and confidence intervals, and to perform residual analysis, among others. A known statistical approach related to KP for regression tasks is called basis expansion ([20], Chapter 5, Page 115): The core idea in this chapter is to augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features. This basis expansion strategy is used by techniques such as MARS [17]. MARS starts a model having only the mean of the response values as the intercept term. Then it greedly iterates adding those pairs of basis functions to the model that gives the maximum reduction in the loss function (sum-of-squares residual error). The basis functions are simply hinge functions that return the positive part of a partition, otherwise returning zero. Another deterministic related method is the Fast Function Extraction (FFX, [27]). FFX uses path-wise regularized linear learning techniques to create generalized linear models, i.e., a composition of nonlinear basis functions (formulas) with linearly-learned coefficients. The basis functions for a given complexity (tree depth) are all created and evaluated at once, and complexity increases in the course of iterations, resulting in an exponential growth in the amount of basis functions. Also, as all combinations are created there is no learning to identify poor basis functions, which are greedly selected by a stepwise procedure. Many models are built with different levels of complexity and are filtered using a multi-objective domination procedure. Traditional statistical methods for basis expansion are deterministic, mostly limited in terms of the functions that can be used and in the complexity of the discovered basis (due to using exhaustive search). KP, on the other hand, has experts that can be either deterministic or non-deterministic allowing, for instance, that rejected basis functions may reappear in a later cycle and be significant. A similar approach was taken by Icke and Bongard [21], who proposed a hybrid version in which the resulting features of many models of an FFX run are passed onto GP for another step in model building. The authors hypothesized that such approach would increase the chances of GP to succeed by letting FFX extract informative features while GP builds more complex models. The results reported in the paper are that the hybrid algorithm provided advantage over GP only for the bigger datasets with 10 and 25 variables. As one may notice, in this case, GP is the method supposed to solve the problem; this is exactly the opposite idea of our approach. In this paper, we present a deeper investigation of the technique presented in [9]. The main contributions here are more details on the framework, how we used GP’s components in KP, a comprehensive experimental section, a comparison with many related work from the literature, and a longer discussion on the pros and cons of KP. Nevertheless, it is important to be clear that the focus of this paper is on SR benchmark functions, while the performance on real-world 4

datasets has been investigated elsewhere [11, 14]. This paper is organized as follows: Section 2 discusses the most related work, Section 3 introduces the proposed hybrid method, Section 4 reports on the experiments using a number of selected benchmark functions from the literature, and Section 5 concludes. A further discussion regarding KP properties, including identified weaknesses, is shown in the AppendixC section. Supplementary material containing more detailed descriptive statistics of the experiments reported here is available. 2. Related Work Although there are many SR techniques employing some kind of constant optimization, here we present the most related ones, meaning they are EC techniques that build linear models through the combination of individuals or partial solutions. Other related approaches were described in the Introduction. Keijzer [22] investigated the use of linear regression on the output of arbitrary symbolic expressions with application in symbolic regression. He showed that the use of a scaled error measure performed better than its unscaled counterpart on all possible symbolic regression problems. GPTIPS [35] is a free open source MATLAB based software platform for symbolic data mining. GPTIPS uses a Multi-Gene Genetic Programming (MGGP) approach. In MGGP, an individual is a forest and each tree is a feature for a multiple linear regression model optimized via OLS. GPTIPS can deal with multiple objectives via Pareto tournament selection, helping the search for high-quality yet low-complexity models. To avoid multicollinearity issues, GPTIPS uses the Moore-Penrose pseudo-inverse instead of standard matrix inversion. Following the traditional EC approach, it is expected that the multiple trees will eventually complement each other instead of representing the same information. Arnaldo et al. [2] proposed the Multiple Regression Genetic Programming (MRGP), which decouples and linearly re-combines a program’s subexpressions via multiple regression on the target variable. Instead of using the output of the original individual, fitness is calculated by the output of this new model. The importance of the subexpressions in the new model is not heuristically employed to generate better expressions in further generations, as happens in KP. MRGP can be also used after a run to improve the fitness of a final, evolved solution, by trying to remove those parts that do not improve fitness. Following KP, in the Evolutionary Feature Synthesis (EFS) the population does not consist of complete models but on a set of features used to build a single model. The features tend to be small and are created stochastically. EFS starts with all original features of the dataset. Then, an iterative process starts by building a model and selecting the important features using the same technique employed by FFX (a Path-wise Regularized Linear Learning). New features are created by applying unary and binary functions to the features in the current population, increasing their complexity. To deal with multicollinearity, EFS 5

removes features with high Pearson correlation, but a new feature is compared only to its parent. Irrelevant features are identified and removed because their estimated coefficients tend to zero. Our method is explained next. 3. The hybrid method Kaizen Programming (KP) was proposed in [9] as a hybrid technique based on the concepts of the Kaizen methodology. KP is an abstraction of important parts of both the Kaizen methodology and the Plan-Do-Check-Act (PDCA) [18] cycle with the goal of an application as a Computational Intelligence tool. Therefore, it is not a simulation of a Kaizen event. For easier understanding of KP, Table 1 relates important terms used here (taken from the Kaizen methodology) to concepts from EC and ML. From now on, we will employ EC and ML terms. However, it is also important to notice that while KP is using some EC components, most EC core characteristics are not used by KP. This statement will be explained and justified through the paper. Table 1: A mapping between terms of the Kaizen Programming implemented in this work, Evolutionary Computation methodology, and Machine learning. Kaizen Programming for SR Idea Standard New ideas of an expert Experts (agents) Linear combination of ideas RSE or AdjustedR2 Contribution measure of an individual

Evolutionary Computation Individual Population Offspring Crossover and Mutation Individual Group/Global fitness Individual/Local fitness

Machine Learning Feature Set of Features New constructed features Procedures to construct new features Solution Loss function Importance

The experts are data types (the data-structure and the procedures that operate on it) that propose ideas. As we chose GP to provide the experts (the tree-based structure and the evolutionary operators), it is easier to think that we are converting GP into KP framework. Thus, the team contains functions for crossover and mutation, each one is considered a different and independent expert. However, other algorithms can be used. In [11, 14], KP was implemented with Simulated Annealing and a linear data structure, while in [36, 37] we used Linear Genetic Programming with a particular linear data structure. For SR, as investigated here, in ML terms we employ a cycle of feature generation followed by feature selection, where features are actually partial solutions to solve the problem by decomposition. New features are generated from the current set, are joined with the current set, and are selected based on a contribution/importance measure instead of a quality measure that treats them independently. In statistical terms, KP builds a model with non-linear basis functions which are iteratively improved by different procedures. This approach is, thus, collaborative, and features that exhibit very poor individual quality may present a very high contribution to the final complete solution. Al-

6

gorithm 1 presents a high-level description of KP using GP and Ordinary Least Squares (OLS). Algorithm 1 The hybrid method (KP) using GP and OLS. 1. 2. 3. 4. 5.

Generate the initial population as CurrentPop Evaluate CurrentPop BestPop ← CurrentPop BestPopQuality ← CurrentPopQuality Loop while target is not achieved (a) PLAN: Generate offspring (b) DO: Create ExpandedPop containing CurrentPop and the offspring (c) CHECK: • Build a model on ExpandedPop, and calculate the importance of each individual • Select the most important individuals into ReducedPop • Build a new model on ReducedPop and calculate its quality (d) ACT: • Update CurrentPop if the new model is better • Update BestPop if the new model is better • Restart if necessary 6. Return BestPop, BestPopQuality

We use the OLS method to build multiple linear regression models because it is a well-known method, frequently employed, fast and efficient, even though it has some important requirements, for instance, the number of features must be bigger than the number of examples. Obviously, any regression method that provides some form of feature importance can be used. Failures must be treated properly in the implementation1 . OLS converges to the unimodal optimum as it solves a convex optimization problem [7]. Hence, in every iteration KP efficiently finds a local optimum, and can get closer to the global optimum in subsequent iterations by providing increasingly better response surfaces (sets of features) to OLS. The stages of the proposed method are explained below. 3.1. PLAN This stage is the first difference to traditional GP. GP has a parameter called crossover probability, directly related to the offspring size. After applying crossover, usually a single child is generated and may suffer mutation. In KP, the offspring size (os ) is independent of the population size (ps ); thus, individuals are continuously generated and stored until reaching os . For the SR task investigated here, there are only two solutions being considered at every cycle, no matter the number of ideas. This means that if there are 1 Model building errors are currently caught and treated as exceptions returning poor quality results.

7

only two ideas, or a thousand, all of them are used in a single model; therefore it is a single "fitness" evaluation. In GP, a thousand individuals mean a thousand fitness evaluations. The second difference is that there is no biased selection mechanism (nor selection pressure) for matting (there is a selection for the next cycle); thus, all individuals have the same probability of being selected. The reason is simple: the population has the most important individuals, and they are all used at the same time to build a single complete solution; thus, all of them should be improved. For EC methods, selection pressure plays a very important role and there are several procedures to do that; some of the most popular are roulette wheel, rank selection, and tournament selection. The third difference is that, as explained before, KP experts are independent to each other. Therefore, an individual can be generated by crossover, crossover and mutation, or just mutation; different types of crossover and mutation can be applied. As an example, suppose a problem with only one feature (x) and ps = 3; then, an arbitrarily generated population could be: Ind1 = −(x2 )/(123.91 − x + tanh(10)); Ind2 = −13.502 ∗ sin(x); Ind3 = sqrt(abs((5.21343 ) ∗ x)). 3.2. DO Here is another characteristic of KP (the fourth difference to GP, and similar to FFX). To calculate the importance of each feature, KP analyzes the current and new features together, so a posterior feature selection can be employed. KP creates a feature matrix Fn,ps +os , where n is the number of observations in the training set; thus, F contains the features calculated from ExpandedPop. This stage is another clear difference from most related methods which evaluate each feature separately and then select the best ones for aggregation, trying to predict which features would be useful to the model. 3.3. CHECK This fifth difference is that GP provides the features, but does not solve the problem (the opposite of the hybrid approach in [21]). In this phase KP, creates a linear model using F and y (the response variable), and calculates the coefficients using OLS. OLS minimizes the sum of squared residuals, which are the squared vertical distances between the observed responses in the sample dataset and the responses predicted by the linear model. Using the previous example with features F1 , F2 , and F3 , the model is created in the form: yî = βˆ1 Fi,1 + βˆ2 Fi,2 + βˆ3 Fi,3 + i ,

(2)

where yî , i = 1, ..., n is the calculated output for a particular input, i is the error term, usually described as the effect of the variables that were omitted

8

from the equation, and βˆ1 , βˆ2 , and βˆ3 are the coefficients estimated by OLS using βˆ = (F T F )−1 F T y. (3) Since KP investigated here employs OLS, only the weights βî can be optimized by it. Any constant configuration, for instance, sin(x ∗ CON ST ), CON ST /x, (1 − CON ST ), or CON ST 2 , is updated via GP mutation. For efficiency purposes, they will be optimized by more adequate numerical optimization methods in a future work. One can also notice that Eq. 2 does not explicitly contain an intercept. In fact, it is expected that one of the features is a constant generated by KP and becomes the intercept. Nevertheless, the intercept can be easily enforced into the model. The main point here is that this linear model is used only to guide the search for better features. Hence, it is important to be clear that the coefficients estimated by OLS are not included in the standard solution (set of ideas, partial solutions, individuals) during the search, only in the final solution. This avoids increasing solution complexity. In order to determine the contribution of each feature to the complete solution (its individual fitness) we do not use the sum of residuals, i.e. the part of its fitness that it contributes to the overall fitness of the model to assign credits to each individual. Instead, we calculate its p−value, which may be interpreted to be the probability of keeping the quality of the model after dropping that particular feature; therefore high probabilities suggest that changes in that particular feature are not associated with changes in the response. This is the sixth difference: in KP individuals collaborate, while in GP they compete. In [27, 21], FFX starts with no features and use stepwise selection to include those with coefficient different from zero, one at a time, until reaching a maximum number of features. To select features for KP, the variance-covariance matrix C of the estimated regression coefficients has to be calculated using the estimated error variance σ ˆ2:

2

Pn

σ ˆ =

− yî )2 , n−r

i=1 (yi

C=σ ˆ 2 (F 0 F )−1 ,

(4)

(5)

where r is the number of features in the input matrix. In order to calculate the contributions, one uses F as the input matrix, so r = ps + os . It is important to note that the same individual may have distinct contribution in distinct models, since σ 2 is necessary to such measure. The variance-covariance matrix is used to calculate the standard error (se) of a particular coefficient βˆj :

9

se(βˆj ) =

p Cjj .

(6)

Using se, one can test the null hypothesis that H0 : βˆj = 0 using the tstatistic:

tβˆj =

βˆj . se(βˆj )

Finally, the p-value of the feature j is calculated using: p-valuej = 2 × (1 − T (df, |tβˆj |)),

(7)

(8)

where T is the cumulative distribution of the student’s t-distribution, df are the residual degrees of freedom, and |tβˆj | is the absolute value of the observed t-statistic. If p−valuej is not significant (p-valuej > α), then the individual (evaluated as a feature) was not "useful enough" to help solving the problem and will be discarded in the next cycle. A penalty also applies to duplicated/collinear features, for which we set p-valuej = 1.0, the maximum value2 . Another procedure is the removal of features downscaled to very small values because βˆj Fj ≈ 0; for such purpose, we provide a user-defined threshold θ. McConaghy [27] uses a slightly similar approach in his FFX method. It ignores zero-value (i.e. useless) coefficients and avoids large-value coefficients that could result in an over-fitted model. Since one wants to maximize the contribution of a feature, contribj is penalized and set to 0.0. The use of this criterion of judging the importance of a contribution is a key characteristic of the method to perform feature selection, which may also help to control bloat. In summary:   if p-valuej > α 0.0, . (9) contribj = 0.0, if |βˆj | < θ   1 − (p-valuej ), otherwise In this work, we employ a hard constraint for the p-value. However, if feasible models are difficult to appear, one could start with a higher value for α and reduce it over time; otherwise, empty models (with no significant features) could be returned. We are dealing with such issue by keeping α small and constant, but running KP for more cycles expecting that, at some point, significant features will appear. It should be noted that distinct features may have distinct scalings. Thus, it could be useful to standardize the dataset, something often suggested for 2 In matrix F , when two columns are collinear, we keep the one the appears first and set βˆj = 0.0 for the second one, which is later removed. Better analyses can be used to determine which feature should be removed.

10

Artificial Neural Networks and Support Vector Machines. Currently, however, we employ a single θ for all features. θ may also be bypassed altogether by setting it to a negative value. An example to show the importance of the check stage is as follows. For our example above, let the target function y be f (x) = sin(x), where −1.0 ≤ x ≤ 1.0, spaced in 0.01, giving 201 points. Now, suppose that one is using a traditional EC algorithm to solve the symbolic regression problem and that the current best individual in the population is fbest (x) = −(x2 )/(123.91 − x + tanh(10)) − 13.502 × sin(x) + sqrt(abs((5.21343 ) × x)). A conventional GP method inserts expressions trying to reduce the error caused by previously inserted expressions. A substantial increase in solution size without a corresponding increase in solution quality is common result, called bloat. Confronted with such a situation, GP would normally perform a local-search algorithm, which would, however, cost many evaluations. If that current best solution has been found in a KP run, it would have three contributing individuals: Ind1 = −(x2 )/(123.91 − x + tanh(10)); Ind2 = −13.502 × sin(x); and Ind3 = sqrt(abs((5.21343 ) × x)). These individuals would be calculated into three features (F1 , F2 , and F3 ) and joined to become the matrix F . One obtains Table 2 after performing multiple linear regression, as in Eq. 2, and collecting the corresponding statistics. Table 2: Resulting statistics after running OLS on x and f (x). βˆ Feature p-value F1

7.98513e-15

5.99e-01

F2 F3

-7.40631e-02 -4.54363e-18

< 2e-16 4.85e-01

Using Eq. 9 and α = 0.05 (the commonly adopted significance level), one can conclude that only F2 is significant. The other two features are therefore discarded, which leads to a new best fit of fnew_best (x) = −13.502 × sin(x). When the estimated coefficient is applied to this new solution, one has fnew_best (x) = (−0.0740631) × (−13.502) × sin(x), resulting in fnew_best (x) = 1.0 × sin(x), which is the exact target function. This process is executed in a single KP iteration, thus saving a substantial number of function evaluations. However, it is important to remember that the estimated coefficient is applied only to the final solution. During the evolutionary process, the estimated coefficients are used only to calculate the contribution of individual expert ideas, which are used to guide the search. The previous steps are performed to evaluate the importance of features as partial solutions. Nonetheless, the new complete solution has √ to be evaluated. One possible measure is the residual standard error (RSE = σ ˆ 2 , see Eq. 4), since it is an unbiased estimate of the true standard error of the sample. The quality of the fit increases as this error is minimized. The other possible approach, the one employed here, is to compare how much of the initial variation in the sample dataset can be reduced by regressing to F . 11

In Statistics, the coefficient of determination R2 is the proportion of variability in the dataset that is accounted for by a particular statistical model. While R2 is a goodness-of-fit measure, a variant, the adjusted R2 , is a comparative measure of suitability for models with distinct sets of features. Adjusted R2 is calculated as follows: n−1 Adj.R2 = 1 − (1 − R2 ) n−r−1 (10) r 2 2 . = R − (1 − R ) n−r−1 with Pn (yi − yî )2 2 , (11) R = 1 − Pi=1 n ¯)2 i=1 (yi − y where r is the number of features in the reduced F (subset containing the most important features, limited by the minimum between ps and the number of significant features of F ). While R2 may increase just because the number of features increases and it actually never decreases (see [30], page 27), adjusted R2 increases only if a new feature improves the model more than would be expected by chance. Since KP generates distinct models with different numbers of features, selected according to their contribution to the model, the adjusted R2 is a more appropriate measure for model selection. Therefore, KP tries to increase the quality of the information that is fed to the linear model, instead of simply aggregating more information. 3.4. ACT In this stage, selection takes place. It performs corrective actions so as to adjust the current standard (dubbed population). The new solution, built using only the significant features (limited by ps ), replaces the CurrentPop only if it is better. A new cycle begins if the termination criterion is not met. A parsimony procedure could be employed here, accepting a slightly lower quality model that is less complex. Provided that KP employs a small number of individuals, as opposed to hundreds of individuals commonly required by EC techniques, a restart procedure may be necessary to escape from stagnation if the current best solution remains the same for a number of iterations defined by the user. The restart procedure saves the current standard, increases the number of features (ps ) by an expansion factor EF , and generates new random features to fill the new slots. The user can choose to restart with completely random new features or keep some of those from the current standard, like the elitism approach in traditional GP. Nonetheless, it is easy to notice that KP restarts near to the previous search-space region if some of the current individuals remain. 3.5. Resulting solution As a demonstration, we discuss the resulting solution when applying our algorithm to a symbolic regression problem that uses OLS as model builder. We use the following function in 5 variables: 12

f (x0, x1 , x2 , x3 , x4 ) = −5.41 + 4.9

x3 − x0 + 3x4

x1 x4

,

(12)

where xi = U (−50, 50). As a solution, we expect an additive function that approximates the response variables, not necessarily an exact function with error zero. However, a single feature may be the final solution if it is sufficient. A KP run using the configuration of Table 8 resulted in the best solution shown in Figure 1. While 8 features were used, only 4 were found to be significant. The weights were calculated by OLS. In Figure 2(a), the values of the different features are plotted using 50 random values xi , i = 0, ..., 4. Figure 2(b) shows the composite final output yˆ and compares it to the original function of Equation (12). The values of the observed and predicted outputs overlap completely, i.e., the exact function was found up to slight approximation errors. The overlap can also be seen in Figure 2(c), where the quality measures show a near-perfect fit. ff ound (x0 , x1 , x2 , x3 , x4 ) = (0.261141430577 ∗(−1 ∗ ((1/((−1 ∗ (((1/(−1 ∗ (x0))) ∗(1/ − 6.2545928837239835)))) ∗(−1 ∗ (x4))))))) +(−1.63333333333 ∗(−1 ∗ ((1/((−1 ∗ (((1/(x4 ∗ x3)) ∗ x4))) ∗(−1 ∗ (x4))))))) +(−1.63333333333 ∗(−1 ∗ ((1/((−1 ∗ (((1/x1) ∗ x4))) ∗(−1 ∗ (x4))))))) +(0.6446225282 ∗(−1 ∗ ((1/(1/8.3925084267666)))))

(a)

fsimplif ied (x0, x1 , x2 , x3 , x4 ) = x0 //F1 −1.6333333333324 ∗ x x1 4 +1.63333333333 ∗ x2 //F2 4

3 //F3 +1.63333333333 ∗ x x4 −5.41000000000209 //F4

(b)

Figure 1: (a) the best solution found by a KP run; (b) the simplified version of (a).

It is interesting to note that one of the features (F4 ) contributes a constant to the solution. The individual error of that feature is high, of course, since it is a very poor fit to the data. In a traditional GP run, such an individual would be discarded. However, since KP is a collaborative approach and the quality of an individual (a feature, a partial solution) is its contribution or importance to the final solution, that constant was maintained. Table 3 shows the resulting statistics using the approximation found from Figure 1, where one may see the ˆ and the significance of each final feature (a p-value < calculated coefficients (β) 0.05 is significant). Bloat can be reduced in this approach because not significant features can be easily identified and removed, though large solutions might still occur in KP if a large number of features and big expressions are allowed. However, as the final solution is a linear model with several features, a post-processing feature selection may be used to reduce the number of features and the overfitting,

13

60.40 ●

value

30.00

Feature F1 ● F2 F3 F4

15.00 ● ●● ●● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ●●●●●●●● ● ● ●

0.00 −15.00

● ●

−35.03 1

10

20

30

40

50

50 random sample points (a) Observed

50 random sample points

Predicted

Reference line

123.83

123.83

80.00

80.00 Predicted

value

Values

40.00 0.00 −40.00

40.00 0.00 −40.00

−87.46 1

10

20

30

40

−87.46 −87.46

50

50 random sample points (b)

R2 = 1 MA E = 1.38e−11 RMS E = 3.96e−11 −40.00

0.00

40.00

80.00

123.83

Observed (c)

Figure 2: (a) Plot of significant features (see Fig 1-b); (b) Observed (Eq. 12) and Predicted (Fig 1-b) outputs; (c) Observed versus Predicted outputs with quality measures.

14

Table 3: Resulting statistics for ff ound . βˆ Feature p-value F1 F2 F3 F4

2.611e-01 -1.633e+00 -1.633e+00 6.446e-01

1e -16) lt / rt else 1 e6 )

7

safeExp = Vectorize ( function ( x ) if ( x 1e -16) log ( abs ( x ) ) else 0)

11 13

# ############################# # new data # #############################

15

npoints = 20 17 19

# Run the code again to have new * random * points x = runif ( npoints , -1 , 1)

21

y = x ^3 + x ^2 + x

23

y _ solSmall = (7.3279773053 * x ) +( -6.33990946248 * sin ( x ) ) + ( 0 . 9 9 7 4 6 8 4 5 3 0 1 2 * (x * x))

25

y _ solBig = ( -0.782293773635 * sin ( safeExp ( x ) ) ) + ( 2. 3 61 44 6 88 93 8 * (( x - 1.0) + safeExp (1.0) ) ) +( -2.22275271899 * sin (( x + x ) ) ) +( -7.24514581586 * cos ( cos ((1.0 - x ) ) ) ) +( 0 . 0 1 0 6 2 2 3 4 4 6 4 9 2 * (( x - 1.0) + (( x * 1.0) * ( safeDiv (1.0 , x ) ) ) ) ) +( -3.08702453607 * cos ( safeExp (1.0) ) ) + ( 0 . 0 0 3 6 1 3 2 0 3 7 9 6 * ( safeLog ( x ) + safeExp (1.0) ) ) + ( 0 . 3 7 0 2 4 4 8 0 7 2 7 1 * ((1.0 - x ) * sin ( x ) ) )

27

# Statistics of the residuals ( prediction error ) cat ( " Trainning set : Statistics of the residuals ( small solution ) \ n " ) print ( summary ( y - y _ solSmall ) )

29 31

cat ( " Trainning set : Statistics of the residuals ( big solution ) \ n " ) print ( summary ( y - y _ solBig ) )

33 35 37

# Plot the expected data ( y ) and the predicted data par ( mfrow = c (2 ,1) ) matplot ( cbind (y , y _ solSmall , y _ solBig ) , type = ’o ’ , lwd = c (1 ,4 , 6) , cex =2 , main = ’ 1: y 2: y _ solSmall 3: y _ solBig ’ , ylab = ’f ( x ) ’ , xlab = ’ points ’) boxplot ( data . frame ( " Small Solution " = y - y _ solSmall , " Big Solution " = y - y _ solBig ) , main = ’ Residuals ’)

AppendixB. Keijzer benchmark set The sensitivity analysis for the corresponding functions is presented in this paper for reference. Table B.22 shows the results of the runs using the function set provided in Table 7. Figure B.5 has the violin plots. As explained earlier, this function set does not provide all operators used in the benchmark problems; therefore, finding the exact solution is not possible for all of them. One can see that only for functions keijzer 5, 6, 8, and 10 KP found solutions that were sufficiently good for the test set. 40

Table B.22: Mean results for Keijzer benchmark set using the test data set. Smallest absolute values are in bold face. Function

Conf1

Conf2

Conf3

Name

M AEtest

N LSEtest

RM SEtest

M AEtest

N LSEtest

RM SEtest

M AEtest

N LSEtest

RM SEtest

keijzer1 keijzer2

0.05476 1.389

0.06292 76.36

0.06434 76

0.007816 0.1096

0.01291 0.2952

0.02268 0.2388

0.3907 141.8

4.536 8928

4.53 8927

keijzer3 keijzer4 keijzer5 keijzer6 keijzer7 keijzer8

0.839 0.1731 0.001718 0.09592 0.01777 2.54E-15

42.62 0.5823 9.032E-05 0.04391 0.0009216 7.957E-30

41.95 0.2328 0.002916 0.1323 0.02337 2.832E-15

0.2326 0.06882 8.861E-06 0.0003385 2.708E-06 2.62E-15

0.7337 0.03623 2.849E-09 7.032E-07 4.867E-10 5.373E-30

0.5366 0.08667 1.852E-05 0.0005593 5.054E-06 2.855E-15

0.1746 0.04591 9.969E-06 5.664E-05 1.747E-08 1.327E-14

0.6684 0.01136 1.131E-07 5.585E-08 2.149E-13 7.614E-27

0.6793 0.05834 5.139E-05 0.0001071 1.063E-07 2.088E-14


0.02312 0.0119 2.909E+07 2.154 1.548E+07 8.945E+05

0.008679 0.002863 2.291E+08 11.14 1.121E+08 7.044E+06

0.07088 0.0164 2.272E+08 11.52 1.114E+08 6.986E+06

0.00456 0.001799 1.836E+09 0.001217 3.874E+08 2.055E+08

0.0216 0.000105 1.444E+10 2.611E-05 3.042E+09 1.619E+09

0.1038 0.002884 1.432E+10 0.005146 3.017E+09 1.605E+09

0.07267 0.001168 1.971E+10 2.002E+08 2E+11 4.07E+10

1.947 6.779E-05 1.552E+11 1.576E+09 1.575E+12 3.205E+11

2.124 0.001839 1.539E+11 1.563E+09 1.562E+12 3.179E+11

keijzer15

0.3217

0.0001166

8.07E-07

0.0002907

5.025E+04

3.957E+05

3.925E+05 RM SEtest

0.147

0.1767

Function Name

M AEtest

Conf4 N LSEtest

RM SEtest

M AEtest

Conf5 N LSEtest

RM SEtest

M AEtest

Conf6 N LSEtest

keijzer1 keijzer2 keijzer3

0.05452 0.1746 0.3437

0.05965 0.7361 3.298

0.06318 0.389 2.515

0.00872 0.1174 0.2951

0.02051 0.3452 2.97

0.02728 0.2608 2.831

0.3541 5.282 0.1892

3.513 307.3 0.5791

3.494 307.2 0.5727

keijzer4 keijzer5 keijzer6 keijzer7 keijzer8

0.1583 0.001163 0.106 0.01717 2.256E-15

0.3214 4.451E-05 0.05582 0.0008665 4.453E-30

0.208 0.00221 0.1505 0.02179 2.501E-15

0.06634 9.795E-06 0.0004566 1.076E-07 2.945E-15

0.02526 5.925E-09 3.495E-06 2.192E-13 2.155E-29

0.08253 2.026E-05 0.0007642 1.408E-07 3.41E-15

0.04923 1.947E-06 9.487E-05 2.489 1.902E+05

0.01361 4.854E-10 2.687E-07 25.47 6.022E+06

0.06249 4.631E-06 0.0001877 2.509 6.019E+06


0.01527 0.01308 2.129E+07 0.1896 8.703E+07 4.801E+07

0.00386 0.003789 1.476E+08 0.02793 6.783E+08 3.781E+08

0.05299 0.01802 1.461E+08 0.2759 6.728E+08 3.75E+08

0.00391 0.002765 4.791E+07 3.763E-05 3.888E+08 1.622E+08

0.009094 0.003184 3.665E+08 1.348E-08 3.055E+09 1.277E+09

0.08732 0.006125 3.635E+08 0.0001013 3.03E+09 1.266E+09

5.02E+93 0.0007759 2.838E+09 5E+04 2.994E+10 6.18E+09

1.589E+95 4.474E-05 2.232E+10 3.938E+05 2.354E+11 4.867E+10

1.588E+95 0.001361 2.214E+10 3.905E+05 2.335E+11 4.827E+10

keijzer15

1.977E+05

1.557E+06

1.544E+06

0.0007148

6.538E-05

0.00332

4.917E+06

3.872E+07

3.841E+07

AppendixC. Further discussion In this section, we discuss some important questions raised during our research and by the anonymous reviewers. Many questions are still open and should be answered in future work. What is the main aspect in KP that makes it succeed? In the task investigated in this paper, KP performs feature expansion from the current best partial solutions and then applies feature selection to test the new reduced feature set. The importance of a feature is calculated from information given by the model itself, so features are important as much as the model can tell they are. We understand that the expansion/selection mechanism is crucial, but it seems that the most important part of the methodology is the technique responsible for creating a final solution. This final solution should adequately select and combine partial solutions based on their importance to the model, thus performing an efficient local search to find the best way to combine the solutions. While symbolic regression can be a non-linear optimization task, OLS quickly solves a convex optimization task using useful partial solutions discovered by experts, which are improved over the cycles. Having said that, it is clear that the evolutionary algorithm is not the main technique; instead, it searches 41

Figure B.5: Violin and dot plots (sensitivity analysis) of final results for Keijzer benchmark set.

for good starting points for the local-search method. What is the individual importance of crossover and mutation? Based on the results reported, it seems that crossover is more useful for expressions that cannot be decomposed into partial solutions for the additive model. At the moment, there is no experimental analysis to confirm this impression so we have to defer to future work. But as previously explained, one may use experts that do not rely on mutation and crossover at all. How big are the solutions produced by KP? The short answer is that they were usually small when compared to GP trees of depth d = 17, which is the maximum size commonly adopted in the literature; final average sizes are not available from most work in the literature for comparison. The long answer is presented next, using the maximum size allowed for each partial solution. Suppose an empty solution and that each new feature Fj is of depth d, the maximum allowed depth. After Fj is multiplied by its corresponding coefficient βj , it grows one depth. Thus, each final feature has a minimum depth of d + 1. This first feature is then inserted into the empty solution. When two features are added together, the addition term grows the tree another level; therefore, the new solution has depth = d + 2. For each new feature, the solution tree grows just one depth, which is the addition term. Consequently, the maximum tree depth is d + ps as shown in Figure C.6. For a model with the intercept, the final formula is d + ps + 1. 42

p s=4

+ * β4

p s=3

+ F4

β3

*

p s=2

+ *

F3

β2

p s=1

*

F2

β1

F1

d

d+ p s

Figure C.6: Example of a tree depth for different ps ; where minimum d = 0.

In terms of nodes, the maximum number in a perfect binary tree is 2d+1 − 1, d > 0. When the coefficient is multiplied by the partial solution, the new sub-tree has two more nodes; thus, 2d+1 + 1 nodes. To combine two partial solutions (ps = 2), one inserts an addition node; thus, 1 + 2 × (2d+1 + 1) nodes. Therefore, a complete solution contains a maximum of (ps − 1) + ps × (2d+1 + 1) nodes, which can be checked in Figure C.6. Now, back to our experimental section, for the second experimental analysis with a maximum of ps = 10 × D experts (D = 5 variables) and maximum depth d = 10, the worst case scenario gives 102,499 nodes in a single complete solution. This number lies between the sizes of perfect tree solutions with d = 15 (65,535) and d = 16 (131,071). Even though the correct size of the final solution is not available in this paper, the supplementary material contains information on the final size, the total number of partial solutions evaluated (Evals) and the total number of nonterminal nodes (NodeEvals), considering all partial solutions, even when they are repeated. With such information, one can calculate the average number of non-terminal nodes in a partial solution as NodeEvals/Evals. One may then notice that such value is around 4 for many functions, even the hardest ones (see Tables 7-18 in the supplementary material). In a perfect tree, the number of nodes is N = 2 ∗ I + 1, where I is the number of non-terminal nodes. Thus, N = 2 ∗ 4 + 1 = 9. For such tree, the depth is d = log(N ) = log(9) ≈ 2.19, which is much less than d = 10. By using d = 2.19 and ps = 50, the maximum number of nodes in the complete solution using the previous formula is (50 − 1) + 50 ∗ (22.19+1 + 1) = 555.30, which is much less than the theoretical limit of 102,499 nodes. It is still a large number, but it is an upper bound supposing that all 50 partial solutions were important and that all of them were perfect trees. Finally, in the same supplementary material one may check the column named FSN (Final Solution number of Nodes) and find that the Mean values for this example (see Table 18) varied from 8.42 (keijzer8) to 277.2 nodes 43

(keijzer5).

For problems where KP finds the exact solution, does the solution consist of a single term ("feature"), or multiple terms? For most problems that could be adequately decomposed into an additive model, more than one partial solution was used. However, for some problems one partial solution was responsible for explaining almost all variation while the others modeled small fluctuations. When a single partial solution is the exact solution, one may argue that a single feature was needed (a single GP method in the current implementation). What is the novelty presented by KP? The idea of KP is to have 1) procedures to generate partial solutions to expand the current set, preferably learning from past trials 2) a procedure to efficiently build a complete solution and calculate the importance of the partial solutions to perform selection, 3) a procedure to efficiently build a complete solution and calculate the quality of a complete solution. The complete solution is created by a technique that forces cooperation, instead of expecting its emergence. KP is not linked to any particular technique for any of those three aspects. GP procedures were used here because GP is a well-known symbolic regression technique. OLS was used because it is popular, has solid statistical foundations, and is the simplest possible model being in addition fast and efficient. Adjusted R2 was used because models have distinct number of parameters and need to be penalized for reducing complexity. The p-value was used for feature selection purposes because it is the direct variable’s importance measure in a linear model. Finally, model improvement is guided by a variable’s importance, not how well it solves the problem. KP is, indeed, similar to FFX and the hybrid method proposed by [21], but the differences have already been presented in this paper. Comparisons to more methods are presented in Section 2. What are the weaknesses of the methodology? As stated before, KP is an alternative search method that could provide good solutions after a few cycles. However, it has some characteristics that the user must take into account to make a reasonable choice. KP works by decomposing the original problem into partial solutions; if this cannot be done, this implementation of KP will likely fail. Thus, one should use other methods such as GP which evolves individuals as single solutions. Population-based EC techniques are intrinsically parallel as the individuals are independent. Several works in the literature have populations of thousands of individuals. Therefore, one may have, in theory, an infinite population size and evaluate it in a single step. KP, on the other hand, requires a serial procedure to build the model, even though the partial solutions can be calculated in parallel. 44

The current implementation uses OLS, which requires some strong assumptions on the data, while most related techniques have no requirements at all. This issue could be reduced by replacing OLS by a more flexible technique. KP has three main modules, thus, three parameter settings to be adjusted, which can be time consuming. However, when using restarts, KP can start with simpler models (few and small features) and become more complex (many and large features) over the cycles, reducing the number of parameters.

45

Automatic Feature Engineering for Regression ...

Automatic Feature Engineering for Regression ...

Suggest Documents

Automatic Feature Engineering for Answer Selection and Extraction

Feature Selection for Ordinal Regression - esuli.it

Feature Selection for Ridge Regression with Provable

Logistic Regression for Automatic Lexical Level Morphological

Towards Automatic Feature Vector Optimization for ... - CiteSeerX

A Robust Feature Extraction for Automatic Speech

Automatic feature learning for vulnerability prediction - arXiv

performance evaluation for automatic feature extraction - CiteSeerX

Physiologically Motivated Feature Extraction for Robust Automatic ...

ROBUST FEATURE EXTRACTION FOR AUTOMATIC ... - CiteSeerX

Component Regression Analysis on Automatic

Feature engineering - cs.Princeton

Taxonomy-aware feature engineering for ... - BMC Bioinformatics

Associative feature modeling for concurrent engineering ... - CiteSeerX

DOWNLOAD Feature Engineering for Machine ... - Google Sites

Feature Engineering for Mobile (SMS) Spam Filtering

A Conceptual Basis for Feature Engineering - CiteSeerX

Volumetric Feature Recognition for Direct Engineering - SUTD

Instance-Based Regression by Partitioning Feature Projections

instance-based regression by partitioning feature projections

Parallel Large Scale Feature Selection for Logistic Regression

L1 LOGISTIC REGRESSION AS A FEATURE SELECTION STEP FOR ...

Correlated Regression Feature Learning for Automated ... - IEEE Xplore

Feature-Rich Two-Stage Logistic Regression for Monolingual Alignment