Learning Graphical Models for Parameter Tuning
Mauro Birattari, Marco Chiarandini, ¨ tzle Marco Saerens, Thomas Stu
IRIDIA – Technical Report Series Technical Report No. TR/IRIDIA/2011-002 January 2011
IRIDIA – Technical Report Series ISSN 1781-3794 Published by: IRIDIA, Institut de Recherches Interdisciplinaires et de D´eveloppements en Intelligence Artificielle
´ Libre de Bruxelles Universite Av F. D. Roosevelt 50, CP 194/6 1050 Bruxelles, Belgium Technical report number TR/IRIDIA/2011-002
The information provided is the sole responsibility of the authors and does not necessarily reflect the opinion of the members of IRIDIA. The authors take full responsibility for any copyright breaches that may result from publication of this paper in the IRIDIA – Technical Report Series. IRIDIA is not responsible for any use that might be made of data appearing in this publication.
Learning Graphical Models for Parameter Tuning Mauro Birattari
Marco Chiarandini Thomas St¨ utzle∗
Marco Saerens
January 2011
Abstract We introduce a new method for deciding the values of categorical and numerical parameters of algorithms for optimization. The method is based on graphical models and Bayesian learning. Each parameter is modelled by a node of the network and parameter dependencies by arcs. Nodes have associated a local probability distribution. Both discrete and continuous variables can be treated, assuming Gaussian linear regression for the latter. Learning can be achieved by a combination of importance sampling techniques and Bayesian calculus. We describe the method and review the main elements of the theory underlying its components. We then present its application on simple cases and compare its performance with methods from the literature. The results show that the method achieves comparable results to the state-of-the-art while being perhaps more principled and having interesting features. Among them the flexibility to handle different tuning scenarios, the possibility to include prior knowledge and the output of relevant information. We make available in form of an R package an implementation of the method that works also in parallel environments.
1
Introduction
The need to calibrate parameters by tuning is present everywhere in science and engineering. In computer science and operations research, algorithms for solving optimization problems have typically a large number of inherent parameters, often related to heuristic choices, that cannot be assessed by theoretical analysis only. Calibration and tuning must be carried out by executing computational experiments, ∗ M. Birattari · T. St¨ utzle IRIDIA, Universit´e Libre de Bruxelles
M. Birattari e-mail:
[email protected] T. St¨ utzle e-mail:
[email protected] M. Chiarandini IMADA, University of Southern Denmark, Campusvej 55, DK-5230 Odense, Denmark. e-mail:
[email protected] M. Saerens Machine Learning Group, Universit´e catholique de Louvain, 1348 Louvain-La-Neuve, Belgium e-mail:
[email protected]
1
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
2
first in the laboratory on synthetic data and then in conditions similar to those of real-life applications. This procedure can be burdensome due to the large number of possible candidate configurations to test and the high cost in time that each single run of the algorithm may imply. Moreover, it is of low intellectual content. Parameters in optimization algorithms can be of numerical or categorical nature. Examples are the branching rule in a branch and bound algorithm, the type of neighborhood in a local search heuristic, the rate of evaporation of pheromone in ant colony optimization or the algorithmic modules and their order in hybrid solvers. In the heuristic algorithm that ranked third in the international timetabling competition in 2007 there were eighteen parameters to decide [16]. The open source ZIB Optimization Suite, a solver for mixed integer programming problems1 exposes up to 876 parameters and the IBM ILOG CPLEX system,2 the widely used commercial solver that represents the cutting edge technology in MIP solvers has at least 63 possible parameters in non default settings [26]. Moreover, it is well known that algorithms may behave differently according to the input data they have to solve. That is, the parameter setting depends on the input data and may vary from one type of input data to another. Often to mitigate the effect of overtuning on specific instances algorithms are allowed to take random decisions thus introducing stochasticity in the results and further complicating the situation. In the recent years, a considerable amount of research within computer science has been concerned with the development of an appropriate automatic method for calibration by tuning of optimization algorithms. Given a problem, a set of solution algorithms with relevant parameters exposed and a description of the input data, the goal is finding the setting of parameter values that is most likely to perform best. In the literature, this undertaking is sometimes referred to as algorithm tuning and configuration problem, or algorithm selection problem. The formalization of the latter can be found already in Rice (1975) [39]. It can be seen as a subproblem in the wider area of algorithm engineering [43] that deals with the whole production cycle of algorithms, including design, theoretical analysis, implementation, test and experimental evaluation. The main stream of this research is an interdisciplinary approach with the field of statistics [5]. The main aspects of statistics that are appealing in this context are the objectivity provided by the mathematical framework and the previous development of methods for estimating the contribution of factors while minimizing the number of experiments to perform. In this latter category fall: advanced techniques for experimental designs, response surface modelling and sequential testing. In this work, we describe a general method for the task of automatic tuning based on a different framework with respect to previous work. Indeed we look at the problem from the perspective of graphical models. Graphical models are well known abstractions for representing kknowledge developed in the field of machine learning in artificial intelligence. This representation in computer readable form is manipulated by various algorithms to learn and make inference. Graphical models are well suited to handle uncertainty due to different causes, such as lack of knowledge or inherent stochasticity of events. Moreover, they can cope with situations 1 2
Zuse Institute Berlin (ZIB), 2003-2010. ZIB Optimization Suite. http://zibopt.zib.de/. IBM ILOG CPLEX, 2010. High-performance mathematical programming engine. http://www. ibm.com/software/integration/optimization/cplex/
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
3
where all possibilities cannot be considered because they are simply too many. This is achieved by shifting the analysis of the different possibilities into the framework of probability calculus and focusing instead on the likelihood of possible outcomes. Our contribution is the formalization of the tuning task as a reasoning activity in a graphical model that encodes the tuning parameters and how they relate each others. The idea is to use a network in which nodes represent the parameters to tune and connections the dependencies between these parameters. Each node is a random variable with a probability distribution that reflects the knowledge on the parameter that it represents. We then seek to learn the configuration of parameter values that is most likely to perform the best, in the same way as expert systems learns the most probable explanation to the input evidence. In our specific context, this requires a further step recognizing a stochastic optimization problem and solving it in a similar vein as in rare event simulation. In other terms, we wish to modify the joint probability distribution of the model in such a way that it becomes more likely to infer configurations of parameter values that perform well. We achieve this by sampling configurations from the network and learning only on those that perform best. We carry out the task of learning the joint distribution by revising previous knowledge on the basis of new evidence within the well known framework of Bayesian calculus. This framework within which we cast the tuning task lead to a few favorable features. We are thus able to treat all types of parameters, categorical, integer and continuous, even though dealing with mixed types of parameters in graphical models is still challenging and a focus of current research. It is also possible to handle different types of dependencies, such as those of probabilistic type or those of nesting, that arise whenever a parameter choice is meaningful only in presence of other choices. We restrict ourselves to learn probability distributions but future work could consider learning also the structure of dependencies among the tuning parameters. Experiments shows that the performance of the proposed method is comparable than previously proposed tuning approaches in the field of optimization. At the same time the new method exhibits favorable features that none of the previous methods have. Among them the possibility of visualizing results, being flexible to many different tuning scenarios that arise in optimization, having a solid mathematical foundation in probability calculus, allowing to define and use a-priori knowledge and extracting from the final model relevant information on which are the important parameters. In order to foster research and diffusion of the method we make available our implementation that works also in a parallel environment in form of an R package. R is a free software environment for statistical computing and graphics that is gaining users at high speed both in the academy and in the industry. We start by reviewing the recent literature on the topic of tuning in optimization in Section 2. In Section 3 we introduce the notation and define the problem. The same section also contain the sketch of the tuning algorithm put forward in this paper. Section 4 refines the description of the algorithm reviewing the elements from Bayesian calculus that are used to carry out the learning phase. Section 5 presents numerical results on three cases. The first two are for exemplifying the
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
4
method while the third case is used to carry out experiments on the parameters of the method and for comparison with another tuning method from the literature. We conclude in Section 6 summarizing the main features of the method introduced and looking at future work.
2
Literature
We distinguish two main approaches to algorithm tuning: the offline strategy and the online strategy. Offline tuning considers a set of training instances that are representative of the population of instances on which there is an interest to tune the algorithm. The outcome of the tuning is a configuration of parameters that has best performance on average on the population of instances. Other measures rather than average can be defined as median and quantiles. The main feature of this approach is however that the configuration has to be decided a-priori and cannot be adapted to the specific instance. Online tuning considers instead the possibility of changing the parameter configuration while executing the algorithm. Thus values are decided according to on going results of the algorithm run. In general, tuning methods for the online case can be used in the offline context as well. Gagliolo (2010) showed that the offline policy is clearly dominated by a-posteriori selecting the best configuration on each specific instance. More interestingly from a practical point of view, is the fact that also an online strategy can outperform the offline one [19]. This author presents different ways for accomplishing the task of online tuning by solving the problem of time allocation among configurations. The study is however limited to the time-to-goal response and how this should develop with solution quality is not clear. Between the offline and the online strategy there is a whole gamma of possibilities that have been referred to as static selection [39]: given a new unseen instance, recognize some fundamental properties of the instance and choose the configuration of algorithm parameters accordingly. There is a considerable body of recent literature on both strategies. In this work we focus on offline tuning and hint at how our method can be adapted for static selection as well. One of the first serious attempts is perhaps F-race by Birattari et al. (2002) [9]. This method implements a sequential testing procedure called race in machine learning literature. At each new training instance a set of alive candidate configurations is executed and candidates that result statistically inferior are discarded from successive instances. The success of F-race is due to the statistical test used: the Friedman test that transforms results in ranks thus making easy to handle to issue of different scales among training instances. The pitfall of the method is the necessity to decide a-priori the configurations on which focusing the tests without modifying them on the basis of results collected. An attempt to work around this issue is Iterated F-race [2]. An approach that received considerable attention in the literature is the modelbased approach with roots in advanced statistics and kriging [30]. Its main components are space-filling designs, i.e., designs that spread points evenly throughout the experimental region (e.g., fractional factorial designs, central composite designs,
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
5
Latin Hypercube Designs [33]) and model-based regression. Several applications of these techniques not focused on tuning but rather in screening factors with a significant impact have appeared in the literature [40, 7]. The use as a tuning method in optimization has been instead proposed by Bartz-Beielstein (2006) [4]. Its Sequential Parameter Optimization has the main feature of iterating the process of sampling design points and fitting a regression model. This work has spread a considerable body of related research with experts from statistics (see for example [6]). The goal of this thread is refining on several issues like deciding which among linear regression, splines, neural networks, support vector regression, and kriging models are most appropriate [11]. The pitfall of the literature on this approach is that although the method is presented as an offline tuning it is always used on one single instance. A recent attempt to handle multiple training instances within this framework has appeared in [25] but the resulting method is quite intricate. Tuning of algorithm parameters corresponds to optimization of a function where no expression is available for the function or its derivatives. Due to the numerical solution and/or the real-world phenomena being modeled the objective function landscape can be multimodal and noisy. This view led to the attempt to solve the tuning problem by means of model-free search heuristics. Within this paradigm the main proposal has been ParamILS [26]. A fundamental difference in the tuning problem with respect to search in optimization is that the evaluation of a single tentative configuration implies running computationally-intensive simulations and this may be time consuming. Hence one of the main features of search heuristics that is the possibility of visiting fast many solutions can be hardly achieved. Recent results seem indeed to indicate that model-based approaches coupled with principle sampling of design points achieve better results than model-free search heuristics [24]. Other approaches on these lines are from the field of evolutionary algorithms [45]. The method we put forward in this work can be seen as a Bayesian optimization algorithm (BOA) on the space of algorithm parameters. BOA [36, 37] has been so far used to solve directly optimization problems by modelling the variables that compose a solution as node of the network and the dependencies among the variables as arcs. The algorithm uses inference from the network to sample new solutions and learn both the parameters of the probability distribution at the nodes as well as the structure of the network. In fact most emphasis is put on learning this latter [35]. A related approach is estimation of distribution algorithm [31]. All applications of these methods of which we are aware have focused on discrete variables and not mixed variables as we intend to pursue here. Another related approach, whose main idea of rare event simulation is reused here, is the cross entropy method [38]. However, both these methods have not been applied to the problem of algorithm tuning as we present it here.
3
Problem Definition and Proposed Approach
We denote random variables by upper-case letters (e.g. X, Y, Xi , Θ) and their realized values by lower-case letters (e.g. x, y, xi , θ). The extension of this notation to sets or vectors of variables and their realizations is denoted by bold-face letters, (e.g. X, Y, Xi , Θ and x, y, xi , θ). We use p(X = x) or its shorthand p(x) to denote the
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
6
probability of an event X = x. We also use p(x) to denote the probability distribution for X (both mass and density function). Whether p(x) refers to a probability or a probability distribution will be clear from the context. Let x be a vector of algorithm parameters corresponding to a point in the design space X of all possible combinations of parameter values. Each element xi can denote a categorical parameter to indicate which algorithm component to use, a discrete parameter or a continuous, i.e., a real-valued, parameter. The performance measure y(·) is a real-valued function with domain the subset X of the d-dimensional Euclidean space Rd . Under parameter setting x an algorithm produces a single real-valued response that depends on some random variables. Among these random variables we distinguish at least Π indicating the instance and E indicating the pseudo-random seed and other possible noisy factors such as the computational environment, hardware components, etc. For a fixed point x the performance measure is a random variable induced by the distribution of Π and E. To emphasize this dependency we write the performance as a random function Y (x, Π, E). We indicate observed data by d = {(x1 , π1 , 1 , y(x1 , π1 , 1 )), . . . , (xn , πn , n , y(xn , πn , n ))}. In tuning and configuration problem we aim at determining the design point x∗ that minimizes3 the mean performance over the instances and random seed, i.e., µ(x∗ ) = min Eπ, [y(x, π, )] x∈X
(1)
Given the independence between Π and E we can rewrite: Z Z µ(x) = Eπ, [y(x, π, )] = y(x, π, )p(π)p()dπd Π
E
More generally, we can define (implicitly) ξ α (x) the lower α-quantile of the distribution Y (x, Π, E) Pr[y(x, π, ) ≤ ξ α (x)] = α If y(x, π, ) were skewed a better choice to the mean would be to minimize the median ξ 0.5 (x). Similar definitions are possible for the α-quantiles ξ α (x∗ ). Given the random nature of y(·), the problem (1) is a stochastic optimization problem.4 A way to solve the stochastic optimization problem (1) is by repeatedly sampling in the design space X . Then by a technique known as rare event simulation, the joint probability distribution is updated in such a way that sampling important points becomes more likely. To apply this technique, we model the design point as a vector of random variables, one element for each parameter in the design space, which we denote 3
We assume minimization problems, the conversion to maximization problems being straightforward. 4 In the literature, sometimes Y (x, Π, E) has been assumed to be a Gaussian random function or Gaussian stochastic process for all x ∈ X [44](although the focus of this thread of research has been rather on one single instance, that is, Y (x, E|π) [24]). In [3, 17], we modeled Y (x, Π, E) as a hierarchical Gaussian function, which removes the need for specifying the correlation function of the Gaussian random function. In particular, we defined a linear model Y (x) = xβ + T + E where T and E have distribution N (0, στ ) and N (0, σ), respectively.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
7
X = (X1 , X2 , . . . , Xd ). Further, let Pr(X = x) be a joint probability distribution on the design space. We use this distribution to sample repeatedly design points. More precisely, we are interested in the distribution p(µ(X) ≤ γ, Θ = θ) which is a family of joint probability density and mass functions on X . The parameters of the probability distribution Θ are uncertain variables and θ are the values that correspond to the possible true value of the probability of drawing exactly x∗ . If the probability of the event {µ(x) ≤ γ} is very small, as when we are trying to estimate the minimum of µ(x), we call this a rare event. The cross entropy method [38] estimates the extreme value by minimizing over θ the Kullback-Leibler distance between probability density functions. Computing the joint probability distribution in presence of dependencies between the parameters by means of the cross entropy is not straightforward because the calculations to derive the values of θ in closed form are rather involved. Instead, we approach this learning within the Bayesian framework that can handle efficiently interdependencies between variables. To make this task computationally affordable the common practice is to define a graphical model of dependencies among variables. A graphical model is made by an acyclic directed graph G = (V, E) that encodes a set of conditional independence assertions about the variables of algorithm parameters in X. Each vertex corresponds to a parameter and links are established among vertices if the corresponding parameters are supposed to interact. On the vertices there are a set P of local probability distributions associated with each variable. Together these components define the joint probability distribution p(µ(X) ≤ γ, Θ = θ).5 The set P is expressed by means of conditional probability tables that define the probability distributions of each parameter given the choice of the parameter from which they depend. We use Xi to denote both the variable and its depending vertex and Pai to denote the parents of node Xi in G as well as the variables corresponding to those parents. The lack of possible arcs in G encodes the conditional independencies. In particular, given the structure G the joint probability distribution for X is given by p(x) =
d Y
p (Xi = xi |Pai = pai )
(2)
i=1
The local probability distributions P are the distributions corresponding to the factors in the product of Equation (2). Hence, the pair (G, P) encodes the joint probability p(X). We then need to determine the local probability distributions p(xi |Pai ). For example, in the case of discrete variables we need to define one distribution for Xi for every configuration of Pai . We wish to learn these probabilities from the observations. Hence, we treat their values as uncertain, and model the parameters θ of the probability distribution as random variables Θ. Given a random sample d from the true probability distribution (or the one we wish to learn) and a prior probability p(θ) we can determine the posterior p(Θ = θ|D = d) from the observations collected by means of Bayes’ theorem: p(θ|d) = 5
p(θ)p(d|θ) p(θ)p(d|θ) p(θ)p(d|θ) =R =R p(d) p(d, θ)dθ p(d|θ)p(θ)dθ
To simplify the notation we use p(X) in place of p(µ(X) ≤ γ).
(3)
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
8
The probability p(d) is a normalization constant that does not depend upon θ and we will often omit it in the rest of the paper. In Bayesian terms the tuning problem of predicting the best configuration corresponds to computing the joint probability distribution of the network consequent to the observations collected using the a priori knowledge on p(Θ). If we let d = (x1 , . . . , xn ) be the previously seen design points the probability distribution that we wish to compute is p(Xn+1 = xn+1 |D = d). Then, to make predictions on xn+1 we consider the expected value over all possible configurations of θ: Z p(xn+1 |d) = Eθ [p(xn+1 |d, θ)] = p(xn+1 |xn , θ)p(θ|d)dθ. (4) We can finally sketch the proposed algorithm for parameter tuning by learning graphical models in Algorithm 1. A Bayesian network is input defined by a graph structure and by the probability distributions. We assume a priori knowledge for both these decision as input to the algorithm. We wish to learn the probability parameters from the data while we maintain the structure of the network fixed. We learn sequentially repeating two steps: (i) inferring from the network to sample good configurations and executing them; (ii) updating the parameters from the best of the sampled configurations. To obtain probabilistic inference on high performing configurations in (i) it is enough to fix one of the possibly many topological ordering of the network G and then for each vertex drawing a value for the corresponding variable, given the outcome of the parent nodes and the local probability distribution. In the second step, at each iteration of the loop (1-9) a batch of n configurations constitute the data base on which the network parameters are learned. Note that we allow configurations to be present multiple times both in the sample of line 2 and in the data base of line 6, as this might be representative of the fact that they attain good performance. However, when executing the configurations in line 3 we use different seeds. In the following sections we explain in the details how to carry out the learning phase, introducing the probability distribution families to be used in (4) and (3).
4
Bayesian Learning
In the learning phase of Algorithm 1 we wish to learn the values of the probability parameters θ that will make more likely to sample configurations that achieve the best average results. We carry out this task using well known Bayesian theory. This section does not include new material but it is rather a review of this theory. Wider and more accurate presentations can be found in various text books on this subject [20, 29]. We discuss first the single variable case without dependencies and consider three types of variables: binomial, discrete and continuous. The binomial case is a subcase of the discrete type but we treat it separately as an introductory example. We then extend the theory to multiple variables and conditional dependencies encoded in a network. In this case we treat first the case with only discrete variables and then the mixed case of discrete and continuous variables, whose theory is more involved [12]. A central aspect of the theory is deciding the probability distributions is such a way that Equation (3) and (4) can be solved analytically. This can be achieved for
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
9
Algorithm 1: Parameter tuning by learning graphical models input : A Bayesian network G = (V, A), prior probability parameters θ, a class of instances Π output: Probability parameters θ for G learned from data 1
2
3 4 5
6
7
8
while computational resource budget not exceeded do take an unseen instance π ∈ Π ; inference step: sample m design points (x1 , . . . , xm ) from the graph model G using the prior local conditional probabilities p(x|d) of Equation (4); run all design points once on instance π; order the results from best to worst y(x(1) ) < . . . < y(x(m) ) ; evaluate the empirical ρ-quantile of the distribution of y derived from the observed data, ie, γˆt = y(x(bρmc) ); let d = (x(1) , . . . , x(n) ) with n = bρmc be the data base of best configurations on which we wish learning to occur; learning step: calculate the posterior distributions of θ by means of Equation (3), the dependency structure of G and the data base d; make the posterior parameters θ the new prior parameters;
posterior distributions p(θ|x) that are in the same family as the prior probability distribution p(θ). Probability distributions with this property are called conjugate distributions.
4.1
Binomial parameters
We consider a single variable X with domain {0, 1}. Let θ represent the probability of sampling 1 each time the parameter is sampled, and 1 − θ the probability of sampling 0. Then, we can write p(X = 1|θ) = Bern(θ) = θ which is known as Bernoulli distribution. Further, for a set of n observed outcomes d = (x1 , . . . , xn ) of which s are 1s, we have the binomial sampling model. It states that n s p(D = d|θ) = p(s|θ) = Bin(s|θ) = θ (1 − θ)n−s (5) s Here θ corresponds to the exact probability of having outcome 1. The prior distribution of θ represents the population from which the θ of current interest has been drawn; in the subjective representation of knowledge our level of uncertainty about θ is captured by the random realization of θ from the prior distribution. Without any reason to believe any value for θ a plausible prior distribution is the uniform distribution. A more flexible distribution that includes the previous case is the beta distribution. In the following we assume the prior distribution of Θ to be
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
10
a beta distribution, that is, p(θ) = Beta(θ|α, β) =
Γ(α + β) α−1 θ (1 − θ)β−1 Γ(α)Γ(β)
The function Γ(·) is the Gamma function. The parameters α and β with α > 0 and β > 0 are hyperparameters of the prior distribution. We recall that the mean of a α beta distribution is E[θ] = α+β . The beta distribution has the conjugacy property, that is, the posterior distribution has the same functional form as the prior. This property is convenient because the posterior can be derived in closed form and has a known parametric form: p(d|θ)p(θ) p(d) Bin(s|θ)p(θ) = p(d) ∝ Beta(θ|α + s, β + (n − s)).
p(θ|d) =
We obtain the predictive prior distribution for a new sample xn+1 by averaging over all possible values for θ. Using the probability expansion rule we have Z p(xn+1 |D) = p(xn+1 |θ) p(θ|D) dθ =Ep(θ|D) [θ] Z = θ Beta(α, β) dθ =
α+s α+β+n
(6)
The result is the posterior probability of drawing a 1 from the population. In the limit the hyperparameters of the prior distribution have no influence on the posterior. The hyperparameters of the beta distribution can be assessed using the data available, a technique known as imagined future data. Let i = 1, . . . , n identify a case in the data base. At the ith case, let si−1 be the number of 1s in the previous cases and ni−1 the number of 0s. Then, α α+β α+1 p(X2 = 1|X1 ) = α+β+1 α + si−1 p(Xi = 1|si−1 , ni−1 ) = α+β+i−1 p(X1 = 1) =
and one can solve to find the values of α and β. With new data, the beta posterior distribution p(θ|d) can be updated using batch learning or sequential learning. In batch learning, p(θ|d) is found by updating the hyperparameters of p(θ) with all cases in d at the same time, i.e., in a batch. In sequential learning, p(θ) is updated one case at a time, using the previous posterior distribution as the prior distribution for the next case to be considered. When the database d is complete, batch learning and sequential learning leads to the
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
11
same posterior distribution and the final result is independent of the order in which the cases in d are processed [8]. It is of course also possible to process batches sequentially, that is, a set of cases at a time and then a new set of cases added to an already processed database. In Algorithm 1 we follow the last approach. Let t = 1, . . . , T identify an iteration of the while loop of the algorithm. In the tth iteration let nt be the number of configurations below the threshold γˆ t and st the number of these configurations that had parameter X set to 1. We model the st ’s as independent binomial data, conditional to sample size nt and iteration-specific means θt . We assume means to have a beta distribution. We want to form a prior distribution for θt+1 using the data up to t. Hence, α p(X1 = 1) = α+β α + s1 p(X2 = 1|s1 , n1 ) = α + β + n1 P α + tt=1 st p(Xt+1 = 1|s1 , n1 , . . . , st , nt ) = P α + β + tt=1 nj At the beginning of this process, when no data are available, the equivalent samples technique can be used, which assigns a minimal equal sample size to any outcome and evaluates α and β consequently.
4.2
Multinomial parameters
1 r Let now X be Pra discrete variable with r possible outcomes x , . . . , x and let θ = (θ1 , . . . , θr ), k=1 θk = 1, be the parameters that correspond to the true value of the probability of realization of each outcomes. That is,
p(X = xk |θ) = θk .
(7)
Further, for a set of n outcomes d = (X1 = x1 , . . . , Xn = xn ), let (s1 , . . . , sr ) be the number of times X = xk , k = 1, . . . , r in d. The probability distribution of the sample is r Y n! p(D = d|θ) = θksk . s1 ! · · · sk ! k=1
The prior distribution assumed for θ is the Dirichlet distribution P r Γ( rk=1 αk ) Y αk−1 p(θ) = Dir(θ|α1 , . . . , αr ) = Qr θk k=1 Γ(αk )
(8)
k=1
which is a generalization of the beta distribution to the multinomial case. Here, the hyperparameters are αk , k = 1, . . . , r. The posterior probability is thus p(θ|d) ∝ Dir(θ|α1 + s1 , . . . , αr + sr ) and the predictive posterior distribution of xn+1 Z αk + sk p(Xn+1 = xk |d) = θk Dir(θ|α1 + s1 , . . . , αr + sr ) dθ = Pr l=1 αl + n
(9)
(10)
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
12
Assessing the Dirichlet distribution can be done as above by the imagined future data technique, which can be easily derived from (10).
4.3
Continuous parameters
Let X be a variable with domain the real numbers. We assume X has a Gaussian distribution −(x−µ)2 1 p(X = x|θ) = √ e 2σ2 2πσ 2 where θ = (µ, σ 2 ). As priori probability density for the two-parameters θ we assume a conjugate prior that leads to a univariate sampling model. A convenient parameterization is [20] µ|σ 2 ∼ N(µ0 , σ 2 /κ0 ) σ 2 ∼ Inv-χ2 (ν0 , σ02 ) which corresponds to the joint prior density 1 2 −1 2 −(ν0 /2+1) 2 2 p(µ, σ ) ∝ σ (σ ) exp − 2 [ν0 σ0 + κ0 (µ0 − µ) ] . 2σ The presence of σ 2 in the distribution of µ|σ 2 indicates that µ and σ are dependent in their conjugate prior density. In particular the larger the σ 2 the larger the prior distribution from which µ is drawn. An alternative choice to construct a prior distribution for θ would be to specify prior distributions for µ and σ independently. Further, for a sample d = (x1 , . . . , xn ) the joint posterior distribution p(θ|d) can be computed via the conditional posterior density of µ, given σ 2 , κ0 µ + σn2 x ¯ 1 2 σ2 0 µ|σ , d ∼ N , κ0 , (11) κ0 + σn2 + σn2 σ2 σ2 and the marginal posterior density of σ 2 σ 2 |d ∼ Inv-χ2 (νn , σn2 )
(12)
with νn = ν0 + n νn σn = ν0 σ02 + (n − 1)s2 +
κ0 n ¯ (d − µ0 )2 κ0 + n
Thus to sample from the joint posterior distribution one first draws σ 2 from its marginal posterior distribution (12) and then draws µ from its normal conditional posterior distribution (11) using the simulated value of σ 2 . Prior information about the variance σ 2 could be expressed by an inverse-χ2 density possibly fitting the parameters ν0 and σ02 from available data. For noninformative prior distribution for σ 2 one can set µ0 = 0, thus p(σ 2 ) = σ −2 [20]. The predictive posterior distribution is Z p(xn+1 |d) = p(xn+1 |µ, σ 2 , d) p(µ, σ 2 |d) dµ dσ 2 . (13)
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
13
The first of the two factors under the integral is simply the normal distribution for the future observation given µ and σ 2 while the second factor is the joint posterior distribution discussed above. Hence, to compute the future observation Xn+1 , one first estimates µ and σ 2 from their joint posterior distribution and then draws xn+1 ∼ N (µ, σ 2 ). The estimation of the µ and σ parameters can be done on the basis of S simulations, e.g., S = 1000.
4.4
Learning local probabilities in Bayesian networks with mixed variables
We consider the joint distribution of parameters X encoded in some network structure G of dependencies. We encode our uncertainty on the parameters θ of the given distributions in a prior distribution p(θ) and use empirical data and Bayes’ theorem to update the posterior distribution p(θ, d). We assume that the structure of the directed graph is known and the distribution type is given and we consider the specification of the parameters in the distributions. As in the single variable case this process is carried out through conjugate Bayesian analysis. We also introduce other simplifying assumptions. We assume that the parameters associated with one variable are independent of the parameters associated with the other variables (global parameter independence) [46]. In addition to this, we will assume that the parameters are independent for each configuration of the discrete parents (local parameter independence). A consequence of this parameter independence assumption is that, for each configuration of the discrete parents, we can update the parameters in the local distributions independently. This also means that if we have local conjugacy, i.e., the distributions of p(θ i |pai ) for a variable i belong to a conjugate family, then because of parameter independence, we have global conjugacy. It follows that the joint distribution of θ belongs to a conjugate family. Parameter independence in the prior distribution implies parameter independence in the posterior distribution. Further, we also assume that the data d is complete, that is, each case it contains at least one instance of every random variable in the network with no missing data. These assumptions are common in Bayesian learning theory [21, 13]. 4.4.1
Discrete part of the network
A common assumption in the literature [29], is that discrete variables do not have continuous parents. Hence we study only discrete nodes with discrete parents. For discrete parameters, assuming an (unrestricted6 ) multinomial distribution model, each local distribution function is a collection of multinomial distributions (7), one distribution for each configuration of Pai , that is, p(xki |paji , θ i ) = θijk , Q where pa1i , . . . , paqi i , qi = Xi ∈Pai ri , denote the configurations of Pai , and θ i = (θijk ), k = 1, . . . , ri , j = 1, . . . , qi , are the local parameters of variable i. For 6
The term unrestricted is used to indicate that no low-dimensional function of Pai is used, that is, a probability model is considered for each value of the parent.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
14
convenience, let θ ij = (θij1 , . . P . , θijri ) be the vector of parameters for each variables i i and parent configuration j ( rk=1 θijk = 1). In the case of no missing values, that is, all variables of the network have a value in the random sample d, and independence among parameters, the parameters remain independent given d, that is, p(θ|d) =
qi d Y Y
p(θ ij |d)
i=1 j=1
In other terms, we can update each vector parameter θ ij independently, just as in the one-variable case shown in Equation (9). Assuming each vector has the prior distribution Dir(θ ij |αij1 , . . . , αijri ) as in Equation (8), we obtain the posterior distribution p(θ ij |d) = Dir(θ ij |αij1 + sij1 , . . . , αijri + sijri ) where sijk is the number of cases in d in which Xi = xijk and Pai = paji . To obtain predictions on xn+1 , the next case after d, we can average over θ. For xn+1 with Xi = xijk and Pai = paji where k and j depend on i, p(xn+1 |D) =
Z Y d i=1
θijk p(θ|D) dθ =
d Z Y
θijk p(θ|d) dθ =
i=1
d Y i=1
α + sijk Pri ijk k=1 (αijk + sijk ) (14)
where we first used the independence of parameters and later Equation (10). 4.4.2
Mixed part of the network
The common assumption is that the local probability distributions are Gaussian linear regressions on the continuous parents, with parameters depending on the configuration of the discrete parents [23, 12, 13, 42]. The local mass function of a continuous variable i is a collection of Gaussian distributions depending on the configurations of discrete and continuous parents Pai and parameters θ i . In the following we specify Pa = (Pad, Pac) to distinguish discrete and continuous parents, respectively. We also use padj to indicate the realization of configuration j of the discrete parents and pac to indicate the realization of continuous parents. There are four possible cases for continuous nodes in a mixed network, namely when the node i has no parents, when it has no discrete parents, when it has no continuous parents and when it has mixed parents. Case (i)
There are no parents, then the distribution is p(xi |θ i ) = N(µi , σi2 )
where θ i = (µi , σi2 ) Case (ii)
There are only continuous parents, then the distribution is p(xi |paci , θ i ) = N(µi + β i paci , σi2 )
and θ i = (µi , β i , σi2 ).
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002 Case (iii)
15
There are only discrete parents. Then we have 2 p(xi |padji , θ i ) = N(µij , σij )
2 ), i.e., the mean depends solely on the parent configurations j of and θ i = (µij , σij Pai .
Case (iv) There are both continuous and discrete parents, this is the real mixed case. The distribution is 2 p(xi |padji , paci , θ i ) = N(µij + β ij paci , σij )
(15)
2 ). and θ i = (µji , β ji , σij
We can rewrite (15) with matrix notation 2 p(xi |padji , paci , θ i ) = N(ZTij β ij , σij )
where
1 pac i1 Zij = . .. pacik
(16)
µji βij1 β ij = . ..
βijk
are matrices of dimensions 1 × (k + 1) and (k + 1) × 1, respectively, k being here the number of continuous parents of i. Due to parameter independence we can update parameters for each continuous variable i and each configuration of discrete parents j independently. By Bayes’ theorem it is p(θ ij |d) ∝ p(xij | pacji , padji , θ i )p(θ ij ). (17) Here xij is a vector of nj observations in d for node i corresponding to configuration j of the discrete parents of i, i.e., Padi = padji . Similarly, pacji are the values of the continuous parents of i in the cases of d with configuration j for the discrete parents of i. Since observations in d are independent, p(xij |pacji , padji , θ i ) is the probability function of a multivariate Gaussian distribution with mean vector Zij (β ij )T and 2 I where Z is a matrix of size n × (k + 1) and I is the identity matrix. covariance σij ij j As prior probability for the parameters θ [12] assume a Gaussian–inverse-Gamma distribution that beside having the conjugate prior property also leads to a univariate sampling model. 2 |padj ) it is For parameters (β ij , σij i 2 β ij |padji , σij 2 σij
2 (τ )−1 ∼ Nk+1 ν ijt , σij ijt ∼ Inv-Γ(ρijt /2, φijt /2)
where the hyperparameters ν ijt is a vector of size k + 1, τ ijt is a matrix of size (k + 1) × (k + 1), and ρijt and φijt are scalar. In the other stages, the posterior
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
16
becomes the prior and a new posterior is determined from the prior and the new observations available. The posterior distribution p(θ|d) derived in closed form using the conjugate property of the prior can be also expressed by a two stage sampling process: 2 , padj , d ∼ N 2 −1 β ij |σij k+1 (ν ij,t+1 , σij (τ ij,t+1 ) ) i
(18)
2 |d ∼ Inv-Γ(ρ σij ij,t+1 /2, φij,t+1 /2)
with τ ij,t+1 = τ ij,t + (Zij )T Zij ν ij,t+1 = (τ ij,t+1 )−1 τ ij,t ν ij,t + (Zij )T xij
ρij,t+1 = ρij,t+1 + nj φij,t+1 = φij,t+1 + (xij − Zij ν ij,t+1 )T xij + (ν ij,t − ν ij,t+1 )T τ ij,t ν ij,t
The predictive posterior distribution can be computed in the same way as in the single variable case of Equation (13). The parameters β and σ are estimated by simulation from (18). Alternatively, for σ 2 one can use the mode which is given by φ/(ρ + 2) or the mean given by φ/(ρ − 2).
4.5
Parameters of the method
The parameters of the method put forward in Algorithm 1 are ρ, m and the initial prior distributions of the probability parameters in the graphical model. The computational budget is another parameter, which affects the number of learning iterations performed by the algorithm. The definition of the prior probability hyperparameters for both discrete and continuous variables may reflect prior knowledge of the user on promising choices. By default, for the discrete variables we let each configuration be equally likely. Thus, we imagine to have a database of cases, with one case for each combination of discrete parameters. The sample size of this imaginary database is l. Each θijk can then be computed by marginalization as 1/ri . (These parameters can be stored in a multi-way array with the node itself occupying the first dimension and the parents each occupying one dimension.) For the continuous variables we set user defined values as an initial estimate of the mean and the variance. Alternatively, if a presample is available the sample mean and the sample variance can be estimated by regression from data. In the general mixed case the local probability distributions are Gaussian linear regressions on the continuous parents with parameters depending on the configuration of the discrete parents, as expressed in Equation (16). The parameters to keep at the nodes are therefore (θ i |pai ) = (β ij , σ 2ij |pai ). The vector β ij contains an intercept βij0 and a linear regressor coefficient for each continuous parent of i. Initially, we set the user defined values for the estimated mean and 2 ), with 0 a vector of variance of the independent case, that is, (θ i0 |pai ) = (µi0 , 0, σi0 length equal to the number of parents. (These parameters can be stored in a matrix with one row for each configuration of the discrete variables. The first column 2 and the remaining columns β .) contains σij ij
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
5
17
Numerical Examples
We implemented Algorithm 1 in R [47] on the basis of the Bayesian learning procedures described in the previous sections that are implemented by the package deal [14]. We make publicly available our implementation as an R package lgmpt that imports deal at www.imada.sdu.dk/~marco/lgm. In the software we adapt the continuous parameters after they have been sampled. Precisely, we ensure that they fall within the limits imposed by the user. If they exceed these limits they are adjusted to the closest limit. Moreover, in case of integer parameters treated as continuous we round the decimal digits. The learning part is carried out on the adjusted values. We show the application of the method to three cases. The first is a simple example with only two discrete parameters on a combinatorial optimization problem. The remaining examples are applications similar to those typically faced in optimization. They contemplate mixed parameters and include deterministically conditional parameters (the terminology deterministic condition refer to the possibility that under some choices of the parents the parameter has no meaning and is not used). Cases 1 and 2 are carried out on a MacBook 2.4 GHz with Intel Core 2 Duo processor with 4GB of RAM while experiments for case 3 are carried out on a cluster of computers each node consisting of machines with 2 INTEL Xeon E5410 (quad core 2.33GHz, 2x 6MB L2 cache) that share 8GB RAM. The package lgmpt uses Rmpi to distribute experiments among slave processing slots and run them in parallel. All scripts and programs used for these cases are available as examples in the R package lgmpt and documented in its vignette.
5.1
Case 1: Heuristics for the traveling salesman problem
The first application example considers a simple heuristic algorithm for the classical traveling salesman problem (TSP). The heuristic uses a construction procedure to determine an initial solution and a local search to improve it until a local optimum is reached. Both choices are modelled as categorical parameters with heur ∈ {nn, nearest insertion, farthest insertion, cheapest insertion, arbitrary insertion, random insertion} and ls ∈ {none, 2-opt, linkern}. See [27] for a description of the heuristics. The implementation is provided in R by the package TSP [22] while for the more involved Lin-Kerninghan local search we used the implementation available in concorde [1]. The goal is determining the best combination of these two components on a class of Euclidean instances randomly generated of size 100 and 200.7 In the graphical model we encode a dependency of the local search from the initial solution. See Figure 1, left. We run the Bayesian learning algorithm for 20 stages (we call stage a full iteration of the loop 1-9 of Algorithm 1) with m = 75 sampled configurations at each stage and ρ set to 0.05. Hence, at each stage the b0.05 · 75c = 3 best configurations are used to learn the model parameters. The total number of single algorithm runs at the end of the process is 1500 and the number of training instances seen 20.8 7
For this size the branch and price in concorde solves the problem easily to optimality, however these sizes make it already possible to observe differences among the heuristic algorithms. 8 In sampling the instances we make sure that the first 10 instances are those of size 100 and the
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
nn−linkern arbitrary_insertion−linkern cheapest_insertion−linkern farthest_insertion−linkern nearest_insertion−linkern
ls
Config
nn−2−opt arbitrary_insertion−2−opt cheapest_insertion−2−opt farthest_insertion−2−opt nearest_insertion−2−opt nn−no arbitrary_insertion−no cheapest_insertion−no heur
farthest_insertion−no nearest_insertion−no
●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●●● ●●● ●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●● ● ● ● ●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●● ● ● ●●●●●● ●●●●●●●●●● ● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ● ●●● ●● ●●●●●●●●●●●●●●● ●● ● 5
10
Stage
15
20
18
●
nn−linkern
●
arbitrary_insertion−linkern
●
cheapest_insertion−linkern
●
farthest_insertion−linkern ●
nearest_insertion−linkern nn−2−opt
●
●●● ●●
●
arbitrary_insertion−2−opt cheapest_insertion−2−opt
●
●
farthest_insertion−2−opt
●
●
nearest_insertion−2−opt
●
nn−no
●
●
●
●
●
arbitrary_insertion−no
●
cheapest_insertion−no ●
farthest_insertion−no
●
nearest_insertion−no 0
20
40
60
rank
Figure 1: On the right the graphical model. On the center the profile of sampled configurations. On the right the boxplot of the rank transformed results including all configurations tried. Note that in a simple case like this with only categorical variables, that amount to 15 different configurations, and a sample size of 75, some configurations are sampled more than once. Since we mostly deal with stochastic algorithms, whenever a configuration is repeated in a stage it is re-run with a different seed. The presence of repeated configurations is interpreted within the current framework as a possible reinforcement for the most promising configurations. Obviously, with larger design spaces like those of the next examples, this resampling aspect is not present. In Figure 1, center, we visualize the behavior of the learning process by plotting the selection density of different configurations sampled in line 2 of the algorithm. Darker grey circles indicate that the number of times the corresponding configuration is sampled is higher. It is evident from the plot that the learning algorithm is converging towards a few configurations that are performing better. On the right part of Figure 1 we report the boxplots of all results collected at the end of the 20 stages. The response variable is transformed in ranks within each instance to avoid scale issues due to the use of different instances. The fact that the configurations to which the algorithm converges are also those that rank best confirms that we are learning as wished the best configurations. Figures 2 and 3 illustrate the changes in the probability distributions throughout all learning stages of the algorithm. We report the numerical results for the local conditional probabilities for the two nodes of the network in Table 1. As the graphical model indicates, for ls the distributions are conditional to the choice of the heur parameter. From this table we can decide the best configuration by taking for each component in topological order the choice that has the highest probability. In this case we have: arbitrary insertion + linkern. This result is consistent with the vast literature on TSP and we take it here as evidence that our tuning method produces sound results. second 10 are those of size 200. Although this is not going to have any relevance in this example, since each run lasts in any case less than one second and the computational cost is not an issue, this procedure might become useful in other cases when we might wish to learn from small instances some a-priori knowledge to simplify the learning task on the large instances.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
19
1.0 0.8 0.6
p 0.4 0.2 0.0 20 19 18 17 16 nn 15 14 13 12 arbitrary_insertion 11 109 cheapest_insertion 87 65 farthest_insertion 43 2nearest_insertion 1
stage
heur
Figure 2: Changes in the probability distribution of heur after each learning stage. 1 2 3 4 5 ls no 2-opt linkern
nearest insertion 0.09 0.09 0.82
heur nearest insertion farthest insertion cheapest insertion arbitrary insertion nn
farthest insertion 0.11 0.11 0.78
Freq 0.17 0.14 0.11 0.33 0.25
cheapest insertion 0.14 0.14 0.71
arbitrary insertion 0.05 0.05 0.91
nn 0.06 0.06 0.88
Table 1: Final local conditional probabilities for the node heur (above) and ls (below) of the network. Moreover, we observe that while the winner for the ls parameter by large linkern, the choice for the heur parameter is more uncertain. This fact may be indicate that for this algorithm the starting solution is not very important if a good local search is used or in other terms whatever is the starting solution the local search linkern leads to the best results. We emphasize that this information provided by lgmpt can be relevant in practice and it is an added value of this method that other methods in the literature do not exhibit.
5.2
Case 2: Least median of squares
The second case studies heuristics for linear regression by least median of squares [41]. Briefly, given a linear regression yt = β0 + β1 xt + t
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
1
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion cheapest_insertion p farthest_insertion nearest_insertion
heur
7
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion cheapest_insertion p farthest_insertion nearest_insertion
heur
13
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion cheapest_insertion p farthest_insertion nearest_insertion
heur
1
heur
7
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion linkern cheapest_insertion2−opt farthest_insertion no nearest_insertion
heur
20
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion cheapest_insertion p farthest_insertion nearest_insertion
20
ls
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion linkern cheapest_insertion2−opt farthest_insertion no nearest_insertion
heur
13
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion linkern cheapest_insertion2−opt farthest_insertion no nearest_insertion
heur
ls 20
ls
1.0 0.8 p 0.6 0.4 0.2 0.0 nn arbitrary_insertion linkern cheapest_insertion2−opt farthest_insertion no nearest_insertion
heur
ls
Figure 3: Changes in the probability distribution of heur (left) and ls (right) after stage 1, 7, 13 and 20, when the algorithm finishes. The variable heur is independent while the variable ls is conditional to heur.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
21
+
Yt
+ + + ++ + + + +++ +++ + ++ + +++++ + + + + + + + + + + ++ ++ + +++++++ + + + ++ ++ ++++++++++ + ++ ++++ +++++ + ++ ++++ +++ ++ + +++ +++ +++ ++++++ + +++++ + +++++ ++ +++ ++ +++ +++ + + ++++++ ++++ + ++ + + + +
−0.04
−0.04
0.02 0.06
Yt
0.00015
z
median(εε2t ) = 0.00014 0.02 0.06
median(εε2t ) = 5.2e−05
0.00020
+ + + + ++ + + + +++ +++ + ++ + +++++ + + + + + + + + + + ++ ++ + +++++++ + + + ++ ++ ++++++++++ + ++ ++++ +++++ + ++ ++++ +++ ++ + +++ +++ +++ ++++++ + +++++ + +++++ ++ +++ ++ +++ +++ + + ++++++ ++++ + ++ + + + +
0.00010
−0.02
0.00
0.02
0.04
−0.02
Xt
−0.04
0.010 0.005
1.4
beta
0.000
0.02 0.06
Yt
−0.04
1.6
0.02 0.06
Yt
1.8
0.04
median(εε2t ) = 6.9e−05 +
+ + + ++ + + +++++ +++ + ++ + + +++ +++++ + + + + + + + ++++ +++++ + + + + ++ ++ ++ ++++ ++ ++++++++++ +++ +++ ++ + ++++ ++++ + +++ ++ +++ ++++++ ++ +++++ + +++++ ++ +++ ++ +++ +++ + + ++++++ ++++ + + ++ + + +
0.02 Xt
median(εε2t ) = 8.6e−05 2.0
0.00
+ + + + ++ + + +++++ +++ + ++ + + +++ +++++ + + + + + + + ++++ +++++ + + + + ++ ++ ++ ++++ ++ ++++++++++ +++ +++ + ++++ ++++ + + + +++ + +++ ++++++ ++ +++++ + +++++ ++ +++ ++ +++ +++ + + ++++++ ++++ + + ++ + + +
1.2 −0.005
−0.02
alpha
1.0 −0.010
0.00
0.02
0.04
−0.02
0.00
0.02
0.04
Figure 4: On the left, an example of a function to minimize arising from the least median of square problem in two dimensions. On the right, the blue line represents the regression corresponding to different local optima of the function depicted on the left. Solutions may differ drastically. The red line represents instead the classical least squares regression. we wish to estimate the values of the intercept β0 and β1 . The minimization of the sum of squared errors min
n X t=1
2t =
n X
= yt − β0 − β1 xt
2
t=1
leads to an easy mathematical derivation of β0 and β1 . An alternative way is minimizing the median of squares [41] min median 2t This latter criterion is likely to be more robust against outliers. Nevertheless, it leads to minimize a non-differentiable, nonlinear and multimodal function. An example from the data later used is shown in Figure 4. For the solution of this problem we need to resort to heuristics. As emphasized from the figure it is however relevant to find good solutions. In this settings we may ask which are the best heuristics and in which way they relate to properties of the function to minimize. In Table 2 we identify the controllable factors of the experimental set up. A basic algorithm that achieves often good results is the Nelder-Mead simplex algorithm [34] which implements a search in a neighborhood of a point of the function. The Nelder-Mead algorithm has a number of continuous parameters: alpha is the reflection factor, beta the contraction factor and gamma the expansion factor. The authors suggested as default setting alpha=1.0, beta=0.5 and gamma=2.0. Practical experience shows that better solutions can be found if this algorithm is restarted several times from a point after convergence or from another different point. In addition the way initial points are
●
●
60
●
●
β1
●
●
● ●
●
●
●
●
●
●
0
●
−2
−1
●
0
100 ●
●
● ● ●
1 β0
80
●
●
2
3
●
●
●
−2
●
●
●
● ●
●●
● ●
●
●
●
● ●
●
●
●
● ●
●
●
−1
0
●
● ● ● ● ●
●●
● ●
● ●
● ● ●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
● ●●
● ●
● ●
●
●●
● ●
●
● ● ● ●
● ●
●
● ● ●
● ● ●
●
● ●● ● ●
●
●
●
● ● ●● ● ● ●● ●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
● ●
● ●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
20
●
●
●
●
●
●
22
● ●
●
60
● ●
●
● ●
●
40
●
●
●
●
●
● ●
40
● ●
●
20
80
●
0
●
● ●
β1
100
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
● ●
● ●
1
2
3
β0
Figure 5: Starting point generators in 2-D, Left: Uniform random distribution (pseudo random number generator), Right: Quasi-Monte Carlo method: low discrepancy sequence generator. selected may have a more or less relevant impact. We focus our analysis on: the number of reinforcements max.reinf, i.e., restarts from convergence points of the Nelder-Mead algorithm; the number of restarts max.restart; and the method to select random restarts points init.method. In particular, for this last factor we consider a uniform random distribution of points and a quasi-Monte Carlo method that generates a low discrepancy sequence of points [15]. (See Figure 5 to appreciate the difference.) Table 2 summarizes the parameter considered and their range. In the same figure we also give the graphical model. We decided to encode a dependency between init.method and max.restart since it seems plausible that as the number of restarts increases the impact of the low discrepancy sampling of points becomes less relevant (asymptotically it should disappear). Moreover, max.reinf might have an impact on the scaling of the simplex in the Nelder-Mead algorithm and hence on its three continuous parameters. Least median of squares has been proposed as a more robust method for carrying out the linear regression in the capital asset pricing model (CAPM) [32, 48]. Our attempt to use real-life data from this application turn out to be uninteresting because most configurations of the parameters above returned the same results. Therefore we turned our attention to synthetic instances generated in a way similar to [41]. Let yi = β0 + βj xij + ei be the linear model with p input variables and p + 1 coefficients. We simulate the points (yi , xij ) as follows. Let β0 = 1, β1 = 2, . . . , βp = p + 1. The points xi are generated by a Gamma distribution with scale 1 and shape 1.5. Errors ei are normally distributed with mean 0 and variance 5. A part of yi values, corresponding 1 to 0.3 p+1 n of total, are contaminated by sampling βj , j = 0, p + 1 from a normal distribution centered in βj and variance 3. We generated 100 instances made of N = 600 points and p = 16 regressors. Unfortunately, after several attempts the problem of many configurations attaining the same results was still there. At each stage, about a half of the configurations ranked best and were selected for learning, hence the value of the parameter ρ is not
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
23
Score: NA max.restart gamma
Algorithm
Factor max.restart init.method max.reinf alpha beta gamma
Type Discrete Discrete Discrete Real Real Real
Levels/Intervals {400,1600,3200} {random, quasi-random} {1;5} [0.01; 3.00] [0.01; 3.00] [0.01; 4.00]
init.method
beta
max.reinf alpha
Table 2: On the left, the parameters to tune for the LMS case. On the right, the graphical model. reliable in this example. We use a maximum of 20 learning stages and a sample of 75 configurations at each stage. In Figure 6 we visualize the trend of the sampling throughout the different stages for the combinations of categorical variables. Each circle includes different choices for the continuous parameters. It is evident a trend to focus very soon on areas of the design space that the boxplot on the right confirm better best performing. In Figure 7, we try to represent the sampling process for the continuous parameters using darker colors for the latest stages of the race. It is hard in this case to distinguish any pattern. In Table 3 we report the values of the parameters defining the local conditional probability distributions. We also report for comparison the prior distributions that we defined as input network. We note also in this case that most of the effect is produced by one discrete parameter, namely, init.method. The result shows that the Quasi-Monte Carlo method is outperformed by usual uniform sampling. This result is clearly linked to the dimension of the problem p and it would be interesting in the future to include this parameter somehow in the analysis. The probabilities of the other two discrete parameters do This has to be interpreted as a lack of influence and since lower values for these parameters imply shorter running times, they should be preferred. As far as continuous parameters are concerned we see that they do not exhibit large variations in their mean parameters while there is a considerable reduction in variance. This result calls for more investigation to understand whether the little change in means is due to the fact that the prior values are those suggested by the authors for which experience has already shown that they are good values or to a drawback of the tuning method. Finally, the most likely configuration of the learned network obtained by sampling the value of the variables chosen in topological order is given in Table 4.
Config
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
3200−random−20 1600−random−20 400−random−20 3200−quasi−random−20 1600−quasi−random−20 400−quasi−random−20 3200−random−15 1600−random−15 400−random−15 3200−quasi−random−15 1600−quasi−random−15 400−quasi−random−15 3200−random−10 1600−random−10 400−random−10 3200−quasi−random−10 1600−quasi−random−10 400−quasi−random−10 3200−random−5 1600−random−5 400−random−5 3200−quasi−random−5 1600−quasi−random−5 400−quasi−random−5 3200−random−1 1600−random−1 400−random−1 3200−quasi−random−1 1600−quasi−random−1 400−quasi−random−1
●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ● ●● ● ●●●●● ●● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●● ●●●●●●●●● ● ● ● ● ●●●● ●● ●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●● ●● ● ● ● ● ● ● ● ●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●● ● ● ● ●● ●● ●● ● ● ● ●●●● ●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ● ● ● ● ●●●● ●● ● ●● ● ● ● ● 5
10
Stage
15
20
24
● ● ●
3200−random−20 1600−random−20 400−random−20 3200−quasi−random−20 1600−quasi−random−20 400−quasi−random−20 3200−random−15 1600−random−15 400−random−15 3200−quasi−random−15 1600−quasi−random−15 400−quasi−random−15 3200−random−10 1600−random−10 400−random−10 3200−quasi−random−10 1600−quasi−random−10 400−quasi−random−10 3200−random−5 1600−random−5 400−random−5 3200−quasi−random−5 1600−quasi−random−5 400−quasi−random−5 3200−random−1 1600−random−1 400−random−1 3200−quasi−random−1 1600−quasi−random−1 400−quasi−random−1
● ● ● ● ● ● ● ● ●
● ●
●
●
● ● ● ● ● ●
●
● ● ● ●
●
●
● ● ● ● ● ● ● ●
● ●
20
30
40
50
60
70
rank
Figure 6: On the left the profile of sampled configurations. On the right the boxplot of the rank transformed results including all configurations tried.
Figure 7: An attempt to represent the distribution of sampled points of continuous parameters for each combination of discrete parameters. Here different scales of grey are used to indicate in which stage the corresponding point was sampled, the darker the later being the stage.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
max.restart x 400 0.33 1600 0.33 3200 0.33 init.method 400 1600 3200 quasi-random 0.50 0.50 0.50 random 0.50 0.50 0.50 max.reinf x 1 0.20 5 0.20 10 0.20 15 0.20 20 0.20 alpha s2 Intercept:alpha :1 1.00 1.00 :5 1.00 1.00 :10 1.00 1.00 :15 1.00 1.00 :20 1.00 1.00 beta s2 Intercept:beta :1 0.50 0.50 :5 0.50 0.50 :10 0.50 0.50 :15 0.50 0.50 :20 0.50 0.50 gamma s2 Intercept:gamma :1 1.00 2.00 :5 1.00 2.00 :10 1.00 2.00 :15 1.00 2.00 :20 1.00 2.00
25
Var1 Freq 400 0.37 1600 0.28 3200 0.35 init.method 400 1600 3200 quasi-random 0.03 0.04 0.03 random 0.97 0.96 0.97 max.reinf Var1 Freq 1 1 0.27 2 5 0.16 3 10 0.17 4 15 0.16 5 20 0.25 alpha s2 Intercept:alpha :1 0.34 1.33 :5 0.44 0.99 :10 0.27 1.15 :15 0.32 0.92 :20 0.38 1.10 beta s2 Intercept:beta :1 0.20 0.55 :5 0.15 0.50 :10 0.15 0.58 :15 0.15 0.54 :20 0.13 0.55 gamma s2 Intercept:gamma :1 0.49 2.16 :5 0.52 1.95 :10 0.60 2.11 :15 0.41 1.93 :20 0.41 1.86 max.restart
1 2 3
Table 3: Conditional probability tables for each variable in the LMS case. On the left column, the parameters of the prior local distributions. On the right, the parameter of the local distributions after the execution of the learning algorithm.
1
max.restart 400
max.reinf 1
init.method random
alpha 1.32
beta 0.55
gamma 2.17
Table 4: The configuration most likely to perform best in the LMS example.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
5.3
26
Case 3: Analysis of LGMPT parameters and assessment
The third case that we investigate is the tuning of Ant colony optimization algorithms (ACOs) for traveling salesman problem (TSP). This case is very close to the typical applications that may interest practitioners. Moreover it has already been used as benchmark in the literature for similar studies [40, 10]. Our experimental goals on this case are threefold. First we will investigate the impact of the parameters of lgmpt. Second, we will compare its performance against Iterated F-race (iterated F-race) [10]. And finally we will analyze the results produced. 5.3.1
ACO for TSP
Several templates have been proposed for ant colony optimization. The ACOTSP software package9 implements the templates: MAX-MIN Ant System (MMAS), ant colony system (ACS), rank-based ant system (RAS), elitist ant system (EAS), and ant system (AS) [18]. The configuration of the software involves 12 mixed parameters, one of them being the ACO version to be used. We use Euclidean TSP instances where the nodes are uniformly distributed in a square of size 10, 000 × 10, 000. We generated 300 instances of 750 vertices using the DIMACS instance generator [28]. The parameters include the ACO template, algorithm, with possible choices MMAS, ACS, RAS, EAS and AS and the local search type, localsearch, with four levels: no local search, 2-opt, 2.5-opt, and 3-opt. All ACO templates share the three continuous parameter alpha, beta, and rho, and the number of ants ants. The heuristic information controlling the construction of solutions by ants is nearest neighbor heuristic with restricted candidate list. The size of the candidate list is controlled by nnants. Five further parameters are deterministically conditional on the ACO template considered: the continuous parameter q0 indicating the probability of best choice in tour construction triggered when ACS is selected; the number of ranks rasrank triggered when RAS is selected; the number of elitist ants elitistants in EAS; the number of nearest neighbors nnls and the don’t look bit binary parameter dlb, which are in effect only when local search is used. The list of parameters is resumed in Table 8. In lgmpt, we model all integer parameters (except the binary) as continuous parameters. The graphical model is reported in Figure 10, left. The only dependencies that we include are those due to deterministic relationships between parameters described above. In this way we treat deterministic dependencies in the same way as probabilistic dependencies. This is not a problem, the probability distributions learned under the conditions in which the parameter is not used will be erratic and will be ignored when selecting the best configuration. In the next experiments we use Euclidean TSP instances with 750 nodes uniformly distributed in a square of side 10,000. 5.3.2
LGMPT parameters
The parameters that might influence the performance of lgmpt are the number of configurations sampled and executed at each stage N.new.configs the number of best configurations used for learning controlled by the parameter ρ and the total 9
Available at http://iridia.ulb.ac.be/~mdorigo/ACO/downloads/
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
27
Score: NA
Algorithm
Factor algorithm localsearch alpha beta rho ants nnants q0 rasrank elitistants nnls dbl
Type Discrete Discrete Real Real Real Integer Integer Real Integer Integer Real Discrete
Levels/Intervals {as,mmas,eas,ras,acs} {0, 1, 2, 3} [0.01; 5.00] [0.01; 10.00] [0.00; 1.00] [5; 100] [5, 50] [0.00; 1.00] | algorithm (acs) [1; 100] | algorithm (ras) [1; 750] | algorithm(eas) [5; 50] | localsearch (1, 2, 3) {0; 1} | localsearch (1,2,3)
localsearch
algorithm
alpha
elitistants
beta
rasrank
rho
dlb
q0
ants nnants
nnls
Figure 8: The parameters to tune and their ranges for the ACO-TSP case. On the right the graphical model. number of stages carried out max.stages. The number of stages times the number of sampled configurations gives the total budget, hence one of these parameters is determined by the other two. We tested in a full factorial design all combinations of max.stages ∈ {10, 20, 30}, N.new.configs ∈ {50, 75, 150} and ρ ∈ {0.05, 0.1, 0.2}. Note that the total budget of the different configurations is not the same among configurations. As time limit for each single run of the ACO algorithm we set 20 seconds. We run LGM for each combination of its parameters sampling instances from a set of 700 instances. Each single run of LGM using around 50 processing slots working in parallel lasted between 27 minutes and 4.5 hours, depending on its total budget. When finished each run indicated a configuration of ACO parameters as most probable to be the best. We took this best ACO configuration as representative of the results attainable by the LGM parameter configuration and run it on a set of 30 new test instances sampled from a set of 300 instances of same characteristics as the previous but generated with different seed. We analyze the final results in relation to the lgmpt parameters in Figure 9. The interaction plot does not suggest any clear pattern. Similarly, the distributions of results of each LGM configuration visualized in the boxplots shows that differences are small. A statistical analysis by means of Friedman test unveils that there is not statistically significant difference among the first 18 best configurations. Hence, we could choose the one that uses the least total computation, that is, max.stages = 10, N.new.configs = {50, 75} and ρ = 0.05. 5.3.3
Comparison with iterated F-race
To have an external assessment on the performance of LGM we compare it with another tuning method from the literature, namely, F-Race or more precisely its enhanced version iterated F-race, which is able to deal with large parameter space including continuous parameters. We use a revised version of the iterated F-race
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
28
10−50−0.2
●
20−50−0.2
●
●
●
●
●
●
30−75−0.2 ●
20−50−0.05
●
10−50−0.1 ●
20−150−0.2
max.stages
●
20−75−0.1 ●
30−50−0.2
●
10−50−0.05
●
30−75−0.1
rank
rank
●
30−75−0.05
rank
●
20−75−0.2
●
10−150−0.05 ●
10−150−0.1
new.configs
lgm.rho
new.configs
lgm.rho
new.configs
●
10−75−0.2
lgm.rho
● ●
●
10−75−0.1
max.stages
●
●
● ●
●
30−50−0.05
max.stages
●
●
20−75−0.05
●
20−150−0.05 30−50−0.1
●
10−150−0.2
● ●
20−50−0.1
●
20−150−0.1 ●
30−150−0.05 ●
10−75−0.05 ●
30−150−0.2
●
30−150−0.1 0
5
10
15
20
25
rank
Figure 9: On the left, 3D interaction plots and, on the right, boxplot for the study of LGM parameters. Configurations are indicated by max.stages-N.new.configs-ρ and results are transformed into ranks within each single instance. 1 2 3 4 5
algorithm acs acs acs acs acs
localsearch 3 3 3 3 3
alpha 4.4 2.9 2 2.3 4.7
beta 0.86 1.8 2.2 5 8.9
rho 0.86 0.65 0.37 0.79 0.77
ants 78 72 49 55 62
nnants 13 24 32 22 12
nnls 11 10 18 30 15
q0 0.9 0.73 0.92 0.7 0.78
dlb 0 0 0 0 0
rasrank – – – – –
elitistants – – – – –
Table 5: The outcome of iterated F-race. A dash means that the parameter is not meaningful under that configuration. described in [10], which is currently under development at IRIDIA. Iterated F-race has a number of parameters to be set as well. We uses the default settings provided by the developers under the justification that these defaults are the result of their testing with the software. We controlled instead the total experiment budget that was set to 1500 overall single runs of the ACO program to configure, the same as in [10]. The run of iterated F-race took 4 hours and 35 minutes in the cluster where we run our experiments. It stopped at a total of 1425 single ACO algorithm runs when the number of best configurations still alive was 3, and hence below a control parameter whose default was 5. The program finishes returning the 5 best configurations seen. They are reported in Table 5. Overall iterated F-race performed 7 iterations (each iteration corresponding to a run of F-race with resampled configurations). The number of training instances, which is not a control parameter, was 26 instance at the end of iterated F-race. The 26 instances are extracted from the same pool of 700 training instances used for BPC. Among the first three ranked iterated F-race was not able to decide for one hence it is fair to choose the first as representative of the outcome of the race. We compare this configuration againt the most probable best configuration indicated by lgmpt
Config
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
acs−3−1 ras−3−1 eas−3−1 mmas−3−1 as−3−1 acs−2−1 ras−2−1 eas−2−1 mmas−2−1 as−2−1 acs−1−1 ras−1−1 eas−1−1 mmas−1−1 as−1−1 acs−0−1 ras−0−1 eas−0−1 mmas−0−1 as−0−1 acs−3−0 ras−3−0 eas−3−0 mmas−3−0 as−3−0 acs−2−0 ras−2−0 eas−2−0 mmas−2−0 as−2−0 acs−1−0 ras−1−0 eas−1−0 mmas−1−0 as−1−0 acs−0−0 ras−0−0 eas−0−0 mmas−0−0 as−0−0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 6
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 8
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10
29
●
acs−3−1 ras−3−1 eas−3−1 mmas−3−1 as−3−1 acs−2−1 ras−2−1 eas−2−1 mmas−2−1 as−2−1 acs−1−1 ras−1−1 eas−1−1 mmas−1−1 as−1−1 acs−0−1 ras−0−1 eas−0−1 mmas−0−1 as−0−1 acs−3−0 ras−3−0 eas−3−0 mmas−3−0 as−3−0 acs−2−0 ras−2−0 eas−2−0 mmas−2−0 as−2−0 acs−1−0 ras−1−0 eas−1−0 mmas−1−0 as−1−0 acs−0−0 ras−0−0 eas−0−0 mmas−0−0 as−0−0
●
●
● ●
●
● ●
●
● ● ● ● ●
●
● ● ● ● ●
●
●
● ● ● ● ● ●
●
●
● ●
●
● ●
●●
● ● ● ●
●
● ● ● ● ● ● ●
●
● ●
●
● ●
0
Stage
20
40
60
rank
Figure 10: On the left the profile of sampled points for different configurations of categorical variables. On the right the boxplot of the rank-transformed results including all configurations in the combinations of categorical variables. when run with the same total computational budget, namely 1500 ACO algorithm runs. From Figure 9 the first lgmpt configuration that uses the same or less than 1500 runs is max.stages = 10, N.new.configs = 75, ρ = 0.05. In fact this configuration terminates when only 750 have been collected and for the previous analysis does not perform significantly worse than others that use a higher budget. We run the best ACO configuration indicated by iterated F-race and the best ACO configuration indicated by LGM-(10-75-0.05) on a set of 30 test TSP instances sampled from 300 new instances with the same characteristics as those used for training. Finally, we compare the results of the two selected ACO configurations on the 30 instances by means of the Wilcoxon signed rank test with matched pairs to account for blocking on the instances. The test returns a p-value of 0.9354 indicating that the null hypothesis of no difference between the two tuners cannot be rejected. Hence we can conclude that in this specific application LGM is at least as effective as the state-of-the-art, while offering more insights on the analysis, as expanded in the following. 5.3.4
The results of lgmpt on the ACO-TSP case
In Figure 10, center, the profile of selected configurations of categorical parameters. There are 40 possible combinations of categorical parameters and 75 configurations sampled at each stage. We see again a certain convergence towards a group of configurations, that the boxplot on the right confirms being good configurations, although the presence of other 9 continuous parameters make this convergence less pronounced. As in the previous example we try to give an idea on the distributions of sample points for continuous parameters in Figure 11. We observe that the latest stages focus more on the same configurations that we recognize in Figure 10 as most promising. However, beside this, it is hard to distinguish any relevant pattern due to the large number of parameters.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
30
Figure 11: An attempt to represent the distribution of sampled points of continuous parameters for each combination of discrete parameters in the ACO-TSP case. Here different scales of grey are used to indicate in which stage the corresponding point was sampled, the darker the later being the stage.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
1
algorithm acs
localsearch 3
alpha 2.62
beta 4.71
rho 0.47
nnants 25.67
q0 0.34
elitistants 342.59
31 ants 49.06
nnls 26.59
dlb 1
rasrank 41.48
Table 6: The configuration for the ACO-TSP case that is most likely to perform best according to lgmpt. In Table 7 we report the prior local conditional probability tables, that were declared by us as input data, and the final tables after the learning process. We note that the most influential parameter is the local search type. For the continuous parameters we only observe a decrease in the variance as in the previous example. Finally, the most likely configuration inferred by sampling values from the learned distributions choosing variables in topological order is given in Table 6.
6
Conclusions
We presented a novel application of graphical models and Bayesian learning for the automatic tuning of parameters in algorithms for optimization. Graphical models allow us to encode prior knowledge that we might have on the parameters in such a way that the tuning experimentation can be carried out more effectively. Bayesian calculus provides the framework to learn on the basis of observed results keeping into consideration the prior knowledge and hence opportunely weighting the new information collected without taking premature decisions that might be just the effect of stochastic phenomena. The tuning algorithm designed makes also use of rare event simulations, in the sense that the probability learned through the graphical model is the probability of sampling the configuration with the best average performance on a class of instances. The method proposed is able to treat all types of parameters arising in tuning, categorical, discrete and continuous. Moreover it can deal with probabilistic dependencies as well as deterministic dependencies, that is, nesting of parameters under choices of parent parameters. Graphical models that include mixed discrete and continuous parameters are less common in the literature and are still an active field of research. We implemented and tested the method in a few examples. The results seem to indicate that the method is behaving well and is competitive with other methods available both in terms of quality of final solutions and in terms of computational resources used. The algorithm can be halted by the user according to the computational resources that he is willing to allocate. Tests have shown that the tuning algorithm is quite robust and consistent to changes of its main parameters. In addition, at the end of the process the learned graphical model provides more information than usual tuning methods. Beside the best configuration, it also provides a way to understand which parameters affect most the results and the importance of parameters in interaction with their parents. We believe that this work opens a interesting new direction of research in algorithm tuning and configuration. Certainly more experiments are needed to assess the performance of the tuning algorithm proposed, comparing it with other methods from the literature. In addition, more investigations are needed for the cases where
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002 algorithm as mmas eas ras acs localsearch
x 0.20 0.20 0.20 0.20 0.20
x 0 0.25 1 0.25 2 0.25 3 0.25 alpha s2 Intercept:alpha 1 5.00 2.50 beta s2 Intercept:beta 1 16.00 5.00 rho s2 Intercept:rho 1 0.16 0.50 ants s2 Intercept:ants :0 900.00 50.00 :1 900.00 50.00 :2 900.00 50.00 :3 900.00 50.00 nnants s2 Intercept:nnants 1 400.00 25.00 nnls s2 Intercept:nnls :0 400.00 25.00 :1 400.00 25.00 :2 400.00 25.00 :3 400.00 25.00 q0 s2 Intercept:q0 :as 0.25 0.25 :mmas 0.25 0.25 :eas 0.25 0.25 :ras 0.25 0.25 :acs 0.25 0.25 dlb 0 1 2 3 0 0.50 0.50 0.50 0.50 1 0.50 0.50 0.50 0.50 rasrank s2 Intercept:rasrank :as:0 900.00 50.00 :mmas:0 900.00 50.00 :eas:0 900.00 50.00 :ras:0 900.00 50.00 :acs:0 900.00 50.00 :as:1 900.00 50.00 :mmas:1 900.00 50.00 :eas:1 900.00 50.00 :ras:1 900.00 50.00 :acs:1 900.00 50.00 :as:2 900.00 50.00 :mmas:2 900.00 50.00 :eas:2 900.00 50.00 :ras:2 900.00 50.00 :acs:2 900.00 50.00 :as:3 900.00 50.00 :mmas:3 900.00 50.00 :eas:3 900.00 50.00 :ras:3 900.00 50.00 :acs:3 900.00 50.00 elitistants s2 Intercept:elitistants :as 10000.00 350.00 :mmas 10000.00 350.00 :eas 10000.00 350.00 :ras 10000.00 350.00 :acs 10000.00 350.00
32
Var1 Freq as 0.13 mmas 0.25 eas 0.16 ras 0.13 acs 0.33 Var1 Freq 1 0 0.17 2 1 0.17 3 2 0.19 4 3 0.47 alpha s2 Intercept:alpha 1 2.14 2.62 beta s2 Intercept:beta 1 7.08 4.74 rho s2 Intercept:rho 1 0.07 0.47 ants s2 Intercept:ants :0 409.09 50.00 :1 409.09 50.00 :2 392.55 52.09 :3 360.66 48.81 nnants s2 Intercept:nnants 1 170.24 25.61 nnls s2 Intercept:nnls :0 181.82 25.00 :1 181.82 25.00 :2 193.92 26.16 :3 168.19 26.95 q0 s2 Intercept:q0 :as 0.11 0.25 :mmas 0.08 0.22 :eas 0.11 0.27 :ras 0.11 0.25 :acs 0.10 0.33 dlb 0 1 2 3 0 0.50 0.50 0.52 0.51 1 0.50 0.50 0.48 0.49 rasrank s2 Intercept:rasrank :as:0 300.00 50.00 :mmas:0 300.00 50.00 :eas:0 300.00 50.00 :ras:0 300.00 50.00 :acs:0 300.00 50.00 :as:1 300.00 50.00 :mmas:1 300.00 50.00 :eas:1 300.00 50.00 :ras:1 300.00 50.00 :acs:1 300.00 50.00 :as:2 300.00 50.00 :mmas:2 261.26 48.80 :eas:2 300.00 50.00 :ras:2 300.00 50.00 :acs:2 579.92 34.67 :as:3 300.00 50.00 :mmas:3 711.50 49.82 :eas:3 367.75 47.43 :ras:3 300.00 50.00 :acs:3 260.39 41.76 elitistants s2 Intercept:elitistants :as 4444.44 350.00 :mmas 5663.26 358.17 :eas 4395.13 356.89 :ras 4444.44 350.00 :acs 4723.62 341.31 algorithm
1 2 3 4 5 localsearch
Table 7: Conditional probability tables for each variable in the ACO-TSP case. On the left column, the parameters for the prior local distributions. On the right, the parameter of the local distributions after the execution of the learning algorithm.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
33
the prior probability is deceiving. It is easy to envision other applications of this method to algorithms for optimization. With exact algorithms, both in combinatorial optimization and decision problems, time is the performance to measure. In this context the parallel implementation of the tuning algorithm is very appealing as it makes it possible to truncate runs of configurations as soon as the desired number of configurations on which we want learning to occur have found a solution. Further, the feature of apriori knowledge can be used to learn fast on small instances and then try the inference on the larger instances that are computational more costly. Instance parameters could be included in the graphical model thus allowing for infer the best algorithm given some specific features of the instances. Finally, in our work we restricted ourselves to learn the parameters of the probability distributions. It is however possible to learn also the structure of the network. A considerable amount of work exists on this topic that constitute the next step to attain even better tuning methods. Acknowledgements Marco Chiarandini acknowledges support from The Danish Council for Independent Research | Natural Sciences. Mauro Birattari and Thomas St¨ utzle acknowledge support from the fund for scientific research FRS-FNRS of the French Community of Belgium, of which they are research associates.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
34
References [1] D. L. Applegate, R. E. Bixby, V. Chv´atal, and W. J. Cook. The Traveling Salesman Problem: A Computational Study. Princeton University Press, 2006. [2] P. Balaprakash, M. Birattari, and T. St¨ utzle. Improvement strategies for the f-race algorithm: Sampling design and iterative refinement. In T. BartzBeielstein, M. J. B. Aguilera, C. Blum, B. Naujoks, A. Roli, G. Rudolph, and M. Sampels, editors, Hybrid Metaheuristics, volume 4771 of Lecture Notes in Computer Science, pages 108–122. Springer, 2007. [3] J. Bang-Jensen, M. Chiarandini, Y. Goegebeur, and B. Jørgensen. Mixed models for the analysis of local search components. In T. St¨ utzle, M. Birattari, and H. Hoos, editors, Proceedings of the first International Workshop on Engineering Stochastic Local Search Algorithms (SLS 2007), volume 4638 of Lecture Notes in Computer Science, pages 91–105. Springer, 2007. [4] T. Bartz-Beielstein. Experimental Research in Evolutionary Computation. The New Experimentalism. Natural Computing Series. Springer Berlin Heidelberg, 2006. [5] T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors. Experimental Methods for the Analysis of Optimization Algorithms. Springer, Germany, November 2010. (479 pp). [6] T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors. Proceedings of the Workshop on Experimental Methods for the Assessment of Computational Systems, Krakow, Poland, September 2010. Algorithm Engineering Reports, TR10-2-007, Technische Universit¨at Dortmund, Germany. (65 pp). [7] R. Bellio, L. D. Gaspero, and A. Schaerf. Design and statistical analysis of a hybrid local search algorithm for course timetabling. Journal of Scheduling, 2011. [8] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley & Sons, Chichester, 1994. [9] M. Birattari, T. St¨ utzle, L. Paquete, and K. Varrentrapp. A racing algorithm for configuring metaheuristics. In W. B. Langdon, E. Cant´ u-Paz, K. Mathias, R. Roy, D. Davis, R. Poli, K. Balakrishnan, V. Honavar, G. Rudolph, J. Wegener, L. Bull, M. Potter, A. Schultz, J. Miller, E. Burke, and N. Jonoska, editors, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2002), pages 11–18, New York, 9-13 July 2002. Morgan Kaufmann Publishers. [10] M. Birattari, Z. Yuan, P. Balaprakash, and T. Sttzle.
f-race and iterated
f-race: An overview. In T. BartzBeielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors, Experimental Methods for the Analysis of Optimization Algorithms, pages 311–336. Springer Berlin Heidelberg, 2010.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
35
[11] B. Bischl, O. Mersmann, and H. Trautmann. Resempling methods in model validation. In T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors, Proceedings of Workshop on Experimental Methods for the Assessment of Computational Systems joint to PPSN2010, Krakow, Poland, number TR102-007 in Computer Science Series of the Dortmund University, 2010. [12] S. G. Bøttcher. Learning bayesian networks with mixed variables. Artificial Intelligence and Statistics, pages 149–156, 2001. [13] S. G. Bøttcher. Learning Bayesian Networks with Mixed Variables. PhD thesis, Department of Mathematical Sciences, Aalborg University, 2004. [14] S. G. Bottcher and C. Dethlefsen. deal: Learning Bayesian Networks with Mixed Variables, 2009. R package version 1.2-33. [15] P. Bratley, B. L. Fox, and H. Niederreiter. Algorithm-738 - programs to generate niederreiters low-discrepancy sequences. ACM Transactions On Mathematical Software, 20(4):494–495, Dec. 1994. [16] M. Chiarandini, C. Fawcett, and H. H. Hoos. A modular multiphase heuristic solver for post enrolment course timetabling. In Proceedings of the 7th International Conference on the Practice and Theory of Automated Timetabling, pages 1–6, Montr´eal, 2008. [17] M. Chiarandini and Y. Goegebeur. Mixed models for the analysis of optimization algorithms. In T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors, Experimental Methods for the Analysis of Optimization Algorithms, pages 225–264. Springer, Germany, 2010. Preliminary version available as Tech. Rep. DMF-2009-07-001 at the The Danish Mathematical Society. [18] M. Dorigo and T. St¨ utzle. Ant Colony Optimization. MIT Press, Cambridge, MA, USA, 2004. [19] M. Gagliolo. Online Dynamic Algorithm Portfolios. PhD thesis, IDSIA/USI, 2010. [20] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Texts in Statistical Science. Chapmna & Hall, Great Britain, first edition, 1995. [21] D. Hackerman. A tutorial on learning with bayesian networks. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Publisher, The Netherlands, 1998. [22] M. Hahsler and K. Hornik. TSP: Traveling Salesperson Problem (TSP), 2010. R package version 1.0-1. [23] D. Heckerman, D. Geiger, and D. M. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197– 243, 1995.
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
36
[24] F. Hutter, T. Bartz-Beielstein, H. H. Hoos, K. Leyton-Brown, and K. Murphy. Sequential model-based parameter optimisation: an experimental investigation of automated and interactive approaches. In Bartz-Beielstein et al. [5]. (479 pp). [25] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm configuration (extended version). Technical Report TR-2010-10, University of British Columbia, Department of Computer Science, 2010. Available online: http://www.cs.ubc.ca/˜hutter/papers/10-TRSMAC.pdf. [26] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. St¨ utzle. ParamILS: an automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36:267–306, October 2009. [27] D. Johnson, G. Gutin, C. McGeoch, A. Yeo, X. Zhang, and A. Zverovitch. Experimental analysis of heuristics for the ATSP. In G. Gutin and A. Punnen, editors, The Traveling Salesman Problem and Its Variations, pages 445–487. Kluwer Academic Publishers, Boston, MA, USA, 2002. Available as techincal report at http://www2.research.att.com/~dsj/papers/stspchap.pdf. [28] D. Johnson, L. McGeoch, C. Rego, and F. Glover. 8th dimacs implementation challenge, 2001. [29] M. I. Jordan, editor. Learning in Graphical Models. Kluwer Academic Publisher, The Netherlands, 1998. [30] J. P. C. Kleijnen, S. M. Sanchez, T. W. Lucas, and T. M. Cioppa. State-ofthe-art review: A user’s guide to the brave new world of designing simulation experiments. INFORMS JOURNAL ON COMPUTING, 17(3):263–289, 2005. [31] P. Larra˜ naga and J. A. Lozano. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers, October 2001. [32] H. Markowitz. Portfolio selection. Journal of Finance, 7(1):77–91, 1952. [33] M. D. Mckay, R. J. Beckman, and W. J. Conover. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239–245, 1979. [34] J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7(4):308–313, 1965. An Errata has been published in The Computer Journal 1965 8(1):27. [35] M. Pelikan. Hierarchical Bayesian Optimization Algorithm: Toward a new Generation of Evolutionary Algorithms, volume 170 of Studies in Fuzziness and Soft Computing. Springer, 2005. [36] M. Pelikan, D. E. Goldberg, and E. Cant´ u-Paz. BOA: The Bayesian optimization algorithm. In W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith, editors, Proceedings of the Genetic
IRIDIA – Technical Report Series: TR/IRIDIA/2011-002
37
and Evolutionary Computation Conference (GECCO-1999), volume I, pages 525–532. Morgan Kaufmann Publishers, San Francisco, CA, USA, 13-17 1999. [37] M. Pelikan, D. E. Goldberg, and F. G. Lobo. A survey of optimization by building and using probabilistic models. Computational Optimization and Applications, 21(1):5–20, Jan. 2002. [38] S. M. P.T. de Boer, D. Kroese and R. Rubinstein. A tutorial on the crossentropy method, 2003. [39] J. R. Rice. The algorithm selection problem. Technical Report CSD-TR 152, Purdue University, Computer Science Technical Reports, 1975. Published in Advances in Computers, Vol. 15, Academic Press, 1976. [40] E. Ridge and D. Kudenko. Analyzing heuristic performance with response surface models: prediction, optimization and robustness. In H. Lipson, editor, Proceedings of GECCO, pages 150–157. ACM, 2007. [41] P. J. Rousseeuw. Least median of squares regression. Journal of the American Statistical Association, 79(388):871–880, 1984. [42] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, New Jersey, USA, second edition, 2003. [43] P. Sanders. Algorithm engineering - an attempt at a definition. In S. Albers, H. Alt, and S. N¨ aher, editors, Efficient Algorithms, volume 5760 of Lecture Notes in Computer Science, pages 321–340. Springer, 2009. [44] T. J. Santner, W. B., and N. W. The Design and Analysis of Computer Experiments. Springer, 2003. [45] S. K. Smit and A. E. Eiben. Using entropy for parameter analysis of evolutionary algorithms. In T. Bartz-Beielstein, M. Chiarandini, L. Paquete, and M. Preuss, editors, Experimental Methods for the Analysis of Optimization Algorithms, pages 287–310. Springer Berlin Heidelberg, 2010. [46] D. J. Spiegelhalter and S. L. Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20:579–605, 1990. [47] R. D. C. Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. [48] P. Winker, M. Lyra, and C. Sharpe. Least median of squares estimation by optimization heuristics with an application to the capm and a multi-factor model. Computational Management Science, pages 1–21, 2009.