Efficient Bayesian Optimisation Using Derivative Meta

0 downloads 0 Views 370KB Size Report
Bayesian optimisation is an efficient method for global opti- misation of expensive .... +))/σ(x). Φ(z) and φ(z) are the CDF and PDF of standard normal distribution.
Efficient Bayesian Optimisation Using Derivative Meta-Model Ang Yang, Cheng Li, Santu Rana, Sunil Gupta, and Svetha Venkatesh Deakin University, Geelong, Center for Pattern Recognition and Data Analytics {leon.yang; cheng.l; santu.rana; sunil.gupta; svetha.venkatesh}@deakin.edu.com

Abstract. Bayesian optimisation is an efficient method for global optimisation of expensive black-box functions. However, the current Gaussian process based methods cater to functions with arbitrary smoothness, and do not explicitly model the fact that most of the real world optimisation problems are well-behaved functions with only a few peaks. In this paper, we incorporate such shape constraints through the use of a derivative meta-model. The derivative meta-model is built using a Gaussian process with a polynomial kernel and derivative samples from this meta-model are used as extra observations to the standard Bayesian optimisation procedure. We provide a Bayesian framework to infer the degree of the polynomial kernel. Experiments on both benchmark functions and hyperparameter tuning problems demonstrate the superiority of our approach over baselines. Keywords: Bayesian optimisation, Gaussian process, Meta learning, Derivative-based

1

Introduction

Bayesian optimization (BO) of black-box function [1] often use Gaussian Process (GP) as priors of latent functions. A GP is specified by a mean function and a covariance function. The squared exponential (SE) kernel is a popular choice of covariance function [2]. The posterior distribution is computed by combing the likelihood of these observations and GP prior. Then a utility function which combines the mean and variance of posterior GP is used to determine the next point for evaluating the black-box function. Most real world functions, however resulting either from physical experiments or hyperparameter tuning, are well behaved. They are smooth and have a small number of local peaks. If such knowledge can be harnessed, then BO may converge faster. BO algorithms for well-behaved functions have been addressed only in limited contexts when either the function is monotonic [3] or it has a concave/convex shape [4]. BO methods for functions with more general shape properties such as incorporating the knowledge that the function has only a few peaks has not been addressed before, and thus remains an open problem. Addressing that, we propose a new method that can flexibly incorporate the shape of the function through a derivative meta-model. The derivative metamodel is built using a polynomial. To maintain the Bayesian flavor and to maintain the ability of estimating the meta-model from a few observations, we use

a Gaussian process with polynomial kernel (GPPK) for the meta-model. Based on the observed data we fit the GPPK and then sample derivative values for the use in the main GP for the BO. In effect, the main GP is built based on a trade-off between the flexible model induced by the stationery kernel and the structure induced by the derivative information based on GPPK. We refrain from using the samples of the function values from GPPK because we only want to pass the shape information through derivative, while keeping the function values guided mostly by the main GP. The crucial in this scheme is setting the degree of the polynomial kernel. We use a Bayesian formulation to estimate the degree from the observed data. We then use a truncated geometric prior, cut-off at degree of 10 and then normalised, which essentially prefers lower degree as a prior information. Posterior is then computed based on the marginal likelihood of the GPPK on the observed data. The mode of the posterior is then used as the degree for our derivative meta-model. We demonstrate our method on three synthetic examples and applications on hyperparameter tuning for two machine learning algorithms. We compare with BO without derivatives and BO with true derivatives in synthetic examples and only compare BO without derivatives in hyperparameter tuning since true derivatives are not available in this case. In all experiments our proposed method outperforms the baselines. In summary, our contributions are: 1. Proposal of a new method to incorporate shape information in BO through a derivative metamodel; 2. Derivation of a mechanism to estimate the parameter of the prior shape function through Bayesian inference; 3. Validation on synthetic functions and applications of hyperparameter tuning.

2 2.1

Related Background Bayesian optimisation

Bayesian optimisation has two main components. The first is to model the unknown function using GP as a prior. The other component is to search the next point where to perform the experiment. The search for the next point is guided by a surrogate utility function, called acquisition function. Gaussian Process We briefly review GP [2] here. GP is a strategy of specifying prior distributions over the space of smooth functions. It is a distribution over function and the properties of the Gaussian distribution allow us to compute the predictive mean and variance in the closed form. GP is specified by its mean function µ(x) and covariance function k(x, x0 ). A sample from a GP is a function given as: f (x) ∼ N (µ(x), k(x, x0 )) where N is a Gaussian distribution and x denotes a D-dimensional covariate vector. Without any loss in generality, the prior mean function can be assumed to be a zero function making the GP fully defined by the covariance function. A popular choice of kernel is the 0 2 k ) where ρl is squared exponential function given as:k(x, x0 ) = σf2 exp(− 21 kx−x ρ2l the characteristic length scale, and σf is the signal standard deviation.

We denote a set of observations D = {x1:t , f 1:t }, where f 1:t = {f (xi )}ti=1 . The joint distribution of observations D and a new observation {xt+1 , ft+1 } is still a Gaussian. If the observation is a noisy estimate of the actual function value 2 then y = f (x)+ξ where ξ ∼ N (0, σnoise ). Then the predictive distribution of ft+1 can be written as P(ft+1 |D1:t , xt+1 ) = N (µ(xt+1 ), σ 2 (xt+1 )) with mean and 2 variance:µ(xt+1 ) = kT [K + σnoise I]−1 f1:t , σ 2 (xt+1 ) = k(xt+1 , xt+1 ) − kT [K + −1 2 σnoise I] k where k = [ k(xt+1 , x1 ) k(xt+1 , x2 ) . . . k(xt+1 , xt ) ] and K is the kernel matrix given by:   k(x1 , x1 ) . . . k(x1 , xt )   .. .. .. K= (1)  . . . k(xt , x1 ) . . . k(xt , xt )

Acquisition Functions Acquisition functions have been defined so that it effectively trades off exploitation and exploration. Exploitation means the areas where the mean prediction for function values are high. Exploration means the areas where the epistemic uncertainty about the function values are high. In this paper, we use EI as the criteria. Assume that our optimisation problem is maximising f (x) and the current maximum is f (x+ ). The improvement function I(x) [5] is written as:I(x) = max{0, f (x) − f (x+ )}. The analytic form of E(I(x)) can be obtained as [5]: ( (µ(x)−f (x+ ))Φ(z) + σ(x)φ(z) if σ(x) > 0 E(I(x)) = 0 if σ(x) = 0 where z = (µ(x) − f (x+ ))/σ(x). Φ(z) and φ(z) are the CDF and PDF of standard normal distribution.

3

Framework

We propose a new method to incorporate shape information about the objective function through the use of a derivative meta-model. First, we describe the construction of the meta-model using Gaussian process with a polynomial kernel (GPPK). Next, we construct our method of sampling the derivative information from the GPPK and then use it in the main Gaussian process. Finally, we present a Bayesian approach to estimate the polynomial degree based on the observations. 3.1

Meta-model

Some early researches have investigated the use of polynomial curve fitting in optimisation. Specifically, in [6] the authors theoretically explained the mechanism of curve fitting in global optimisation of expensive black box functions. Motivated from the usefulness of prior shape information [7], in our framework, we use Gaussian process with polynomial kernel to fit the observation data and then estimate derivative information.

Gaussian process with polynomial kernel (GPPK) Gaussian processes allow us to compute the predictive mean and variance in closed form which is fully defined by the covariance function, The kernel in GPPK is defined as k(x, x0 ) = (c + x · x0 )d , where c is kernel offset and d is the degree of the polynomial. The covariance matrix K can be computed by following Eq.(1) and the mean function of GPPK can be computed as: µ(x) = kK −1 y

(2)

where k =[ (c + x · x1 )d (c + x · x2 )d . . . (c + x · xt )d ]. Derivative estimation Now we can estimate the derivative values at our observations by differentiating the mean function Eq.(2) ∇f =

∂ µ(x) = k0 K −1 y ∂x

(3)

where k0 =[d(c + x · x1 )d−1 · x1 d(c + x · x2 )d−1 · x2 . . . d(c + x · xt )d−1 · xt ]. 3.2

BO with estimated derivatives

Since the derivatives of a Gaussian Process is still a GP [8], the joint distribution of function values and derivatives is analytically tractable.In terms of squared exponential covariance function, the covariance between function values and partial derivatives can be written as [3]: PD j (j) (i) (j) 2 −2 (i) cov(f i , ∂f(j) ) = σf2 exp(− 21 b=1 ρ−2 l (xb − xb ) ) × (ρl (xg − xg )) ∂xg

and covariance between partial derivatives is given as: ∂f j ∂f j (i) , (j) ) = ∂xg ∂xh P (i) D σf2 exp(− 12 b=1 ρ−2 l (xb

cov(

(j)

(i)

(j)

(i)

(j)

−2 − xb )2 ) × ρ−2 l (δgh − ρl (xh − xh )(xg − xg ))

where δgh = 1if g = h, and δgh = 0 if g 6= h. Now using GP we can derive the posterior over a new function value ft+1 at xt+1 when given a set of observations of the function values and a set of derivative ¯ [f ,∇f ] to denote the joint covariance matrix over a set information. We use K 1:t 1:t of observations of function values and the estimated derivatives. Then the new joint distribution for [f 1:t , ∇f 1:t , ft+1 ] is:      f 1:t ¯ ¯ k 1:t ]  ∇f 1:t  ∼ N 0, K[f 1:t ,∇f (4) ¯T k k(xt+1 , xt+1 ) ft+1 T ¯ [ k[f , k[ft+1 ,∇f ] ]T , and the predictive distribution on xt+1 is a where k= 1:t ,∇f ] normal distribution N (¯ µ, σ ¯ 2 ) where µ ¯(xt+1 ) and σ ¯ 2 (xt+1 ) are given as:

Algorithm 1 Bayesian Optimisation using Derivative Meta-model(BODMM) 1:for n = 1, 2,...t do 2: Fit the data D using GPPK 3: Estimate derivative values ∇f from GPPK via Eq.(3) 4: Build GP with function observations and estimated derivatives of observations 5: Find xt+1 by maximisingxt+1 = argmaxx EI(x|D) 6: Evaluate the objective function: yt+1 = f (xt+1 ) + ξ 7: Augment the observation set D = D ∪ (xt+1 , yt+1 ). 8: end for

¯T K ¯ −1 µ ¯ (xt+1 ) = k [f

1:t ,∇f 1:t ]

[f 1:t , ∇f 1:t ]

¯T K ¯ ¯ −1 σ ¯ 2 (xt+1 ) = k(xt+1 , xt+1 )−k [f 1:t ,∇f 1:t ] k

(5) (6)

We then use the Eq.(5) and Eq.(6) to construct acquisition function and perform BO. The proposed method is described in Algorithm 1. 3.3

Degree estimation

Given the degree d = 2 in the polynomial kernel, the posterior mean function in GP can be maximum quadratic. While a quadratic meta-model can be sufficient in majority of cases, we provide a mechanism to estimate the degree of the polynomial if one wishes so. To achieve this, in this subsection we infer the degree through Bayesian inference. In Bayesian inference, the posterior probability of a variable is proportional to the product of the prior and the likelihood. In our case, the posterior of the degree d is mathematically computed as p(d | X, y) ∝ p(d) p(y | X, d)

(7)

The prior p(d) represents our belief on the degree. Since the degree is discrete, we choose the geometric distribution as our prior. The geometric distribution presents the probability that the first occurrence of success requires m independent trials, each with success probability q, p(d = m) = (1 − q)m−1 q

(8)

where m = 1, 2, 3 · · · . In practice, we do not expect that the d is over than a high value such as10 and then we can use the truncated geometric distribution with normalisation. The likelihood p(y | X, d) in Eq.(7) is the marginal likelihood of GP with polynomial kernel. We compute it as following 1 n 1 log p(y | X, d) = − y T (K + σ 2 I)−1 y − log | K + σ 2 I | − log 2π 2 2 2

(9)

Given the Eq.(8) and Eq.(9), we can compute the posterior as in Eq. (7), and then use the mode of the posterior as the estimated degree. Once we infer the d, we directly apply it into the derivative estimation as in Eq. (2).

4

Experiments

We firstly examine the capability of the GPPK to catch function shape. The results show that GPPK can approximately capture the 1D and 2D functions. We then evaluate our method on three different benchmark functions and then real world applications on hyperparameter tuning for two machine learning algorithms. We compare the proposed method BODMM with the following baselines for benckmark functions: – Bayesian Optimisation without derivative observations (Standard BO). – Bayesian Optimisation using true derivative values (BOTD). As the true derivative values are not available in real applications, we only compare with Standard BO. In all experiments, we use EI as the acquisition function and the SE kernel as the covariance function. We use DIRECT [9] to optimise the acquisition function. In terms of kernel parameters, we use the isotropic length scale σl = 0.1, 2 signal variance σf2 = 1 and noise variance σnoise = (0.01)2 . In GPPK, we use c = 0.5 as kernel offset and q = 0.5 in truncated geometric distribution. We run each algorithm 100 trials with different initialisation and report the simple regret and standard errors for benchmark functions while reporting accuracy for hyperparameter tuning tasks. Simple regret is defined as rt = f (x∗ ) − f (x+ t ) where f (x∗ ) is the global optimum and f (x+ t ) = maxx∈{x1:t } f (x) which is the current best value. 4.1

Experiment with benchmark test functions

We test our algorithm on three benchmark functions as below: 1. 3D Hartmann (Hartmann-3D). The global minimum is f (x∗ ) = −3.86278 at x∗ = (0.114614, 0.555649, 0.852547) where search space is in [0, 1]; 2. 2D Branin’s function (Branin-2D). The global minimum is f (x∗ ) = 0.397887 at x∗ = (π, 2.275) where search space is in [0, 4]. 3. Unnormalized 5D Gaussian PDF (Gaussian PDF-5D). The global maximum is f (x∗ ) = 1 at x∗ = (1, 1, 1, 1, 1) where search space is in [0, 2]. We set a convergence threshold which is reaching 10% to the optimum. We did not apply the threshold on Branin’s since the BO for it can converge fast. We start to examine our algorithm on Hartmann-3D. Fig 1a plots the simple regret vs iteration for three different algorithms. For the proposed BODMM we run it with fixed degree and estimated degree respectively. BO using true derivative values performs the best in all three algorithms and converges after 10 iterations. It is easy to understand since true derivatives have been incorporated in this algorithm. Our algorithm with fixed degree and estimated degree outperforms Standard BO. The setting with estimated degree performs better than that of fixed degree. Fig 1b demonstrates the estimated degree at each iteration. We also receive positive results from other test functions. Results of Branin’s function and Gaussian PDF-5D have been illustrated in Fig 1c and Fig 1d respectively.

Hartmann-3D

1 0.8

4

0.6

0.2

3 5

10

15

20

25

30

4 3 2

15

20

25

Number of Iterations

Number of Iterations

(a)

(b)

30

0.3 0.25 0.2 0.15 0.1

0 10

BODMM BOTD Standard BO

0.35

1

0.4

GaussianPDF-5D

0.4

BODMM(Estimated degree) BOTD Standard BO

5

Simple Regret

1.2

Branin

6

estimated degree

Degree

Simple Regret

Estimated Degree

5

BODMM(Estimated degree) BODMM(Fixed degree=2) BOTD Standard BO

1.4

Simple Regret

1.6

0.05 5

10

15

20

20

40

Number of Iterations

60

80

100

Number of Iterations

(c)

(d)

Fig. 1: Simple regret vs iterations for (a) Hartmann-3D function (b)Estimated degree d in the GPPK for optimising Hartmann-3D function (c) Branin’s Function (d) GaussianPDF-5D.

4.2

Hyperparamter Tuning

We experiment with three real world datasets for tuning hyperparameters of two classifiers: Support Vector Machines (SVM) and Elastic Net. In SVM we optimise two hyperparameters which are the cost parameter (C) and the width of the RBF kernel (γ). The search bounds for the two hyperparameters are C =10λ where λ ∈ [−3, 3] and γ = 10ω where ω ∈ [−5, 0] correspondingly. To make our search bounds manageable, we optimise for λ and ω. In Elastic Net, the hyperparameters are the l1 and l2 penalty weights. The search bound for both of them is [10−5 ,10−2 ]. We optimise in the range of exponents ([−5, −2]). All three datasets: BreastCancer, LiverDisorders and Mushrooms are publicly available from UCI data repository [10]. BreastCancer

1

0.97

0.62

0.9

0.6

0.88

Accuracy

Accuracy

0.64

0.92

10

15

20

25

30

0.96 0.95 0.94 0.93

0.58 5

BODMM Standard BO

0.98

0.66

0.94

Mushrooms

0.99

BODMM Standard BO

0.68

0.96

Accuracy

LiverDisorders

0.7

BODMM Standard BO

0.98

0.92 5

10

Number of Iterations

15

20

25

30

5

10

Number of Iterations

15

20

25

30

Number of Iterations

(a) SVM BreastCancer

1

LiverDisorders

0.73

1

BODMM Standard BO

BODMM Standard BO

0.95

0.725

0.995

0.8

Accuracy

Accuracy

Accuracy

0.9 0.85

Mushrooms BODMM Standard BO

0.72

0.715

0.75

0.99

0.985

0.71 0.7 0.65

0.98

0.705 5

10

15

20

25

Number of Iterations

30

35

40

10

20

30

40

Number of Iterations

50

5

10

15

20

25

30

Number of Iterations

(b) Elastic Net

Fig. 2: Accuracy vs iterations for (a) hyperparameter tuning for SVM on three datasets: BreastCancer, LiverDisorders and Mushrooms (b) hyperparameter tuning for Elastic Net on the same three datasets.

The results for hyperparameter tuning are showing in Fig 2. In all cases our approach BODMM performs better than Standard BO. For example in the leftmost graphic of Fig 2b, Standard BO achieves to 0.89 after 40 iterations while our algorithm achieves to 0.97.

5

Conclusion

We propose a novel method for Bayesian optimisation for well-behaved functions with small numbers of peaks. We incorporate this information through a derivative meta-model. The derivative meta-model is based on a Gaussian process with polynomial kernel. By controlling the degree of the polynomial we control the shape of the main Gaussian process which is built using the SE kernel and the covariance matrix is computed by using both the observed function value and the derivative values sampled from the meta-model. We also provide a Bayesian way to estimate the degree of the polynomial based on a truncated geometric prior. In experiments, both on benchmark test functions and the hyperparameter tuning from popular machine learning models, our proposed model converged faster than the baselines. Acknowledgment. This research was partially funded by the Australian Government through the Australian Research Council (ARC) and the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning. Professor Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

References 1. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. Journal of Global optimization 13(4) (1998) 455–492 2. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning. Volume 1. MIT press Cambridge (2006) 3. Riihimäki, J., Vehtari, A.: Gaussian processes with monotonicity information. In: Proceedings of the Thirteenth International Conference on AIStat. (2010) 4. Jauch, M., Peña, V.: Bayesian optimization with shape constraints. arXiv preprint arXiv:1612.08915 (2016) 5. Mockus, J.: Application of bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization 4(4) (1994) 347–365 6. Denison, DGT and Mallick, BK and Smith, AFM: Automatic Bayesian curve fitting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60(2) (1998) 333–350 7. Abdolmaleki, A., Lioutikov, R., Peters, J.R., Lau, N., Reis, L.P., Neumann, G.: Model-based relative entropy stochastic search. In: NIPS. (2015) 8. Solak, E., Murray-Smith, R., Leithead, W.E., Leith, D.J., Rasmussen, C.E.: Derivative observations in gaussian process models of dynamic systems. In: NIPS. (2003) 9. Finkel, D.E.: Direct optimization algorithm user guide. CRSC (2003) 10. Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017)

Suggest Documents