Gaussian Process Regression for Structured Data Sets

1 downloads 0 Views 1MB Size Report
training sets because its time complexity is O(N3) and memory complexity is. O(N2), where ... Several papers describe approaches for GP regression on a lattice ... di and the sample size is a ..... NASA technical memorandum, National Aeronautics and Space Admin- ... Mathematical Programming 91(2), 201–213 (2002). 5.
Gaussian Process Regression for Structured Data Sets Mikhail Belyaev1,2,3 , Evgeny Burnaev1,2,3 , and Yermek Kapushev1,2(B) 1

Institute for Information Transmission Problems, Bolshoy Karetny per. 19, Moscow 127994, Russia 2 DATADVANCE, llc, Pokrovsky blvd. 3, Moscow 109028, Russia [email protected] 3 PreMoLab, MIPT, Institutsky per. 9, Dolgoprudny 141700, Russia {mikhail.belyaev,evgeny.burnaev}@datadvance.net

R

AF

T

Abstract. Approximation algorithms are widely used in many engineering problems. To obtain a data set for approximation a factorial design of experiments is often used. In such case the size of the data set can be very large. Therefore, one of the most popular algorithms for approximation — Gaussian Process regression — can hardly be applied due to its computational complexity. In this paper a new approach for a Gaussian Process regression in case of a factorial design of experiments is proposed. It allows to efficiently compute exact inference and handle large multidimensional and anisotropic data sets.

1

Introduction

D

Keywords: Gaussian process

· Structured data · Regularization

Gaussian Processes (GP) have become a popular tool for regression which has lots of applications in engineering problems [13]. They combine a simple structure of Bayesian inference and interpretable parameters with an ability to approximate a wide range of functions. However, GP regression is inapplicable for large training sets because its time complexity is O(N 3 ) and memory complexity is O(N 2 ), where N is a size of the training sample. Significant amount of research concerns sparse approximation of GP regression reducing time complexity to O(M 2 N ) for some M ≪ N [1,9,12]. Several papers are dedicated to a Mixture of GPs and a Bayesian Machine Committee [13,14]. However, these are approximations to GP. Exact GP with reduced time and memory complexity can be achieved by taking into account a structure of a Design of Experiments (DoE). In engineering problems factorial DoEs are often used [10]. In such designs there are several groups of variables — factors, in each factor variables take values from a finite set. A size of this set is called a factor size and its values are called levels. The Cartesian product of factors forms the training set. The size of factorial DoE can be very large as it grows exponentially with dimension of input variables. c Springer International Publishing Switzerland 2015 ⃝ A. Gammerman et al. (Eds.): SLDS 2015, LNAI 9047, pp. 106–115, 2015. DOI: 10.1007/978-3-319-17091-6 6

Gaussian Process Regression for Structured Data Sets

107

There are several efficient methods based on splines which consider such structure of the data, e.g. [17]. Several papers describe approaches for GP regression on a lattice which exploit structure of a covariance matrix, e.g. [3]. Such techniques have O(N log N ) time- and O(N ) memory complexity. A disadvantage of these methods is that they cannot be used in case of multidimensional factors. Another problem which we are likely to encounter is that factor sizes can vary significantly. We will refer to this property of data set as anisotropy. Engineers usually use large factors sizes if the corresponding input variables have big impact on function values otherwise the factors sizes are likely to be small, i.e. the factor sizes are often selected using knowledge from a subject domain [15]. Difference between factor sizes can lead to degeneracy of the GP model. In this paper we propose an algorithm that takes into account factorial nature of the DoE and allows to efficiently calculate exact inference of GP regression. Proposed algorithm is designed for a general case of a factorial DoE when factors are multidimensional. It is also discussed how to introduce regularization to take into account possible anisotropy of the training data set. 1.1

Approximation Problem

R

AF

T

Let f (x) be some unknown smooth function. The task is given a data set D = {(xi , yi ), xi ∈ Rd , yi ∈ R}N i=1 of N pairs of inputs xi and outputs yi construct ˆ an approximation f (x) of the function f (x), assuming that outputs yi are noisy with additive i.i.d. Gaussian noise:

1.2

2 εi ∼ N (0, σnoise ).

(1)

D

yi = f (xi ) + εi ,

Factorial Design of Experiments

In this paper a special case of factorial DoE is considered. Let us refer to sets of points sk = {xkik ∈ Xk }nikk=1 , Xk ⊂ Rdk , k = 1, K as factors. A set of points S is referred to as a factorial DoE if it is a Cartesian product of factors K S = s1 × s2 × · · · × sk = {[x1i1 , . . . , xK iK ], {ik = 1, . . . , nk }k=1 }.

(2)

!K The elements of S are vectors of the dimension d = i=1 di and the sample "K size is a product of sizes of all factors N = i=1 ni . If all the factors are onedimensional then S is a full factorial design. But in a more general case factors are multidimensional (see Figure 1). Note that in this paper a factorial design is implemented across continuous real-valued features. 1.3

Gaussian Process Regression

GP regression is a Bayesian approach where a prior distribution over continuous functions is assumed to be a Gaussian Process, i.e. f | X ∼ N (µ, Kf ),

(3)

108

M. Belyaev et al.

1 0.8

x3

0.6 1

0.4 0.2 0 1

0.5 0.8

0.6

0.4

0.2

x2

0

0

x1

Fig. 1. DoE with 2 factors. x1 is a 1-dimensional and (x2 , x3 ) is a 2-dimensional factor.

T

where f = (f (x1 ), f (x2 ), . . . , f (xN )) is a vector of outputs, X = (xT1 , xT2 , . . . , xTN )T is a matrix of inputs, µ = (µ(x1 ), µ(x2 ), . . . , µ(xN )) is a vector of mean function µ(x), Kf = {k(xi , xj )}N i,j=1 is a covariance matrix of a priori selected covariance function k. Without loss of generality the standard assumption of a zero-mean data is made. Also we assume that observations are corrupted with a Gaussian noise 2 ). For a prediction of f (x∗ ) at data point x∗ the yi = f (xi ) + εi , εi ∼ N (0, σnoise posterior mean, conditioned on the observations y = (y1 , y2 , . . . , yN ), is used (4)

AF

fˆ(x∗ ) = k(x∗ )T K−1 y y,

R

2 I and I is an where k(x∗ ) = (k(x∗ , x1 ), . . . , k(x∗ , xN ))T , Ky = Kf + σnoise identity matrix. Approximation accuracy is evaluated by the posterior variance

(5)

D

cov(fˆ(x∗ )) = k(x∗ , x∗ ) − k(x∗ )T K−1 y k(x∗ ).

Let us denote the vector of hyperparameters by θ. To choose the hyperparameters of our model we consider the log likelihood 1 1 N log p(y | X, θ, σf , σnoise ) = − yT K−1 log |Ky | − log 2π y y− 2 2 2

(6)

and optimize it over the hyperparameters [13]. The complexity of learning GP regression is O(N 3 ) as it is needed to calculate the inverse of Ky and the determinant.

2 2.1

Proposed Approach Tensor and Related Operations

For further discussion we use tensor notation, so let us introduce definition of a tensor and some related operations. A tensor Y is a K-dimensional matrix of size n1 × n2 × · · · × nK [8]: Y = {yi1 ,i2 ,...,iK , {ik = 1, . . . , nk }K k=1 }.

(7)

Gaussian Process Regression for Structured Data Sets

109

By Y (j) we denote a matrix consisting of elements of the tensor Y whose rows are 1×nj slices of Y with fixed indices ij+1 , . . . , iK , i1 , . . . , ij−1 and altering index ij = 1, . . . , nj . For a 2-dimensional tensor it holds Y (1) = Y T and Y (2) = Y. Let B be some matrix of size ni × n′i . Then the product of the tensor Y and the matrix B along the direction i is a tensor Z of size n1 ×· · ·×ni−1 ×n′i ×ni+1 × · · · × nK such that Z (i) = Y (i) B. We will denote this operation by Y ⊗i B. For a 2-dimensional tensor Y multiplication along the first and the second directions are left and right multiplications by a matrix: Y ⊗1 B = B T Y, Y ⊗2 B = YB. Let’s consider an operation vec which for every multidimensional matrix Y returns a vector containing all elements of Y. Then for every tensor Y of size n1 × n2 × · · · × nK and ni × pi size matrices Bi , i = 1, . . . , K it holds that [8] T ), (B1 ⊗ B2 · · · ⊗ BK )vec(Y) = vec(Y ⊗1 B1T · · · ⊗K BK

(8)

2 ) assumComplexity of calculation of the left hand side of the equation is O(N" n × n and N = ni . The ing that all the matrices Bi are quadratic of size i ! i complexity of the right hand side is only O(N i ni ).

Fast Exact Inference

T

2.2

R

AF

To obtain the benefit of using tensor calculations to perform inference we consider the covariance function which can be represented as a product of covariance functions each depending only on variables from particular factor. The covariance function and the corresponding covariance matrix take the form

D

k(xp , xq ) =

K #

ki (xip , xiq ),

Kf =

i=1

K $

Ki ,

i=1

where xip , xiq ∈ si , Ki is a covariance matrix defined by the ki covariance function computed at points from si . Such function is still a valid covariance function being the product of separate covariance functions. The most popular squared exponential function can also be as a product of squared exponential % represented * & '2 ( ) ! 2 (j) (j),i (j),i (j),i d 2 xp − xq exp − j i θi is functions ki (xip , xiq ) = σf,i , where xp a j-th component of xip . This covariance function is more general as it allows to take into account features of factors by choosing different ki for different factors. Note that factors can be multidimensional. Let Ki = Ui Di UTi be a Singular Value Decomposition (SVD), where Ui is an orthogonal matrix of eigenvectors of matrix Ki and Di is a diagonal matrix of eigenvalues. Using properties of the Kronecker product we obtain K−1 y =

+K $ i=1

Ui

, +- K $ i=1

.

2 Di + σnoise I

,−1 +

K $ i=1

UTi

,

.

(9)

Equations (4), (5), (6) do not require explicit inversion of Ky . In each equation it is multiplied by a vector. Calculation of (9) requires O(N 2 ) operations, but to

110

M. Belyaev et al.

compute K−1 y y much more efficient expression can be derived. Let Y be a tensor such that vec(Y) = y. Applying identities (8) and (9) K−1 y y can be written as ) 0 /( −1 K−1 ⊗1 UT1 · · · ⊗K UTK , (10) y y = vec (Y ⊗1 U1 · · · ⊗K UK ) ∗ D 0 /1 where D is a tensor constructed by transforming the diagonal of D = k Dk + 2 σnoise I into a tensor. The elements of the tensor D are eigenvalues of Ky , so # Di1 ,...,iK . (11) |Ky | = i1 ,...,iK

Taking into account the complexity of computing the right hand side of (8) and SVD the following proposition is obtained. Proposition 1. The computational complexity of the&log likelihood (6), where ' !K 3 !K . K−1 y and |K | are calculated using (10) and (11), is O n + N n y i y i=1 i i=1

T

Supposing that ni ≪ N (the number of factors is large and their sizes are close). ! 1 we obtain that O(N i ni ) = O(N 1+ K ). This is much less than O(N 3 ). For a gradient based method which is used for optimization of the log likelihood over parameters θ the derivatives are required. They take the form

AF

∂ 1 1 T −1 ′ −1 ′ (log p(y|X, σf , σnoise )) = − Tr(K−1 y K ) + y Ky K Ky y, ∂θ 2 2

(12)

D

R

where θ is one of the parameters of covariance function (component of θi , σnoise ∂K . K′ is also the Kronecker product K′ = or σf,i , i = 1, . . . , d) and K′ = ∂θ i K1 ⊗ · · · ⊗ Ki−1 ⊗ ∂K ∂θ ⊗ Ki+1 ⊗ · · · ⊗ KK , where θ is a parameter of the ith covariance function. Denoting by A a tensor such that vec(A) = K−1 y y the second term in (12) is efficiently computed using the same technique as in (10): 2 1 T −1 ′ −1 y Ky K Ky y = A, A ⊗1 KT1 ⊗2 · · · ⊗i−1 KTi−1 ⊗i 2 (13) 3 ∂KTi T T ⊗i+1 Ki+1 ⊗i+2 · · · ⊗K KK . ∂θ Using properties of the trace and the Kronecker product we can obtain 4 5 K $ ) ( ′ Tr(K−1 vec D−1 , diag (Ui K′i Ui ) , y K )=

(14)

i=1

( ) where diag A is a vector of diagonal elements of a matrix A. For the derivatives the following statement holds.

Proposition 2. The % Kcomputational *complexity of calculating derivatives of the K ! 3 ! log likelihood is O ni + N ni . i=1

i=1

Gaussian Process Regression for Structured Data Sets

2.3

111

Anisotropy

In an engineering practice sizes of factors often differ significantly. It is a common case for the GP regression to become degenerate in this case (see Figures 3, 4). Suppose that the given DoE consists of two one-dimensional factors with sizes n1 , n2 and n1 ≪ n2 . Then the length-scale for the first factor is expected to be much greater than the length-scale for the second factor (or θ1 ≪ θ2 ). However, in practice the opposite is often observed. The reason is that the optimization algorithm stacks in a local maximum as the log likelihood is non-convex function with lots of local maxima. ( (i) )−1 (i) (i) , where θk is a k-th component Let us denote length-scales as lk = θk of a vector of parameters of i-th covariance function. To incorporate knowledge about factor sizes in GP we introduce a regularization by imposing beta-prior 6 7 (i) (i) on θ with parameters α and β and rescaled to some interval ak , bk : (i)

(i)

(i) bk

(i) ak

θk − ak −

∼ Be(α, β), {i = 1, . . . , dk }K k=1 ,

(15)

T

Let B(α, β) be a beta function. Then the log likelihood has the form

R

AF

1 1 N log p(y | X, θ, σf , σnoise ) = − yT K−1 log |Ky | − log 2π − d log(B(α, β))+ y y− 2 2 2 " " (i) # " " ## (i) (i) (i) ! θk − ak θk − ak (α − 1) log + (β − 1) log 1 − , (i) (i) (i) (i) b − a b − a k,i k k k k

D

(16)

It is not known a priori large or small should be the length-scales, so prior should impose nearly the same penalties on the intermediate values of θ. Setting α = β = 2 leads to the desired shape of the prior (Figure 2). 1 0

log(PDF)

−1 −2 −3 −4 −5 −6 0

α=β=2 0.2

0.4

0.6

0.8

1

x

Fig. 2. Logarithm of Beta distribution probability density function, α = β = 2 Numerous references use a gamma-prior, e.g. [11]. Preliminary experiments showed that GP models with such prior often degenerate. Prior distribution with compact $ (i) (i) % (i) support restricts θk to belong to interval ak , bk therefore prohibiting too small and too large length-scales and excluding every possibility to degenerate (if intervals $ (i) (i) % ak , bk are chosen properly).

112

M. Belyaev et al.

It seems reasonable that the length-scale is not needed to be much less than the distance between points. That is why we choose the lower bound to be ck ∗ ∥x(i) −y (i) ∥ and the upper bound to be Ck ∗ max ∥x(i) −y (i) ∥. The value min x,y∈sk

x,y∈sk ,x(i) ̸=y (i)

ck should be close to 1. If ck is too small we are taking risks to overfit the data by allowing small length-scales. If ck is too large we are going to underfit the data by allowing only large length-scales. Constants Ck must be much greater than ck to permit large length-scales and preserve flexibility. Constants ck = 0.5 and Ck = 100 worked rather good in our test cases. Figure 5 illustrates GP regression with introduced regularization. The parameters were chosen such that the approximation is non-degenerate. GP Training set

3

True function Training set

3

2

2

2

1

1

1

0

0

0

−1 1

−1 1

−1 1

1

x2

−1

−1

0

0

x1

x2

Fig. 3. Degeneracy of the GP model

−1

−1

0

x1

x2

−1

−1

x1

Fig. 5. The GP model with proposed prior distribution

R

AF

Fig. 4. True function

1 0

Initialization

D

2.4

1 0

T

0

GP, beta prior Training set

3

It is also important to choose reasonable initial values of parameters in order to converge to a good solution during parameters optimization. The kernel-widths for different factors should be proportional to the factor sizes. So it seems reasonable to use average distance between points in a factor as an initial value (i)

θk =

3

&

1 nk

'

()−1 . max (x(i) ) − min (x(i) )

x∈sk

x∈sk

(17)

Experimental Results

The algorithms were tested on a set of functions from [5, 18]. Sample sizes N varied from 100 to about 200000, input dimensions — from 2 to 6. For each function several factorial anisotropic DoE were generated. Compared algorithms are proposed approach without (tensorGP) and with regularization (tensorGP-reg), Fully Independent Training Conditional GP (FITC) [16], Sparse Spectrum GP (SSGP) [9] and Multivariate Adaptive Regression Splines (MARS) [7]. For FITC and SSGP the number of inducing points (or spectral points for SSGP) M varied from 500 to 70 as complexity of algorithms is O(M 2 N ).

Gaussian Process Regression for Structured Data Sets

113

To assess quality of approximation a mean squared error was used MSE =

1 Ntest

N! test i=1

(18)

(fˆ(xi ) − f (xi ))2 ,

where Ntest = 50000 is a size of a test set. The test sets were generated randomly. To compare performance of algorithms on a number of problems it is convenient to use Dolan-Mor´e curves [4]. The idea of Dolan-Mor´e curves is as follows. Let tp,a be a quality measure of an a-th algorithm on a p-th problem and rp,a be a performance t . Then Dolan-Mor´e curve is a graph of ρa (τ ) ratio, rp,a = minp,a (t ) p,s

s

ρa (τ ) =

1 size{p : rp,a ≤ τ }, np

AF

T

which can be thought of as a probability for the a-th algorithm to have performance ratio within factor τ ∈ R+ . The higher the curve ρa (τ ) is located the better works the a-th algorithm. ρa (1) is a ratio of problems on which the a-th algorithm performed the best. As tensorGP is an exact GP it performs better than the approximations of GP, i.e. FITC and SSGP. The tensorGP-reg algorithm has better quality due to regularization (see Figure 6). Figure 7 presents Dolan-Mor´e curves for training time. Time perfor-

0.7 0.6

R

0.4

tensorGP-reg tensorGP SSGP FITC MARS

0.3

D

ρa (τ)

0.5

0.2 0.1 0

0

0.5

1

1.5

1 0.8

ρa (τ)

0.8

0.6

tensorGP-reg tensorGP SSGP FITC MARS

0.4 0.2

2

log(τ)

0

0

0.5

1

1.5

2

log(τ)

Fig. 6. Dolan-Mor´e curves for approxima- Fig. 7. Dolan-Mor´e curves for run-times comparison tion quality comparison mance of proposed approach is comparable to that of MARS algorithm and outperforms FITC and SSGP techniques. For N = 1000 training time of tensorGP-reg is about 1 sec, for N = 400000 — 480 sec Training time of standard GP model for N = 1000 is about 120 sec, for N = 400000 it cannot be fitted due to memory limitation.1

3.1

Rotating Disc Problem

Let us consider a real world problem of a rotating disc shape design. Such kind of problems often arises during aircraft engine design and in turbomachinery [2]. In this problem a disc of an impeller rotating around the shaft is considered. The geometrical 1

Experiments were conducted on a PC with Intel i7 2.8 GHz processor and 4 GB RAM.

114

M. Belyaev et al.

Fig. 8. Rotating disc parametrization

Fig. 9. Rotating disc objectives

150

150

100

100

50

50

0

0

−50

−50

300 200 100

x5

0

700

500

600

tensorGP-reg Training set Test set 300

−100 400

tensorGP-reg Training set Test set 300

−100 400

300 200

400

100

x5

x6

0

700

500

600

400

x6

AF

T

Fig. 10. 2D slice along x5 and x6 variables Fig. 11. 2D slice along x5 and x6 variables of tensorGP-reg. MSE = 0.398. of FITC. MSE = 86.703.

D

R

shape of the disc is parameterized by 6 variables x = (h1 , h2 , h3 , h4 , r2 , r3 ) (r1 and r4 are fixed), see Figures 8, 9. The task is to find such geometrical shape of the disc that minimizes weight and contact pressure p1 between the disc and the shaft while constraining the maximum radial stress Srmax to be less than some threshold. It is a common practice to build approximations of objective functions to analyze and optimize them [6]. We applied the proposed algorithm and FITC to this problem. The DoE was full factorial and anisotropic, the sample size was 14400, factor sizes — [1, 8, 8, 3, 15, 5]. Figures 10 and 11 depict 2D slices of contact pressure approximations along x5 , x6 variables (other variables are fixed). As you can see tensorGP-reg provides much more accurate approximation than FITC.

4

Conclusion

Gaussian Processes are often used for building approximations for small data sets. However, the structure of the given data set contains important information which allows to efficiently compute exact inference even for large data sets. Introduced regularization combined with reasonable initialization has proven to be an efficient way to struggle degeneracy in case of anisotropic data. Algorithm proposed in this paper takes into account the special factorial structure of the data set and is able to handle huge samples preserving power and flexibility of GP regression. Our approach has been successfully applied to toy and real problems. Acknowledgments. The research was conducted in the IITP RAS and solely supported by the Russian Science Foundation grant (project 14-50-00150).

Gaussian Process Regression for Structured Data Sets

115

References

D

R

AF

T

1. Abdel-Gawad, A.H., Minka, T.P., et al.: Sparse-posterior gaussian processes for general likelihoods. arXiv preprint arXiv:1203.3507 (2012) 2. Armand, S.C.: Structural Optimization Methodology for Rotating Disks of Aircraft Engines. NASA technical memorandum, National Aeronautics and Space Administration, Office of Management, Scientific and Technical Information Program (1995) 3. Chan, G., Wood, A.T.: Algorithm as 312: An algorithm for simulating stationary gaussian random fields. Journal of the Royal Statistical Society: Series C (Applied Statistics) 46(1), 171–181 (1997) 4. Dolan, E.D., Mor´e, J.J.: Benchmarking optimization software with performance profiles. Mathematical Programming 91(2), 201–213 (2002) 5. Evoluationary computation pages – the function testbed: Laappeenranta University of Technology. http://www.it.lut.fi/ip/evo/functions/functions.html 6. Forrester, A.I.J., Sobester, A., Keane, A.J.: Engineering Design via Surrogate Modelling - A Practical Guide. J. Wiley (2008) 7. Friedman, J.H.: Multivariate adaptive regression splines. The Annals of Statistics, 1–67 (1991) 8. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Review 51(3), 455–500 (2009) 9. L´ azaro-Gredilla, M., Qui˜ nonero-Candela, J., Rasmussen, C.E., Figueiras-Vidal, A.R.: Sparse spectrum gaussian process regression. The Journal of Machine Learning Research 11, 1865–1881 (2010) 10. Montgomery, D.C.: Design and Analysis of Experiments. John Wiley & Sons (2006) 11. Neal, R.M.: Monte carlo implementation of gaussian process models for bayesian regression and classification. arXiv preprint physics/9701026 (1997) 12. Qui˜ nonero-Candela, J., Rasmussen, C.E.: A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research 6, 1939–1959 (2005) 13. Rasmussen, C.E., Williams, C.: Gaussian Processes for Machine Learning. MIT Press (2006) 14. Rasmussen, C.E., Ghahramani, Z.: Infinite mixtures of gaussian process experts. In: Advances in Neural Information Processing Systems 14, pp. 881–888. MIT Press (2001) 15. Rendall, T., Allen, C.: Multi-dimensional aircraft surface pressure interpolation using radial basis functions. Proc. IMechE Part G: Aerospace Engineering 222, 483–495 (2008) 16. Snelson, E., Ghahramani, Z.: Sparse gaussian processes using pseudo-inputs. In: Advances in Neural Information Processing Systems 18, pp. 1257–1264 (2005) 17. Stone, C.J., Hansen, M., Kooperberg, C., Truong, Y.K.: Polynomial splines and their tensor products in extended linear modeling. Ann. Statist. 25, 1371–1470 (1997) 18. Swiss International Institute of Technology. http://www.tik.ee.ethz.ch/sop/ download/supplementary/testproblems/