ON THE MAXIMUM LIKELIHOOD TRAINING OF GRADIENT ...

8 downloads 0 Views 3MB Size Report
sian processes in the context of computer experiments both with and without ...... [3] D. Busby, C. L. Farmer, and A. Iske, Hierarchical nonlinear approximation for ...
SIAM J. SCI. COMPUT. Vol. 35, No. 6, pp. A2554–A2574

c 2013 Society for Industrial and Applied Mathematics 

ON THE MAXIMUM LIKELIHOOD TRAINING OF GRADIENT-ENHANCED SPATIAL GAUSSIAN PROCESSES∗ R. ZIMMERMANN† Abstract. Spatial Gaussian processes, alias spatial linear models or Kriging estimators, are a powerful and well-established tool for the design and analysis of computer experiments in a multitude of engineering applications. A key challenge in constructing spatial Gaussian processes is the training of the predictor by numerically optimizing its associated maximum likelihood function depending on so-called hyper-parameters. This is well understood for standard Kriging predictors, i.e., without considering derivative information. For gradient-enhanced Kriging predictors it is an open question of whether to incorporate the cross-correlations between the function values and their partial derivatives in the maximum likelihood estimation. In this paper it is proved that in consistency with the model assumptions, both the autocorrelations and the aforementioned cross-correlations must be considered when optimizing the gradient-enhanced predictor’s likelihood function. The proof works by computational rather than probabilistic arguments and exposes as a secondary effect the connection between the direct and the indirect approach to gradient-enhanced Kriging, both of which are widely used in applications. The theoretical findings are illustrated on an academic example as well as on an aerodynamic engineering application. Key words. design and analysis of computer experiments, gradient-enhanced Kriging, Gaussian process, spatial linear model, response surface, surrogate model, hyper-parameter training, maximum likelihood AMS subject classifications. 62K20, 62P30, 65C20, 65D15 DOI. 10.1137/13092229X

1. Introduction. Spatial Gaussian processes, also known as best linear unbiased predictors, refer to a statistical data interpolation method, which is applied in a wide range of scientific fields, including computer experiments in modern engineering context; see, e.g., [22], [10]. As a powerful tool for geostatistics, it was pioneered by Krige in 1951 [13], and to pay tribute to his achievements, the method is also termed Kriging; see [9] for the geostatistical background. Given a finite set of sample points, Kriging predictors allow for interpolating both the given function values as well as their partial derivatives at the sample sites, resulting in a first-order accurate response surface approximation thereat. This approach is termed gradient-enhanced Kriging (GEK). In order to construct Kriging predictors, it is a mandatory requirement to model the sample data’s covariance structure. In the context of computer experiments, this is done via spatial correlation functions, where the level of correlation depends on the spatial distance of the sample points to each other, weighted coordinatewise by so-called hyper-parameters. These, in turn, are tuned by optimizing the likelihood function corresponding to the predictor. Maximum likelihood training for standard Kriging predictors has been investigated in [25], [18], [1], [5], [27]. In [3], Kriging has been utilized as a tool for adaptive sampling. Despite the fact that GEK predictors have been considered by several authors, ∗ Submitted

to the journal’s Methods and Algorithms for Scientific Computing section May 24, 2013; accepted for publication (in revised form) September 23, 2013; published electronically November 12, 2013. http://www.siam.org/journals/sisc/35-6/92229.html † Institute of Computational Mathematics, TU Braunschweig, 38100, Germany (ralf. [email protected]). A2554

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2555

especially in the context of computer experiments (see, e.g., [4], [14], [26], [17], [8]), to the best of the author’s knowledge, no theoretical investigations on the maximum likelihood training for GEK predictors have yet been published. A fundamental question that arises in this context is whether to incorporate the cross-correlations between the function values and their partial derivatives in the maximum likelihood estimation (MLE). In the applications, there is some dissent on this issue: In [17], it is proposed not to consider the cross-correlations in the likelihood training, while in [26], [8] the opposite is proposed. In [4], [14], and [6] no details about the likelihood training process are given. In this regard, the main original contributions of this paper are the following: (1) It will be proved by computational arguments that the cross-correlations must be considered when conducting likelihood training for gradient-enhanced spatial linear models. The proof works by tracing back the likelihood estimation problem corresponding to GEK to the likelihood estimation problem corresponding to standard gradient-free Kriging for a suitably augmented sample data set. (2) Along the way, the proof clarifies the connection between the direct GEK and the data augmentation approach (also known as indirect GEK). In this way, theoretical confirmations of experimental observations from [4, 14] are obtained. As a further consequence, theorems proved for gradient-free Kriging predictors transfer to GEK predictors. (3) The conditioning of the GEK approach is compared to the conditioning of the aforementioned data augmentation approach. It is shown numerically that it is possible that the former may feature a worse condition number than the latter, while in the majority of cases, the converse is true. (4) The theoretical findings are confirmed and illustrated on an academic example as well as on a real-life problem in the context of aerodynamical engineering. The paper is organized as follows. In the next section, a short review of the basic theory of Kriging and GEK predictors is given. Section 3 features the main original contribution of this paper: It is derived theoretically how to perform maximum likelihood training for GEK predictors. In section 4 an illustrative academic example is discussed. An application to an aerodynamic engineering problem is presented in section 5. Conclusions are drawn in section 6. The focus of this paper is on computational aspects rather than statistical issues. 2. Theoretical background. 2.1. Kriging in a nutshell. In this section, the essentials of spatial Gaussian processes in the context of computer experiments both with and without taking derivative information into account are reviewed. For more details and proofs, the reader is referred to the textbooks [23], [21], and [6] and the survey articles [22] and [11]. The basic objective is to estimate an unknown function y : Rd ⊇ U → R based on a finite data set of sample locations x1 , . . . , xn ∈ U ⊆ Rd with corresponding responses y1 = y(x1 ), . . . , yn = y(xn ) ∈ R obtained from measurements or numerical computations. The collection of responses is denoted by the vector Y T = (y1 , . . . , yn ) ∈ Rn . By assumption, y is the composition of a regression model f (x) and a Gaussian random error (x) of zero mean, ⎞ β0 ⎜ ⎟ y(x) = f (x)T β + (x) = (f0 (x), . . . , fp (x)) ⎝ ... ⎠ + (x), (x) ∼ N (0, σ 2 ), ⎛

βp

A2556

R. ZIMMERMANN

where β = (β0 , . . . , βp ) is the vector of regression coefficients. The most common choices for the regression model are constant, linear, or higher-order multivariate polynomials. The regression design matrix is ⎞ ⎛ ⎞ ⎛ f0 (x1 ) f1 (x1 ) . . . fp (x1 ) f (x1 )T ⎟ ⎜ ⎜ .. .. ⎟ ∈ Rn×(p+1) . F = ⎝ ... ⎠ = ⎝ ... . ... . ⎠ f (xn )T

f0 (xn ) f1 (xn ) . . . fp (xn )

In the context of computer experiments (see [22]), covariances of the random errors are modeled by spatial correlation functions of the form cov((xi ), (xj )) = σ 2 ρ(θ, xi , xj ), by convention parametrized such that for p = q,  1 for θ → 0, ρ(θ, p, q) → 0 for θ → ∞. Here, θ = (θ1 , . . . , θd ) ∈ Rd is a vector of hyper-parameters and ρ is the correlation coefficient. For a selection of admissible spatial correlation functions, see, e.g., [16, Table 2.1]. The appendix features the explicit expressions for the Gaussian and the cubic correlation models. The correlation matrix and the correlation vector at a location x are defined respectively by R = R(θ) = (ρ(θ, xi , xj ))i,j ∈ Rn×n , r(x) = r(θ, x) = (ρ(θ, x1 , x), . . . , ρ(θ, xn , x))T ∈ Rn . The Kriging predictor yˆ is the best linear unbiased estimator yˆ(x) = ω(x)T Y =

(2.1)

n

ωi (x)yi ,

i=1

where the weights ω(x) = (ω1 (x), . . . , ωn (x)) are determined by the Kriging equation system





R F ω(x) r(x) (2.2) = ∈ Rn+p+1 FT 0 μ(x) f (x) with Lagrange multipliers μ(x) = (μ0 (x), . . . , μp (x))T . The hyper-parameter training problem for Kriging models is to optimize the profile log likelihood (2.3)

max

{θ∈Rd ,θj >0}

L(θ) = −n log(σ 2 (θ)) − log(det(R(θ))),

where the dependency on θ is as follows: (2.4) (2.5)

β(θ) = (F T R−1 (θ)F )−1 F T R−1 (θ)Y, 1 σ 2 (θ) = (Y − F β(θ))T R(θ)−1 (Y − F β(θ)). n

2.2. The basic behavior of the hyper-parameter training process. According to the convention introduced in section 2.1, small values of the hyperparameters θ correspond to a strong level of spatial correlation, while large values of θ indicate weak spatial correlation. The following simple examples demonstrate

A2557

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES 5

Kriging predictor sample points

4.5

4

3.5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 2.1. Ordinary Kriging estimation of the line y(x) = 2x + 3 based on 10 mutually distinct random sample points xi ∈ [0, 1] (black squares). The numerically optimized hyper-parameter θ = 0.001 indicates a strong spatial correlation.

that the likelihood estimation procedure is able to detect the level of correlation properly. Here, the cubic correlation model (see Appendix A) is applied and no gradient information is included. Hyper-parameter optimization is restricted to the interval [0.001, 1.0]. Details are omitted, since this section illustrates only the basic behavior of the likelihood training. As a first test case, the Kriging predictor is applied to a data set stemming from the linear function y(x) = 2x + 3 sampled at 10 randomly chosen points xi ∈ [0, 1], i = 1, . . . , 10. Because of the linear relation, one would expect a very strong correlation, and indeed, the likelihood optimization process outputs the lower bound θ = 0.001 of the constraining interval. As a second test case, the Kriging predictor is applied to a data set stemming from the nonlinear function y(x) = sin(10x) + x2 sampled at the same 10 randomly chosen points. Here, as expected, the likelihood optimization process detects a weaker spatial correlation, the optimized hyper-parameter being θ = 0.6883. The predictors corresponding to the above examples are displayed in Figures 2.1 and 2.2. 2.3. Direct GEK. As summarized in [11, section 4.5] and [6, section 7], given the sampled function values yi = y(xi ) and their partial derivatives ∂k yi = ∂y/∂xk (xi ), the Kriging predictor yˆ can be enhanced to meet yˆ(xi ) = yi ,

∂k yˆ(xi ) = ∂k yi (i = 1, . . . , n; k = 1, . . . , d).

To this end, the vector of responses is augmented to include the partial derivatives (2.6)

YD = (Y T , ∂1 Y T , . . . , ∂d Y T )T ∈ Rn(d+1) .

The ansatz for the best linear unbiased estimator corresponding to (2.1) now becomes (2.7)

yˆ(x) = ωD (x)T YD =

n i=1

ωi (x)yi +

d n k=1 i=1

ωk,i (x)∂k yi ,

A2558

R. ZIMMERMANN 2

Kriging predictor sample points 1.5

1

0.5

0

−0.5

−1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 2.2. Ordinary Kriging estimation of the function y(x) = sin(10x) + x2 based on 10 mutually distinct random sample points xi ∈ [0, 1] (black squares). The numerically optimized hyperparameter θ = 0.6883 indicates a weaker spatial correlation.

where the augmented weights vector ωD is ordered as follows: ωD = (ω1 , . . . , ωn , ω1,1 , . . . , ω1,n , . . . , ωd,1 , . . . , ωd,n )T ∈ Rn(d+1) . Introducing the notation

∂ρ i j (θ, x , x ) ∈ Rn×n , ∂xik i,j≤n

∂ρ (θ, xi , xj ) ∈ Rn×n , ∂(·,l) R = ∂xjl i,j≤n

2 ρ ∂ 2 ∂(k,l) R= (θ, xi , xj ) ∈ Rn×n , ∂xik ∂xjl i,j≤n

∂ρ (θ, xi , x) ∈ Rn ∂(k,·) r(x) = ∂xik i≤n

(2.8)

∂(k,·) R =

(2.9)

(2.10) (2.11)

(k, l = 1, . . . , d), the auto- and cross-correlation matrix and the auto- and crosscorrelation vector corresponding to GEK read (2.12)



R

⎜∂(1,·) R ⎜ RD = ⎜ . ⎝ .. ∂(d,·) R

∂(·,1) R 2 ∂(1,1) R .. . 2 ∂(d,1) R

⎞ ⎞ ⎛ ∂(·,d) R r(x) 2 ⎜∂(1,·) r(x) ⎟ ∂(1,d) R⎟ ⎟ ⎟ ⎜ n(d+1)×n(d+1) , r (x) = ∈ R ⎟ ⎟. ⎜ .. .. D ⎠ ⎠ ⎝ . . 2 . . . ∂(d,d) R ∂(d,·) r(x)

... ...

For details, see [11, section 4.5] or [20, section 9.6, pp. 316–317]. For any stationary correlation model depending on the distance vector d = (xi − xj ) rather than the

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2559

precise sample locations xi , xj , the first-order partial derivatives are odd functions; see [6, Figure 7.1] for an illustration and the appendix for the detailed expressions corresponding to the Gaussian and the cubic correlation models, respectively. This observation explains the common misunderstanding that the enhanced correlation matrix RD is nonsymmetric; see [14, section 2.5.1]. However, RD , as displayed in (2.12), is symmetric, since taking the transpose and taking the partial derivatives with respect to the second spatial argument both cause a sign switch,  T     ∂(k,·) ρ(θ, xi , xj ) i,j≤n = ∂(k,·) ρ(θ, xj , xi ) i,j≤n = (−1)(−1) ∂(·,k) ρ(θ, xi , xj ) i,j≤n . The second-order partial derivatives are symmetric by the Schwartz theorem; see any primer on calculus. For the general expressions of the partial derivatives of stationary correlation models, see [11, section 4.5]. The augmented regression design matrix FD is   T FD = F T , ∂1 F T , . . . , ∂d F T ∈ R(p+1)×n(d+1) . The uniquely determined weights that lead to the best linear unbiased predictor are now given by the solution to the GEK system





rD (x) RD FD ωD (x) = ∈ Rn(d+1)+(p+1) . (2.13) T μ(x) f (x) FD 0 Obviously, the direct GEK approach requires the choice of an at least twice differentiable correlation function ρ. 2.4. Indirect GEK. As an alternative approach to the direct method outlined in section 2.3, any available derivative information can be included in the Kriging predictor via finite difference approximations. To this end, a small positive step size ε > 0 is fixed and a number of nd new sample points x˜i,k = xi + εek (i = 1, . . . , n; k = 1, . . . , d) is added to the data set with corresponding function values approximated via y(˜ xi,k ) = y(xi + εek ) ≈ y(xi ) + ε∂k y(xi ) (i = 1, . . . , n; k = 1, . . . , d). Keeping the notation of (2.6), the indirect GEK response vector is (2.14)

YI = (Y T , Y T + ε∂1 Y T , . . . , Y T + ε∂d Y T )T ∈ Rn(d+1) .

After these settings, the standard Kriging approach as outlined in section 2.1 applies to the augmented data set consisting of n(d + 1) sample points and sample values. The resulting augmented Kriging system of the form of (2.2) is of exactly the same dimensions as the direct GEK system (2.13). The indirect approach is quite popular in the engineering community, because the standard Kriging predictor can be applied after a few very simple modifications; see [4], [14]. It is sometimes also referred to as “data augmentation”; see [15]. 3. Maximum likelihood training for GEK predictors. This section features the main original contribution of the paper at hand. By establishing a one-toone connection between the direct and the indirect GEK approach, a precise answer to the question raised in the introduction is obtained.

A2560

R. ZIMMERMANN

Lemma 3.1. Up to a first-order Taylor approximation, the direct and indirect GEK approaches are equivalent by a suitable transformation of the weights vector in (2.7). Proof. Let ε > 0 be a fixed small positive step size and let In ∈ Rn×n denote the (n × n)-identity matrix. Introducing the n(d + 1) × n(d + 1)-transformation matrix ⎞ ⎞ ⎛ ⎛ In In ⎟ ⎟ ⎜− 1 In 1 In ⎜In εIn ε ⎟ ⎟ ⎜ ε ⎜ ⎟ ⎟ ⎜ 1 ⎜ . . . . −1 ⎟ ⎟, ⎜ ⎜ . . 0 0 Φ = ⎜− ε In ⎟ , Φ = ⎜In ⎟ ⎟ ⎟ ⎜ . ⎜ . . . .. . .. .. .. .. ⎠ ⎠ ⎝ .. ⎝ .. . . . .. − 1ε In In 0 . . . 0 1ε In 0 . . . 0 εIn it holds that YD = ΦYI ; see (2.6) and (2.14). Thus, the direct GEK ansatz (2.7) can be written as  T yˆ(x) = ωD (x)T ΦYI = ΦT ωD (x) YI = ω ˜ I (x)T YI , which is the standard indirect GEK ansatz, but with transformed weights ω ˜ I (x). Just as outlined for standard Kriging in section 2.1, the optimal weights are determined by the linear equation system (3.1) ⎛ (ρ(xi , xj ))i,j i,1 j ⎜(ρ(˜ ⎜ x , x ))i,j ⎜ .. ⎜ . ⎜ ⎝(ρ(˜ xi,d , xj ))i,j FT

(ρ(xi , x ˜j,1 ))i,j (ρ(˜ xi,1 , x ˜j,1 ))i,j .. . (ρ(˜ xi,d , x ˜j,1 ))i,j F1T

... ... ... ...

(ρ(xi , x ˜j,d ))i,j (ρ(˜ xi,1 , x ˜j,d ))i,j .. . (ρ(˜ xi,d , x ˜j,d ))i,j FdT

⎛ ⎞ ⎞ F (ρ(xi , x))i i,1 x , x))i ⎟ F1 ⎟ ⎜ ⎜(ρ(˜ ⎟ ⎟ ˜ (x) ⎜ ⎟ .. .. ⎟ ω = ⎜ ⎟ ⎟. . . ⎟ μ(x) ⎜ ⎟ i,d ⎝ ⎠ (ρ(˜ x , x))i ⎠ Fd f (x) 0

Here, ⎞ ⎛ x1,k ) f (˜ x1,k )T f0 (˜ ⎟ ⎜ ⎜ . .. .. Fk = ⎝ ⎠=⎝ . ⎛

f (˜ xn,k )T

f1 (˜ x1,k ) .. .

...

... xn,k ) f1 (˜ xn,k ) ... f0 (˜

⎞ fp (˜ x1,k ) ⎟ .. ⎠ ∈ Rn×(p+1) . fp (˜ xn,k )

for x ˜i,k = xi + εek . By Taylor’s theorem, ρ(˜ xi,l , xj ) ≈ ρ(xi , xj ) + ε∂(l,·) ρ(xi , xj ), ˜j,k ) ≈ ρ(xi , xj ) + ε∂(·,k) ρ(xi , xj ), ρ(xi , x 2 ˜j,k ) ≈ ρ(xi , xj ) + ε∂(·,k) ρ(xi , xj ) + ε∂(l,·) ρ(xi , xj ) + ε2 ∂(l,k) ρ(xi , xj ). ρ(˜ xi,l , x

Using the notation of (2.8)–(2.11), the auto- and cross-correlation matrix in the system (3.1) can be written as (3.2) ⎛ ⎜ ⎜ ⎝

R R + ε∂(1,·) R . . . R + ε∂(d,·) R

R + ε∂(·,1) R 2 R + ε∂(1,·) R + ε∂(·,1) R + ε2 ∂(1,1) R . . . 2 R + ε∂(d,·) R + ε∂(·,1) R + ε2 ∂(d,1) R

... ...

...

R + ε∂(·,d) R 2 R + ε∂(1,·) R + ε∂(·,d) R + ε2 ∂(1,d) R . . . 2 R + ε∂(d,·) R + ε∂(·,d) R + ε2 ∂(d,d) R

⎞ ⎟ ⎟. ⎠

For stationary correlation models, the first-order terms ε∂(k,·) R + ε∂(·,k) R appearing in the diagonal blocks of the cross-correlation submatrix cancel, being of opposite sign; see section 2.3.

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2561

As it corresponds to the standard indirect GEK correlation matrix, the above matrix will be referred to as RI . Accordingly, the correlation part of the right-handside vector of (3.1) is ⎛ ⎞ r(x) ⎜r(x) + ε∂(1,·) r(x)⎟ ⎜ ⎟ ⎜ ⎟ = rI (x) ∈ Rn(d+1) . .. ⎝ ⎠ . r(x) + ε∂(d,·)r(x) It holds that rI (x) = Φ−1 rD (x) (see (2.12)), so that the system (3.1) becomes −1 T





0 Φ 0 Φ ωD (x) rD (x) RI FI = , (3.3) 0 I(p+1) μ(x) 0 I(p+1) f (x)T FIT 0   where FIT = F T , F1T , . . . , FdT is the regression design matrix corresponding to the augmented indirect GEK data set. A straightforward computation shows that T



Φ RI FI RD FD Φ 0 0 , = T FIT 0 FD 0 0 I(p+1) 0 I(p+1) so that the system (3.3) is equivalent to the standard direct GEK system (2.13), where RD = ΦRI ΦT ,

(3.4)

FD = ΦFI

are the key relation. In regard of the direct and the indirect GEK approach, in [14, section 2.4] it was claimed that “when using the same correlation parameters both formulations give the same result.” The above lemma gives a rigorous proof of this observation. As shown in the proof, both approaches match exactly to a first-order Taylor approximation of the indirect GEK correlation matrix. Theorem 3.2. The maximum likelihood optimization problem corresponding to direct GEK is equivalent to the maximum likelihood optimization problem corresponding to indirect GEK, up to a first-order Taylor approximation of the indirectly augmented correlation matrix. Proof. The regression vector and the variance associated with indirect GEK are βI (θ) = (FIT RI−1 (θ)FI )−1 FIT RI−1 (θ)YI , 1 (YI − FI βI (θ))T R(θ)−1 (YI − FI βI (θ)) ; σI2 (θ) = n(d + 1) see (2.4), (2.5). The hyper-parameter vector θ is fitted by optimizing the condensed log-likelihood function (3.5)

max

{θ∈Rd ,θj >0}

LI (θ) = −n(d + 1) log(σI2 (θ)) − log(det(RI (θ))),

which is just (2.3) for indirect GEK. Because of (3.4), it holds that RI = Φ−1 RD (Φ−1 )T and FI = Φ−1 FD . As a consequence −1 −1 T T βI (θ) = FD (Φ−1 )T ΦT RD ΦΦ−1 FD FD (Φ−1 )T ΦT RD ΦΦ−1 YD = βD (θ). 2 Similarly, σI2 (θ) = σD (θ). Finally,     log det(RI (θ)) = log det Φ−1 RD (Φ−1 )T = log ε2dn det(RD ) = 2dn log(ε) + log det(RD ).

A2562

R. ZIMMERMANN

Since 2dn log(ε) is a constant independent of θ, (3.5) is equivalent to (3.6)

max

{θ∈Rd ,θj >0}

2 LD (θ) = −n(d + 1) log(σD (θ)) − log(det(RD (θ))),

which is precisely the hyper-parameter fitting problem for direct GEK. Corollary 3.3. The cross-correlations must be considered when conducting maximum likelihood training for GEK predictors. Proof. A sampling plan constructed via the indirect GEK data augmentation approach may be considered as the realization of a (very unlikely) random sampling plan for standard (gradient-free) Kriging. Hence, all sample points are considered in the MLE, just as for standard Kriging. By the one-to-one connection to the direct GEK established in Lemma 3.1 and Theorem 3.2, it follows that the complete autoand cross-correlation matrix must be considered when conducting maximum likelihood training for direct GEK predictors. The above corollary should not be interpreted to imply that gradients or augmented data at all base sample points in every coordinate direction are actually required in order to obtain high-quality GEK predictors. There may be some redundancy in the sampled information and computational costs may be saved by excluding certain data points and/or partial derivatives. As exposed in [8], GEK predictors may be constructed including only some selected derivative information. This is in analogy to the reduced data augmentation schemes considered in [15, section 7.2]. The reasoning of Corollary 3.3 then implies that the cross-correlations between the selected derivatives and sample points must be included in the likelihood estimation process, since the corresponding augmented sample points would also appear naturally in the likelihood estimation of an indirect data augmentation scheme. Remark 3.4. It seems likely that Corollary 3.3 can be obtained by probabilistic reasoning: Defining the likelihood as the probability of the occurrence of the data (based on the assumed underlying covariance structure) and inferring that this probability should be large since the data occurs in one trial, the objective is to identify the model’s hyper-parameters by maximizing this probability, i.e., to find the model which is the most likely to produce the observed event. When considering the partial derivatives as auxiliary variables with known cross-correlations to the sampled function values, including the derivatives in the MLE process may be justified in the same way as done for the so-called two-variable cokriging predictors; see [9, p. 342 ff.]. However, the author was not able to find a rigorous probabilistic proof in the literature and the original computational proof given above has the additional benefit that it equates explicitly the direct and the indirect GEK. Remark 3.5. By Lemma 3.1 and Theorem 3.2, it is shown that the direct GEK approach is theoretically equivalent to the standard Kriging approach for an augmented data set. As a surprising consequence, the limit result derived in [27, Theorem 3.1] transfers to direct GEK: Under suitable conditions, ensuring that certain limits exist, it holds that optimally trained direct GEK predictors cannot feature arbitrary illconditioned correlation matrices. Moreover, for hyper-parameters θ → 0, such that ρ(θ) → 1, there exist asymptotic sandwich bounds for the maximum likelihood function in terms of the condition number of the corresponding auto- and cross-correlation matrix. As a consequence, when the condition number blows up, so does the likelihood estimate. A basic requirement for this result to hold is that in the limit θ → 0, the corresponding correlation matrix converges to the singular matrix with all entries equal to one, which is also fulfilled by the Taylor approximation to the correlation matrix of the augmented indirect Kriging data set; see (3.2). The direct connection

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2563

between the worsening of the likelihood and the condition number blowup does hold when the complete auto- and cross-correlation matrix is considered for the likelihood training. Figures 4.3 and 4.4 depict this behavior. This theoretical result, however, does not prevent the correlation matrix in practical applications from being so badly conditioned that the prediction is spoiled by numerical errors. It has been observed by several authors that the indirect GEK correlation matrix tends to be badly conditioned; see [14, section 2.5.1]. This is obvious, since the indirect GEK correlation matrix is a singular matrix disturbed by O(ε); see (3.2). Recall that for an invertible symmetric matrix A, the condition number with respect to the Euclidean norm is κ2 (A) = A2 A−1 2 = |λmax (A)|/|λmin (A)|, where λmax (A), λmin (A) denote the eigenvalues of maximum and minimum modulus of A, respectively. For hyper-parameters θ → 0, however, both the indirect and the direct GEK correlation matrices become singular, as does the standard Kriging autocorrelation block. The example considered in section 4 shows that it is possible that the conditioning of the direct GEK correlation matrix can be even worse than the conditioning of the indirect GEK correlation matrix evaluated at the same hyper-parameters; see section 4.3 and Figure 4.3. 4. An academic example. 4.1. Setup. In this section, the theoretical facts established in section 3 are illustrated on an academic but nonartificial example. As a reference function,   y : [0, 100] × [0, 100] → R, x = (x1 , x2 ) → 5 + 10−4 x21 x2 sin 10−1 x2 , is chosen. The sample data set consists of n = 15 randomly selected sample points with corresponding sample values and partial derivatives, all of them listed in Table 4.1. Preparative, the sample data is normalized by a transformation to the unit square

xk − mink ψ1 (x) 2 2 , (k = 1, 2), , ψk (x) = ψ : R → [0, 1] , ψ(x) = ψ2 (x) maxk − mink where maxk = max{xik |i = 1, . . . , n}, mink = min{xik |i = 1, . . . , n}, and the sampled partial derivatives are modified according to the chain rule so that y is replaced by Table 4.1 Sample data set. x = (x1 , x2 ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(84.02, (78.31, (91.16, (33.52, (27.78, (47.74, (36.48, (95.22, (63.57, (14.16, (1.630, (13.72, (15.67, (12.98, (99.89,

39.44) 79.84) 19.76) 76.82) 55.40) 62.89) 51.34) 91.62) 71.73) 60.70) 24.29) 80.42) 40.09) 10.88) 21.83)

y(x)

∂1 y(x)

∂2 y(x)

−15.01 53.55 20.09 13.51 2.107 5.079 −1.233 26.58 27.52 4.742 5.004 6.488 4.249 5.162 22.83

−0.4764 1.240 0.3311 0.5075 −0.2083 0.0033 −0.3418 0.4533 0.7086 −0.0364 0.0052 0.2168 −0.0959 0.0250 0.3570

−2.443 −0.0287 0.1175 0.2582 0.2624 1.4345 0.1582 −7.787 2.139 0.1147 −0.0003 −0.0098 −0.0824 0.0234 −0.4340

A2564

R. ZIMMERMANN

Table 4.2 Initial and optimized hyper-parameters: (GEK-MLE) means considering the auto- and crosscorrelations when conducting hyper-parameter training; (Krig-MLE) refers to discarding the crosscorrelations when conducting hyper-parameter training. θinit (0.2, 0.2) (0.02, 0.02)

θopt (GEK-MLE)

θopt (Krig-MLE)

(0.6228, 6.4706) (0.6285, 6.4849)

(3.0945, 27.2077) (0.001, 0.001)

  y ◦ ψ −1 : [0, 1]2 → R. The purpose of such a data transformation is twofold: on the one hand, it standardizes the range of weak or strong influence of the hyper-parameters θ. On the other hand it makes sure that the spatial factors (xil − xjl ), which appear in the partial derivatives (see (A.1)–(A.6)), are of modulus smaller than 1. 4.2. Comparison of different likelihood training approaches. In this section, two GEK predictors are compared which differ by the maximum likelihood training of the hyper-parameter vector θ = (θ1 , θ2 ). On the one hand, the hyper-parameters are determined via solving the likelihood optimization problem (3.6) with respect to the complete auto- and cross-correlation matrix RD . This approach is denoted by the gradient-enhanced Kriging maximum likelihood estimation (GEK-MLE). On the other hand, the hyper-parameters are determined via solving the likelihood optimization problem (3.6) with respect to the autocorrelation block of the matrix RD only, as suggested in [17]. This approach is denoted here by Kriging maximum likelihood estimation (Krig-MLE). Both approaches are applied to the direct GEK with analytic partial derivatives. In order to model the spatial correlations, the Gaussian correlation function as stated in the appendix was applied and the constant regression model was chosen. It is well known that the Gaussian model tends to produce ill-conditioned correlation matrices (see [12]), yet it was chosen here deliberately, since the example at hand focuses on the comparison of the different likelihood training approaches. For improved numerical stability during the likelihood optimization, a small regularization surcharge of r = 10−7 is added to the diagonal of the correlation matrices in both approaches. Table 4.2 displays the different starting values θinit used to initialize the likelihood optimization problems as well as the optimal hyper-parameters θopt obtained via the GEK-MLE approach and the Krig-MLE approach, respectively. As an optimization procedure, the sequential programming method fmincon with the option sqp from MATLAB [19] was employed. The hyper-parameters are bounded by 10−3 ≤ θ1 , θ2 ≤ 50. The response surfaces corresponding to the different likelihood optimization approaches and the different starting solutions as well as the reference surface are displayed in Figures 4.1 and 4.2. For both choices of starting parameters, the GEK-MLE approach leads to the same (local) optimum and therefore the same predictor, while the Krig-MLE optimization hits the lower bound when started from θinit = (0.02, 0.02). In this case, despite the regularization, the corresponding predictor degenerates numerically; see Figure 4.2. 4.3. Comparison of the condition number of direct and indirect GEK. In this section, the direct gradient-enhanced correlation matrix RD (see (2.12)), the indirect gradient-enhanced correlation matrix as appearing in (3.1), here denoted by Raug , and the Taylor approximation of the indirect gradient-enhanced correlation matrix RI (see (3.2)) associated with the sample data set introduced in section 4.1 are compared in terms of their condition number depending on the hyper-parameters

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2565

100

y(x1,x2)

50

0

1 −50 0

0.8 0.6 0.2

0.4

0.4

0.6

0.8

0.2 1

0

x2

x1

Fig. 4.1. GEK estimation for a two-dimensional analytic test function based on 15 samples points. The likelihood optimization was started from θ = (0.2, 0.2). The graph shows the reference function (white), the GEK prediction with hyper-parameters determined by optimizing the likelihood with respect to the full auto- and cross-correlation matrix (light shaded), and the GEK prediction with hyper-parameters determined by optimizing the likelihood with respect to the autocorrelation matrix only (dark shaded).

100

y(x1,x2)

50

0

1 −50 0

0.8 0.6 0.2

0.4

0.4

0.6

x1

0.8

0.2 1

0

x

2

Fig. 4.2. Same as Figure 4.1, but for a different starting point θ = (0.02, 0.02) used to initialize the likelihood optimization.

θ = (θ1 , θ2 ). For the indirectly augmented sample locations and thus the corresponding Taylor approximations, a step size of ε = 0.001 was chosen. In order to obtain one-dimensional graphs, the various condition numbers are evaluated along the bisecting line θ(τ ) = (τ, τ ) for 0.001 ≤ τ ≤ 10 at 1, 000 equidistant points. The resulting graphs are displayed in Figure 4.3. As can be seen in this figure, in the

A2566

R. ZIMMERMANN

22

10

−200 DGEK IGEK

20

10

IGEK (Taylor) −300

18

10

−400 16

LOG MLE

κ2

10

14

10

−500

12

10

−600 10

10

−700

DGEK

8

10

IGEK IGEK (Taylor)

6

10

0

2

4

τ

6

8

10

−800 0

2

4

τ

6

8

10

Fig. 4.3. Left: Comparison of the condition number κ2 of the direct gradient-enhanced correlation matrix (DGEK, solid), the indirect gradient-enhanced correlation matrix (IGEK, dotted), and its Taylor approximation according to (3.2) (IGEK-Taylor, dashed) for hyper-parameters θ along the bisecting line θ = (τ, τ ), 0.001 ≤ τ ≤ 10.0. Right: Graphs of the associated maximum likelihood functions. The IGEK maximum likelihood estimations are shifted by the additive constant 2nd log(ε), ε = 10−3 ; cf. the proof of Theorem 3.2. The various likelihood functions behave very similarly, and discrepancies are due to matrix regularization. The dashed and dotted lines virtually coincide.

range 2 ≤ τ ≤ 10, the condition number of the direct gradient-enhanced correlation matrix RD is lower by about four orders of magnitude than the condition numbers of Raug and RI . In the range 0.001 ≤ τ ≤ 1, however, the various correlation matrices are comparably ill-conditioned. It is a remarkable fact that for a rather moderate value of τ = 0.5, the direct gradient-enhanced correlation matrix features a larger condition number than its competitors, the numerical values at θ = (0.5, 0.5) being κ2 (RD ) = 2.25e+18, κ2 (Raug ) = 7.31e+17, κ2 (RI ) = 1.31e+18. This effect may be partly explained as follows: for the Gaussian correlation model, the hyper-parameter vector θ = (0.5, 0.5) leads to entries exactly equal to 1.0 on the diagonal of the correlation block corresponding to the second-order derivatives; see (A.1)–(A.3). If all components of θ are smaller than 0.5, the diagonal entries become smaller than 1.0, and if all components of θ are larger than 0.5, the diagonal entries become larger than 1.0. In this case, the correlation matrix tends toward diagonal dominance and the conditioning improves. A similar effect can be observed for the diagonal entries of the Taylor approximation of the augmented matrix RI but decelerated by the square of the Taylor step size ε; see (3.2). Figure 4.3 also shows that the condition number of the indirect gradient-enhanced correlation matrix Raug behaves very comparable to the condition number of its Taylor approximation RI throughout the hyper-parameter range considered here. The right-hand side of Figure 4.3 shows the graphs of the associated maximum likelihood functions along the bisecting line θ(τ ) = (τ, τ ), 0.001 ≤ τ ≤ 10.0, regularized by adding r = 10−11 to each correlation matrix diagonal. As predicted by theory, when considering the shift by the additive constant 2nd log(ε), ε = 10−3 (cf. the proof of Theorem 3.2), the various likelihood functions behave very similarly. In fact, the differences in the graphs are solely caused by the regularization and would vanish otherwise. Yet, without regularization, numerical singularities occur along the line θ(τ ) = (τ, τ ). In order to confirm this claim, this exercise was repeated with the cubic correlation function replacing the Gaussian model (see Appendix A). Along the

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

16

10

A2567

−340

DGEK IGEK −360

IGEK (Taylor)

14

10

−380 12

LOG MLE

κ

2

10

10

10

−400

−420

8

10

−440

6

10

DGEK

−460

IGEK IGEK (Taylor) 4

10

0

0.2

0.4

τ

0.6

0.8

−480 0

1

0.2

0.4

τ

0.6

0.8

1

Fig. 4.4. Same as Figure 4.3 but for cubic correlation instead of Gauss correlation. Left: Comparison of the condition number κ2 of the direct gradient-enhanced correlation matrix (DGEK, solid), the indirect gradient-enhanced correlation matrix (IGEK, dotted), and its Taylor approximation according to (3.2) (IGEK-Taylor, dashed) for hyper-parameters θ along the bisecting line θ = (τ, τ ), 0.01 ≤ τ ≤ 0.999. Right: Graphs of the associated maximum likelihood functions. When shifting the IGEK maximum likelihood estimations by the additive constant 2nd log(ε), ε = 10−3 (cf. the proof of Theorem 3.2), all the lines displayed virtually coincide. (In the θ-range chosen here, the cubic correlation model does not require regularization.)

20

10

−340

DGEK IGEK

18

10

IGEK (Taylor)

−360

16

10

−380 14

LOG MLE

κ2

10

12

10

−400

−420

10

10

−440 8

10

−460

6

10

DGEK IGEK IGEK (Taylor)

4

10

0

0.2

0.4

τ

0.6

0.8

1

−480 0

0.2

0.4

τ

0.6

0.8

1

Fig. 4.5. Same as Figure 4.4 but for a Taylor step size of ε = 10−4 instead of ε = 10−3 chosen for the indirect GEK approaches. While the likelihood (right) is not affected notably after the additive shift by 2nd log(ε), ε = 10−4 , the value of the condition number corresponding to the indirect GEK approaches is increased (left).

bisecting line θ(τ ) = (τ, τ ), 0.001 ≤ τ ≤ 0.999, again evaluated at 1, 000 equidistant points, it was possible to compute the corresponding likelihood estimations without applying any regularization. As a consequence, the function graphs virtually coincide; see Figure 4.4. Thus, Theorem 3.2 is confirmed by numerical experiment. Figure 4.5 displays the same graphs but for a step size of ε = 10−4 used in the Taylor expansion for the indirect GEK approaches. As can be seen in this figure, the likelihood graphs are not notably affected by the smaller Taylor step size, while the condition number of the indirect approaches has increased. Of course, the additive shift has been adapted in this case to the value 2nd log(ε), ε = 10−4 .

A2568

R. ZIMMERMANN

0.4

Z

0.2

0

-0.2

-0.4 0

0.2

0.4

0.6

0.8

1

X

Fig. 5.1. Computational grid for the RAeS 2822 airfoil, detailed view close to the surface.

Z

0.05

0

-0.05 -0.05

0

0.05

X

Fig. 5.2. Computational grid for the RAeS 2822 airfoil, detailed view close to the nose.

5. An engineering application. 5.1. Setup. In this section, the GEK method is applied to a real-life aerodynamic engineering problem. The objective is to construct a surrogate model for computing the lift coefficient CL of the RAeS 2822 airfoil over the Mach number M (i.e., the ratio of the flow velocity to the speed of sound) and the angle of attack α at which the incoming flow hits the airfoil. For computing a sample data point {(αi , Mi ), CL (αi , Mi )}, the turbulent Navier– Stokes equations must be solved. This is done by means of computational fluid dynamics (CFD) using the flow solver TAU [24] developed at the German Aerospace Center. The computational mesh, which represents the flow domain around the RAeS 2822 airfoil, features 43,590 elements and is displayed in extracts in Figures 5.1 and 5.2. Flow solutions are computed at a constant Reynolds number of 6.5 million. (The Reynolds number relates the inertial forces to the viscous forces.) The range of interest of the lift coefficient function is chosen as CL : [2.4◦ , 3.2◦ ] × [0.71, 0.75] → R, (α, M ) → CL (α, M ). In order to hold enough sample data and validation data, the aerodynamic CL function was evaluated at 60 sample points in the range [2.4◦ , 3.2◦ ] × [0.71, 0.75]. The sampling plan for the alpha-Mach range was obtained according to a quasi-random point set Sobol sequence, which provides a higher degree of uniformity than a Monte Carlo sampling. The sample locations are displayed in Figure 5.3, right-hand side.

A2569

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

0.755

0.75

0.75

0.745

0.745

0.74

0.74

0.735

0.735

Mach

Mach

0.755

0.73 0.725

0.73 0.725

0.72

0.72

0.715

0.715

0.71

2.5

2.6

2.7

2.8

2.9

α

3

3.1

0.71

3.2

2.5

2.6

2.7

2.8

α

2.9

3

3.1

3.2

Fig. 5.3. The coarse set of 8 sample points for computing the GEK surrogate model (left) and the fine set of 60 sample points for computing the GEK reference surface (right).

0.9 0.88

lift coefficient C

L

0.86 0.84 0.82 0.8 0.78 0.76 0.74 0.72 0.7 1 0.8

1 0.6

Mach (scaled)

0.8 0.6

0.4 0.4 0.2

0.2 0

0

α (scaled)

Fig. 5.4. GEK reference lift prediction based on the fine 60-points sample data set with a constant Mach isoline (solid) and a constant alpha isoline (dashed) highlighted.

A TAU CFD computation of CL at a single alpha-Mach combination took about 2, 600 s ≈ 43.3 min on a machine with a 2.80-GHz CPU. The TAU adjoint flow solver [2, 7] provides the partial derivatives in both directions ∂α CL (αi , Mi ) and ∂M CL (αi , Mi ) at a sample point (αi , Mi ) in a single computation at once. For the RAeS 2822 airfoil grid at hand, an adjoint flow solution took about 100 seconds. For computing a reference surface, GEK prediction was performed based on the fine sample data set consisting of 60 sample points and 120 partial derivatives (60 in Mach direction, 60 in alpha direction). The cubic correlation model was applied; see Appendix A. As outlined in section 4.1, before the actual GEK interpolation, the sample locations are normalized by a transformation to the unit square. The resulting CL -surface is displayed in Figure 5.4. 5.2. Comparison of the different likelihood training approaches. As in section 4.2, the two different approaches to GEK hyper-parameter training are compared in the following. To this end, a coarse subset of 8 sample points was

A2570

R. ZIMMERMANN Table 5.1 Sample data set for RAeS 2822 lift prediction. x = (α, M ) 1 2 3 4 5 6 7 8

(3.0000, (2.7250, (2.4625, (2.5875, (3.1875, (2.8875, (3.1375, (2.5375,

0.7200) 0.7463) 0.7194) 0.7131) 0.7231) 0.7181) 0.7356) 0.7456)

CL (x)

∂α CL (x)

∂M CL (x)

0.8406 0.7842 0.7368 0.7481 0.8735 0.8173 0.8549 0.7628

0.1748 0.0961 0.1951 0.1929 0.1441 0.1851 0.0901 0.1215

3.6254 −1.422 4.2627 4.3054 2.2179 3.9689 −0.6335 −0.2316

0.9 0.88

0.86

0.86

0.84

0.84

L

0.9 0.88

lift coefficient C

lift coefficient C

L

selected randomly out of the 60 points sample data set at hand. The precise sample data is listed in Table 5.1, and the sample locations are displayed in Figure 5.3, left-hand side. Again, the GEK predictor that is obtained by optimizing the hyperparameters with respect to the complete auto- and cross-correlation matrix is referred to as being constructed via the GEK-MLE approach, while the GEK predictor that is obtained by optimizing the hyper-parameters with respect to the autocorrelation block only is referred to as being constructed via the Krig-MLE approach. For both predictors, the cubic correlation model (see Appendix A) is utilized. Since the cubic model generally leads to better conditioned correlation matrices when compared to the Gaussian correlation, adding a regularization constant to the correlation matrix diagonal is omitted. The hyper-parameter training is restricted to the interval 10−3 ≤ θ1 , θ2 ≤ 1.0 − −4 10 . As a starting value (θ1 , θ2 ) = (0.2, 0.2) is chosen. Constrained optimization is performed in MATLAB [19] by the fmincon function using the sqp option for sequential quadratic programming. The GEK-MLE approach leads to optimized hyper-parameters θ = (0.2588, 0.5852), while the Krig-MLE approach leads to optimized hyper-parameters θ = (0.6152, 0.2289). The corresponding response surfaces are displayed in Figure 5.5. For comparing the accuracy of both GEK predictors,

0.82 0.8 0.78 0.76

0.82 0.8 0.78 0.76

0.74

0.74

0.72

0.72

0.7 1

0.7 1 0.8

1 0.6

0.8 0.4

Mach (scaled)

0.8

1 0.6

0.6 0.4 0.2

0.2 0

0

α (scaled)

0.8 0.6

0.4

Mach (scaled)

0.4 0.2

0.2 0

α (scaled)

0

Fig. 5.5. GEK lift prediction based on the coarse 8-points sample data set according to the GEK-MLE approach (left) and the Krig-MLE approach (right).

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2571

0.025

0.02

0.015

0.01

error

0.005

0

−0.005

−0.01

−0.015

−0.02

−0.025

5

10

15

20

25

30

35

40

45

50

55

60

index

Fig. 5.6. Discrete errors of the GEK predictor trained via the GEK-MLE approach over the index of the validation points. 0.025

0.02

0.015

0.01

error

0.005

0

−0.005

−0.01

−0.015

−0.02

−0.025

5

10

15

20

25

30

35

40

45

50

55

60

index

Fig. 5.7. Discrete errors of the GEK predictor trained via the Krig-MLE approach over the index of the validation points. Table 5.2 Initial hyper-parameters, optimized hyper-parameters, norm of the vector of errors at the 52 validation points, and condition number of the auto- and cross-correlation matrix for the different likelihood training approaches. MLE approach GEK-MLE Krig-MLE

θinit

θopt

 · 2 -error

 · 1 -error

κ2 (RD )

(0.2, 0.2) (0.2, 0.2)

(0.2588, 0.5852) (0.6152, 0.2289)

0.0336 0.0632

0.1761 0.3454

1.240e+5 1.624e+4

the remaining 52 sample points of the 60-points data set, which were not used to construct the predictor, now serve as validation data. The errors over the point index are displayed in Figures 5.6 and 5.7, and the 1-norm and the 2-norm of the resulting error vectors are listed in Table 5.2. As can be seen in this table and the figures, the errors produced by the GEK-MLE approach are about half as large as the errors produced by the Krig-MLE approach. Apart from the superior accuracy, it is even more interesting to note that the GEK-MLE approach detects a stronger level of correlation in the alpha coordinate

A2572

R. ZIMMERMANN

0.86

0.84

lift coefficient CL

0.82

0.8

0.78

section cut along Mach section cut along α

0.76

0.74 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 5.8. Isoline CL (·, M ) of the GEK reference surface at a constant scaled Mach number ˜ = 0.7995 corresponding to an actual value of M = 0.7416 (solid) and isoline CL (α, ·) at a of M constant scaled angle of α ˜ = 0.7995 corresponding to an actual value of α = 3.0293 (dashed) (cf. Figure 5.4).

direction (θ1 = 0.2588) than in the Mach coordinate direction (θ2 = 0.5852). The isolines highlighted in Figure 5.4 and also displayed in Figure 5.8 show that this is perfectly in line with the theoretical expectation motivated in section 2.2, since the Mach isolines are closer to being linear functions than the alpha isolines. (Remember that small hyper-parameter values correspond to strong spatial correlation and vice versa.) In comparison, the Krig-MLE approach leads to counterintuitive hyperparameter values of θ = (0.6152, 0.2289), where the stronger level of spatial correlation is wrongly detected in the Mach coordinate direction. However, in this case the condition number of the correlation matrix obtained via the GEK-MLE approach is larger by one order of magnitude than the condition number of the correlation matrix obtained via the Krig-MLE approach; see Table 5.2. This observation may be explained in part by the fact reported in [1] that likelihood optimized hyper-parameters may correspond to ill-conditioned correlation matrices. On the side, this example shows that GEK predictors trained via the GEK-MLE ansatz are not necessarily better conditioned than their competitors. 6. Summary and conclusion. In this paper, it has been shown that the direct GEK and the indirect GEK are related by a one-to-one parameter transformation, up to a first-order Taylor approximation of the correlation functions. Moreover, the associated maximum likelihood functions coincide up to an additive constant. Since the indirect GEK approach is but the standard Kriging method for an augmented data set, it has been inferred that the complete auto- and cross-correlation matrix must be considered in the likelihood training approach, as it corresponds to the augmented data set when performing indirect GEK, the sample points are wider spread in Mach direction than in alpha direction The resulting GEK predictors for two different likelihood training approaches (namely, considering auto- and cross-correlation versus considering autocorrelation only) have been compared with an academic example as well as with an aerodynamic engineering application. The academic example featured a confirmation of the theoretical findings by numerical computations. For the engineering problem, a superiority of the full maximum likelihood training approach in terms of prediction accuracy was observed. A word of caution: The main result of this work shows that theoretically, the full correlation matrix must be considered when performing the maximum

TRAINING OF GRADIENT-ENHANCED GAUSSIAN PROCESSES

A2573

likelihood training. This does not mean that the resulting GEK predictor is guaranteed to outperform the autocorrelation trained GEK predictor. Appendix A. Correlation models. The Gaussian correlation model is ρ(θ, xi , xj ) d exp(− k=1 θk (xik − xjk )2 ). The partial derivatives read

=

(A.1) ∂(k,·) ρ(θ, xi , xj ) =

∂ρ (θ, xi , xj ) = ρ(θ, xi , xj )(−2)θk (xik − xjk ), ∂xik

∂(·,l) ρ(θ, xi , xj ) =

∂ρ

(A.2) ∂xjl

(θ, xi , xj ) = ρ(θ, xi , xj )2θk (xil − xjl ) = −∂(l,·) ρ(θ, xi , xj ),

(A.3) 2 ∂(k,l) ρ(θ, xi , xj ) =

∂2ρ ∂xik ∂xjl

(θ, xi , xj )

  = ρ(θ, xi , xj )2θk δk,l − 2θl (xik − xjk )(xil − xjl ) .  The cubic correlation model is ρ(θ, xi , xj ) = dk=1 ρk (θk , (xik − xjk )), where  2  3  1 − 3 θk |xik − xjk | + 2 θk |xik − xjk | , 1 > θk |xik − xjk |, j i ρk (θk , (xk − xk )) = 0, 1 ≤ θk |xik − xjk |. j i Writing dij k = xk − xk , the partial derivatives of the cubic model read

(A.4)

 i

j

∂(k,·) ρ(θ, x , x ) =

3 ij 2 2 ij sign(dij k )(6θk (dk ) − 6θk |dk |) 0,

 h=k

ij ρh (θh , dij h ), 1 > θk |dk |, 1 ≤ θk |dij k |,

(A.5) ∂(·,l) ρ(θ, xi , xj ) = −∂(l,·) ρ(θ, xi , xj ), (A.6) 2 ∂(k,k) ρ(θ, xi , xj )

 =

2 (−12θk3 |dij k | + 6θk ) 0,

 h=k

ij ρh (θh , dij k ), 1 > θk |dk |, 1 ≤ θk |dij k |,

(A.7) 2 ∂(k,l) ρ(θ, xi , xj )    ij ij 3 ij 2 2 ij − h=k,l sign(dij h=k,l ρh (θh , dh ), 1 > maxh θh |dh |, h )(6θh (dh ) − 6θh |dh |) = 0, 1 ≤ maxh θh |dij h |.

Acknowledgment. The author is very grateful to Dr. Dishi Liu (Institute of Aerodynamics and Flow Technology, German Aerospace Center (DLR), Braunschweig, Germany) for providing the data for the RAeS 2822 lift prediction test case discussed in section 5 and for fruitful discussions.

A2574

R. ZIMMERMANN REFERENCES

[1] R. Ababou, A. B. Bagtzoglou, and E. F. Wood, On the condition number of covariance matrices in kriging, estimation and simulation of random fields, Math. Geol., 26 (1994), pp. 99–133. [2] J. Brezillon and N. Gauger, 2D and 3D aerodynamic shape optimisation using adjoint approach, Aerospace Science Technology, 8 (2004), pp. 715–727. [3] D. Busby, C. L. Farmer, and A. Iske, Hierarchical nonlinear approximation for experimental design and statistical data fitting, SIAM J. Sci. Comput., 29 (2007), pp. 49–69. [4] H. S. Chung and J. J. Alonso, Using gradients to construct cokriging approximation models for high-dimensional design optimization problems, in Proceedings of the 40th AIAA Aerospace Sciences Meeting and Exhibit, Reno, NV, 2002. [5] G. J. Davis and M. D. Morris, Six factors which affect the condition number of matrices associated with kriging, Math. Geol., 29 (1997), pp. 669–683. [6] A. I. J. Forrester, A. Sobester, and A. J. Keane, Engineering Design via Surrogate Modelling: A Practical Guide, John Wiley & Sons, New York, 2008. [7] M. B. Giles and N. A. Pierce, An introduction to the adjoint approach to design, Flow Turbulence Combustion, 65 (2000), pp. 393–415. ¨ rtz, and R. Zimmermann, Improving variable-fidelity surrogate modeling [8] Z. H. Han, S. Go via gradient-enhanced kriging and a generalized hybrid bridge function, Aerospace Science Technology, 25 (2012), pp. 177–189. [9] A. G. Journel and C. J. Huijbregts, Mining Geostatistics, 5th ed., Blackburn Press, Coldwell, NJ, 5 ed., 1991. [10] M. C. Kennedy and A. O’Hagan, Predicting the output from a complex computer code when fast approximations are available, Biometrika, 87 (2000), pp. 1–13. [11] J. R. Koehler and A. B. Owen, Computer experiments, in Design and Analysis of Experiments, S. Ghosh and C. R. Rao, eds., Handbook of Statist. 13, Elsevier, Amsterdam, 1996, pp. 261–308. [12] A. B. Kostinski and A. C. Koivunen, On the condition number of gaussian sample-covariance matrices, IEEE Trans. Geosci. Remote Sensing, 38 (2000), pp. 329–332. [13] D. Krige, A statistical approach to some basic mine valuation problems on the Witwatersrand, J. Chemical Metallurgical Mining Engrg. Soc. South Africa, 52 (1951), pp. 119–139. [14] J. Laurenceau and P. Sagaut, Building efficient response surfaces of aerodynamic functions with kriging and cokriging, AIAA J., 46 (2008), pp. 498–507. [15] W. Liu, Development of Gradient-Enhanced Kriging Approximations for Multidisciplinary Design Optimization, Ph.D. thesis, University of Notre Dame, Notre Dame, IN, 2003. [16] S. Lophaven, H. B. Nielsen, and J. Søndergaard, Dace—A MATLAB Kriging Toolbox, Version 2.0s, Technical report IMM-TR-2002-12, Technical University of Denmark, Lyngby, Denmark, 2002. [17] A. March, K. Willcox, and Q. Wang, Gradient-based multifidelity optimisation for aircraft design using bayesian model calibration, in Proceedings of the 2nd Aircraft Structural Design Conference, Royal Aeronautical Society, London, 2010, p. 1720. [18] K. V. Mardia and A. J. Watkins, On multimodality of the likelihood in the spatial linear model, Biometrika, 76 (1989), pp. 289–295. [19] MATLAB, version 7.10.0 (R2010a), The MathWorks, Inc., Natick, MA, 2010. [20] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, 1965. [21] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, Cambridge, MA, 2006. [22] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn, Design and analysis of computer experiments, Statist. Sci., 4 (1989), pp. 409–423. [23] T. J. Santner, B. J. Williams, and W. I. Notz, The Design and Analysis of Computer Experiments, Springer, New York, 2003. [24] D. Schwamborn, T. Gerhold, and R. Heinrich, The DLR TAU-code: Recent applications in research and industry, in Proceedings of the European Conference on Computational Fluid Dynamics, ECCOMAS CFD 2006, Egmond aan Zee, the Netherlands, 2006. [25] J. J. Warnes and B. D. Ripley, Problems with likelihood estimation of covariance functions of spatial gaussian processes, Biometrika, 47 (1987), pp. 640–642. [26] W. Yamazaki, M. P. Rumpfkeil, and D. J. Mavriplis, Design optimization utilizing gradient/hessian enhanced surrogate models, Proceedings of the 28th AIAA Applied Aerodynamics Conference, Chicago, IL, 2010. [27] R. Zimmermann, Asymptotic behavior of the likelihood function of covariance matrices of spatial gaussian processes, J. Appl. Math., (2010), 494070.