Gaussian process binary classification, Multivariate skew normal, and ... between the GP latent process and the probit d
Some relations between Gaussian process binary classification, Multivariate skew normal, and Maximising correlations (working note) Kian Ming A. Chai School of Informatics, University of Edinburgh
[email protected]
Third version: February 15, 2009 Second version: October 24, 2008 First version: January 31,2008
Abstract We show that the posterior of the latent function in GP binary classification is related to the multivariate skew normal distribution, and suggest how classification may be achieved without explicitly using a latent function. A connection is made between ML-II estimation for GP binary classification and correlation maximisation. Two heuristics to approximate the ML-II are proposed.
1 GP Binary Classification and Multivariate skew normal For an input x, let y ∈ {−1, 1} be its class label. Binary classification can be performed via a Gaussian Process with the following model [Rasmussen and Williams, 2006, chapter 3]: hf (x)i = 0 0
0
hf (x), f (x )i = k(x, x ) p(yi |fi ) = Φ(yi fi )
prior mean of latent process
(1)
prior covariance of latent process probit likelihood,
(2) (3)
Rz 2 where fi = f (xi ) and Φ(z) = √12π −∞ e−t /2 dt is the cdf of standard normal. Given a observed data set D = {(yi , xi )}ni=1 , the posterior over the latent function values f = (f (x1 ), . . . , f (xn )) is p(f |D) ∝ φn (f ; 0, K)
n Y
Φ(yi fi )
posterior of latent values,
(4)
i=1
where K is an n×n matrix with k(xi , xj ) being its (i, j)th entry, and φn (·; µ, Σ) is the n-variate normal pdf with mean µ and covariance Σ. The product over the likelihoods can be expressed as n Y
Φ(yi fi ) = Φn (Dy f ; 0, In )
(5)
i=1
where Dy is an n×n diagonal matrix with yi being its (i, i)th entry, and Φn (·; µ, Σ) is the n-variate normal cdf with mean µ and covariance Σ, i.e. Z z1 Z zn 1 (t−µ)T Σ−1 (t−µ)/2 dt1 · · · dtn e− (6) Φn (z; µ, Σ) = p n (2π) |Σ| −∞ −∞ Note that Φ1 (z; 0, 1) ≡ Φ(z). Placing (5) into (4) immediately identifies the posterior as a multivariate skew normal distribution [MSN, Gupta et al., 2004]: f |D ∼ SN n (·|0, K, Dy ) 1 SN n (z; µ, Σ, D) def φn (z; µ, Σ)Φn (D(z − µ); 0, In ). = Φn (0; 0, I + DΣDT )
(7) (8)
The mean and variance of the posterior is thus those of the corresponding MSN distribution. We note here that the MSN distribution proposed by Gupta et al. [2004] is but one of the many [Arellano-Valle and Azzalini, 2006]. 1
2 Joint distribution via MSN construction Similar to the construction of MSN given by Gupta et al. [2004, Proposition 3.4], we shall give a construction of the joint distribution between the GP latent process and the probit data likelihood. We define an alternative class labelling scheme: define r− def = (−∞, 0) and r+ def = (0, ∞), and for an input x, let r ∈ {r− , r+ } be its class labels; conceptually, it means that instead of knowing the class labels via discrete (binary) values, we know the class labels via the knowledge of the sign of the values. The data set corresponding to D is then Dr def = {(ri , xi ) | ri = rsgn(yi ) }ni=1 . We also introduce a vector of n real auxiliary random variables g def = (g1 , . . . , gn )T , def and define an augmented data set Dg = {(gi ∈ ri , xi )}. Proposition 2.1. Let I +K g ∼ N2n · 0, f K
K K
(9)
Then, f ∼ Nn (·|0, K) gi ∈ ri |fi ∼ Φ(yi fi ) f |Dg ∼ SN n (·|0, K, Dy )
prior of latent values likelihood posterior of latent values
(10) (11) (12)
Proof. One may prove by considering the original likelihoods given in (3), but delaying the integration within the probit to as late as possible in all operations. The following is an alternative proof following that by Gupta et al. [2004, Proposition 3.4]. That (10) holds is clear by considering only the f component of (9). For (11), consider the conditional distribution g|f ∼ Nn (·|f , I). The components are independent in this conditional, so we may also write gi |fi ∼ N (·|fi , 1) for each i = 1 . . . n. When yi = −1, then ri = r− , and we have p(gi ∈ ri |fi ) = Φ1 (0; fi , 1) ≡ Φ(yi fi ); similarly for yi = 1. For (12), using Bayes’ rule gives p(f |Dg ) =
p(f ) p({gi ∈ ri }ni=1 |f ) . p({gi ∈ ri }ni=1 )
(13)
Comparing this with (7) and (8), we are required to show ?
p({gi ∈ ri }ni=1 |f ) = Φn (Dy f ; 0, In ) p({gi ∈
ri }ni=1 )
?
= Φn (0; 0, I +
(14)
Dy KDyT ).
(15)
The likelihood term (14) follows immediately from the definition of Φn and (11). For the normalisation term (15), note that Dy is unitary so that we have I + Dy KDyT = Dy (I + K)DyT ,
(16)
giving Φn (0; 0, I +
Dy KDyT )
0
0
1 T T −1 dt1 · · · dtn exp − t I + Dy KDy t =q 2 −∞ (2π)n |I + Dy KDyT | −∞ Z 0 Z 0 1 T −T 1 −1 −1 dt1 · · · dtn exp − t Dy (I + K) Dy t =q 2 −∞ (2π)n |D (1 + K)DT | −∞ 1
y
Z
y
1
=q
Z
(2π)n |(1 + K)DyT Dy | Z 0 1
Z
0
−∞
0
T 1 dtn exp − Dy−1 t (I + K)−1 Dy−1 t 2 −∞
Z dt1 · · · 0
1 yn dun exp − uT (I + K)−1 u 2 (2π)n |1 + K| −∞/y1 −∞/yn Z Z 1 T 1 −1 =p du1 · · · dun exp − u (I + K) u 2 (2π)n |1 + K| u1 ∈r1 un ∈rn
=p
Z
y1 du1 · · ·
= p({gi ∈ ri }ni=1 ), where we have used a change of the dummy variables of integration ti = yi vi , i = 1 . . . n. 2
(17)
Remarks 1. In essence, the marginal probability p({gi ∈ ri }ni=1 ) chooses the relevant ranges for integration of the latent multivariate Gaussian distribution. 2. From (9), clearly gi ∼ N (·|fi , 1), or gi = fi + i where i ∼ N (·|0, 1). That is, the noise N (·|0, 1) that was previously associated with the likelihood Φ(yi fi ) is now modelled in the latent function g. That the noise can be modelled either in the likelihood or in the latent process has been previously noted by Opper and Winther [2000]. This is also evident from the definition of the probit model which involves a normally distributed threshold. 3
Binary classification without explicit latent values
Proposition 2.1 suggests that one may be able to achieve GP binary classification using the latent values only implicitly, by directly using auxiliary variables. Suppose one would wish to predict the class of a novel input x∗ , and we have that I +K k∗ g (18) ∼ Nn+1 · 0, g∗ kT 1+κ ∗ where k∗ = (k(x1 , x∗ ), . . . , k(xn , x∗ ))T and κ = k(x∗ , x∗ ), where k is the prior covariance function of f . Then the conditional probability that the class label y∗ = −1 is p(g∗ ∈ r− |Dg , x∗ ) =
Φn+1 (0; 0, I + Σ∗ ) p(g∗ ∈ r− , Dg |, x∗ ) = , p(Dg ) Φn (0; 0, I + Dy KDyT )
(19)
where Σ∗ =
Dy 0
K kT ∗
0 −1
k∗ κ
Dy 0
0 −1
T .
(20)
Remarks on prior work Models of this form are also known as multivariate probit models or correlated probit regression models first motivated and proposed by Ashford and Sowden [1970]. The focus on the latent variables f — a la GP binary classification — is proposed by Muthen [1979]; in a similar vein, Albert and Chib [1993] focus on the weights β of the linear composition f (x) = β T x. In particular, Albert and Chib [1993] motivate their Gibbs sampling (of β) approach by citing the inappropriateness of normal (i.e. Laplace) approximations to the posterior, noted previously by Griffiths et al. [1987], and Zellner and Rossi [1984]. 4 ML-II estimation and Correlation maximisation ˆ given by For a kernel k(·, ·) parameterised by θ, the ML-II or empirical Bayes approach requires the search of optimal parameters θ ˆ = arg max p(Dg |θ) = arg max Φn (0; 0, I + Dy Kθ DT ) def θ y = arg max Φn (0; 0, Σ), θ
θ
(21)
θ
where Σ def = I + Dy Kθ DyT , and we have subscripted K with the hyperparameters θ to emphasis its dependence on them. The function Φn involves the integration of a zero-mean multivariate Gaussian over the lower orthant ti ≤ 0, i = 1 . . . n. The higher the mass the multivariate Gaussian has within that orthant, the higher the value of the integral. The value of this integral is also known as the (lower) orthant probability, and depends only on the correlation of the zero mean Gaussian, i.e. the variances do not matter. This suggests that the covariance Σ should be one that maximises the correlation of the random variables distributed under the Gaussian. We give an example for n = 2 where higher correlation gives higher evidence. For the same volume |Σ| = 4, consider the case when Dy Kθ Dy = I for zero correlation to the case when Dy Kθ Dy = 32 12×2 for maximum correlation (though degenerate). In the former case, the evidence is 14 by symmetry. In the latter case, it is 14 1 + π2 arctan 43 ≈ 14 × 1.4 as illustrated in Figure 1 and considering the additional area bounded between the negative t1 axis and the half line given by the direction vector −4 [see also Ruben, 1961]. 3 Below we shall make explicit what is meant by maximising the correlation. Theorem 4.1. (Slepian’s inequality [Selpian, 1962], see e.g. Gupta [1963]) Let x ∼ Nn (0, Σ) and y ∼ Nn (0, C), and denote the elements of Σ by σij and of C by cij . Let σii = cii = 1, i = 1 . . . n. If σij ≥ cij , i, j = 1 . . . n, then 1 P (x1 < a1 , x2 < a2 , . . . , xn < an ) ≥ P (y1 < a1 , y2 < a2 , . . . , yn < an )
(22)
1 Gupta [1963] gives the inequality in terms of the upper orthant, while some other literature give it in terms of the lower. We note that the two are equivalent using the transformation x0 = −In x.
3
t2
t2 t1 = t2 major axis
t1 = −t2 minor axis t1 = −2t2 3t1 + 4t2 = 0
3t1 + 4t2 = 0 2t1 = t2
t21
+
t22
1 4
=2
×
2 π
arctan 43
t1
t1 1 4
1 −1/2 e of the bivariate Gaussian with covariance I; Figure 1: LEFT: The black circle has unit radius, and is the contour p(t) = 2π 1 1 −2 the ellipse is the contour p(t) = 4π e of the correlated bivariate Gaussian with covariance 12 ( 53 35 ). The arrows show the transformations that happen when transforming the unit Gaussian p into the correlated Gaussian, using the sampling rule tunit ∼ N (·|0, I) and tcorrelated = √110 ( 53 04 )tunit . The ticks on the axes are 8/5 units from the origin. RIGHT: The thick lines bound the area under isotropic Gaussian to be considered for the lower orthant probability of the correlated bivariate Gaussian with covariance 12 ( 53 35 ). The values given in the boxes are the probabilities within the associated areas.
or equivalently Φn (a; 0, Σ) ≥ Φn (a; 0, C)
(23)
The following corollary allows us to work also with covariance functions. Corollary 4.2. (See also [Tong, 1990, §5.1.4]) Let x ∼ Nn (0, Σ) and y ∼ Nn (0, C), and denote the elements of the correlation of c σ x by ρij , and of y by rij , i.e. ρij = √σiiijσjj and rij = √ciiijcjj . If ρij ≥ rij , i, j = 1 . . . n, then Φn (0; 0, Σ) ≥ Φn (0; 0, C)
(24)
Proof. Use the variance-correlation decomposition of Σ into ∆1/2 P∆1/2 , where ∆ is a diagonal matrix of variances and P is the correlation matrix. Φn (0; 0, Σ) = Φn (0; 0, P) can be shown via suitable scaling of dummy variables of integration (by the reciprocal of the respective standard deviations). The same can be done for C, so that Φn (0; 0, Σ) = Φn (0; 0, P) ≥ Φn (0; 0, R) = Φn (0; 0, C), where R is the correlation matrix for y. Hence, ML-II maximisation is the same as maximising the correlations between the variables. Note, however, that both the Slepian’s inequality and the above corollary give only a pre-ordering of correlations {ρij } based on the scalar field Φn (a; 0, {ρij }). Hence their uses are limited, since there is no single direction along which to optimise; that is, it is still necessary to try all valid increases in correlations, compute the corresponding Φn values, and choose the highest. To make it clearer, consider the case of 3 data points for GP classification, giving a centered trivariate Gaussian distribution parameterised by {ρ12 , ρ13 , ρ23 }. Figure 2 illustrates Slepian’s inequality. Each point in the vector space of correlations gives a Gaussian more correlated than the points it dominates in the Cartesian sense. However, points such as ∗ and † in Figure 2 are not comparable directly. 5 Maximising projection of correlations and Empirical alignment For maximising the correlation, one heuristic is to maximise the sum of correlations, i.e. given an n×n correlation matrix P = (ρij ), we maximise n X n X X 1T P1 = ρij = n + 2 ρij = n + 21 ρ12 ρ13 . . . ρ(n−1)n (25) n n i=1 j=1
i